LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2011, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sun, 8 May 2011 14:27:51 -0400
Reply-To:     Dave Fournier <otter@OTTER-RSCH.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Dave Fournier <otter@OTTER-RSCH.COM>
Subject:      Re: Why still use SAS with a lot of open source applications?
Comments: To: Vincent Granville <vincentg@DATASHAPING.COM>

On Sat, 7 May 2011 13:59:17 -0400, Vincent Granville <vincentg@DATASHAPING.COM> wrote:

>This discussion was posted on our LinkedIn group. Here's my answer: > >SAS has some nice features, such as the SQL procedure or simple "group by" >features. Try to compute correlations "by group" in R: say you have 2,000 >groups, 2 variables e.g. salary and education level, and 2 million >observations - you want to compute correlation between salary and education >within each group. > >It is not obvious, your best bet is to use some R package (see sample code on >Analyticbridge to do it), and the solution is painful, you can not return both >correlation and stdev "by group", as the function can return only one >argument, not a vector. So if you want to return not just two, but say 100 >metrics, it becomes a nightmare. > >Read discussion at http://bit.ly/jRJQvj

This is a trivially small problem with a fast compiled language. I used the open source C++ code in AD Model builder to create and analyze data as you describe it. For 2 million records with 2000 groups and 2x2 matrix the code ran in about 1 second on my laptop.

Increasing the size to 10 million records 2000 groups and 10x10 matrix took about 25 seconds. Here is the code.

main() {

int nobs=10000000; int ngroups=2000; int ndim=10; dmatrix obs(1,nobs,1,ndim); dvector dgroups(1,nobs); dmatrix means(1,ngroups,1,ndim); ivector groups(1,nobs); random_number_generator rng(101); obs.fill_randn(rng); // simulated data dgroups.fill_randu(rng); groups=ivector(dgroups*ngroups+1); // randomly assign data to groups

d3_array covar(1,ngroups,1,ndim,1,ndim); ivector gtot(1,ngroups); gtot.initialize(); means.initialize(); covar.initialize();

for (int i=1;i<=nobs;i++) { means(groups(i))+=obs(i); covar(groups(i))+=outer_prod(obs(i),obs(i)); gtot(groups(i))+=1; }

for (int i=1;i<=ngroups;i++) { means(i)/=gtot(i); covar(i)/=gtot(i); covar(i)-=outer_prod(means(i),means(i)); } ofstream ofs("report"); for (int i=1;i<=ngroups;i++) { ofs << "group " << i << endl; ofs << covar(i) << endl << endl; }

}


Back to: Top of message | Previous page | Main SAS-L page