Date: Sun, 8 May 2011 14:27:51 -0400
Reply-To: Dave Fournier <otter@OTTER-RSCH.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Dave Fournier <otter@OTTER-RSCH.COM>
Subject: Re: Why still use SAS with a lot of open source applications?
On Sat, 7 May 2011 13:59:17 -0400, Vincent Granville
<vincentg@DATASHAPING.COM> wrote:
>This discussion was posted on our LinkedIn group. Here's my answer:
>
>SAS has some nice features, such as the SQL procedure or simple "group by"
>features. Try to compute correlations "by group" in R: say you have 2,000
>groups, 2 variables e.g. salary and education level, and 2 million
>observations - you want to compute correlation between salary and education
>within each group.
>
>It is not obvious, your best bet is to use some R package (see sample code on
>Analyticbridge to do it), and the solution is painful, you can not return both
>correlation and stdev "by group", as the function can return only one
>argument, not a vector. So if you want to return not just two, but say 100
>metrics, it becomes a nightmare.
>
>Read discussion at http://bit.ly/jRJQvj
This is a trivially small problem with a fast compiled language.
I used the open source C++ code in AD Model builder to create
and analyze data as you describe it. For 2 million records with
2000 groups and 2x2 matrix the code ran in about 1 second on my laptop.
Increasing the size to 10 million records 2000 groups and 10x10 matrix
took about 25 seconds. Here is the code.
main()
{
int nobs=10000000;
int ngroups=2000;
int ndim=10;
dmatrix obs(1,nobs,1,ndim);
dvector dgroups(1,nobs);
dmatrix means(1,ngroups,1,ndim);
ivector groups(1,nobs);
random_number_generator rng(101);
obs.fill_randn(rng); // simulated data
dgroups.fill_randu(rng);
groups=ivector(dgroups*ngroups+1); // randomly assign data to groups
d3_array covar(1,ngroups,1,ndim,1,ndim);
ivector gtot(1,ngroups);
gtot.initialize();
means.initialize();
covar.initialize();
for (int i=1;i<=nobs;i++)
{
means(groups(i))+=obs(i);
covar(groups(i))+=outer_prod(obs(i),obs(i));
gtot(groups(i))+=1;
}
for (int i=1;i<=ngroups;i++)
{
means(i)/=gtot(i);
covar(i)/=gtot(i);
covar(i)-=outer_prod(means(i),means(i));
}
ofstream ofs("report");
for (int i=1;i<=ngroups;i++)
{
ofs << "group " << i << endl;
ofs << covar(i) << endl << endl;
}
}
|