Date: Mon, 4 May 1998 16:59:14 -0400
Reply-To: "Dorfman, Paul" <pdorfma@UCS.ATT.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Dorfman, Paul" <pdorfma@UCS.ATT.COM>
Subject: Compare Variables from 2 data sets
Content-Type: text/plain; charset="iso-8859-1"
<abdu.elnagheeb@NATIONSBANK.COM>, in particular, wrote:
...
I hope someone will respond to this question which is best asked
using
an example. I have 2 data sets (A and B) (sorted for convenience)
as
below:
Data A : has one variable (X): 5, 9, 11, 23, 29, 35
Data B : has one variable (Y): 3, 10, 15, 19, 27, 31, 37, 43
I want to do the following. For EVERY observation in data A I want
to
compare X to Y (in data B) and create a variable that will be the #
of
times that that OBSERVATION (of X) was less or equal to (<=) Y (all
observations). In our example, comparing the value X=5 to all the Y
values, I will get that X=5 <= Y 7 times (observations), 9 <= Y 7
times, 11 <= Y 6 times, 23 <= Y 4 times ...etc.
Hence, I want to get a data set (C) that will look like this:
data C:
X NewVar
5 7
9 7
11 6
23 4
29 3
35 2
I have used arrays. The problem is, the number of observations in
data A and B are in thousands (> 20,000) and I ran into problems
because of that.
What would the most efficient way of doing this?
...
Abed,
For your task, I still don't see any viable alternative to using an
array. Resorting to ANY other approach would be equivalent to having to
read at least more than 20000*20000=400 million records file.
If your problem has been running out of memory, its cure appears hazy
without having seen the code and/or knowing what your platform is. I've
never experienced a difficulty to allocate such small an array (since
your variables are numeric, it occupies only 164K of task memory) under
Windows, to say nothing of MVS. However, let me suggest a couple of
things:
1. Does the fact that you have used the word "arrays", in plural,
indicate that in your code, you have allocated more that one
array? If so, it is unnecessary, since only one is needed.
2. Have you used non-temporary array(s)? If so, don't. A quick look at
FULLSTIMER diagnostics reveals
Task memory - 1737K (949K data, 788K program)
Task memory - 952K (164K data, 788K program)
for a non-temporary and temporary array of 20000 elements, respectively.
So, my code would look like
DATA _NULL_;
IF 0 THEN SET Y NOBS=YNOBS;
CALL SYMPUT('YNOBS',LEFT(PUT(YNOBS,BEST.)));
STOP;
RUN;
DATA COMPARE(KEEP=X COUNT);
ARRAY AY(&YNOBS) _TEMPORARY_;
IF _N_ = 1 THEN DO I=1 TO YNOBS;
SET Y NOBS=YNOBS;
AY(I) = Y;
END;
SET X;
COUNT = 0;
DO I=1 TO DIM(AY);
IF X <= AY(I) THEN COUNT + 1;
END;
RUN;
I've tested it with your test data, and it results into
OBS X COUNT
1 5 7
2 9 7
3 11 6
4 23 4
5 29 3
6 35 2
.
To demonstrate that the approach is viable for "large" files, let's
simulate them using a random generator:
DATA X; DROP I; DO I=1 TO 20000; X = CEIL(RANUNI(0)*100); OUTPUT; END;
DATA Y; DROP I; DO I=1 TO 20000; Y = CEIL(RANUNI(0)*100); OUTPUT; END;
Running the 'compare' program against the datasets created in the above
manner under OS/390 6.09e results into the following log message:
NOTE: The data set WORK.COMPARE has 20000 observations and 2 variables.
NOTE: The DATA statement used 289.59 CPU seconds and 3944K.
Under W95 (166 MHz, 32M) it's not as speedy but still implementable:
NOTE: The data set WORK.COMPARE has 20000 observations and 2 variables.
NOTE: The DATA statement used 17 minutes 56.03 seconds.
I hope it may help you.
Regards,
Paul.
++++++++++++++++++++++++++++++
Paul M. Dorfman
Citibank UCS Decision Support Systems
Jacksonville, FL
++++++++++++++++++++++++++++++