Date: Fri, 27 Oct 2006 12:43:49 -0400
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: Looking for a more efficient way to merge
On Wed, 25 Oct 2006 23:47:09 -0500, Doug Cacialli <cacialli.sas@GMAIL.COM>
>Suppose I have two data sets (and suppose, for the sake of argument, that
>you care dearly about my data sets :-). Data-set-1 has multiple
>observations per BY variable, whereas data-set-2 has single observations per
>BY variable. Suppose further that data-set-1 has a good amount of missing
>data, and that data-set-2 may contain that data. What I'd like to do is
>merge the two data sets while meeting the following conditions:
>1. I only want to retain observations that are contained in data-set-1
>(data-set-2 will contain data from participants who do not exist in
>2. I want to replace missing values in data-set-1 with non-missing values in
>data-set-2 where appropriate.
>What I've got is this:
> MERGE work.data_set_1 (IN = inset1) work.data_set_2;
> BY ID;
>IF (inset1 = 1);
>ARRAY fixmiss (4) var1_nomiss var2_nomiss var3_nomiss var4_nomiss;
>ARRAY original (4) var1 var2 var3 var4;
> DO i = 1 to 4;
> IF (original(i) = .)
> THEN original(i) = fixmiss(i);
>... and that works, but it's really ugly and I know there's got to be a more
>efficient way to accomplish what I'm trying to do.
Ugly and efficient are not opposites, are not mutually exclusive. Ugly is
also pretty subjective.
So I'm not sure what dissatisfies you about the code shown. It looks like a
pretty good solution.
You could use the abbreviated form
in place of
var1 var2 var3 var4
If you were to normalize your data structures (turn each existing
observation into 4 observations, each holding just one VAR value), you could
get rid of the arrays and loops and instead simply add a BY variable.
and likewise for the second table if the names were formed with numeric
suffixes (eg, var_nomiss2 instead of var2_nomiss).
I don't see much opportunity for improved performance, especially if the
given tables are already sorted by ID. But I'm guessing on that, since I
don't know the sizes of the tables or other pertinent characteristics.
I thought UPDATE might
>be what I wanted, but the multiple observations per BY variable on what
>would be the master data set precludes it's use. I thought maybe SQL, but I
>don't know it nearly well enough to make something like this work. Does
>anyone have any suggestions on this?
The thing with SQL is that there is nothing like arrays and loops to handle
a lot of columns which require parallel treatment.
>Thanks in advance,
>- Doug Cacialli