Date: Fri, 9 Mar 2007 09:24:54 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Can I transpose and dedup several column values up front in
PROC SQL?
Content-Type: text/plain; charset="iso-8859-1"
Howard:
I agree (as usual) with your comment about not needing the DISTINCT qualifier in this case. I'll have to look at a test with data to see whether a UNION of views actually scans the same source dataset in the views once per view. The SAS compiler has an opportunity to optimize the query. Whether it does or not, I can't say a priori. At least the query has a simple form and should take very little programming time.
S
________________________________
From: owner-sas-l@listserv.uga.edu on behalf of Howard Schreier <hs AT dc-sug DOT org>
Sent: Fri 3/9/2007 9:15 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Can I transpose and dedup several column values up front in PROC SQL?
On Fri, 9 Mar 2007 08:19:44 -0500, Sigurd Hermansen <HERMANS1@WESTAT.COM> wrote:
>Martin:
>I'd create three SQL views of the dataset that have the same alias in each
for an account column, and the person column, then UNION the three views:
>
>create view vwA1 as select INIT_ACCT_NR as XREF_ACCT_NR,PERSON_NR
>;
><etc.>
>create table A123 as
>select distinct * from
>(select * from vwA1)
>union
><etc.>
>where NOT XREF_ACCT_NR is missing
>order by PERSON_NR,XREF_ACCT_NR
>;
>
>S
That should work. No need for the DISTINCT, since that's implicit behavior
for the UNION operator.
However, Martin asked for "more efficient ideas", but this SQL approach is
going to pass the data three times and probably perform an avoidable sort.
>
>
>
>________________________________
>
>From: owner-sas-l@listserv.uga.edu on behalf of Martin Mathis
>Sent: Wed 3/7/2007 5:01 PM
>To: SAS-L@LISTSERV.UGA.EDU
>Cc: Martin Mathis
>Subject: Can I transpose and dedup several column values up front in PROC SQL?
>
>
>
>Hello SAS-L,
>
>I'm solicting more efficient ideas for the following data processing task:
>
>I have an account numbers to person number cross-reference DB2 table with
>columns and example values (there are 3 different possible account numbers
>for a person):
>
>INIT_ACCT_NR CUR_ACCT_NR PREV_ACCT_NR PERSON_NR
> 2 3 1 22
> 4 5 33
> 6 6 44
> 9 9 8 55
>
>
>I need a "all possible accounts to person" reference SAS dataset with
>init_acct_nr, cur_acct_nr, prev_acct_nr transposed to a single account
>column by person_nr, deduped and excluding any blank previous account
>numbers, looking like this:
>
>XREF_ACCT_NR PERSON_NR
> 1 22
> 2 22
> 3 22
> 4 33
> 5 33
> 6 44
> 8 55
> 9 55
>
>My current idea is to use PROC SQL and to first pull init_acct_nr,
>cur_acct_nr, prev_acct_nr, person_nr into a SAS dataset "dump", ordering by
>person_nr. Then:
>
>proc transpose data=dump out=have(drop=_name_ _label_);
> by person_nr;
> var init_acct_nr cur_acct_nr prev_acct_nr;
>run;
>
>proc sort data=have out=need(rename=(col1=xref_acct_nr)) nodupkey;
> by col1;
> where col1 ne "";
>run;
>
>Trouble is, that's a lot of processing starting with 126 million obs, times
>3, before starting to narrow it down...
>
>Any idea how this could be done more efficiently with less steps directly
>in PROC SQL? How I could select the distinct, non-blank values from 3
>columns and output them as 1-3 rows, repeating the person_nr in the process?
>
>Thanks!
>Martin
|