Date: Fri, 30 Apr 2010 13:10:57 -0500
Reply-To: Joe Matise <snoopy369@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Joe Matise <snoopy369@GMAIL.COM>
Subject: Re: Error in using Hash Objects
In-Reply-To: <y2l2fc7f3341004301010tfa823174wcebd1f226365ef0c@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Can you explain that further? I thought that first statement was just
defining the PDV and the second set statement fully brought in the hash
table... perhaps I don't understand hash tables adequately though?
-Joe
On Fri, Apr 30, 2010 at 12:10 PM, Muthia Kachirayan <
muthia.kachirayan@gmail.com> wrote:
> Naresh Kumar,
>
> In Mark's code the statement
>
> set have (keep=tn order_number obs=0);
> excludes adding to the hash table. Try replace the first part by
>
> data want(drop = rc);
> if _n_ = 1 then do;
> if 0 then set have;
> declare hash found_keys();
> found_keys.definekey('tn', 'order_number');
> found_keys.definedone();
> end;
>
> This will output unduplicated records with all variables
>
> Muthia Kachirayan
>
>
> On Fri, Apr 30, 2010 at 12:48 PM, Joe Matise <snoopy369@gmail.com> wrote:
>
> > I think that's what Mark's code does. It only puts two variables into
> the
> > hash table, but it outputs the entire row (174 vars) to the dataset.
> >
> > -Joe
> >
> > On Fri, Apr 30, 2010 at 11:42 AM, naresh kmar <nareshkmar@yahoo.co.in
> > >wrote:
> >
> > > Mark,
> > >
> > > Thanks. I would like to get the other 174 variables as well in my
> output
> > > dataset. Actually, I don't need to sort the dataset but would like to
> > remove
> > > duplicates on the composite key (TN+ORDER_NUMBER). I don't think Hash
> > table
> > > will not be able to take in all those variables through definedata().
> > >
> > > Any thoughts??
> > >
> > > Thanks,
> > > Naresh
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
> > > To: SAS-L@LISTSERV.UGA.EDU
> > > Sent: Fri, 30 April, 2010 8:18:42 PM
> > > Subject: Re: Error in using Hash Objects
> > >
> > > Naresh:
> > >
> > > You are asking for WAY too much memory. So PROC SORT, which
> substitutes
> > > disk I/O for memory, may be the preferred tactic.
> > >
> > > BUT ... you could use a hash if, by "removing duplicates" you mean
> > keeping
> > > only one record for each combination of identification variables, say
> TN
> > and
> > > ORDER_NUMBER. That's apparently your intention in your code.
> > >
> > > If so, consider the below. Here the hash table only accomodates the
> two
> > id
> > > variables, merely for maintainng a list tracking which id values have
> > > already been encountered at any point in your progress through dataset
> > HAVE.
> > >
> > >
> > > data want (drop=rc);;
> > > ** Get variable attributes of the key variables into the PDV **;
> > > set have (keep=tn order_number obs=0);
> > >
> > > declare hash found_keys (hashexp:16);
> > > found_keys.definekey('TN','ORDER_NUMBER');
> > > found_keys.definedone();
> > >
> > > do until (end_of_have);
> > > set have end=end_of_have;
> > > rc=found_keys.check();
> > > if rc^=0 then do; /* If not yet in table ... */
> > > rc=found_keys.add(); /* .. add to the table ... */
> > > output; /* .. and write to WANT */
> > > end;
> > > end;
> > > stop;
> > > run;
> > >
> > >
> > > Whenever a record is encountered whose TN/ORDER_NUMBER are already in
> > > FOUND_KEYS, then no OUTPUT statement is executed.
> > >
> > > Note this will NOT sort the data, but it will write out only one record
> > per
> > > TN/ORDER_NUMBER combination.
> > >
> > > Regards,
> > > Mark
> > >
> > > > -----Original Message-----
> > > > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> > > > naresh kmar
> > > > Sent: Friday, April 30, 2010 10:24 AM
> > > > To: SAS-L@LISTSERV.UGA.EDU
> > > > Subject: Error in using Hash Objects
> > > >
> > > > Hi All,
> > > >
> > > > I am running the below code on 14 million record dataset and received
> > > > an error. Could anyone let me know how to resolve this? work.indsn
> has
> > > > 14 million records and 176 variables. My objective is to sort the
> input
> > > > dataset and remove duplicates based on the key. I could have used
> PROC
> > > > sort but heard that Hash objects are more efficient.
> > > >
> > > > DATA _NULL_ ;
> > > > IF _N_=1 THEN SET work.indsn ;
> > > > DECLARE HASH HH ( DATASET: 'work.indsn', HASHEXP: 16, ORDERED: 'A') ;
> > > > HH.DEFINEKEY ( 'TN', 'ORDER_NUMBER' ) ;
> > > > HH.DEFINEDATA ( 'var1','var2',....,'var176') ; /****** ADD ALL
> > > > VARIABLES ****/
> > > > HH.DEFINEDONE () ;
> > > > HH.OUTPUT(DATASET:'work.outdsn');
> > > > STOP;
> > > > RUN;
> > > >
> > > > ERROR: Hash object added 131056 items when memory failure occurred.
> > > > FATAL: Insufficient memory to execute data step program. Aborted
> during
> > > > the EXECUTION phase.
> > > >
> > > > Thanks,
> > > > Naresh
> > > >
> > >
> > >
> > >
> > >
> >
>
|