|
Joe,
Dorfman in several of his papers noted that the following:
set have point = _n_ ; * get key/data attributes for parameter type matching
;
* set have (obs = 1) ; * this will work, too :-)! ;
* if 0 then set have ; * and so will this :-)! ;
* set have (obs = 0) ; * but for some reason, this will not :-( ;
Mark had used the last statement by mistake. He might not have tested his
code.
On Fri, Apr 30, 2010 at 2:10 PM, Joe Matise <snoopy369@gmail.com> wrote:
> Can you explain that further? I thought that first statement was just
> defining the PDV and the second set statement fully brought in the hash
> table... perhaps I don't understand hash tables adequately though?
>
> -Joe
>
>
> On Fri, Apr 30, 2010 at 12:10 PM, Muthia Kachirayan <
> muthia.kachirayan@gmail.com> wrote:
>
>> Naresh Kumar,
>>
>> In Mark's code the statement
>>
>> set have (keep=tn order_number obs=0);
>> excludes adding to the hash table. Try replace the first part by
>>
>> data want(drop = rc);
>> if _n_ = 1 then do;
>> if 0 then set have;
>> declare hash found_keys();
>> found_keys.definekey('tn', 'order_number');
>> found_keys.definedone();
>> end;
>>
>> This will output unduplicated records with all variables
>>
>> Muthia Kachirayan
>>
>>
>> On Fri, Apr 30, 2010 at 12:48 PM, Joe Matise <snoopy369@gmail.com> wrote:
>>
>> > I think that's what Mark's code does. It only puts two variables into
>> the
>> > hash table, but it outputs the entire row (174 vars) to the dataset.
>> >
>> > -Joe
>> >
>> > On Fri, Apr 30, 2010 at 11:42 AM, naresh kmar <nareshkmar@yahoo.co.in
>> > >wrote:
>> >
>> > > Mark,
>> > >
>> > > Thanks. I would like to get the other 174 variables as well in my
>> output
>> > > dataset. Actually, I don't need to sort the dataset but would like to
>> > remove
>> > > duplicates on the composite key (TN+ORDER_NUMBER). I don't think Hash
>> > table
>> > > will not be able to take in all those variables through definedata().
>> > >
>> > > Any thoughts??
>> > >
>> > > Thanks,
>> > > Naresh
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > > From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
>> > > To: SAS-L@LISTSERV.UGA.EDU
>> > > Sent: Fri, 30 April, 2010 8:18:42 PM
>> > > Subject: Re: Error in using Hash Objects
>> > >
>> > > Naresh:
>> > >
>> > > You are asking for WAY too much memory. So PROC SORT, which
>> substitutes
>> > > disk I/O for memory, may be the preferred tactic.
>> > >
>> > > BUT ... you could use a hash if, by "removing duplicates" you mean
>> > keeping
>> > > only one record for each combination of identification variables, say
>> TN
>> > and
>> > > ORDER_NUMBER. That's apparently your intention in your code.
>> > >
>> > > If so, consider the below. Here the hash table only accomodates the
>> two
>> > id
>> > > variables, merely for maintainng a list tracking which id values have
>> > > already been encountered at any point in your progress through dataset
>> > HAVE.
>> > >
>> > >
>> > > data want (drop=rc);;
>> > > ** Get variable attributes of the key variables into the PDV **;
>> > > set have (keep=tn order_number obs=0);
>> > >
>> > > declare hash found_keys (hashexp:16);
>> > > found_keys.definekey('TN','ORDER_NUMBER');
>> > > found_keys.definedone();
>> > >
>> > > do until (end_of_have);
>> > > set have end=end_of_have;
>> > > rc=found_keys.check();
>> > > if rc^=0 then do; /* If not yet in table ... */
>> > > rc=found_keys.add(); /* .. add to the table ... */
>> > > output; /* .. and write to WANT */
>> > > end;
>> > > end;
>> > > stop;
>> > > run;
>> > >
>> > >
>> > > Whenever a record is encountered whose TN/ORDER_NUMBER are already in
>> > > FOUND_KEYS, then no OUTPUT statement is executed.
>> > >
>> > > Note this will NOT sort the data, but it will write out only one
>> record
>> > per
>> > > TN/ORDER_NUMBER combination.
>> > >
>> > > Regards,
>> > > Mark
>> > >
>> > > > -----Original Message-----
>> > > > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf
>> Of
>> > > > naresh kmar
>> > > > Sent: Friday, April 30, 2010 10:24 AM
>> > > > To: SAS-L@LISTSERV.UGA.EDU
>> > > > Subject: Error in using Hash Objects
>> > > >
>> > > > Hi All,
>> > > >
>> > > > I am running the below code on 14 million record dataset and
>> received
>> > > > an error. Could anyone let me know how to resolve this? work.indsn
>> has
>> > > > 14 million records and 176 variables. My objective is to sort the
>> input
>> > > > dataset and remove duplicates based on the key. I could have used
>> PROC
>> > > > sort but heard that Hash objects are more efficient.
>> > > >
>> > > > DATA _NULL_ ;
>> > > > IF _N_=1 THEN SET work.indsn ;
>> > > > DECLARE HASH HH ( DATASET: 'work.indsn', HASHEXP: 16, ORDERED: 'A')
>> ;
>> > > > HH.DEFINEKEY ( 'TN', 'ORDER_NUMBER' ) ;
>> > > > HH.DEFINEDATA ( 'var1','var2',....,'var176') ; /****** ADD ALL
>> > > > VARIABLES ****/
>> > > > HH.DEFINEDONE () ;
>> > > > HH.OUTPUT(DATASET:'work.outdsn');
>> > > > STOP;
>> > > > RUN;
>> > > >
>> > > > ERROR: Hash object added 131056 items when memory failure occurred.
>> > > > FATAL: Insufficient memory to execute data step program. Aborted
>> during
>> > > > the EXECUTION phase.
>> > > >
>> > > > Thanks,
>> > > > Naresh
>> > > >
>> > >
>> > >
>> > >
>> > >
>> >
>>
>
>
|