LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2002, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sun, 5 May 2002 19:39:55 -0400
Reply-To:     sashole@bellsouth.net
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Paul M. Dorfman" <sashole@BELLSOUTH.NET>
Organization: Sashole of Florida
Subject:      Re: Sorting linked transaction data
In-Reply-To:  <F197GCctbPnQ9c318Xj000095cf@hotmail.com>
Content-Type: text/plain; charset="us-ascii"

> From: "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM> > Paul and Ian: Thank you for your solutions and reminders. I was > definitely looking for the hash based approach since my data would > not lend itself to direct addressing.

Richard,

In my dictionary, hashing is one of direct-addressing algorithms. I guess you meant direct addressing in its purest form (that is, devoid of any comparisons between keys) - key-indexing or bit-mapping.

> I think I would be right in stating it's a problem of finding an inverse > hash map. Ian showed a concise solution with the hash function and > inversing managed implicitly by a table index.

I am not sure where Ian was using a hash function of any kind. Technically, he built and searched a disk-resident B-tree. Nor the searching performance delivered by the tree is O(1), which would be typical for hashing. However, if we forget about the technical details and look at Ian's code from the standpoint of logical programming, I would agree that it looks like hashing in the sense that a piece of data is retrieved directly by the key.

> Paul shows the same with data step arrays and a custom hash function, with > it's expected speedy results. > > I am speculating a v9 solution (based on the excellent read at > http://www.sas.com/rnd/papers/sugi27/dsv9-sugi.pdf) would look something > like > > data sortedTransactionsLinkInfo; > declare Hash ht(); > > ht.defineKey("currId"); > ht.defineData("prevId","nextId"); > ht.defineDone(); > > do while (not allKeysRead); > set transactions end=allKeysRead; > ht.add(); > if prevId = . then lookForId = currId; > end; > > do while (0 = ht.find(lookForId)); > * presume successful find automatically moves > * hash data to data step PDV; > output; > lookForId = nextId; > end; > end;

Right on the money! To test, I have coded approximately along the lines you suggested, expanded your test data a little bit (from N=20 to N=500000), and ran all three approaches (Ian's, mine, and SI's - or rather yours) under XP batch. Please see the log after the sig line (I have cleaned it up for better viewing, leaving only essential stuff).

Kind regards, ================= Paul M. Dorfman Jacksonville, FL =================

NOTE: Copyright (c) 2002 by SAS Institute Inc., Cary, NC, USA.

NOTE: SAS (r) Proprietary Software Pre-Production Version 9.00 (TS B0)

Licensed to SAS Institute, Internal Test Release, Site 0000000001.

NOTE: This session is executing on the XP_PRO platform.

%let N = 500000 ;

data transactions (keep = previd currid nextid) ;

array use ( &N ) _temporary_ ;

seed = 1 ;

do i = 1 to &N ;

previd = currid ;

currid = nextid ;

nextid = int( &N * ranuni(seed) ) + 1 ;

do while (use(nextid) ne .) ;

if nextid >= &N then nextid = 1 ;

else nextid = nextid + 1 ;

end;

use(nextid) = 1;

nextid = nextid * (1 + int (1000*ranuni(seed)));

output;

end;

run;

NOTE: The data set WORK.TRANSACTIONS has 500000 observations and 3 variables. NOTE: DATA statement used (Total process time):

real time 31.10 seconds

cpu time 30.76 seconds

proc sort data = transactions ;

by nextid ;

where currid ;

run;

NOTE: There were 499999 observations read from the data set WORK.TRANSACTIONS.

WHERE currid;

NOTE: The data set WORK.TRANSACTIONS has 499999 observations and 3 variables. NOTE: PROCEDURE SORT used (Total process time):

real time 1.75 seconds

cpu time 2.70 seconds

*** Ian Whitlock: SAS index ;

data trans ( index = ( currid ) ) ;

set transactions ;

if previd = . then call symput ( "start" , put(currid,best.-l)) ;

run ;

NOTE: There were 499999 observations read from the data set WORK.TRANSACTIONS.

NOTE: The data set WORK.TRANS has 499999 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 2.21 seconds

cpu time 2.18 seconds

data isorted ( drop = next ) ;

retain next &start ;

currid = next ;

set trans key = currid /unique nobs = nobs ;

if _n_ > nobs or _iorc_ then

do ;

_error_ = 0 ;

stop ;

end ;

next = nextid ;

run ;

NOTE: The Data set WORK.ISORTED has 499999 observations and 3 variables.

NOTE: The DATA statement used (Total process time):

real time 31.51 seconds

cpu time 31.43 seconds

*** Paul Dorfman: Hand-coded hash with linear probing ;

%let h = 1000001 ;

data hsorted (keep = previd currid nextid) ;

array x (0 : &h) _temporary_ ;

array c (0 : &h) _temporary_ ;

do _n_ = 1 to n ;

set transactions (rename = (previd = p)) nobs = n ;

link hash ;

x (j) = currid ;

c (j) = nextid ; ;

if p = . then start = currid ;

end ;

currid = start ;

do _n_ = 1 to n ;

link hash ;

nextid = c (j) ;

output ;

previd = currid ;

currid = nextid ;

end ;

stop ;

hash: do j = mod(currid,&h) by 1 until (x(j) = . or x(j) = currid) ;

if j = &h then j = 0 ;

end ;

run ;

NOTE: There were 499999 observations read from the data set WORK.TRANSACTIONS.

NOTE: The data set WORK.HSORTED has 499999 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 1.37 seconds

cpu time 1.32 seconds

*** SI: Hash table from upcoming V9 ;

data asorted ( keep = previd currid nextid ) ;

declare associativearray x (dataset: 'work.transactions', hashexp: 10) ; x.defineKey ('currid') ;

x.defineData ('nextid') ;

x.defineDone ( ) ;

do _n_ = 1 by 1 until (p = .) ;

set transactions (rename = (previd = p)) nobs = n ;

end ;

do _n_ = 1 to n ;

rc = x.find () ;

output ;

previd = currid ;

currid = nextid ;

end ;

stop ;

run ;

NOTE: There were 169360 observations read from the data set WORK.TRANSACTIONS.

NOTE: There were 499999 observations read from the data set WORK.TRANSACTIONS.

NOTE: The data set WORK.ASORTED has 499999 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 2.78 seconds

cpu time 2.78 seconds


Back to: Top of message | Previous page | Main SAS-L page