```Date: Fri, 15 Feb 2002 11:50:34 -0600 Reply-To: Jonathan Goldberg Sender: "SAS(r) Discussion" From: Jonathan Goldberg Subject: Re: Tricky sampling problem Content-Type: TEXT/PLAIN; charset=US-ASCII Talbot Michael Katz wrote: >... I got the impression from Brad's original posting >that that wasn't an issue. Now, I may have misinterpreted what Brad >wanted, but the problem with what you and some others have said about >choosing the unique ids first is that you don't sample them with >respect to their frequency in the initial population. Whereas, if you >use the weighted sampling technique I suggested to choose the unique >ids, you can always match this set of sample ids against the original >(sorted) population and then choose a random record uniformly within >each of the ids that were chosen on the basis of weight. > >By the way, as a side note to Jonathan Goldberg, you can choose an >exact size random sample. My take on Brad's problem is that he wants an exact-sized random of transactions that is also an exact-sized sample of ids. There wasn't anything about randomly picking ids and then getting a transaction from each. BTW, I am aware that it is straightforward to pick an exact sized random sample. I just don't see how you do both at once, at least in one pass (as I mentioned, trimming is an option). If you randomly sample ids, you'll get fluctuation in the exact number of transactions; if you sample transactions, you don't exactly control ids. What am I missing? The attached code seems aimed at generating one transaction from a percent sample of ids, since it computes sample size as sampsize = FLOOR(unicount * &pct.) ; where unicount is number of unique ids and &pct is poportion desired, here set to .01; I generated some test data like this: 2 data initial; 3 do i = 1 to 100000; 4 if ranuni(432) > .9 then id + 1; /*about 1 id for 10 obs*/ 5 output; 6 end; 7 stop; 8 run; and ended up with 100000 "transactions" (naturally) and 10037 unique ids. Running the rest of the (corrected) code against it produced a data set of 637 ids representing 12377 transactions. The final data set contained 637 transactions, 1 per id. This isn't .01 of either ids or transactions. The entire log is below. Jonathan Goldberg Missouri Alcoholism Research Center Dept. of Psychiatry Washington University School of Medicine 40 N. Kingshighway, Suite One St. Louis, MO 63108 314-286-2212 1 *-- some test data; 2 3 data initial; 4 do i = 1 to 100000; 5 if ranuni(432) > .9 then id + 1; /*about 1 id for 10 obs*/ 6 output; 7 end; 8 stop; 9 run; NOTE: The data set WORK.INITIAL has 100000 observations and 2 variables. NOTE: DATA statement used: real time 0.26 seconds cpu time 0.08 seconds 10 11 %let seed = 46 ; %* ranuni seed, choose your favorite ; 12 %let pct = 0.01 ; %* percent sample ; 13 14 * sort without deduping ; 15 PROC SORT DATA = initial 16 OUT = allsort ; 16 OUT = allsort ; 17 BY id ; 18 RUN ; NOTE: There were 100000 observations read from the data set WORK.INITIAL. NOTE: The data set WORK.ALLSORT has 100000 observations and 2 variables. NOTE: PROCEDURE SORT used: real time 0.37 seconds cpu time 0.20 seconds ^L2 The SAS System 11:39 Friday, February 15, 2 19 20 * dedupe and add counts ; 21 DATA ddw ; 22 SET allsort 23 END = last 24 NOBS = allct ; 25 * assume NOBS works and gives correct count ; 26 BY id ; 27 28 KEEP id idct ; 29 RETAIN idct allcount ; 30 IF FIRST.id THEN DO ; 31 idct = 0 ; 32 IF _N_ = 1 THEN DO ; 33 * NOBS has some usage technicalities ; 34 allcount = allct ; 35 END ; 36 END ; 37 idct + 1 ; 38 unicount + 1 ; 39 IF LAST.id THEN DO ; 40 OUTPUT ; 41 IF last THEN DO ; 42 sampsize = FLOOR(unicount * &pct.) ; 43 * could use CEIL ; 44 unalrat = unicount / _n_ ; 45 CALL SYMPUT("sampsize",sampsize) ; 46 CALL SYMPUT("unicount",unicount) ; 47 CALL SYMPUT("unalrat",unalrat) ; 48 CALL SYMPUT("allcount",_n_) ; 49 END ; * last ; 50 END ; * last.id ; 51 RUN ; NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 45:27 46:27 47:26 48:27 NOTE: There were 100000 observations read from the data set WORK.ALLSORT. NOTE: The data set WORK.DDW has 10037 observations and 2 variables. NOTE: DATA statement used: real time 0.19 seconds cpu time 0.10 seconds 52 53 * choose a random sample of sampsize according to frequency ; 54 DATA sampids ; 55 KEEP id idct ; 56 SET ddw ; 57 RETAIN 58 sampsize &sampsize. 59 unicount &unicount. ^L3 The SAS System 11:41 Friday, February 2 60 ; 61 IF sampsize = unicount THEN DO ; 62 * account for very low probability event ; 63 compval = 1 ; 64 END ; 65 ELSE DO ; 66 * adjusted weight for picking current id ; 67 compval = idct * sampsize * &unalrat. / unicount ; 68 END ; 69 IF RANUNI(&seed.) LE compval THEN DO ; 70 * pick according to (adjusted) weight in initial population ; 71 OUTPUT ; 72 sampsize = sampsize - 1 ; 73 * weight adjustment of numerator for size exactness ; 74 IF sampsize = 0 THEN DO ; 75 * done, guaranteed to get here eventually ; 76 STOP ; 77 END ; * stop if ; 78 END ; * choose if ; 79 unicount = unicount - 1 ; 80 * weight adjustment of denominator for size exactness ; 81 RUN ; NOTE: There were 10037 observations read from the data set WORK.DDW. NOTE: The data set WORK.SAMPIDS has 637 observations and 2 variables. NOTE: DATA statement used: real time 0.07 seconds cpu time 0.01 seconds 82 83 proc print data = sampids; 84 sum idct; 85 title 'test print of sampids data set'; 86 run; NOTE: There were 637 observations read from the data set WORK.SAMPIDS. NOTE: The PROCEDURE PRINT printed pages 1-13. NOTE: PROCEDURE PRINT used: real time 0.01 seconds cpu time 0.00 seconds 87 * match back to sorted population for choice of individuals 88 within each unique id group ; 89 DATA sample ; 90 KEEP id x1 x2 ; 91 92 RETAIN 93 prob 94 match 0 95 ; ^L4 The SAS System 11:41 Friday, February 15, 2 96 MERGE 97 sampids (IN = ins) 98 allsort (IN = ina) 99 ; 100 BY id ; 101 IF ins THEN DO ; 102 * only need to check ins, ina guaranteed ; 103 IF FIRST.id THEN DO ; 104 prob = 1 / idct ; 105 match = 0 ; 106 END ; 107 IF match = 0 AND RANUNI(&seed.) LE prob 108 THEN DO ; 109 * equal chance to pick each obs for this id ; 110 OUTPUT ; 111 match = 1 ; 112 END ; 113 IF match = 0 AND LAST.id 114 THEN DO ; 115 * choose last in this id if none chosen yet ; 116 * could also use the adjustment algorithm from 117 previous data step ; 118 OUTPUT ; 119 END ; 120 END ; * ins ; 121 RUN ; WARNING: The variable x1 in the DROP, KEEP, or RENAME list has never been referenced. WARNING: The variable x2 in the DROP, KEEP, or RENAME list has never been referenced. NOTE: There were 637 observations read from the data set WORK.SAMPIDS. NOTE: There were 100000 observations read from the data set WORK.ALLSORT. NOTE: The data set WORK.SAMPLE has 637 observations and 1 variables. NOTE: DATA statement used: real time 0.21 seconds cpu time 0.16 seconds ```

Back to: Top of message | Previous page | Main SAS-L page