Date: Fri, 15 Feb 2002 11:50:34 -0600
Reply-To: Jonathan Goldberg <jonathan@MATLOCK.WUSTL.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jonathan Goldberg <jonathan@MATLOCK.WUSTL.EDU>
Subject: Re: Tricky sampling problem
Content-Type: TEXT/PLAIN; charset=US-ASCII
Talbot Michael Katz <TopKatz@EMAIL.MSN.COM> wrote:
>... I got the impression from Brad's original posting
>that that wasn't an issue. Now, I may have misinterpreted what Brad
>wanted, but the problem with what you and some others have said about
>choosing the unique ids first is that you don't sample them with
>respect to their frequency in the initial population. Whereas, if you
>use the weighted sampling technique I suggested to choose the unique
>ids, you can always match this set of sample ids against the original
>(sorted) population and then choose a random record uniformly within
>each of the ids that were chosen on the basis of weight.
>
>By the way, as a side note to Jonathan Goldberg, you can choose an
>exact size random sample.
My take on Brad's problem is that he wants an exact-sized random of
transactions that is also an exact-sized sample of ids. There wasn't
anything about randomly picking ids and then getting a transaction from
each.
BTW, I am aware that it is straightforward to pick an exact sized random
sample. I just don't see how you do both at once, at least in one
pass (as I mentioned, trimming is an option). If you randomly sample ids,
you'll get fluctuation in the exact number of transactions; if you sample
transactions, you don't exactly control ids. What am I missing?
The attached code seems aimed at generating one transaction from a
percent sample of ids, since it computes sample size as
sampsize = FLOOR(unicount * &pct.) ;
where unicount is number of unique ids and &pct is poportion desired, here
set to .01;
I generated some test data like this:
2 data initial;
3 do i = 1 to 100000;
4 if ranuni(432) > .9 then id + 1; /*about 1 id for 10 obs*/
5 output;
6 end;
7 stop;
8 run;
and ended up with 100000 "transactions" (naturally) and 10037 unique ids.
Running the rest of the (corrected) code against it produced a data set of
637 ids representing 12377 transactions. The final data set contained 637
transactions, 1 per id. This isn't .01 of either ids or transactions.
The entire log is below.
Jonathan Goldberg
Missouri Alcoholism Research Center
Dept. of Psychiatry
Washington University School of Medicine
40 N. Kingshighway, Suite One
St. Louis, MO 63108
314-286-2212
1 *-- some test data;
2
3 data initial;
4 do i = 1 to 100000;
5 if ranuni(432) > .9 then id + 1; /*about 1 id for 10 obs*/
6 output;
7 end;
8 stop;
9 run;
NOTE: The data set WORK.INITIAL has 100000 observations and 2 variables.
NOTE: DATA statement used:
real time 0.26 seconds
cpu time 0.08 seconds
10
11 %let seed = 46 ; %* ranuni seed, choose your favorite ;
12 %let pct = 0.01 ; %* percent sample ;
13
14 * sort without deduping ;
15 PROC SORT DATA = initial
16 OUT = allsort ;
16 OUT = allsort ;
17 BY id ;
18 RUN ;
NOTE: There were 100000 observations read from the data set WORK.INITIAL.
NOTE: The data set WORK.ALLSORT has 100000 observations and 2 variables.
NOTE: PROCEDURE SORT used:
real time 0.37 seconds
cpu time 0.20 seconds
^L2 The SAS System 11:39 Friday, February
15,
2
19
20 * dedupe and add counts ;
21 DATA ddw ;
22 SET allsort
23 END = last
24 NOBS = allct ;
25 * assume NOBS works and gives correct count ;
26 BY id ;
27
28 KEEP id idct ;
29 RETAIN idct allcount ;
30 IF FIRST.id THEN DO ;
31 idct = 0 ;
32 IF _N_ = 1 THEN DO ;
33 * NOBS has some usage technicalities ;
34 allcount = allct ;
35 END ;
36 END ;
37 idct + 1 ;
38 unicount + 1 ;
39 IF LAST.id THEN DO ;
40 OUTPUT ;
41 IF last THEN DO ;
42 sampsize = FLOOR(unicount * &pct.) ;
43 * could use CEIL ;
44 unalrat = unicount / _n_ ;
45 CALL SYMPUT("sampsize",sampsize) ;
46 CALL SYMPUT("unicount",unicount) ;
47 CALL SYMPUT("unalrat",unalrat) ;
48 CALL SYMPUT("allcount",_n_) ;
49 END ; * last ;
50 END ; * last.id ;
51 RUN ;
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
45:27 46:27 47:26 48:27
NOTE: There were 100000 observations read from the data set WORK.ALLSORT.
NOTE: The data set WORK.DDW has 10037 observations and 2 variables.
NOTE: DATA statement used:
real time 0.19 seconds
cpu time 0.10 seconds
52
53 * choose a random sample of sampsize according to frequency ;
54 DATA sampids ;
55 KEEP id idct ;
56 SET ddw ;
57 RETAIN
58 sampsize &sampsize.
59 unicount &unicount.
^L3 The SAS System 11:41 Friday, February
2
60 ;
61 IF sampsize = unicount THEN DO ;
62 * account for very low probability event ;
63 compval = 1 ;
64 END ;
65 ELSE DO ;
66 * adjusted weight for picking current id ;
67 compval = idct * sampsize * &unalrat. / unicount ;
68 END ;
69 IF RANUNI(&seed.) LE compval THEN DO ;
70 * pick according to (adjusted) weight in initial population ;
71 OUTPUT ;
72 sampsize = sampsize - 1 ;
73 * weight adjustment of numerator for size exactness ;
74 IF sampsize = 0 THEN DO ;
75 * done, guaranteed to get here eventually ;
76 STOP ;
77 END ; * stop if ;
78 END ; * choose if ;
79 unicount = unicount - 1 ;
80 * weight adjustment of denominator for size exactness ;
81 RUN ;
NOTE: There were 10037 observations read from the data set WORK.DDW.
NOTE: The data set WORK.SAMPIDS has 637 observations and 2 variables.
NOTE: DATA statement used:
real time 0.07 seconds
cpu time 0.01 seconds
82
83 proc print data = sampids;
84 sum idct;
85 title 'test print of sampids data set';
86 run;
NOTE: There were 637 observations read from the data set WORK.SAMPIDS.
NOTE: The PROCEDURE PRINT printed pages 1-13.
NOTE: PROCEDURE PRINT used:
real time 0.01 seconds
cpu time 0.00 seconds
87 * match back to sorted population for choice of individuals
88 within each unique id group ;
89 DATA sample ;
90 KEEP id x1 x2 ;
91
92 RETAIN
93 prob
94 match 0
95 ;
^L4 The SAS System 11:41 Friday, February
15,
2
96 MERGE
97 sampids (IN = ins)
98 allsort (IN = ina)
99 ;
100 BY id ;
101 IF ins THEN DO ;
102 * only need to check ins, ina guaranteed ;
103 IF FIRST.id THEN DO ;
104 prob = 1 / idct ;
105 match = 0 ;
106 END ;
107 IF match = 0 AND RANUNI(&seed.) LE prob
108 THEN DO ;
109 * equal chance to pick each obs for this id ;
110 OUTPUT ;
111 match = 1 ;
112 END ;
113 IF match = 0 AND LAST.id
114 THEN DO ;
115 * choose last in this id if none chosen yet ;
116 * could also use the adjustment algorithm from
117 previous data step ;
118 OUTPUT ;
119 END ;
120 END ; * ins ;
121 RUN ;
WARNING: The variable x1 in the DROP, KEEP, or RENAME list has never been
referenced.
WARNING: The variable x2 in the DROP, KEEP, or RENAME list has never been
referenced.
NOTE: There were 637 observations read from the data set WORK.SAMPIDS.
NOTE: There were 100000 observations read from the data set WORK.ALLSORT.
NOTE: The data set WORK.SAMPLE has 637 observations and 1 variables.
NOTE: DATA statement used:
real time 0.21 seconds
cpu time 0.16 seconds