Date: Mon, 19 Jun 2000 11:08:08 -0400
Reply-To: Mark.K.Moran@CCMAIL.CENSUS.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mark Moran <Mark.K.Moran@CCMAIL.CENSUS.GOV>
Subject: Percent Relative Differences
Content-type: text/plain; charset=us-ascii
Dear guardians of the SAS-L world, the goal for this 6.12 program is to
calculate the percent relative difference (disregarding sign)
between a contaminated weighted estimate and an uncontaminated weighted
estimate. This would be the calculation
((abs(CONTAM - UNCONTAM))/UNCONTAM)*100.
For every value of a three-digit number, KB3, this calculation has to be
repeated. These KB3's are likes-with-likes groupings of the CFN's.
The contamination comes from values which imitate an old survey (more precisely,
they look like they were sampled from the old
distribution) instead of the current year's survey.
CFN is a record identifier. The contaminating data has a mean 250 and a
standard deviation of 40. The good data is supposed to have
a mean of 259 and a standard deviation of 80, but 5% of the records are taken
from the contaminating distribution. A record that comes
from the contaminating dataset in this program is called an "inlier." The
program has no mistakes that are obvious to me, and there are
no errors in the log; however, my output is:
100 observations with smallest percent relative difference
OBS CFN PERCDIFF UNCNTAMW CNTAMW
1 inlier . 898.82
18272.07
2 inlier . 25010.77
21913.24
What I expected were percentages like 7%, 6%, 5%, 2%, 2%. I am convinced there
is a logic issue toward the end, because I think what I really
need is two different things, and it's not there in the SAS. At the level of
the KB3 grouping, what I seem to need is to know which KB's have an
estimate whose percent relative difference from the uncontaminated estimate is
bigger than a preset cutoff value and if so, call that KB " INLIER
CONTAMINATED". At the record level, what I also seem to need is to know is
which CFN's have a value different from their own KB's true
mean, using the same percdiff comparison except between each record and the
KB-specific mean. Any clues how SAS might be able to accomplish
this?
Mark
137
138 ************************************************************/
139 * SIMULATION PROGRAM */
140 * */
141 ************************************************************/
142
143 options pageno=1 pagesize=54 linesize=90;
144
145 title1 "Simulation Program";
146
147 * SIMULATE THE UNCONTAMINATED DATA ;
148
149 data work.noco(keep=cfn prewuncn);
150 retain seed3 7613936;
151 do cfn=1 to 1000;
152 prewuncn=259 + 80*normal(seed3);
153 output noco;
154 end;
155 run;
NOTE: The data set WORK.NOCO has 1000 observations and 2 variables.
NOTE: The DATA statement used 0.07 seconds.
156
157 data work.nocontam (drop= wild1 wild2);
158 set work.noco;
159 wild1 = 2999+100*normal(729315);
160 KB3 = INT(substr(wild1,1,3));
161 if KB3 < 10 then KB3= KB3 + 10;
162 if KB3 < 100 then KB3= KB3 +100;
163 wild2 = 3999 + 99*normal(5021931);
164 weight= 2*INT(substr(wild2,2,1));
165 if weight = 0 then weight = 15;
166 uncntamw = prewuncn * weight;
167 run;
NOTE: Numeric values have been converted to character values at the places given
by:
(Line):(Column).
160:20 164:24
NOTE: Character values have been converted to numeric values at the places given
by:
(Line):(Column).
160:13 164:17
NOTE: Missing values were generated as a result of performing an operation on
missing values.
Each place is given by: (Number of times) at (Line):(Column).
2 at 160:9 2 at 161:29 2 at 162:30 8 at 164:12 8 at 164:13 8 at
166:23
NOTE: The data set WORK.NOCONTAM has 1000 observations and 5 variables.
NOTE: The DATA statement used 0.07 seconds.
168
169 proc sort data=work.nocontam;
170 by KB3;
171 run;
NOTE: The data set WORK.NOCONTAM has 1000 observations and 5 variables.
NOTE: The PROCEDURE SORT used 0.14 seconds.
172
173 proc freq data=work.nocontam;
174 title2 'for the weighted uncontaminated data';
175 table prewuncn uncntamw weight;
176 by KB3;
177 run;
NOTE: The PROCEDURE FREQ used 1.11 seconds.
178
179 proc means sum noprint maxdec=3 data=work.nocontam;
180 var uncntamw;
181 by KB3;
182 id cfn;
183 output out=work.sumnotam sum=;
184 run;
NOTE: The data set WORK.SUMNOTAM has 67 observations and 5 variables.
NOTE: The PROCEDURE MEANS used 0.04 seconds.
185
186 proc sort data=work.sumnotam;
187 by KB3;
188 run;
NOTE: The data set WORK.SUMNOTAM has 67 observations and 5 variables.
NOTE: The PROCEDURE SORT used 0.02 seconds.
189
190 * SIMULATE THE CONTAMINATED DATA as a mixture of two normal distributions;
191 * 5% with mean=250 and standard deviation=40;
192 * and 95% with mean=259 and standard deviation=80;
193
194 data work.co(keep=cfn prewcont);
195 retain seed1 7438215 seed2 9346811;
196 pastmean = 250;
197 do cfn=1 to 1000;
198 if (cfn le 50)
199 then prewcont=pastmean + 40*normal(seed1);
200 else prewcont=259 + 80*normal(seed2);
201 output co;
202 end;
203 run;
NOTE: The data set WORK.CO has 1000 observations and 2 variables.
NOTE: The DATA statement used 0.06 seconds.
204
205 data work.contam (drop= wild3 wild4);
206 set work.co;
207 wild3 = 2189+105*normal(831015);
208 KB3 = INT(substr(wild3,1,3));
209 if KB3 < 10 then KB3= KB3 + 10;
210 if KB3 < 100 then KB3= KB3 +100;
211 wild4 = 4009 + 222*normal(1329353);
212 weight= 2*INT(substr(wild4,2,1));
213 if weight = 0 then weight = 15;
214 cntamw = prewcont * weight;
215 run;
NOTE: Numeric values have been converted to character values at the places given
by:
(Line):(Column).
208:20 212:24
NOTE: Character values have been converted to numeric values at the places given
by:
(Line):(Column).
208:13 212:17
NOTE: Missing values were generated as a result of performing an operation on
missing values.
Each place is given by: (Number of times) at (Line):(Column).
3 at 208:9 3 at 209:29 3 at 210:30 16 at 212:12 16 at 212:13
16 at 214:21
NOTE: The data set WORK.CONTAM has 1000 observations and 5 variables.
NOTE: The DATA statement used 0.19 seconds.
216
217 proc sort data=work.contam;
218 by KB3;
219 run;
NOTE: The data set WORK.CONTAM has 1000 observations and 5 variables.
NOTE: The PROCEDURE SORT used 0.05 seconds.
220
221 proc freq data=work.contam;
222 title2 'for the weighted uncontaminated';
223 table prewcont cntamw weight;
224 by KB3;
225 run;
NOTE: The PROCEDURE FREQ used 0.2 seconds.
226
227 proc means sum noprint maxdec=3 data=work.contam;
228 var cntamw;
229 by KB3;
230 id cfn;
231 output out=work.sumtam sum=;
232 run;
NOTE: The data set WORK.SUMTAM has 66 observations and 5 variables.
NOTE: The PROCEDURE MEANS used 0.04 seconds.
233
234 proc sort data=work.sumtam;
235 by KB3;
236 run;
NOTE: The data set WORK.SUMTAM has 66 observations and 5 variables.
NOTE: The PROCEDURE SORT used 0.02 seconds.
237
238 * Merge two datasets by id number(=cfn);
239 * and calculate percent relative difference between the two weighted
values;
240
241 data work.both;
242 merge sumnotam(in=pure)
243 sumtam(in=impure);
244 by KB3;
245 if ((pure eq 1) and
246 (impure eq 1)) then do;
247 percdiff = ((abs (contam - uncontam))/uncontam) * 100.;
248 output both;
249 end;
250 run;
NOTE: Variable UNCONTAM is uninitialized.
NOTE: Variable CONTAM is uninitialized.
NOTE: Missing values were generated as a result of performing an operation on
missing values.
Each place is given by: (Number of times) at (Line):(Column).
2 at 247:19 2 at 247:31 2 at 247:43 2 at 247:54
NOTE: The data set WORK.BOTH has 2 observations and 9 variables.
NOTE: The DATA statement used 0.05 seconds.
251
252 * Sort data in ascending order of this percent relative difference;
253
254 proc sort data=both;
255 by percdiff;
256 run;
NOTE: The data set WORK.BOTH has 2 observations and 9 variables.
NOTE: The PROCEDURE SORT used 0.02 seconds.
257
258 * Create a format to identify observations;
259 * from distribution with mean=250 and standard deviation=80 in 2000;
260
261 proc format;
262 value cfnfmt 0-49='uncontam' 50-1000='inlier';
WARNING: Format CFNFMT is already on the library.
NOTE: Format CFNFMT has been output.
263 run;
NOTE: The PROCEDURE FORMAT used 0.02 seconds.
264
265 * Print first 100 observations;
266 * with smallest percent relative differences in values;
267
268 proc print data=work.both(obs=100);
269 var cfn percdiff uncntamw cntamw;
270 title2 "100 observations with smallest percent relative difference";
271 format cfn cfnfmt.;
272 run;
NOTE: The PROCEDURE PRINT used 0.05 seconds.