LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2000, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 19 Jun 2000 11:08:08 -0400
Reply-To:     Mark.K.Moran@CCMAIL.CENSUS.GOV
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Mark Moran <Mark.K.Moran@CCMAIL.CENSUS.GOV>
Subject:      Percent Relative Differences
Comments: To: SAS-L@LISTSERV.VT.EDU
Content-type: text/plain; charset=us-ascii

Dear guardians of the SAS-L world, the goal for this 6.12 program is to calculate the percent relative difference (disregarding sign) between a contaminated weighted estimate and an uncontaminated weighted estimate. This would be the calculation

((abs(CONTAM - UNCONTAM))/UNCONTAM)*100.

For every value of a three-digit number, KB3, this calculation has to be repeated. These KB3's are likes-with-likes groupings of the CFN's. The contamination comes from values which imitate an old survey (more precisely, they look like they were sampled from the old distribution) instead of the current year's survey.

CFN is a record identifier. The contaminating data has a mean 250 and a standard deviation of 40. The good data is supposed to have a mean of 259 and a standard deviation of 80, but 5% of the records are taken from the contaminating distribution. A record that comes from the contaminating dataset in this program is called an "inlier." The program has no mistakes that are obvious to me, and there are no errors in the log; however, my output is:

100 observations with smallest percent relative difference

OBS CFN PERCDIFF UNCNTAMW CNTAMW

1 inlier . 898.82 18272.07 2 inlier . 25010.77 21913.24

What I expected were percentages like 7%, 6%, 5%, 2%, 2%. I am convinced there is a logic issue toward the end, because I think what I really need is two different things, and it's not there in the SAS. At the level of the KB3 grouping, what I seem to need is to know which KB's have an estimate whose percent relative difference from the uncontaminated estimate is bigger than a preset cutoff value and if so, call that KB " INLIER CONTAMINATED". At the record level, what I also seem to need is to know is which CFN's have a value different from their own KB's true mean, using the same percdiff comparison except between each record and the KB-specific mean. Any clues how SAS might be able to accomplish this?

Mark

137 138 ************************************************************/ 139 * SIMULATION PROGRAM */ 140 * */ 141 ************************************************************/ 142 143 options pageno=1 pagesize=54 linesize=90; 144 145 title1 "Simulation Program"; 146 147 * SIMULATE THE UNCONTAMINATED DATA ; 148 149 data work.noco(keep=cfn prewuncn); 150 retain seed3 7613936; 151 do cfn=1 to 1000; 152 prewuncn=259 + 80*normal(seed3); 153 output noco; 154 end; 155 run;

NOTE: The data set WORK.NOCO has 1000 observations and 2 variables. NOTE: The DATA statement used 0.07 seconds.

156 157 data work.nocontam (drop= wild1 wild2); 158 set work.noco; 159 wild1 = 2999+100*normal(729315); 160 KB3 = INT(substr(wild1,1,3)); 161 if KB3 < 10 then KB3= KB3 + 10; 162 if KB3 < 100 then KB3= KB3 +100; 163 wild2 = 3999 + 99*normal(5021931); 164 weight= 2*INT(substr(wild2,2,1)); 165 if weight = 0 then weight = 15; 166 uncntamw = prewuncn * weight; 167 run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 160:20 164:24 NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 160:13 164:17 NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 2 at 160:9 2 at 161:29 2 at 162:30 8 at 164:12 8 at 164:13 8 at 166:23 NOTE: The data set WORK.NOCONTAM has 1000 observations and 5 variables. NOTE: The DATA statement used 0.07 seconds.

168 169 proc sort data=work.nocontam; 170 by KB3; 171 run;

NOTE: The data set WORK.NOCONTAM has 1000 observations and 5 variables. NOTE: The PROCEDURE SORT used 0.14 seconds.

172 173 proc freq data=work.nocontam; 174 title2 'for the weighted uncontaminated data'; 175 table prewuncn uncntamw weight; 176 by KB3; 177 run;

NOTE: The PROCEDURE FREQ used 1.11 seconds.

178 179 proc means sum noprint maxdec=3 data=work.nocontam; 180 var uncntamw; 181 by KB3; 182 id cfn; 183 output out=work.sumnotam sum=; 184 run;

NOTE: The data set WORK.SUMNOTAM has 67 observations and 5 variables. NOTE: The PROCEDURE MEANS used 0.04 seconds.

185 186 proc sort data=work.sumnotam; 187 by KB3; 188 run;

NOTE: The data set WORK.SUMNOTAM has 67 observations and 5 variables. NOTE: The PROCEDURE SORT used 0.02 seconds.

189 190 * SIMULATE THE CONTAMINATED DATA as a mixture of two normal distributions; 191 * 5% with mean=250 and standard deviation=40; 192 * and 95% with mean=259 and standard deviation=80; 193 194 data work.co(keep=cfn prewcont); 195 retain seed1 7438215 seed2 9346811; 196 pastmean = 250; 197 do cfn=1 to 1000; 198 if (cfn le 50) 199 then prewcont=pastmean + 40*normal(seed1); 200 else prewcont=259 + 80*normal(seed2); 201 output co; 202 end; 203 run;

NOTE: The data set WORK.CO has 1000 observations and 2 variables. NOTE: The DATA statement used 0.06 seconds.

204 205 data work.contam (drop= wild3 wild4); 206 set work.co; 207 wild3 = 2189+105*normal(831015); 208 KB3 = INT(substr(wild3,1,3)); 209 if KB3 < 10 then KB3= KB3 + 10; 210 if KB3 < 100 then KB3= KB3 +100; 211 wild4 = 4009 + 222*normal(1329353); 212 weight= 2*INT(substr(wild4,2,1)); 213 if weight = 0 then weight = 15; 214 cntamw = prewcont * weight; 215 run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 208:20 212:24 NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 208:13 212:17 NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 3 at 208:9 3 at 209:29 3 at 210:30 16 at 212:12 16 at 212:13 16 at 214:21 NOTE: The data set WORK.CONTAM has 1000 observations and 5 variables. NOTE: The DATA statement used 0.19 seconds.

216 217 proc sort data=work.contam; 218 by KB3; 219 run;

NOTE: The data set WORK.CONTAM has 1000 observations and 5 variables. NOTE: The PROCEDURE SORT used 0.05 seconds.

220 221 proc freq data=work.contam; 222 title2 'for the weighted uncontaminated'; 223 table prewcont cntamw weight; 224 by KB3; 225 run;

NOTE: The PROCEDURE FREQ used 0.2 seconds.

226 227 proc means sum noprint maxdec=3 data=work.contam; 228 var cntamw; 229 by KB3; 230 id cfn; 231 output out=work.sumtam sum=; 232 run;

NOTE: The data set WORK.SUMTAM has 66 observations and 5 variables. NOTE: The PROCEDURE MEANS used 0.04 seconds.

233 234 proc sort data=work.sumtam; 235 by KB3; 236 run;

NOTE: The data set WORK.SUMTAM has 66 observations and 5 variables. NOTE: The PROCEDURE SORT used 0.02 seconds.

237 238 * Merge two datasets by id number(=cfn); 239 * and calculate percent relative difference between the two weighted values; 240 241 data work.both; 242 merge sumnotam(in=pure) 243 sumtam(in=impure); 244 by KB3; 245 if ((pure eq 1) and 246 (impure eq 1)) then do; 247 percdiff = ((abs (contam - uncontam))/uncontam) * 100.; 248 output both; 249 end; 250 run;

NOTE: Variable UNCONTAM is uninitialized. NOTE: Variable CONTAM is uninitialized. NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 2 at 247:19 2 at 247:31 2 at 247:43 2 at 247:54 NOTE: The data set WORK.BOTH has 2 observations and 9 variables. NOTE: The DATA statement used 0.05 seconds.

251 252 * Sort data in ascending order of this percent relative difference; 253 254 proc sort data=both; 255 by percdiff; 256 run;

NOTE: The data set WORK.BOTH has 2 observations and 9 variables. NOTE: The PROCEDURE SORT used 0.02 seconds.

257 258 * Create a format to identify observations; 259 * from distribution with mean=250 and standard deviation=80 in 2000; 260 261 proc format; 262 value cfnfmt 0-49='uncontam' 50-1000='inlier'; WARNING: Format CFNFMT is already on the library. NOTE: Format CFNFMT has been output. 263 run;

NOTE: The PROCEDURE FORMAT used 0.02 seconds.

264 265 * Print first 100 observations; 266 * with smallest percent relative differences in values; 267 268 proc print data=work.both(obs=100); 269 var cfn percdiff uncntamw cntamw; 270 title2 "100 observations with smallest percent relative difference"; 271 format cfn cfnfmt.; 272 run;

NOTE: The PROCEDURE PRINT used 0.05 seconds.


Back to: Top of message | Previous page | Main SAS-L page