LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2000, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 6 Jun 2000 08:24:38 -0400
Reply-To:   Mark.K.Moran@CCMAIL.CENSUS.GOV
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Mark Moran <Mark.K.Moran@CCMAIL.CENSUS.GOV>
Subject:   Inliers Simulation Program
Comments:   To: SAS-L@LISTSERV.VT.EDU
Content-type:   text/plain; charset=us-ascii

I have had help from Matthew Zack with the following program. Recently sent him a message but didn't hear back from him. I am trying to simulate a data quality problem. Think about data from income taxes (which this is not, but it has similarities). An outlier would be data that is extreme, compared to everyone else's data. However, oftentimes a tax filer will draw their numbers from their previous year's filing. This is a helpful practice to the extent that the new year's data matches the previous year's data. However, at some point the numbers become fictitious and bad, just like an outlier -- except that these numbers look *like* most other numbers rather than appear among the extremes. In our case, we have annual data (like income tax), from a survey, and the goal is to weight the data and add up the data within categories which are defined by values of a variable called KB3. The weighted total within the KB3 category is called the "Estimate" ... obviously there are as many of these as there are KB3 categories. CFN is a record identifier. The goal is to simulate the inliers problem (having values wrongly copied from the previous year). The program below creates an artificial dataset for a previous year, and another one for the present year. The older data has a mean 250 and a standard deviation of 80. The new data is supposed to have a mean of 259 and a standard deviation of 80, but 5% of the records are taken from a distribution with mean of 250 (like the older data) except with a smaller standard deviation (because trying to look inconspicuous I would think reduces the dispersion in the data), in this case 40. In the program, where it says "STARTING HERE" at that point the code takes each record (each CFN) and takes the difference between the value for it from the older dataset and the value for it from the newer dataset. A colleague criticized what I was doing by saying that I should have the same variable, y1 or y2, in each dataset. I do not understand her point of view, however I know there are a couple problems. One problem is that the comparison should not be between the older data and the newer data, but between the newer data with the contamination and the newer data without the contamination. Wouldn't that be the case? Also, the data to be used must be the weighted data, not the unweighted values.

It seems what I really need to do is to compare the WEIGHTED TOTAL OF ALL Y's WITHIN a particular KB3 value, from the contaminated new data, to the WEIGHTED TOTAL OF ALL Y's WITHIN that same KB3 value, from an uncontaminated new dataset. I also want to know what is the effect on the correct standard deviation. Any insights how SAS could do this? Any logic and/or programming insights are quite welcome.

Mark

************************************************************/ * SIMULATION PROGRAM */ *

/ ************************************************************/

options pageno=1 pagesize=54 linesize=90;

title1 "Outliers and Inliers Simulation";

* Simulate Older Year data with mean=250 and standard deviation=80;

data work.past_yrb(keep=cfn y1); retain seed 5834409; do cfn=1 to 1000; y1=250 + 80*normal(seed); output past_yrb; end; run;

data work.past_yr (drop= wild1 wild2); set work.past_yrb; wild1 = 2999+100*normal(729315); KB3 = INT(substr(wild1,1,3)); if KB3 < 10 then KB3= KB3 + 10; if KB3 < 100 then KB3= KB3 +100; wild2 = 3999 + 99*normal(5021931); weight= 2*INT(substr(wild2,2,1)); if weight = 0 then weight = 15; y1_wtd = y1 * weight; run;

proc sort data=work.past_yr; by KB3; run;

proc freq data=work.past_yr; title2 'for the past year'; table y1 y1_wtd weight; by KB3; run;

proc means sum noprint maxdec=3 data=work.past_yr; var y1_wtd; by KB3; id cfn; output out=work.margpast sum=; run;

proc sort data=work.past_yr; by CFN; run;

* Simulate Newer Year data as a mixture of two normal distributions; * 5% with mean=250 and standard deviation=40; * and 95% with mean=259 and standard deviation=80;

data work.pres_yc(keep=cfn y2); retain seed1 7438215 seed2 9346811; do cfn=1 to 1000; if (cfn le 50) then y2=250 + 40*normal(seed1); else y2=259 + 80*normal(seed2); output pres_yc; end; run;

data work.pres_yr (drop= wild3 wild4); set work.pres_yc; wild3 = 2189+105*normal(831015); KB3 = INT(substr(wild3,1,3)); if KB3 < 10 then KB3= KB3 + 10; if KB3 < 100 then KB3= KB3 +100; wild4 = 4009 + 222*normal(1329353); weight= 2*INT(substr(wild4,2,1)); if weight = 0 then weight = 15; y1_wtd = y1 * weight; run;

proc sort data=work.pres_yr; by KB3; run;

proc freq data=work.pres_yr; title2 'for the present year'; table y1 y1_wtd weight; by KB3; run;

proc means sum noprint maxdec=3 data=work.pres_yr; var y1_wtd; by KB3; id cfn; output out=work.margpres sum=; run;

proc sort data=work.pres_yr; by cfn; run;

* Merge two datasets by id number(=cfn); * and calculate absolute values of difference between the two values; * STARTING HERE ;

data work.both; merge past_yr(in=sim1999) pres_yr(in=sim2000); by cfn; if ((sim1999 eq 1) and (sim2000 eq 1)) then do; absdiff=abs(y2-y1); output both; end; run;

* Sort data in ascending order of this absolute difference;

proc sort data=both; by absdiff; run;

* Create a format to identify observations; * from distribution with mean=250 and standard deviation=80 in 2000;

proc format; value cfnfmt 0-49="No change"; run;

* Print first 100 observations; * with smallest absolute differences in values;

proc print data=work.both(obs=100); var cfn absdiff y1 y2; title2 "100 observations with smallest absolute difference"; format cfn cfnfmt.; run;

>>>>>>>>>>>>>>>>>>>> END OF PROGRAM<<<<<<<<<<


Back to: Top of message | Previous page | Main SAS-L page