Date: Fri, 11 Jun 2004 15:02:58 -0400
Reply-To: Steve Albert <salbert@AOL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Steve Albert <salbert@AOL.COM>
Subject: Re: Truncating Data series
MS,
You probably don't want to actually delete those records, just omit them
from the analysis.
Here are a few approaches that come to mind:
1. Use proc univariate to generate the 1st and 99th percentile values,
then either merge that on or hard code it so you can use where-clauses to
restrict the data to what you're looking for.
2. Sort the data on your key variable, then assign every record a
percentile:
data withpctl;
set sorteddata;
frac=_n_/3000000; * or whatever your exact count is;
run;
Now you can use where clauses to trim 1%, or 5%, or .01%, or whatever you
want; e.g.
%let lowlim=.01;
%let uplim=.99;
proc whatever data=withpctl(where=(&lowlim < frac and frac < &uplim));
*proc details;
title3 "Trimmed data -- lower pctile &lowlim, upper pctile &uplim";
run;
I'm assuming that there's only one variable of concern for the
Winsorizing, though the second method is readily extended to trimming on
more than one dimension. It also lets you readily investigate the
robustness of the results to changes in your trimming rule. (You might
also want to see what Winsorizing does; see the recent thread on how to do
that.)
By the way, I'd suggest you do some exploration to see how sensitive any
results are to your trimming. If the results are very sensitive to your
treatment of outliers, then I'd recommend you look at the data very
carefully before drawing any conclusions.
Steve Albert