| Date: | Thu, 13 Jan 2000 13:45:45 -0800 |
| Reply-To: | David Cassell <cassell@MERCURY.COR.EPA.GOV> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | David Cassell <cassell@MERCURY.COR.EPA.GOV> |
| Organization: | OAO Corp. |
| Subject: | Re: outliers |
| Content-Type: | text/plain; charset=us-ascii |
ray wrote:
> Paige Miller wrote:
> > No. In fact, there is no generally agreed upon method for identifying
> > outliers, not is there any way to decide whether or not they should be
> > dropped without using subject matter knowledge.
>
> Dear Paige: Thanks for the answer and I absolutely agree with it. In my
> case, I am simply trying to replicate a previous result which dropped obs
> for certain variable values that were greater than 3 stds from the mean.
[I re-arranged and trimmed so it was easier to read - blame me if
anything
is amiss.]
Ray, if all you want to do is check whether values are within 3 sd of
the sample eman, you can do that with a PROC MEANS and a DATA step:
PROC MEANS DATA=yourdata NOPRINT;
VAR yourvar;
OUTPUT OUT=OUTVAR MEAN=SAMPMEAN STD=SAMPSTD;
RUN;
DATA NEW;
RETAIN SAMPMEAN SAMPSTD;
IF _N_=1 THEN SET OUTVAR(KEEP = SAMPMEAN SAMPSTD);
IF ABS(yourvar - SAMPMEAN) > 3*SAMPSTD THEN DELETE;
RUN;
I think that's what you asked for. But that's fairly naive, and may
have any number of drawbacks [as you agreed above]. If you decide to
evaluate the performance of said previous result, you may want to
look at some papers on outlier detection, like:
Rosner 1975 Technometrics #17
Tietjen & Moore 1972 Technometrics #14
Walsh 1950 Annals of Math. Stat. #21
Walsh 1958 Annals of the Inst. of Stat. Math #10
Those are the ones I found taking a fast look in my reference lists,
but this is hardly comprehensive. The bottom line: if you have a
mixture of distributions, nothing may do the job well.
David
--
David Cassell, OAO cassell@mail.cor.epa.gov
Senior Computing Specialist
mathematical statistician
|