Date: Tue, 23 Sep 2008 16:23:14 +0200
Reply-To: Marta García-Granero <mgarciagranero@gmail.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Marta García-Granero <mgarciagranero@gmail.com>
Subject: Re: IQR and outliers
In-Reply-To: <092320081135.2494.48D8D485000DCF41000009BE22230647629B0A02D29B9B0EBF020E08099A0E909C079D080C@att.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Chris Vaughan wrote:
> I need some help w/ syntax please. I have a dataset w/ a number of variables representing different test scores. I am trying to exclude outliers based upon
> a 1.5 times IQR cut-off, for each individual variable separately. I don't want to exclude the entire case, rather just that one case for the one variable (e.g., test score). Unfortunately, this has to be done separately for each of the 88 different variables (11 vars by 8 age groups). Obviously i could use the explore procedure, gather the 3rd and 1st quartile points and then calculate IQR and 1.5 * IQR and select on those cases between the cut-offs. The
> problem is that i don't want to have to do this separately 88 times after
> manually entering each quartile score into the syntax. I would also like to be
> able to re-run this command after additional data are added.
>
> Any help on how get this done would be greatly appreciated. I have attempted to
> use some similar syntax from the archive list that uses mean and sd as the basis
> for exclusion, however, I don't know that the correct terms are for SPSS to look
> for the 1st and 3rd quartiles and so i haven't been able to translate it.
>
First of all, the wisdom of eliminating outliers has been discussed
several times, and the general idea is that it shouldn't be done lightly.
Now concerning your question. I'm sure you can modify this macro to work
with the whole list (I'm pretty much busy right now, and I've written
something fast, not the best answer). it works on a single quantitative
variable and a single grouping variable.
WARNING: outliers will be replaced by missing values, you should keep an
untouched copy of the dataset!!!.
DEFINE CleanData(!POS !TOKENS(1) / !POS !TOKENS(1)).
* This is needed for matching percentile data later to dataset *.
PRESERVE.
SET TNumbers=Values.
SET OLANG=ENGLISH.
DATASET NAME OriginalData.
DATASET DECLARE Percentiles.
OMS
/SELECT TABLES
/IF COMMANDS = ["Explore"]
SUBTYPES = ["Percentiles"]
/DESTINATION FORMAT = SAV
OUTFILE = Percentiles.
EXAMINE
VARIABLES=!1 BY !2
/PLOT NONE
/PERCENTILES(25,50,75)
/STATISTICS NONE
/MISSING PAIRWISE
/NOTOTAL.
OMSEND.
DATASET ACTIVATE Percentiles.
SELECT IF (Var1 = "Tukey's Hinges").
EXE./* Needed before any "DELETE VARIABLES" *.
DELETE VARIABLES Command_ TO Var2 @50.
RENAME VARIABLES Var3=!2.
DATASET ACTIVATE OriginalData.
SORT CASES BY !2(A).
MATCH FILES /FILE=*
/TABLE='Percentiles'
/BY !2.
DATASET CLOSE Percentiles.
COMPUTE IQR1.5=1.5*(@75-@25).
COMPUTE Lower=@25-IQR1.5.
COMPUTE Upper=@75+IQR1.5.
IF (!1 LT Lower) OR (!1 GT Upper) !1=$SYSMIS.
EXE.
DELETE VARIABLES @25 TO Upper.
RESTORE.
!ENDDEFINE.
* I have used "1991 U.S. General Social Survey.sav", variables age (as
quantitative) & race (as grouping),
and since there were no outliers in the three race groups, I added
some false data at the end of the file (clear outliers)
to be sure that they were detected and cleaned correctly * .
CleanData age race.
HTH,
Marta García-Granero
--
For miscellaneous statistical stuff, visit:
http://gjyp.nl/marta/
=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
|