Date: Wed, 10 Apr 2002 11:48:18 -0700
Reply-To: "Grichuhin, Theodore J" <tgrichuh@FHCRC.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Grichuhin, Theodore J" <tgrichuh@FHCRC.ORG>
Subject: Re: hospital charge data
Content-Type: text/plain; charset="iso-8859-1"
In the older datasets circa 1990, a value of 999999.99 meant this claim is a
cost outlier and
to look for a continuation record, which will have the amount over
999,999.99.
There is a separate field that flags these records.
-----Original Message-----
From: Robert Virgile [mailto:virgile@ATTBI.COM]
Sent: Wednesday, April 10, 2002 10:30
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: hospital charge data
Frank,
Some discussion has already taken place here, but I'll add a couple of
ideas.
First, you may want to distinguish between cleaning your data vs. finding
outliers. What does 9999999 really mean? What does 0 really mean?
Second, a practical approach might work backwards. How many data points do
you really want to check? It's easy enough for proc univariate to find the
99th percentile for each variable. Is checking 1% of the data values too
much work? In similar terms, if mean + 3 SD generates too many points to
check, then change it to mean + 4 SD. Alternatively, if the data points are
largely invalid using your initial cutoff method, then relax it to include
more data points.
Good luck.
Bob V.
-----Original Message-----
From: Frank Schiffel <SchifF@DHSS.STATE.MO.US>
Newsgroups: bit.listserv.sas-l
To: SAS-L@LISTSERV.UGA.EDU <SAS-L@LISTSERV.UGA.EDU>
Date: Tuesday, April 09, 2002 4:38 PM
Subject: hospital charge data
>we're trying to determine outliers in a data set of a few million
variables.
>
>obviously there are pure errors, some high values, and something that we
just don't want to report as its not meaningful.
>
>its not a nice Gaussian distributions, there is some skewness in the data.
>
>what's a good way to do this? put a floor as whatever the insurance pays
for ER visits (say $50), look at the mean plus 3 SD? cap at the 1% and 99%
in proc univariate? (obvioiusly I'm running out of ideas)
>
>I haven't seen how nationally this is dealt with in some of the analysis
(sometimes they just sample and don't do anything, assuming their large n
will cover it). we're going to report at a county level and some are pretty
small. plus once you slice and dice data, you know how that goes. we'll do
some demographics on it also in the reporting.
>
>it helps we can't legally report n less than 20 for an average value.
>
>but the cleaning is a real problem.
>
>any comments or suggestions would be helpful.
>
>
>Frank Schiffel, Research Analyst III
>Bureau of Health Care Performance Monitoring
>Center for Health Information Management and Evaluation
>PO Box 570
>Jefferson City, MO 65102-0570
>
>573 751-6279
|