Date: Mon, 28 Nov 2005 14:59:16 +0000
Reply-To: toby dunn <tobydunn@HOTMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: toby dunn <tobydunn@HOTMAIL.COM>
Subject: Re: Data Cleaning Books
In-Reply-To: <E7512ADD0779B74592AF6E0E74034716104056@qccd-krh-prod20>
Content-Type: text/plain; format=flowed
Ron,
Well basically you have three options in data cleaning:
1.) Check and Change the way you get the information (ie. make the user
intering the data only enter data that conforms to your standard). <Not an
option here>.
2.) Check and Change the data after the fact.
3.) A combionation of both.
Now once you have decided to check and clean the data you always have have
to break the problem down into two types: one for character and one for
numeric. While some aspects of checking will be the same each does have
their own special thingys that need checking.
A simple check would be this:
For characters:
Upcase, lowcase, or propcase all values
Left, center, or right align all values
Check against some valid set of values
Check distribution agianst some predefined distibution
For numeric:
Check against some predefined set of valid values,
or
You can start using things like checking the distribution or using standard
deviation from the mean and looking for outliers. Which if you really get
to reading gets pretty complicated.
You can also if one wanted to a couple of things to check for valid values
like check the file against itself or against another file known to have
good data or in some cases have the data supplier resend the data.
Now that, that is done you have to look at removing duplicate obeservations,
which can get complicated but the actual code is pretty simple in SAS (thank
God SAS got that right).
And finally you have to go through and look for duplicate information.
Which gets hairy when you start look into 'association rule based'
methodology. In the end you should get a well defined data structure that
at least is in 1st normal form and has all the duplicate info stripped out
and clean data.
Toby Dunn
From: "Fehd, Ronald J" <rjf2@CDC.GOV>
Reply-To: "Fehd, Ronald J" <rjf2@CDC.GOV>
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Data Cleaning Books
Date: Mon, 28 Nov 2005 09:28:56 -0500
> From: toby dunn
>
> Does anyone have any favorite data cleaning or Data Quality
> Management books
> other than Ron Cody's book that they would like to recommend?
> I think I have started going way beyond Ron's book.
Toby:
What is the scope of your Questions?
* how to identify stuff?
* what to do with this stuff?
* how to update the stuff in our data sets?
In my own work I resolved 80% of my interminable questions
by having
* the data collection form
* the data dictionary
* and a freq of all variables
see the quote, which is the summary
or head-slap
of my decade of data cleansing.
Ron Fehd the macro maven CDC Atlanta GA USA RJF2 at cdc dot gov
Your task is simple: remove the difference
between how things should be
and how they really are.
-- Ashleigh Brilliant pot-shot #4247
got user-defined formats?
then 80% of -your- job is done.
80% of -somebody- else's job is to review the reports.
%INVALID: a data review macro
using proc FORMAT option other=INVALID to identify and list outliers
http://www.pace.edu/nesug/proceedings/nesug01/at/At1008.pdf
PharmaSUG 2004
http://www.lexjansen.com/pharmasug/2004/DataManagement/DM06.pdf