LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2005, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 28 Nov 2005 14:59:16 +0000
Reply-To:     toby dunn <tobydunn@HOTMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         toby dunn <tobydunn@HOTMAIL.COM>
Subject:      Re: Data Cleaning Books
Comments: To: rjf2@CDC.GOV
In-Reply-To:  <E7512ADD0779B74592AF6E0E74034716104056@qccd-krh-prod20>
Content-Type: text/plain; format=flowed


Well basically you have three options in data cleaning: 1.) Check and Change the way you get the information (ie. make the user intering the data only enter data that conforms to your standard). <Not an option here>.

2.) Check and Change the data after the fact.

3.) A combionation of both.

Now once you have decided to check and clean the data you always have have to break the problem down into two types: one for character and one for numeric. While some aspects of checking will be the same each does have their own special thingys that need checking.

A simple check would be this:

For characters: Upcase, lowcase, or propcase all values Left, center, or right align all values Check against some valid set of values Check distribution agianst some predefined distibution

For numeric: Check against some predefined set of valid values, or You can start using things like checking the distribution or using standard deviation from the mean and looking for outliers. Which if you really get to reading gets pretty complicated.

You can also if one wanted to a couple of things to check for valid values like check the file against itself or against another file known to have good data or in some cases have the data supplier resend the data.

Now that, that is done you have to look at removing duplicate obeservations, which can get complicated but the actual code is pretty simple in SAS (thank God SAS got that right).

And finally you have to go through and look for duplicate information. Which gets hairy when you start look into 'association rule based' methodology. In the end you should get a well defined data structure that at least is in 1st normal form and has all the duplicate info stripped out and clean data.

Toby Dunn

From: "Fehd, Ronald J" <rjf2@CDC.GOV> Reply-To: "Fehd, Ronald J" <rjf2@CDC.GOV> To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Data Cleaning Books Date: Mon, 28 Nov 2005 09:28:56 -0500

> From: toby dunn > > Does anyone have any favorite data cleaning or Data Quality > Management books > other than Ron Cody's book that they would like to recommend? > I think I have started going way beyond Ron's book.

Toby: What is the scope of your Questions?

* how to identify stuff? * what to do with this stuff? * how to update the stuff in our data sets?

In my own work I resolved 80% of my interminable questions by having * the data collection form * the data dictionary * and a freq of all variables

see the quote, which is the summary or head-slap of my decade of data cleansing.

Ron Fehd the macro maven CDC Atlanta GA USA RJF2 at cdc dot gov

Your task is simple: remove the difference between how things should be and how they really are. -- Ashleigh Brilliant pot-shot #4247

got user-defined formats? then 80% of -your- job is done.

80% of -somebody- else's job is to review the reports.

%INVALID: a data review macro using proc FORMAT option other=INVALID to identify and list outliers PharmaSUG 2004

Back to: Top of message | Previous page | Main SAS-L page