LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (January 2010, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 4 Jan 2010 16:21:31 -0500
Reply-To:     "Kevin F. Spratt" <Kevin.F.Spratt@DARTMOUTH.EDU>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Kevin F. Spratt" <Kevin.F.Spratt@DARTMOUTH.EDU>
Subject:      Re: Data Validation/Cleansing Tool Query
Comments: To: Joe Matise <snoopy369@GMAIL.COM>
In-Reply-To:  <b7a7fa631001041302j27c32ba8s85843ff924322565@mail.gmail.co m>
Content-Type: text/plain; charset="us-ascii"; format=flowed

At 04:02 PM 1/4/2010, Joe Matise wrote: >If you have macros defined for it already, then a non-programmer can do it >trivially. > >I however would disagree about it being a waste; a data-savvy programmer can >be highly useful in data cleaning, as it's not necessarily trivial to make >decisions and/or see issues that require additional cleaning steps. Trivial >data cleaning is, well, trivial, and shouldn't take an appreciable amount of >a programmer's actual physical time; data cleaning that is not truly >trivial, but instead requires analysis, should be done by a programmer, in >my book. > >-Joe

I second Joe's comments.

Data cleaning can be particularly non-trivial when the data is gathered according to various normalization rules across a number of tables.

The "trivial" part is documenting the variable names and creating formats. The non-trival part is making sure that the various "joins" that you often need to do to structure the data for particular analyses are merged correctly. Handling missing data can also be a major issue when the database is coded with different numeric values indicating missing as you need to convert these to valid SAS missing data values.

In my experience, even when attempting to get this done in a coherent way, some preliminary analyses often result in identifying some additional cleaning problems, which, of course, is much better when some late stage analysis results in identifying such problems.

The biggest problem I tend to have when some extract comes my what is when the comma delimited file has the response string in a cell rather than the respond numeric code.

For example "Much of the time" rather than 4. This is especially troublesome when the forms have multiple versions and the version to version documentation does not make if clear that in version 1 "Much of the time" corresponds to 4, but in version 2 is corresponds to 3.

When these kinds of things get discovered, the program who made the version 1 to version 2 changes is often following the instructions of a PI, who wants this change but has not actually consulted with the study methodologist and/or statistician who would typically argue against such a mid-stream change.

PIs often seem so surprised that such a "little" thing can cause so much angst. The worst of it is, after explaining why this is a problem and one that potentially is not easily corrected, the same PI on the next study does it again.

All I can say is that it's good when you get to a point in your career when you can just say no when asked to work with someone.

______________________________________________________________________

Kevin F. Spratt, Ph.D. Department of Orthopaedic Surgery Dartmouth Medical School One Medical Center Drive DHMC Lebanon, NH USA 03756 (603) 653-6012 (voice) (603) 653-6013 (fax) Kevin.F.Spratt@Dartmouth.Edu (e-mail) _______________________________________________________________________


Back to: Top of message | Previous page | Main SAS-L page