Date: Mon, 4 Jan 2010 16:21:31 -0500
Reply-To: "Kevin F. Spratt" <Kevin.F.Spratt@DARTMOUTH.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Kevin F. Spratt" <Kevin.F.Spratt@DARTMOUTH.EDU>
Subject: Re: Data Validation/Cleansing Tool Query
In-Reply-To: <b7a7fa631001041302j27c32ba8s85843ff924322565@mail.gmail.co m>
Content-Type: text/plain; charset="us-ascii"; format=flowed
At 04:02 PM 1/4/2010, Joe Matise wrote:
>If you have macros defined for it already, then a non-programmer can do it
>trivially.
>
>I however would disagree about it being a waste; a data-savvy programmer can
>be highly useful in data cleaning, as it's not necessarily trivial to make
>decisions and/or see issues that require additional cleaning steps. Trivial
>data cleaning is, well, trivial, and shouldn't take an appreciable amount of
>a programmer's actual physical time; data cleaning that is not truly
>trivial, but instead requires analysis, should be done by a programmer, in
>my book.
>
>-Joe
I second Joe's comments.
Data cleaning can be particularly non-trivial when the data is
gathered according to various
normalization rules across a number of tables.
The "trivial" part is documenting the variable names and creating
formats. The non-trival
part is making sure that the various "joins" that you often need to
do to structure the
data for particular analyses are merged correctly. Handling missing
data can also be
a major issue when the database is coded with different numeric
values indicating missing
as you need to convert these to valid SAS missing data values.
In my experience, even when attempting to get this done in a coherent
way, some preliminary
analyses often result in identifying some additional cleaning
problems, which, of course,
is much better when some late stage analysis results in identifying
such problems.
The biggest problem I tend to have when some extract comes my what is
when the comma
delimited file has the response string in a cell rather than the
respond numeric code.
For example "Much of the time" rather than 4. This is especially
troublesome when the
forms have multiple versions and the version to version documentation
does not make
if clear that in version 1 "Much of the time" corresponds to 4, but
in version 2 is corresponds
to 3.
When these kinds of things get discovered, the program who made the
version 1 to version 2 changes
is often following the instructions of a PI, who wants this change
but has not actually consulted
with the study methodologist and/or statistician who would typically
argue against such a mid-stream
change.
PIs often seem so surprised that such a "little" thing can cause so
much angst. The worst of it
is, after explaining why this is a problem and one that potentially
is not easily corrected, the
same PI on the next study does it again.
All I can say is that it's good when you get to a point in your
career when you can just say no
when asked to work with someone.
______________________________________________________________________
Kevin F. Spratt, Ph.D.
Department of Orthopaedic Surgery
Dartmouth Medical School
One Medical Center Drive
DHMC
Lebanon, NH USA 03756
(603) 653-6012 (voice)
(603) 653-6013 (fax)
Kevin.F.Spratt@Dartmouth.Edu (e-mail)
_______________________________________________________________________