Dear Jonathan and SAS-L (ers)
Data cleaning is certainly important and the value of clean data is often
underestimated. Interestingly it is for us (GJI) often the easiest service
The biggest bang for buck that we use in the data cleaning process and that
I can recommend as an excellent starting point is Characterise under
Enterprise Guide. I am sure it use to be macro's developed and some SAS-L
ers will be able to refer you to it. (If you do not have Enterprise Guide
available let me know and I will provide you with an extract of the macro's)
Characterise provides a frequency analysis for all alpha fields (top 30 by
default) that we use to quickly identify problems like blank fields or lots
of "N/A", "TBA", "TEST", "HJKL" etc in fields. We ask business owners to
identify the valid versus invalid values in a extracted spreadsheet. We also
use it to generate a translation table that for example translate the
Australian state "Victoria", "VIC.", "V.I.C." etc to a consistent "VIC".
For numeric fields Characterise provides number of missings averages,
maximums, minimums etc.
On Tue, 5 Jan 2010 12:36:14 -0500, Jonathan Goldberg
>Well, I can't complain that I didn't get responses to my query. True,
>most of them were a bit snippy...
>The messages mostly were about how big an investment a piledriver is when
>all I'm looking for is a hammer. Our situation is relatively simple; we
>don't have compliciated normalization schemes or most other possible
>The idea of a tool is to put data cleaning as much as possible *in the
>hands and under the control of the people who know the data*. Who are
>also the people who will deal with any problems found. The need to
>involve programmers slows down the projects and increases costs. Also,
>there is no way whatsoever that a programmer could do data cleaning on
>his/her own. I think trying to streamline that process is a quite
>reasonable thing to do.
>Besides, my management wants me to do it, with something we write
>ourselves (which will involve non-SAS programming for the front end) or
>with a third party product. So, reasonable or not, here we go.
>My thanks for the pointers to Dataflux, SDD (no, I'm not familiar with all
>the SI vertical-market products), and SAS Data Quality Solution.
>Michael, I hope that when you finish your evaluation you will post your
>conclusions here. Inquiring minds whan to know! And thanks again for the
>It's true my cubicle is somewhat messy... :-)
>"(believe it or not SAS programmers are not necessarily the highest paid
>employees in some organizations)."
>In my case I guarentee it.