| Date: | Mon, 14 May 2007 19:44:21 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | =?windows-1256?Q?RE=3A_Multiple_obs_per_subject_=E2=80=93_data_cleaning__?=
=?windows-1256?Q?_______question?= |
|
| In-Reply-To: | <200705142224.l4EHjdAW018703@malibu.cc.uga.edu> |
| Content-Type: | text/plain; charset="windows-1256" |
Lynne:
This form of data cleaning could detect two opposite types of misclassification: duplicates of observations that appear to be different observations and observations of different entities that appear to belong to the same entity. Your focus seems to be on the latter.
I'd suggest a simple SQL GROUP BY query that selects tuples (observations) with the same NEWIDNUM but different values the sources of that value:
proc sql;
create table toReview as
select * from <ds>
group by NEWIDNUM having min(F_NAME) NE max(F_NAME) OR .... expand OR conditions to include other attributes ...
;
quit;
S
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Lynne
Sent: Monday, May 14, 2007 6:25 PM
To: SAS-L@LISTSERV.UGA.EDU
Cc: Lynne Sampson
Subject: Multiple obs per subject – data cleaning question
I am finding a lot of information about how to analyze data with multiple observations but not very much on how to get the data into shape for analysis! I hope someone out there can help.
I have a dataset with multiple observations for some subjects (some people are in there once while others are in there 2 or more times). I will need to be able to identify distinct individuals in the analysis (I plan to use GEE…). Right now the dataset does not contain a reliable unique identifier that can be used for this.
The data is structured like this:
F_NAME L_NAME DOB NEWIDNUM DATE VAR1…VARn
I have created a dummy identifier (NEWIDNUM) using some letters from the first and last names, some numbers from the date of birth, etc. I would like to be able to verify the accuracy of this identifier by examining all the situations in which people have the same NEWIDNUM do NOT match on all of the variables used to make NEWIDNUM. Then I can manually correct any problems I find and proceed from there.
For each set of observations with the same NEWIDNUM, all values for F_NAME, L_NAME, DOB should be the same but they will have different DATE values. If this is the case, then I don’t need to do anything. If not, then I want the observations to be output to a dataset so I can look at them. I thought I could do this by somehow using proc compare with NEWIDNUM as a by-group but was not successful. Does anyone have any other suggestions?
Thank you so much,
--Lynne
|