LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 14 May 2007 19:44:21 -0400
Reply-To:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:   =?windows-1256?Q?RE=3A_Multiple_obs_per_subject_=E2=80=93_data_cleaning__?= =?windows-1256?Q?_______question?=
Comments:   To: Lynne <Lynne@UNC.EDU>
In-Reply-To:   <200705142224.l4EHjdAW018703@malibu.cc.uga.edu>
Content-Type:   text/plain; charset="windows-1256"

Lynne: This form of data cleaning could detect two opposite types of misclassification: duplicates of observations that appear to be different observations and observations of different entities that appear to belong to the same entity. Your focus seems to be on the latter.

I'd suggest a simple SQL GROUP BY query that selects tuples (observations) with the same NEWIDNUM but different values the sources of that value:

proc sql; create table toReview as select * from <ds> group by NEWIDNUM having min(F_NAME) NE max(F_NAME) OR .... expand OR conditions to include other attributes ... ; quit; S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Lynne Sent: Monday, May 14, 2007 6:25 PM To: SAS-L@LISTSERV.UGA.EDU Cc: Lynne Sampson Subject: Multiple obs per subject – data cleaning question

I am finding a lot of information about how to analyze data with multiple observations but not very much on how to get the data into shape for analysis! I hope someone out there can help.

I have a dataset with multiple observations for some subjects (some people are in there once while others are in there 2 or more times). I will need to be able to identify distinct individuals in the analysis (I plan to use GEE…). Right now the dataset does not contain a reliable unique identifier that can be used for this.

The data is structured like this:

F_NAME L_NAME DOB NEWIDNUM DATE VAR1…VARn

I have created a dummy identifier (NEWIDNUM) using some letters from the first and last names, some numbers from the date of birth, etc. I would like to be able to verify the accuracy of this identifier by examining all the situations in which people have the same NEWIDNUM do NOT match on all of the variables used to make NEWIDNUM. Then I can manually correct any problems I find and proceed from there.

For each set of observations with the same NEWIDNUM, all values for F_NAME, L_NAME, DOB should be the same but they will have different DATE values. If this is the case, then I don’t need to do anything. If not, then I want the observations to be output to a dataset so I can look at them. I thought I could do this by somehow using proc compare with NEWIDNUM as a by-group but was not successful. Does anyone have any other suggestions?

Thank you so much,

--Lynne


Back to: Top of message | Previous page | Main SAS-L page