| Date: | Fri, 2 Jun 2006 10:01:14 -0700 |
| Reply-To: | Mak <makgeha@GMAIL.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Mak <makgeha@GMAIL.COM> |
| Organization: | http://groups.google.com |
| Subject: | Messy Data editing HELP!!!!!!!! |
|
| Content-Type: | text/plain; charset="iso-8859-1" |
|---|
I am trying to edit a large data file (about 1 million records) on
dairy cattle survival analysis. The problem that I am facing is that
these are field data collected from farmers and there are a lot of
irregularities that I want to get rid of. I have 12 variables in the
data set for each cow with records on different lactations. The
response variable is disease and is coded from 0 to 9 (each code
represents a certain disease and 0 means no disease reported).
The format is as follows:
Herd| Cow# | Lactation# | Disease etc...
The thing is that I have records on cows that are for example in the
second lactation and reported having a disease (thus being taken out of
the herd) and then the same cow appears again in the third lactation
which makes no sense at all. Another problem is that I have for the
same cow at the same lactation two disease scores, one that shows no
disease and the other shows a disease. I want to write a program that
deals with these cases. For the first case, look up the lactation
number and check if, after a disease is reported, the cow shows up
again in the next lactation, then I want the disease score to be
changed to 0.
For the second case I want the program to check if the cow appears in
the next lactation then obviously the reported records that shows a
disease is wrong then I want to delete it and keep the right record,
and in case the cow doesn't appear in the next lactation just delete
the two records since we don't have any basis to judge on which
information is correct.
An example of the cases is as follows:
Herd|Cow# | Lacation#| Disease
1 | 1 | 01 | 0
1 | 1 | 02 | 5
1 | 1 | 03 | 2
1 | 1 | 04 | 0
1 | 1 | 05 | 5
obviously in this case, the disease report in lactation 2 & 3 is wrong
and I want it to be changed to 0 or to be on the safer side delete all
the records about this specific cow. We might have the same cow number
but different herd numbers (herds are the blocking factor)
Herd|Cow# | Lacation#| Disease
1 | 2 | 01 | 0
1 | 2 | 01 | 5
1 | 2 | 02 | 0
1 | 2 | 03 | 0
1 | 2 | 04 | 5
obviously in this case, for lactation 1 we should keep the record that
shows no disease and delete the other one or as in the previous case
delete all the information about cow# 2
a third problem that i am facing is having lags between lactations
for example
Herd|Cow# | Lacation#| Disease
1 | 3 | 01 | 0
1 | 3 | 03 | 0
1 | 3 | 04 | 2
information about lactation 2 in this case is missing and so i want to
delete all the records on that particular cow.
I have been cracking my scull on this issue for the past couple of
month with no successfull result.
I would really appreciate it if there would be someone out there to
help me out.
Thanks everybody.
|