Date: Wed, 28 Jun 2006 10:05:44 -0400
Reply-To: Kevin Roland Viel <kviel@EMORY.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Kevin Roland Viel <kviel@EMORY.EDU>
Subject: Re: UNIX datastep question
In-Reply-To: <7.0.1.0.2.20060627151242.03591018@viergever.net>
Content-Type: TEXT/PLAIN; charset=US-ASCII
Jennifer,
Having been stung once by ICD-9 codes, the first thing I would do is to
obtain a frequency listing of *all* codes. There might not be a
difference between say, 714, 714.0, and 714.00, but there may be...
I am pretty surprised that noone has questioned the form. Obviously,
many, many patients do NOT have 15 Dx's. These should be held in a
separate table.
Also, when against the wall, Ian's sage suggestion (no intention to
slight others but I stopped reading intensely at this point) of using a
VIEW will serve you well. A VIEW, however, is created on the fly *each*
time you hit it. This means it is potentially dynamic and could require
more CPU time, but if you don't have the memory or disk space, you have no
alternative, given efficient coding.
You might be able to dispense with the flag altogether, either by using
formats or a hash. An exercise like this will hone your attention to
efficiency, either in execution or space. This is why I always made my
students aware of the little things, even if our classroom datasets were
only a few hundred observations-at some point, they will have a "big"
dataset.
Yours is exactly the example I suggest to the genetics folks who are
astounding by the size of our data, which soon will be the entire 3
billion base-pairs of the genome (genome=one persons collection of
DNA)-relish the thought!!! I guess I also cite the financial industries
likely datasets, too.
Good luck,
Kevin
Kevin Viel
Department of Epidemiology
Rollins School of Public Health
Emory University
Atlanta, GA 30322
|