LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (June 2006, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 28 Jun 2006 12:46:04 -0400
Reply-To:   Kevin Roland Viel <kviel@EMORY.EDU>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Kevin Roland Viel <kviel@EMORY.EDU>
Subject:   Re: UNIX datastep question
Comments:   To: plessthanpoinohfive <plessthanpointohfive@gmail.com>
In-Reply-To:   <200606281526.k5SFQCYt027647@virginia.cc.emory.edu>
Content-Type:   TEXT/PLAIN; charset=US-ASCII

On Wed, 28 Jun 2006, plessthanpoinohfive wrote:

> Hi, Kevin, > > You're just upstairs from me, I think! I'm in rm 348. > > I'm still honing my basic skills with SAS, even though I've been out of > school (masters) for 4 years. This is my first time round with a big file. > > The nuances of the codes isn't a trifling issue I agree. I am looking for > specific diagnoses so I have an ICD9 dictionary to look them up. After I > find the one for the diagnoses of interest I look to see what are the > ordered codes around it. So, when I used only 714 I've already looked to > see if 714.0 is different from 714.00. > > This data is from the National Inpatient Sample. I can't imagine there are > many people with 15 diagnoses but I know there are some. In some cases it's > just a diagnoses of things like smoking as opposed to actual disease. So, > some of that 15 Dx isn't actually a diagnoses of a disease. > > I'm afraid I don't understand what you mean by "You might be able to > dispense with the flag altogether, either by using formats or a hash." I'm > ultimately going to be doing some logistic regression and the dummies will > be my predictors. However, I'm all for efficiency, so I'm happy to try new > ways to code it.

Jen,

Unfortunately, I a little to the west. I have begun a pre-post doc and no longer live in Atlanta. I do miss is, though :)

The hash is a new data step object available in v9. It seems that your best choice might be to use an array with a format. Are you familiar with using the CNTLIN option to the FORMAT procedure? Typically, the list of ICD codes will come as a file. You can read this file and create a SAS dataset in a certain format and use it to create a format. In is then encumbent upon your collaborators to be precise in the case definitions. You alter nothing (immediate write-protect the file at arrival) and do not risk typing errors:

1447 1448 data _null_ ; 1449 file "C:\codes.txt" ; 1450 case_def = "714.00" ; 1451 put case_def ; 1452 run ;

NOTE: The file "C:\codes.txt" is: File Name=C:\codes.txt, RECFM=V,LRECL=256

NOTE: 1 record was written to the file "C:\codes.txt". The minimum record length was 6. The maximum record length was 6. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.00 seconds

1453 1454 data CNTLIN ; 1455 infile "C:\codes.txt" end = end ; 1456 length label $ 5 ; 1457 retain fmtname "RA" label "Yes" type "c" ; 1458 do until ( end ) ; 1459 input start : $6. ; 1460 output ; 1461 end ; 1462 label = "Other" ; 1463 hlo = "o" ; 1464 start = " " ; 1465 output ; 1466 stop ; 1467 datalines ;

NOTE: The infile "C:\codes.txt" is: File Name=C:\codes.txt, RECFM=V,LRECL=256

NOTE: 1 record was read from the infile "C:\codes.txt". The minimum record length was 6. The maximum record length was 6. NOTE: The data set WORK.CNTLIN has 2 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds

1469 ; 1470 run ; 1471 1472 proc format cntlin = cntlin fmtlib ; NOTE: Format $RA is already on the library. NOTE: Format $RA has been output. 1473 run ;

NOTE: PROCEDURE FORMAT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

NOTE: There were 2 observations read from the data set WORK.CNTLIN.

1474 1475 data _null_ ; 1476 input ICD $ ; 1477 if put( ICD , $RA. ) = "Yes" then put ICD= ICD= $RA. ; 1478 else put ICD= ; 1479 datalines ;

ICD=714.00 ICD=Yes ICD=250.00 NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds

1482 ; 1483 run ;

As you can see, you can control the LABELs. I had a different format for each type of arthritis (osteoarthritis, rhuematoid, etc). Then I had format that identified any of the 514 ICD-9 codes that fit our case definition of arthritis.

It seems to me that 1) some patients will be repeated 2) that some patients will have multiple "flags" and 3) that case-mix might confound your analyses.

You can process part of the file to ease resource contstraints:

data MI ( keep = ID ) RA ( keep = ID ) ;

array dx ( 15 ) $ 6 ;

set big ( firstobs = 1 obs = 100000 ) ;

do _n_ = 1 to 15 ; if dx( _n_ ) = "" then leave ; /* CAUTION!!! */ else if put( dx( _n_ ) , $MI. ) = "Yes" then output MI ; else if put( dx( _n_ ) , $RA. ) = "Yes" then output RA ; end ; run ;

proc append base = final_MI data = MI ; run ;

proc datasets library = WORK nolist ; delete MI ; quit ;

<repeat: potential macro?>

Note, there will be repeated IDs. Remember, once you identify the patients, be sure to go back and obtain *all* of their ICD-9 and CPT codes, not just those that qualify.

I make no claims of efficiency (either CPU or space). A search of the archives could be of great assistance.

Regards,

Kevin

PS You might want to check out the Johns Hopkins case-mix software?

Kevin Viel Department of Epidemiology Rollins School of Public Health Emory University Atlanta, GA 30322


Back to: Top of message | Previous page | Main SAS-L page