Date: Thu, 25 May 2006 20:14:06 -0700
Reply-To: sophe88@YAHOO.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: sophe88@YAHOO.COM
Organization: http://groups.google.com
Subject: Re: A proc logistic question
In-Reply-To: <4471CD1F020000C900006B07@mail.NDRI.ORG>
Content-Type: text/plain; charset="iso-8859-1"
Thank you for your reply.
Here is background.
1. The DV is binary.
2. About 120 IV, some continuous some categorical.
3. 31000 records in learning universe. About the same count in
validation holdout.
4. Dv% is about 3.1=1.
5. All the 120 IV have at least 0.2 poly corr with DV, - or +.
6. While I am not saying our steps are perfect or best, all the 120 IV
have been reasonably doubted not to have severe multi-corr problems or
illness.
7. No survey data.
8. No treatment vs. control type of increment measure effects at how.
Plain, classic binary logistic.
Now story about this variable question:
Some of my statisticians have stendency to derive using classifier to
recode variables. Sometimes the tool is Baysian naive macros, sometimes
trees. Sometimes Clementine, sometimes Kxen. This specific one was
originally just the first 2 bytes of a 4 byte SIC codes (SIC=standard
industry code). In some previous projects, the so-called SIC clustering
worked out well, in terms of boosting top decile lifts. But I have been
advocating reasonably high top lift.
This time one tested this recoding using CHAID in SPSS clementine and
came up with 7 new values/cuts on the sic data. She plugged it in the
logistic model, since somehow this new baby survived her usual mul-corr
tests. This new sic variable squeezed out 3 other variables, smoothed
out the lift (boosting it to actually 638, from 497) and the lift was
more consistent on both L and V. The only problem was the '999' scar I
mentioned in the ODDs ratio. I did not feel very well about it, but
could not call up any reference right away.
After testing several other variables, I see this tends to happen to
variables that carry ratios or percentage values, while other variables
in the IV pool are original 'numbers'. Perhaps eventually and
essentially the scale is the problem?
Thanks.
PD
Peter Flom wrote:
> Peter L. Flom, PhD
> Assistant Director, Statistics and Data Analysis Core
> Center for Drug Use and HIV Research
> National Development and Research Institutes
> 71 W. 23rd St
> http://cduhr.ndri.org
> www.peterflom.com
> New York, NY 10010
> (212) 845-4485 (voice)
> (917) 438-0894 (fax)
>
>
> >>> <sophe88@YAHOO.COM> 05/22/06 2:20 PM >>> wrote
> <<<
> I see this in my proc logistic output
>
> Odds Ratio Estimates
> ................................
>
> Point 95% Wald
> Effect Estimate Confidence Limits
>
> var1 1.029 1.002 1.057
> sr_cust >999.999 >999.999 >999.999
>
>
> What does >999.999 mean? Does it mean sr_cust is a 'bad' var and should
> not stay in the model? Removing it will 'crash' the model lift
> table(bumpy),
> although I may find others to replace it. Its Chisq and others look OK
> to me. Thanks.
> >>>
>
> You haven't given us much to go on. Could you give some context?
> What's your DV,
> what are your IVs, what is N? Was it a survey? (Paging Dr. Casselll)
>
> One thing might be that the scale is wrong. Something like this could
> happen if, say
> the IV was personaly income measured in millions of dollars per year,
> and the outcome
> was probability of owning a home......Try changing the unit.
>
> But please also write back to SAS-L with more information
>
> Peter
|