Date: Wed, 26 Mar 2008 13:21:57 -0400
Reply-To: Peter Flom <peterflomconsulting@mindspring.com>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: PROC LOGISTIC MODEL--Standardize vars?
Content-Type: text/plain; charset=UTF-8
Tom White wrote
>(4) Back to my original question: Let's not assume what the IVs represent. The majority of them they represent
> percentages, i.e. number between 0 and 1 (inclusive). Other IVs reprersent various counts and their values
> could range from 0 up to whatever integer (say, 23, or 45, or 7000, etc).
>
> My question is again: If I have a banch of IVs to consider for model inclusion, do I have to worry about their ranges? In other words, do I need to worry about (their sizes) scaling them, transforming them, standardizing them, etc.
Yes, you do. But standardizing them by dividing by the SD is probably *not* the transformation you want.
> The puprose of this is to find out which ones are significant to include in the model?
>
> As Peter said, working with the original IVs will give me one model.
> Transforming the Ivs in some way will give me another model.
> So, which model do I want?
> Obviously I want the model which will catch most fraudulent claims!
> So, shoud I work with the original IVs or transform them somehow?
>
Well, given the size of your data, I suggest that you could divide the data you've got and test on part of it. David Cassell argues for crossvalidation because it is more powerful, but here, you have so many observations that you needn't worry about that. So, divide the data into TRAIN and TEST and VALIDATE, either purely randomly or stratified on some key variables. Run various models on TRAIN. Test them on TEST. Play around. Leave VALIDATE alone until you are totally satisfied, and then use VALIDATE to see how well you should do on future data (you won't do quite as well, because VALIDATE is a random sample from your existing data, whereas the new data will probably be different in various ways.
> For example, if I use two IVs VAR1 (values from 0 to 1) and VAR2 (values from 0 to 1000) do I need to worry
> that the VAR2 which has much bigger values than VAR1 will somehow overtake VAR1 when it comes to parameter
> estimation, i.e. when it comes to choosing the best model?
>
No, you don't need to worry about this, or at least, not exactly. It will affect the parameter estimates, but also the standard errors, so it balances out. The problem is that when the range of a variable is very large, it is hard to interpret the parameter estimate. It will MEAN the same thing..... but you wouldn't want to measure the height of humans in nanometers. If the range is REALLY big, it can even cause problems with floating point arithmetic. So, you might want to divide the 0 to 1000 variable by 1000, but mostly for your ease, not the computer's
>One other question I'd like to ask now since I gave you the background.
>
>Peter(?) and possibly others(?) have said that if you use many obs (remember, I have 5M obs to develop the model--
>about 1% fraud rate), then any minute (insignificant?) IVs will show up as significant.
>
I've said this, others have said it, it's pretty fundamental. Statistical significance is *not* the key thing. p-values are almost never the most useful output from a statistical procedure.
>I was thinking of using all 5M obs to develop the model (I have another year's worth put aside for validation, etc.).
>Id this appropriate? David has said the more the better! That's why I am thinking of using all of them.
>(The data obs go back to around year 2000 or so.) Or am I better off to randomly select a smaller number,
>say, 250K obs out of the 5M (keeping the 1%--99% ratio the same), to work with?
>I think David would say use all 5M of them?
>
Yes, use them all. The more data, the more precise your estimates. Precision is good. But the p-values will be pointless. Why? Well, OK. Suppose there is some YES NO, and people who say yes are fraudulent .0000001 more times than people who say no. With 5 million cases, this will be significant. But meaningless. It won't save you enough money to pay for the time of the person who has to check the data.
>The other issue is my small 1% fraud rate.
>
>Do I need to do some kind of weigthing here to give more "strength" to the # of obs with fraud status=1?
>
I don't think so. As I understand it, the problems is if a cell SIZE is small, not if it's a small proportion.
Peter
Statistical Consultant
www DOT peterflom DOT com
|