|Date: ||Thu, 22 Sep 2005 23:21:47 -0700|
|Reply-To: ||pa pa <ctll04@YAHOO.COM>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||pa pa <ctll04@YAHOO.COM>|
|Subject: ||Data Preprocessing questions|
|Content-Type: ||text/plain; charset=iso-8859-1|
I am using a dataset called KDD 99 to feed into a neural network. I also read some paper about this dataset.
In this data, there are some nominal variables and some numerical values.
The papers that I read recommended to preprocess the data as follow:
+ nominal variables (some have more than 60 different values) -> encode to integers (0,1,2,3,4 ...)
+ numerical variables:
- After encoding the nominal variables, my data now has all the numbers.
- Most of them are binary 0 and 1.
- Some have range [0-256]
- Few have range [0 - 1 billion]
And the paper SCALE the variables within range [0-256] to the range [0-1]
range [0-1billion] to [0-10] by Logarithmic scaling.
My questions are:
Q1: Why do we have to encode the nominal data into integers? A nominal value A is encoded to 1 and B to 2, but A is not less than B (1<2). I suspect that this is because we want to speed up the learning of the NN.
Q2: Is that true that the imbalance in the numerical data (range [0-256] and range [0-1billion]) will affect the learning process? Do we have to do such processing?
Q3: If I have to do such preprocessings, how could I scale [0-256] to [0-1] by NORMAL SCALING? Is that true I just divide every values with the maximum values?
Q4: How can I use LOGARITHMIC SCALING to scale [0-1 billion] to [0-1] in SAS?
Have a nice weekend.
Yahoo! for Good
Click here to donate to the Hurricane Katrina relief effort.