| Date: | Thu, 22 Sep 2005 23:21:47 -0700 |
| Reply-To: | pa pa <ctll04@YAHOO.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | pa pa <ctll04@YAHOO.COM> |
| Subject: | Data Preprocessing questions |
| Content-Type: | text/plain; charset=iso-8859-1 |
Hi there,
I am using a dataset called KDD 99 to feed into a neural network. I also read some paper about this dataset.
In this data, there are some nominal variables and some numerical values.
The papers that I read recommended to preprocess the data as follow:
+ nominal variables (some have more than 60 different values) -> encode to integers (0,1,2,3,4 ...)
+ numerical variables:
- After encoding the nominal variables, my data now has all the numbers.
- Most of them are binary 0 and 1.
- Some have range [0-256]
- Few have range [0 - 1 billion]
And the paper SCALE the variables within range [0-256] to the range [0-1]
range [0-1billion] to [0-10] by Logarithmic scaling.
My questions are:
Q1: Why do we have to encode the nominal data into integers? A nominal value A is encoded to 1 and B to 2, but A is not less than B (1<2). I suspect that this is because we want to speed up the learning of the NN.
Q2: Is that true that the imbalance in the numerical data (range [0-256] and range [0-1billion]) will affect the learning process? Do we have to do such processing?
Q3: If I have to do such preprocessings, how could I scale [0-256] to [0-1] by NORMAL SCALING? Is that true I just divide every values with the maximum values?
Q4: How can I use LOGARITHMIC SCALING to scale [0-1 billion] to [0-1] in SAS?
Thanks
Have a nice weekend.
Patrick Tran
---------------------------------
Yahoo! for Good
Click here to donate to the Hurricane Katrina relief effort.
|