|
Romel:
If you think of a believable story as prior knowledge that holds you
back from making a false leap based on very limited data, then you will
be OK. In the real world it works better if you start with a believable
story about what determines what and then use data to make a case for
that story or reject it. In any situation you'll need enough statistical
power to reject or accept a model. With enough statistical power, you
will be able to validate a model using data not used in the original
model estimation, and then be able to evaluate prediction errors of a
model in another data subset not used to estimate the model.
I agree that the group variable seemed more a prediction than a
predictor. As for exponential smoothing, you may find it more
instructive to start simpler transforms of predictors
(y-f(x,x**2,....)).
I had to line up columns in the CARDS statement to read data copied from
your email, but I would hope that you have printed your SAS dataset and
check to make sure that program reads data correctly. I also substituted
a period '.' for a '*' because SAS treats a period in numeric input as a
missing value. The procedures appeared to work fine.
Sig
-----Original Message-----
From: Romel Lira [mailto:rlira007@gmail.com]
Sent: Tuesday, December 12, 2006 3:59 PM
To: Sigurd Hermansen
Cc: sas-l@listserv.uga.edu
Subject: Re: Stats SAS Project for college statistics course,
issue with results
So in other words, if I can tell a believable story of how the
model I have fits the data, I should be ok?
I decided to remove the group designation and use that info
(group/GNP) for a separate calculation to demonstrate that there is a
significant difference between groups in terms of GNP. Mainly because of
the very reasons you state, I wasn't sure if it was possible to throw
both types of data into the same modeling, but I find it telling that
the most significant data from the ANOVA print out was the group number.
My main fear is not that the data is wrong, but that my SAS
programming is incorrect.
Should I use exponential smoothing on the GNP values?
Thanks for taking the time to write your response, you've
certainly provided a new outlook on my 2nd semester of statistics.
Romel
On 12/12/06, Sigurd Hermansen <HERMANS1@westat.com> wrote:
'Best model' in this situation might not be good enough
for prime time,
but may help you learn much about data analysis and
statistical
modelling. I'll offer a few comments about the data
analysis and
predictive modelling side of your questions, but leave
the statistical
estimation side of your questions to real statisticians.
First you'll need to tighten up your description of your
data. It looks
as if you are trying to fit a model that has six
continuous rate and
estimated mean variables, plus a categorical variable
(Group) to
estimated GNP per capita. No respectable econometrician
would claim that
your predictors would support a good predictive model
for per capita
GNP, so let's assume that you are trying as an academic
exercise to
develop the best model for these data. I see no reason
to expect
predictors of a model based on these data to be 'in
whack'.
SAS provides and your instructor has helped by
specifying a number of
useful exploratory and diagnostic statistics and graphs.
Focus on the
residuals (prediction errors). A bad model usually
violates a
statistical estimation method's assumptions about the
distribution of
prediction errors; for example, marked deviations from a
random
distribution of prediction errors (OLS). Residuals much
larger than most
(outliers) may indicate that a model omits important
predictive
variables, and patterns of sets of residuals suggest a
wrong form of
statistical model.
A quick review of documentation of diagnostic
statistics, such as Cook's
D, and attention to plots of residuals, will help you
learn more about
assumptions underlying statistical models. Simpler
linear models
certainly can generate negative predictions of variables
that have
strictly positive domains. Exponential transformations
of dependent and
predictive values may improve the fit of the model to
data.
Always keep in mind the central question: what
determines the variable
that you are trying to predict (per capita GNP). As
George Box once
said, "All models are wrong. Some are useful." You will
learn that
eventually everything determines everything else. Since
death rates and
life expectancies likely depend as much on per capita
GNP as per capital
GNP depends on them, I'd focus on the fit of variables
that could have
positive or adverse impact on economic conditions.
Remember that an
implausible linear relation between a predictor and an
outcome requires
support from extraordinary evidence. In the sample that
you have (likely
summarized over a short interval of time), the chances
of discovering a
misleading sample likely exceed the chances of
discovering a previously
unknown truth. A good fit does not necessarily yield a
good model.
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu
[mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of rlira007@gmail.com
Sent: Tuesday, December 12, 2006 4:31 AM
To: sas-l@uga.edu
Subject: Stats SAS Project for college statistics
course, issue with
results
Hi, I'm attempting to analyze data for 97 nations with
predictors 7 or
8, I believe. Anyway, here is the direct request:
Analyze these data to
estimate the best model that describes the relationship
between the
response (Gross National Product) and the predictors
(all other
variables except for country). Can these variables be
used to predict
GNP? Which of these variables are the most important?
Are there
significant differences among the 6 groups of countries?
This is the program I put together for this problem:
data GNPPredictor;
input index LBR DeathRate InfantDeath LifeEXPM
LifeEXPF GNP
Group Country $;
cards;
1 24.7 5.7 30.8 69.6 75.5 600
1 Albania
2 12.5 11.9 14.4 68.3 74.7 2250 1
Bulgaria
3 13.4 11.7 11.3 71.8 77.7 2980 1
Czechoslovakia
4 12 12.4 7.6 69.8 75.9 * 1
Former_E._Germany
5 11.6 13.4 14.8 65.4 73.8 2780 1
Hungary
6 14.3 10.2 16 67.2 75.7 1690 1
Poland
7 13.6 10.7 26.9 66.5 72.4 1640 1
Romania
8 14 9 20.2 68.6 74.5 * 1
Yugoslavia
9 17.7 10 23 64.6 74 2242 1
USSR
10 15.2 9.5 13.1 66.4 75.9 1880 1
Byelorussian_SSR
11 13.4 11.6 13 66.4 74.8 1320 1
Ukrainian_SSR
12 20.7 8.4 25.7 65.5 72.7 2370 2
Argentina
13 46.6 18 111 51 55.4 630 2
Bolivia
14 28.6 7.9 63 62.3 67.6 2680 2
Brazil
15 23.4 5.8 17.1 68.1 75.1 1940 2
Chile
16 27.4 6.1 40 63.4 69.2 1260 2
Columbia
17 32.9 7.4 63 63.4 67.6 980 2
Ecuador
18 28.3 7.3 56 60.4 66.1 330 2
Guyana
19 34.8 6.6 42 64.4 68.5 1110 2
Paraguay
20 32.9 8.3 109.9 56.8 66.5 1160 2
Peru
21 18 9.6 21.9 68.4 74.9 2560 2
Uruguay
22 27.5 4.4 23.3 66.7 72.8 2560 2
Venezuela
23 29 23.2 43 62.1 66 2490 2
Mexico
24 12 10.6 7.9 70 76.8 15540 3
Belgium
25 13.2 10.1 5.8 70.7 78.7 26040 3
Finland
26 12.4 11.9 7.5 71.8 77.7 22080 3
Denmark
27 13.6 9.4 7.4 72.3 80.5 19490 3
France
28 11.4 11.2 7.4 71.8 78.4 22320 3
Germany
29 10.1 9.2 11 65.4 74 5990 3
Greece
30 15.1 9.1 7.5 71 76.7 9550 3
Ireland
31 9.7 9.1 8.8 72 78.6 16830 3
Italy
32 13.2 8.6 7.1 73.3 79.9 17320 3
Netherlands
33 14.3 10.7 7.8 67.2 75.7 23120 3
Norway
34 11.9 9.5 13.1 66.5 72.4 7600 3
Portugal
35 10.7 8.2 8.1 72.5 78.6 11020 3
Spain
36 14.5 11.1 5.6 74.2 80 23660 3
Sweden
37 12.5 9.5 7.1 73.9 80 34064 3
Switzerland
38 13.6 11.5 8.4 72.2 77.9 16100 3
U.K.
39 14.9 7.4 8 73.3 79.6 17000 3
Austria
40 9.9 6.7 4.5 75.9 81.8 25430 3
Japan
41 14.5 7.3 7.2 73 79.8 20470 3
Canada
42 16.7 8.1 9.1 71.5 78.3 21790 3
U.S.A.
43 40.4 18.7 181.6 41 42 168 5
Afghanistan
44 28.4 3.8 16 66.8 69.4 6340 4
Bahrain
45 42.5 11.5 108.1 55.8 55 2490 4
Iran
46 42.6 7.8 69 63 64.8 3020 4
Iraq
47 22.3 6.3 9.7 73.9 77.4 10920 4
Israel
48 38.9 6.4 44 64.2 67.8 1240 4
Jordan
49 26.8 2.2 15.6 71.2 75.4 16150 4
Kuwait
50 31.7 8.7 48 63.1 67 * 4
Lebanon
51 45.6 7.8 40 62.2 65.8 5220 4
Oman
52 42.1 7.6 71 61.7 65.2 7050 4
Saudi_Arabia
53 29.2 8.4 76 62.5 65.8 1630 4
Turkey
54 22.8 3.8 26 68.6 72.9 19860 4
United_Arab_Emirates
55 42.2 15.5 119 56.9 56 210 5
Bangladesh
56 41.4 16.6 130 47 49.9 * 5
Cambodia
57 21.2 6.7 32 68 70.9 380 5
China
58 11.7 4.9 6.1 74.3 80.1 14210 5
Hong_Kong
59 30.5 10.2 91 52.5 52.1 350 5
India
60 28.6 9.4 75 58.5 62 570 5
Indonesia
61 23.5 18.1 25 66.2 72.7 * 5
Korea
62 31.6 5.6 24 67.5 71.6 2320 5
Malaysia
63 36.1 8.8 68 60 62.5 110 5
Mongolia
64 39.6 14.8 128 50.9 48.1 170 5
Nepal
65 30.3 8.1 107.7 59 59.2 380 5
Pakistan
66 33.2 7.7 45 62.5 66.1 730 5
Philippines
67 17.8 5.2 7.5 68.7 74 11160 5
Singapore
68 21.3 6.2 19.4 67.8 71.7 470 5
Sri_Lanka
69 22.3 7.7 28 63.8 68.9 1420 5
Thailand
70 31.8 9.5 64 63.7 67.9 * 5
Vietnam
71 35.5 8.3 74 61.6 63.3 2060 6
Algeria
72 47.2 20.2 137 42.9 46.1 610 6
Angola
73 48.5 11.6 67 52.3 59.7 2040 6
Botswana
74 46.1 14.6 73 50.1 55.3 1010 6
Congo
75 38.8 9.5 49.4 57.8 60.3 600 6
Egypt
76 48.6 20.7 137 42.4 45.6 120 6
Ethiopia
77 39.4 16.8 103 49.9 53.2 390 6
Gabon
78 47.4 21.4 143 41.4 44.6 260 6
Gambia
79 44.4 13.1 90 52.2 55.8 390 6
Ghana
80 47 11.3 72 56.5 60.5 370 6
Kenya
81 44 9.4 82 59.1 62.56 5310 6
Libya
82 48.3 25 130 38.1 41.2 200 6
Malawi
83 35.5 9.8 82 59.1 62.5 960 6
Morocco
84 45 18.5 141 44.9 48.1 80 6
Mozambique
85 44 12.1 135 55 57.5 1030 6
Namibia
86 48.5 15.6 105 48.8 52.2 360 6
Nigeria
87 48.2 23.4 154 39.4 42.6 240 6
Sierra_Leone
88 50.1 20.2 132 43.4 46.6 120 6
Somalia
89 32.1 9.9 72 57.5 63.5 2530 6
South_Africa
90 44.6 15.8 108 48.6 51 480 6
Sudan
91 46.8 12.5 118 42.9 49.5 810 6
Swaziland
92 31.1 7.3 52 64.9 66.4 1440 6
Tunisia
93 52.2 15.6 103 49.9 52.7 220 6
Uganda
94 50.5 14 106 51.3 54.7 110 6
Tanzania
95 45.6 14.2 83 50.3 53.7 220 6
Zaire
96 51.1 13.7 80 50.4 52.5 420 6
Zambia
97 41.7 10.3 66 56.5 60.1 640 6
Zimbabwe
;
proc gplot data=GNPPredictor;
plot GNP*LBR;
plot GNP*DeathRate;
plot GNP*InfantDeath;
plot GNP*LifeEXPM;
plot GNP*LifeEXPF;
plot GNP*Group;
run;
proc univariate data=GNPPredictor plot normal;
var GNP;
histogram GNP / normal kernel(L=2);
qqplot GNP / normal (L=1 mu=est
sigma=est);
run;
proc reg data=GNPPredictor;
model GNP = LBR DeathRate InfantDeath
LifeEXPM LifeEXPF
Group/p r clb cli clm;
plot r.*p.;
run;
quit;
Okay, My issue is that after running, the end results of
predicted GNP
as opposed to my stated GNP data are completely out of
whack. How can I
have a predicted GNP that's in the negative? Where is my
program wrong?
Hopefully someone can give me at least a hint of what to
do.
|