Date: Tue, 12 Dec 2006 15:49:40 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Stats SAS Project for college statistics course,
issue with results
In-Reply-To: <1165915833.853418.239770@80g2000cwy.googlegroups.com>
Content-Type: text/plain; charset="us-ascii"
'Best model' in this situation might not be good enough for prime time,
but may help you learn much about data analysis and statistical
modelling. I'll offer a few comments about the data analysis and
predictive modelling side of your questions, but leave the statistical
estimation side of your questions to real statisticians.
First you'll need to tighten up your description of your data. It looks
as if you are trying to fit a model that has six continuous rate and
estimated mean variables, plus a categorical variable (Group) to
estimated GNP per capita. No respectable econometrician would claim that
your predictors would support a good predictive model for per capita
GNP, so let's assume that you are trying as an academic exercise to
develop the best model for these data. I see no reason to expect
predictors of a model based on these data to be 'in whack'.
SAS provides and your instructor has helped by specifying a number of
useful exploratory and diagnostic statistics and graphs. Focus on the
residuals (prediction errors). A bad model usually violates a
statistical estimation method's assumptions about the distribution of
prediction errors; for example, marked deviations from a random
distribution of prediction errors (OLS). Residuals much larger than most
(outliers) may indicate that a model omits important predictive
variables, and patterns of sets of residuals suggest a wrong form of
statistical model.
A quick review of documentation of diagnostic statistics, such as Cook's
D, and attention to plots of residuals, will help you learn more about
assumptions underlying statistical models. Simpler linear models
certainly can generate negative predictions of variables that have
strictly positive domains. Exponential transformations of dependent and
predictive values may improve the fit of the model to data.
Always keep in mind the central question: what determines the variable
that you are trying to predict (per capita GNP). As George Box once
said, "All models are wrong. Some are useful." You will learn that
eventually everything determines everything else. Since death rates and
life expectancies likely depend as much on per capita GNP as per capital
GNP depends on them, I'd focus on the fit of variables that could have
positive or adverse impact on economic conditions. Remember that an
implausible linear relation between a predictor and an outcome requires
support from extraordinary evidence. In the sample that you have (likely
summarized over a short interval of time), the chances of discovering a
misleading sample likely exceed the chances of discovering a previously
unknown truth. A good fit does not necessarily yield a good model.
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of rlira007@gmail.com
Sent: Tuesday, December 12, 2006 4:31 AM
To: sas-l@uga.edu
Subject: Stats SAS Project for college statistics course, issue with
results
Hi, I'm attempting to analyze data for 97 nations with predictors 7 or
8, I believe. Anyway, here is the direct request: Analyze these data to
estimate the best model that describes the relationship between the
response (Gross National Product) and the predictors (all other
variables except for country). Can these variables be used to predict
GNP? Which of these variables are the most important? Are there
significant differences among the 6 groups of countries?
This is the program I put together for this problem:
data GNPPredictor;
input index LBR DeathRate InfantDeath LifeEXPM LifeEXPF GNP
Group Country $;
cards;
1 24.7 5.7 30.8 69.6 75.5 600 1 Albania
2 12.5 11.9 14.4 68.3 74.7 2250 1 Bulgaria
3 13.4 11.7 11.3 71.8 77.7 2980 1 Czechoslovakia
4 12 12.4 7.6 69.8 75.9 * 1
Former_E._Germany
5 11.6 13.4 14.8 65.4 73.8 2780 1 Hungary
6 14.3 10.2 16 67.2 75.7 1690 1 Poland
7 13.6 10.7 26.9 66.5 72.4 1640 1 Romania
8 14 9 20.2 68.6 74.5 * 1 Yugoslavia
9 17.7 10 23 64.6 74 2242 1 USSR
10 15.2 9.5 13.1 66.4 75.9 1880 1 Byelorussian_SSR
11 13.4 11.6 13 66.4 74.8 1320 1 Ukrainian_SSR
12 20.7 8.4 25.7 65.5 72.7 2370 2 Argentina
13 46.6 18 111 51 55.4 630 2 Bolivia
14 28.6 7.9 63 62.3 67.6 2680 2 Brazil
15 23.4 5.8 17.1 68.1 75.1 1940 2 Chile
16 27.4 6.1 40 63.4 69.2 1260 2 Columbia
17 32.9 7.4 63 63.4 67.6 980 2 Ecuador
18 28.3 7.3 56 60.4 66.1 330 2 Guyana
19 34.8 6.6 42 64.4 68.5 1110 2 Paraguay
20 32.9 8.3 109.9 56.8 66.5 1160 2 Peru
21 18 9.6 21.9 68.4 74.9 2560 2 Uruguay
22 27.5 4.4 23.3 66.7 72.8 2560 2 Venezuela
23 29 23.2 43 62.1 66 2490 2 Mexico
24 12 10.6 7.9 70 76.8 15540 3 Belgium
25 13.2 10.1 5.8 70.7 78.7 26040 3 Finland
26 12.4 11.9 7.5 71.8 77.7 22080 3 Denmark
27 13.6 9.4 7.4 72.3 80.5 19490 3 France
28 11.4 11.2 7.4 71.8 78.4 22320 3 Germany
29 10.1 9.2 11 65.4 74 5990 3 Greece
30 15.1 9.1 7.5 71 76.7 9550 3 Ireland
31 9.7 9.1 8.8 72 78.6 16830 3 Italy
32 13.2 8.6 7.1 73.3 79.9 17320 3 Netherlands
33 14.3 10.7 7.8 67.2 75.7 23120 3 Norway
34 11.9 9.5 13.1 66.5 72.4 7600 3 Portugal
35 10.7 8.2 8.1 72.5 78.6 11020 3 Spain
36 14.5 11.1 5.6 74.2 80 23660 3 Sweden
37 12.5 9.5 7.1 73.9 80 34064 3 Switzerland
38 13.6 11.5 8.4 72.2 77.9 16100 3 U.K.
39 14.9 7.4 8 73.3 79.6 17000 3 Austria
40 9.9 6.7 4.5 75.9 81.8 25430 3 Japan
41 14.5 7.3 7.2 73 79.8 20470 3 Canada
42 16.7 8.1 9.1 71.5 78.3 21790 3 U.S.A.
43 40.4 18.7 181.6 41 42 168 5 Afghanistan
44 28.4 3.8 16 66.8 69.4 6340 4 Bahrain
45 42.5 11.5 108.1 55.8 55 2490 4 Iran
46 42.6 7.8 69 63 64.8 3020 4 Iraq
47 22.3 6.3 9.7 73.9 77.4 10920 4 Israel
48 38.9 6.4 44 64.2 67.8 1240 4 Jordan
49 26.8 2.2 15.6 71.2 75.4 16150 4 Kuwait
50 31.7 8.7 48 63.1 67 * 4 Lebanon
51 45.6 7.8 40 62.2 65.8 5220 4 Oman
52 42.1 7.6 71 61.7 65.2 7050 4 Saudi_Arabia
53 29.2 8.4 76 62.5 65.8 1630 4 Turkey
54 22.8 3.8 26 68.6 72.9 19860 4
United_Arab_Emirates
55 42.2 15.5 119 56.9 56 210 5 Bangladesh
56 41.4 16.6 130 47 49.9 * 5 Cambodia
57 21.2 6.7 32 68 70.9 380 5 China
58 11.7 4.9 6.1 74.3 80.1 14210 5 Hong_Kong
59 30.5 10.2 91 52.5 52.1 350 5 India
60 28.6 9.4 75 58.5 62 570 5 Indonesia
61 23.5 18.1 25 66.2 72.7 * 5 Korea
62 31.6 5.6 24 67.5 71.6 2320 5 Malaysia
63 36.1 8.8 68 60 62.5 110 5 Mongolia
64 39.6 14.8 128 50.9 48.1 170 5 Nepal
65 30.3 8.1 107.7 59 59.2 380 5 Pakistan
66 33.2 7.7 45 62.5 66.1 730 5 Philippines
67 17.8 5.2 7.5 68.7 74 11160 5 Singapore
68 21.3 6.2 19.4 67.8 71.7 470 5 Sri_Lanka
69 22.3 7.7 28 63.8 68.9 1420 5 Thailand
70 31.8 9.5 64 63.7 67.9 * 5 Vietnam
71 35.5 8.3 74 61.6 63.3 2060 6 Algeria
72 47.2 20.2 137 42.9 46.1 610 6 Angola
73 48.5 11.6 67 52.3 59.7 2040 6 Botswana
74 46.1 14.6 73 50.1 55.3 1010 6 Congo
75 38.8 9.5 49.4 57.8 60.3 600 6 Egypt
76 48.6 20.7 137 42.4 45.6 120 6 Ethiopia
77 39.4 16.8 103 49.9 53.2 390 6 Gabon
78 47.4 21.4 143 41.4 44.6 260 6 Gambia
79 44.4 13.1 90 52.2 55.8 390 6 Ghana
80 47 11.3 72 56.5 60.5 370 6 Kenya
81 44 9.4 82 59.1 62.56 5310 6 Libya
82 48.3 25 130 38.1 41.2 200 6 Malawi
83 35.5 9.8 82 59.1 62.5 960 6 Morocco
84 45 18.5 141 44.9 48.1 80 6 Mozambique
85 44 12.1 135 55 57.5 1030 6 Namibia
86 48.5 15.6 105 48.8 52.2 360 6 Nigeria
87 48.2 23.4 154 39.4 42.6 240 6 Sierra_Leone
88 50.1 20.2 132 43.4 46.6 120 6 Somalia
89 32.1 9.9 72 57.5 63.5 2530 6 South_Africa
90 44.6 15.8 108 48.6 51 480 6 Sudan
91 46.8 12.5 118 42.9 49.5 810 6 Swaziland
92 31.1 7.3 52 64.9 66.4 1440 6 Tunisia
93 52.2 15.6 103 49.9 52.7 220 6 Uganda
94 50.5 14 106 51.3 54.7 110 6 Tanzania
95 45.6 14.2 83 50.3 53.7 220 6 Zaire
96 51.1 13.7 80 50.4 52.5 420 6 Zambia
97 41.7 10.3 66 56.5 60.1 640 6 Zimbabwe
;
proc gplot data=GNPPredictor;
plot GNP*LBR;
plot GNP*DeathRate;
plot GNP*InfantDeath;
plot GNP*LifeEXPM;
plot GNP*LifeEXPF;
plot GNP*Group;
run;
proc univariate data=GNPPredictor plot normal;
var GNP;
histogram GNP / normal kernel(L=2);
qqplot GNP / normal (L=1 mu=est sigma=est);
run;
proc reg data=GNPPredictor;
model GNP = LBR DeathRate InfantDeath LifeEXPM LifeEXPF
Group/p r clb cli clm;
plot r.*p.;
run;
quit;
Okay, My issue is that after running, the end results of predicted GNP
as opposed to my stated GNP data are completely out of whack. How can I
have a predicted GNP that's in the negative? Where is my program wrong?
Hopefully someone can give me at least a hint of what to do.