| Date: | Sat, 19 Dec 2009 11:28:08 -0600 |
| Reply-To: | Satindra Chakravorty <satindra@GMAIL.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Satindra Chakravorty <satindra@GMAIL.COM> |
| Subject: | Re: OT: World Cup Soccer Model |
|
| In-Reply-To: | <1ca751e30912190816s3e9f7364h832727daad252d99@mail.gmail.com> |
| Content-Type: | text/plain; charset=ISO-8859-1 |
|---|
This is an interesting topic. However, your email raises several questions.
1. the formualtion of your model suggests that predictor variables are based
on information known only AFTER the dependant variable daa (i.e. outcome of
the World Cup) is known. I am not sure how you would predict the winner of
the 2010 tournament if this is the case.
2. I am assuming the unit of observation in your modeling sample is a
participating country; hence your N of 32, since there are currently 32
nations in the World Cup. There have been 18 World Cups held so far. If each
of the 32 nations played in each of the 18 World Cups, you might have 18
records per country. However, not each one of the 32 countries scheduled to
play in the 2010 Cup has appeared in each fo the previous Cups. In fact, I
am not mistaken, about 1/2 of these 32 countries may have only participated
in only 3-4 Cups. so you have a significant missing data issue on your
hands.
3. I don't know what your predictor variables are. For the teams that have
had several Cup appearances, you are using observations over a very long
period of time (1930 - present). There have been significant changes in many
factors that might influence the level of play for a given country over
time. Typically, one would want to model using data that is representative
of future data that the model will be applied to. Since you don't have the
luxury of simply discarding old data which would significantly affect your
sample size, are you doing anything else to account for time-based effects
on predictor variables?
4. For validation purposes, one would typically have a portion of data
similar to that on which the model is trained held out from modeling
fitting. The same predictor variables used in the model would be constructed
using the holdout validation data and this would then be scored using the
model. Again, you probably can't set aside any portion of the modeling data
for validation due to a restricted sample size. In such cases one option
might be to find somewhat similar data to test the model on. the FIFA
Confederation Cups come to mind. I don't know how long of a history this
tournament has; however, it is a dress-rehersal for the World cup and many
World Cup participating teams play in the Confederations cup. If you could
contruct the same model attributes using Confederation Cup data, maybe you
could use outcomes from this tournament to validate your World Cup winner
prediction model?
5. Finally, have you considered other modeling techniques? A decision tree
comes to mind - non-parametric, robust, easily handles missing data,
naturally handles interactions, etc.
Satindra.
On Sat, Dec 19, 2009 at 10:16 AM, sudip chatterjee
<sudip.memphis@gmail.com>wrote:
> Dear Users,
>
> I must start with the fact that I am a fanatic soccer fan. Most of you
> might
> know that in June there will be world cup soccer in South Africa. So,
> prediction model are floating in terms of who will win the world cup this
> time. My interest, knowledge and experience provoked me to make a
> prediction
> model ( who will win world cup ) this year. I went to FIFA website &
> collected all relevant informations about the team taking part in this year
> world cup & also about past world cup facts. I made the model & it seems, I
> need to validate the model before I start discussing the results so here
> are
> my question
>
> 1) My data collection forced me to model in this way
> depVar(t-1) = predVar(t)
>
> I was wondering if this kind of modeling sounds ok ? Do I need to add any
> special remark while doing this kind of modeling, I am using simple
> logistic
> regression . Where my N= 32 and I my depvar is the information if any
> country has won the world cup before from 1930 - 2006. Now my predictors
> are
> current informations.
>
> 2) After my model in logistic regression I want check the results through
> simulation process what kind of proc's will help me to do that ?
>
> I must say that I have no commercial interest but shear interest & I work
> on
> this model only during weekends.
>
> I wish all of you an advanced Merry Christmas !
>
> regards
>
|