Date: Sat, 9 Apr 2005 13:49:31 -0400
Reply-To: Peter Flom <flom@NDRI.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <flom@NDRI.ORG>
Subject: Interesting data and regression question (long)
Content-Type: text/plain; charset=US-ASCII
Here's another one for the stats guys.....I think I've mentioned this
data before, but will give context
I am writing a paper, a large part of which will be a comparison of
different ways of analyzing the following:
I have a data set based on several thousand drug injectors.
The DV is number of people you shared needles with. There are a LOT of
zeros, a lot of ones, and a long long right tail, with a max of 200.
Clearly, some of the high numbers are guesses.
Three are about 15 IVs, including categorical and continuous variables.
These include demographics and drug use questions, as well as a question
on 'year of interview', included as a covaraiate as an orthogonal
polynomial (we aren't particularly interested in it, but it needs to be
accounted for).
Issues: 1) The changes in the DV over time are mostly due to outliers.
2) There is substantial interest in the outliers - while the people who
say they shared with a lot of people are guessing, it's likely that
someone who guesses "200" is higher than someone who guesses "100", etc
There are substantive reasons to want to be able to explain these values
in particular. 3) There are actually 2 DVs: Distributive and receptive
partners - right now, I am treating these as separate, and not doing
anything multivariate with them, but may need to change this.
Methods: Thus far, I've applied a variety of models, none of which do
all I want: 1) Ordinal logistic based on a categorized DV (this lumps
all the DV that are over a certain number together - not ideal). 2)
Regression trees (useful, but doesn't give parameter estimates) 3) A
variety of count regression models (Poisson, neg. binomial, and zero
inflated versions of both) (the models without zero-inflation don't fit
well; the zero inflated models are hard to interpret, and not very
numerically stable for these data).
I am now considering a series of (regular) logistic regressions: 0 vs.
more; 1 vs. more among sharers; less than 10 vs. more, etc. I've never
seen this done; there doesn't seem to be any reason NOT to do it, but I
was curious if anyone had a reference to this.
Any other ideas also welcome
Thanks
Peter
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)