```Date: Sat, 9 Apr 2005 13:49:31 -0400 Reply-To: Peter Flom Sender: "SAS(r) Discussion" From: Peter Flom Subject: Interesting data and regression question (long) Content-Type: text/plain; charset=US-ASCII Here's another one for the stats guys.....I think I've mentioned this data before, but will give context I am writing a paper, a large part of which will be a comparison of different ways of analyzing the following: I have a data set based on several thousand drug injectors. The DV is number of people you shared needles with. There are a LOT of zeros, a lot of ones, and a long long right tail, with a max of 200. Clearly, some of the high numbers are guesses. Three are about 15 IVs, including categorical and continuous variables. These include demographics and drug use questions, as well as a question on 'year of interview', included as a covaraiate as an orthogonal polynomial (we aren't particularly interested in it, but it needs to be accounted for). Issues: 1) The changes in the DV over time are mostly due to outliers. 2) There is substantial interest in the outliers - while the people who say they shared with a lot of people are guessing, it's likely that someone who guesses "200" is higher than someone who guesses "100", etc There are substantive reasons to want to be able to explain these values in particular. 3) There are actually 2 DVs: Distributive and receptive partners - right now, I am treating these as separate, and not doing anything multivariate with them, but may need to change this. Methods: Thus far, I've applied a variety of models, none of which do all I want: 1) Ordinal logistic based on a categorized DV (this lumps all the DV that are over a certain number together - not ideal). 2) Regression trees (useful, but doesn't give parameter estimates) 3) A variety of count regression models (Poisson, neg. binomial, and zero inflated versions of both) (the models without zero-inflation don't fit well; the zero inflated models are hard to interpret, and not very numerically stable for these data). I am now considering a series of (regular) logistic regressions: 0 vs. more; 1 vs. more among sharers; less than 10 vs. more, etc. I've never seen this done; there doesn't seem to be any reason NOT to do it, but I was curious if anyone had a reference to this. Any other ideas also welcome Thanks Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax) ```

