Date: Wed, 11 May 2005 11:32:37 -0700 Reply-To: cassell.david@EPAMAIL.EPA.GOV Sender: "SAS(r) Discussion" From: "David L. Cassell" Subject: Re: Dependent sample difference in mean test In-Reply-To: <1115826748.025038.172760@g44g2000cwa.googlegroups.com> Content-type: text/plain; charset=US-ASCII gblockhart@YAHOO.COM wrote: > I have two dependent samples with different numbers of observations. I > need to know whether the means of the two samples are statistically > different from each other. > > My sample_1 has approximately 800,000 observations. Sample_2 has > approximately 130,000 observations. > > I have run a regression on sample_1 to generate coefficients. I then > "fit" the coefficients from sample_1 to the characteristics of sample_2 > observations. This gives me a predicted value for sample_2 based on > sample_1 coefficients. I then calculate a residual by subtracting each > sample_2 observation actual value from the predicted value (predicted > from the sample_1 coefficients applied to the sample_2 > characteristics). > > Then I take the mean of the residuals from sample_2. > > I repeat the process in the opposite, i.e., I run a regression on > sample_2, get coefficients, then fit the coeffificients from sample_2 > to the sample_1 characteristics. This generates a predicted value, > which I subtract from each sample_1 actual - this generates the > sample_1 residuals. I then take the mean sample_1 residual. > > I expect the sample_1 and sample_2 residuals to be of opposite sign. I > need to test the difference in the mean residuals. I have two > dependent samples (of residuals) and I have very different sample sizes > (of residuals). > > I can make the assumption that they are perfectly negatively correlated > and proceed with a t-test. Then assume that they are perfectly > uncorrelated and proceed with a t-test. This will give me a range of > t-stats for my test. > > But, I was hoping someone could help me with a stronger (or more > direct) test. I'm afraid the range won't give strong enough results. > > So, this is a statistical theory question instead of a direct SAS > question. Hey, stat questions are allowed here too. But first... Why are you doing this? This doesn't make much sense to me, and your resulting data are NOT directly comparable. You cannot do either t-test. Period. You want to assume that you have something in between perfectly correlated and uncorrelated, so your t-statistic would be bracketed. It doesn't work that way. Even worse, both of the t-statistics you have in mind assume that the observations are independent. In a paired t-test, one assumes that the *differences* are independent. In a two-sample test, one assumes that all n1+n2 observations are independent of one another. You have created residuals which are (by construction) all inter-related. You have no independent observations here, and you shouldn't be considering a basic t-test. So, step back. Write to SAS-L (not to me personally) and explain why you are doing this, and what you hope to achieve. The big picture would be helpful. Perhaps someone here can point you toward a more productive approach. BTW, with sample sizes like you have, your statistical tests will be really flaky, since the size of n will drive virutally anything to appear significant. Why do you have such large samples, and where do they come from, and what do they represent? HTH, David -- David Cassell, CSC Cassell.David@epa.gov Senior computing specialist mathematical statistician

Back to: Top of message | Previous page | Main SAS-L page