Date: Tue, 21 Mar 2006 12:20:24 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Statistical Question--PROC LOGISTIC
In-Reply-To: <OF944F7851.4083BF19-ONC2257138.004551A3-C2257138.0045DD7C@hsbc.com.tr>
Content-Type: text/plain; format=flowed
BoraYavuz@hsbc.com.tr wrote:
>David,
>
>Can you expand a bit more on "certainty sampling" which you mentioned.
>
>We frequently find ourselves in pretty much the same situation Nick
>described and deal with it - mostly - using weights manually calculated in
>Excel (after deciding on the stratification variables through common
>sense). And I believe it would be great if you could provide some more
>info (examples, your comments, etc.) and pointers on the subject matter.
>
>I also didn't understand the "size multipliers" that you mentioned. :-(
>I'd be grateful if you could expand on that one too.
When we build a sample that has different weights (within a stratum or
with no strata at all) we do that by picking a 'multiplier' so that we pick
some records with a higher likelihood than others. The variable that we
use for this is the variable we list in the SIZE statement.
If our boss comes to us and says:
"Okay Bora, here's what I need and I need it last week. So step on it!
I want a sample of 40,000 from the database. Yeah, yeah, I know you
pulled one yesterday, but this time I need it different. I need the people
with incomes under $10,000 sampled at only one-tenth the rate we use
on the people with incomes over $10,000 . And I need every single one
of the people with an income over $100,000 . I expect to see this in
my inbox by close of business today!"
Okay, maybe our boss isn't that nice. :-)
But now we have certainty sampling (we have to get all the high-income
people) and we have PPS sampling. PPS= Probability Proportional to Size.
Let's do this now. The SIZE variable gets used in the certainty sampling
part too, so we need to think about this. We want a multiplier which is
10 times larger for the medium class than the low class:
if income > 10000 then mult = 10;
else mult = 1;
Or we could use a Boolean and write it as:
mult = 1 + 9*(income > 10000);
But we also need that certainty sample. The SIZE variable works with the
certainty option CERTSIZE like this: we give the largest values of the
multiplier to the records to be sampled for certain, and we use the
CERTSIZE option to tell the system what that cut-off will be. So let's
tack that extra bit on:
if income > 100000 then mult = 20;
else if income > 10000 then mult = 10;
else mult = 1;
Or we use our little Boolean trick again:
mult = 1 + 9*(income > 10000) + 10*(income > 100000);
Now we can do both the certainty sampling *and* the weighted sampling
together:
proc surveyselect data=YourBigData out=YourSample
seed=40589584
method=pps
certsize=20
sampsize=40000;
size mult;
run;
Does that make more sense now?
As for what I wrote before about certainty sampling, I'm copying the
URL so you can find it in the SAS-L archives.
http://listserv.uga.edu/cgi-bin/wa?A2=ind0603C&L=sas-l&P=R28796
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963