LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2009, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 6 Jul 2009 16:34:02 -0400
Reply-To:     oloolo <dynamicpanel@YAHOO.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         oloolo <dynamicpanel@YAHOO.COM>
Subject:      Re: Need help with self-selection bias....
Comments: To: Pete <phlarsen@YAHOO.COM>

according to your first post, sounds like these omitted projects are not missing at random, therefore it is inappropriate to simply randomly select a portion of the observations from your dataset to generate summary statistics

Verbeke discussed this issue fair extensively, you probablly need to simultaneously estimate both the missing value and the parameters. check out the book: "linear mixed model for longitudinal data" by Geert Molenberghs, Geert Verbeke and citations there.

On Sat, 4 Jul 2009 12:57:00 -0400, Pete <phlarsen@YAHOO.COM> wrote:

>On Wed, 1 Jul 2009 19:12:49 -0700, Daniel Nordlund <djnordlund@VERIZON.NET> >wrote: > >>> -----Original Message----- >>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On >>> Behalf Of Pete >>> Sent: Wednesday, July 01, 2009 5:48 PM >>> To: SAS-L@LISTSERV.UGA.EDU >>> Subject: Need help with self-selection bias.... >>> >>> Hi Folks- >>> >>> Longtime SAS programmer here, but a newbie to SAS procedures >>> that correct >>> for self-selection bias. >>> >>> I have non-random data on energy project costs that was >>> self-reported by a >>> handful of companies. Some companies censored the project >>> data by only >>> sending us a share of their projects and the associated costs >>> (most likely >>> their best performing and most costly projects were >>> submitted). Anyway, I >>> do have some information on the total number of projects that >>> each company >>> undertook as well as the number of projects that were >>> actually submitted to >>> us. >>> >>> I have two questions: >>> 1) Is there a way to re-weight the sample data using acceptable bias >>> corrections so that I can report average project costs that >>> may be more >>> indicative of the true population? >>> >>> 2) If I were to model project costs (dependent) against a handful of >>> independent variables, does it sound like the Heckman >>> two-stage method using >>> the proportion of submitted projects makes the most sense? >>> Are there better >>> methods out there? >>> >>> Many thanks~you folks have been very helpful in the past. >> >>Pete, >> >>I think we are going to need a lot more information before you get "useful" >>advice. >>1. What are your study questions? Are you trying to evaluate some >>intervention or are you just doing descriptive analyses? >>2. How many companies are we talking? >>3. What is the range of "proportion of submitted projects"? Are we talking >>about a range like 90-100% or is it more like 60-100% or even worse? >> >>Depending on your answers (especially to 3.) you may or may not have any >>options. Your main problem appears to be missing data, and the data is not >>missing at random. You might want to look at multiple imputation methods. >>If the proportion of submitted projects is never very low, you might just >>downweight those companies proportional to their submission percentage. If >>the submission percentage is very low or a lot of companies have held back >>projects, I am not hopeful about doing anything useful. >> >>You asked about reweighting based on "acceptable bias corrections." Did you >>have something particular in mind? >> >>Your missing data problem is different from the typical selection bias >>scenario where observation units (patients, companies, ...) self-select into >>a treatment/intervention group. In this scenario, one could use a 2-stage >>Heckman (or other type of propensity score analysis) to try to adjust for >>differences based on observable characteristics. I don't know if this >>applies in your situation, but I suspect it won't be helpful in dealing with >>your missing data. Although, if you know something about the >>characteristics of the projects not submitted (not just the percentage of >>unsubmitted projects) you might be able to do something, but you will need >>to be able to model the probability of a project being submitted. >> >>If you can provide more context to your research/evaluation task, someone >>may be able to give you better help. >> >>Dan >> >>Daniel Nordlund >>Bothell, WA USA > >Hi Dan (and group)- > >Thanks for the early feedback. Here is some more information (sorry for the >bullets): > >*There are probably 50 companies that have submitted projects into this >database, which includes about 4000 records in total. Some companies submit >100% of their projects, while others submit smaller shares. I would say, in >general, that there is a 50-60% overall submission rate for projects for the >entire database. Right now, I'm interested in a number of variables, but I >just mentioned "project costs" to keep my original question to the group >fairly simple. > >*I am mostly interested in communicating "unbiased" descriptive statistics >about the population based on this sample (or a sub-sample?). However, I >might also be interested in testing hypotheses. > >*You are correct in your earlier point...that is, I don't know much about >project data that WAS NOT submitted. I also don't know much about the >companies that chose not to submit any projects at all (i.e., there are >other companies out there that didn't respond). We do know, though, that we >have a majority of data for the entire industry, but not everything. I only >really know the project data for the submitted projects and the share of >projects that were submitted by that particular company. We might be able >to reasonably determine how much of the total industry we have covered in >the database, if that helps. > >*Here's a question back to you: Is it reasonable to randomly draw and >report population stats on a sub-sample of our database projects that A) had >complete data for all variables and B) had a 100% submission rate? Also, >could we use this information to minimize the bias across the entire database? > >Just trying to think outside the box on this one....again, thanks for your help. > >-Pete


Back to: Top of message | Previous page | Main SAS-L page