Date: Mon, 6 Jul 2009 17:36:52 -0400
Reply-To: Pete <phlarsen@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Pete <phlarsen@YAHOO.COM>
Subject: Re: Need help with self-selection bias....
Thanks for the reference. This data is not technically longitudinal,
though, because I am not receiving a given company's project information
over multiple periods of time. Does that matter....does your suggestion to
simultaneously estimate both the missing value and the parameters hold even
if the data is not longitudinal?
Basically, we receive data from one company at a point in time and another
company at another point in time. We don't receive updated information from
companies at later dates. Make sense? Many thanks, Pete
On Mon, 6 Jul 2009 16:34:02 -0400, oloolo <dynamicpanel@YAHOO.COM> wrote:
>according to your first post, sounds like these omitted projects are not
>missing at random, therefore it is inappropriate to simply randomly select
>a portion of the observations from your dataset to generate summary
>Verbeke discussed this issue fair extensively, you probablly need to
>simultaneously estimate both the missing value and the parameters.
>check out the book: "linear mixed model for longitudinal data" by Geert
>Molenberghs, Geert Verbeke and citations there.
>On Sat, 4 Jul 2009 12:57:00 -0400, Pete <phlarsen@YAHOO.COM> wrote:
>>On Wed, 1 Jul 2009 19:12:49 -0700, Daniel Nordlund <djnordlund@VERIZON.NET>
>>>> -----Original Message-----
>>>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
>>>> Behalf Of Pete
>>>> Sent: Wednesday, July 01, 2009 5:48 PM
>>>> To: SAS-L@LISTSERV.UGA.EDU
>>>> Subject: Need help with self-selection bias....
>>>> Hi Folks-
>>>> Longtime SAS programmer here, but a newbie to SAS procedures
>>>> that correct
>>>> for self-selection bias.
>>>> I have non-random data on energy project costs that was
>>>> self-reported by a
>>>> handful of companies. Some companies censored the project
>>>> data by only
>>>> sending us a share of their projects and the associated costs
>>>> (most likely
>>>> their best performing and most costly projects were
>>>> submitted). Anyway, I
>>>> do have some information on the total number of projects that
>>>> each company
>>>> undertook as well as the number of projects that were
>>>> actually submitted to
>>>> I have two questions:
>>>> 1) Is there a way to re-weight the sample data using acceptable bias
>>>> corrections so that I can report average project costs that
>>>> may be more
>>>> indicative of the true population?
>>>> 2) If I were to model project costs (dependent) against a handful of
>>>> independent variables, does it sound like the Heckman
>>>> two-stage method using
>>>> the proportion of submitted projects makes the most sense?
>>>> Are there better
>>>> methods out there?
>>>> Many thanks~you folks have been very helpful in the past.
>>>I think we are going to need a lot more information before you
>>>1. What are your study questions? Are you trying to evaluate some
>>>intervention or are you just doing descriptive analyses?
>>>2. How many companies are we talking?
>>>3. What is the range of "proportion of submitted projects"? Are we
>>>about a range like 90-100% or is it more like 60-100% or even worse?
>>>Depending on your answers (especially to 3.) you may or may not have any
>>>options. Your main problem appears to be missing data, and the data is
>>>missing at random. You might want to look at multiple imputation methods.
>>>If the proportion of submitted projects is never very low, you might just
>>>downweight those companies proportional to their submission percentage.
>>>the submission percentage is very low or a lot of companies have held back
>>>projects, I am not hopeful about doing anything useful.
>>>You asked about reweighting based on "acceptable bias corrections." Did
>>>have something particular in mind?
>>>Your missing data problem is different from the typical selection bias
>>>scenario where observation units (patients, companies, ...) self-select
>>>a treatment/intervention group. In this scenario, one could use a 2-stage
>>>Heckman (or other type of propensity score analysis) to try to adjust for
>>>differences based on observable characteristics. I don't know if this
>>>applies in your situation, but I suspect it won't be helpful in dealing
>>>your missing data. Although, if you know something about the
>>>characteristics of the projects not submitted (not just the percentage of
>>>unsubmitted projects) you might be able to do something, but you will need
>>>to be able to model the probability of a project being submitted.
>>>If you can provide more context to your research/evaluation task, someone
>>>may be able to give you better help.
>>>Bothell, WA USA
>>Hi Dan (and group)-
>>Thanks for the early feedback. Here is some more information (sorry for
>>*There are probably 50 companies that have submitted projects into this
>>database, which includes about 4000 records in total. Some companies
>>100% of their projects, while others submit smaller shares. I would say,
>>general, that there is a 50-60% overall submission rate for projects for
>>entire database. Right now, I'm interested in a number of variables, but I
>>just mentioned "project costs" to keep my original question to the group
>>*I am mostly interested in communicating "unbiased" descriptive statistics
>>about the population based on this sample (or a sub-sample?). However, I
>>might also be interested in testing hypotheses.
>>*You are correct in your earlier point...that is, I don't know much about
>>project data that WAS NOT submitted. I also don't know much about the
>>companies that chose not to submit any projects at all (i.e., there are
>>other companies out there that didn't respond). We do know, though, that
>>have a majority of data for the entire industry, but not everything. I only
>>really know the project data for the submitted projects and the share of
>>projects that were submitted by that particular company. We might be able
>>to reasonably determine how much of the total industry we have covered in
>>the database, if that helps.
>>*Here's a question back to you: Is it reasonable to randomly draw and
>>report population stats on a sub-sample of our database projects that A)
>>complete data for all variables and B) had a 100% submission rate? Also,
>>could we use this information to minimize the bias across the entire
>>Just trying to think outside the box on this one....again, thanks for your