Date: Thu, 4 May 2006 15:53:26 -0400
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Nat Wooding <Nathaniel_Wooding@DOM.COM>
Subject: Re: Lab data / Documentation of Below Detectable Limits (BDL) or
Above Threshhold Value (ATV)
Content-Type: text/plain; charset="ISO-8859-1"
For quite a number of years I have maintained a couple data bases of
laboratory results or data from reports generated from these (ie, I don't
have access to the original lab data but can harvest the stuff in the
reports). These data are from enivronmental water , soil, etc samples or
from water discharged by power plants. These data include just about all of
the catagories that you mention -- detected amounts, values that are below
detection (I'll call these bdl) represented by '<some number" or by
My solution has always been to store the data as a character value and
store exactly what the lab or report gave me. In our case, we report these
data to various agencies who have differing thoughts as to how to handle
bdl data -- use the number, use half, use 0, ...; Hence, I cannot give up
the information stored in that "<". Also, there are times when we need to
estimated loadings, ie, the mass of stuff discharged over some time. Here,
I multiply the flow by the concentration. If the concentration is bdl, then
I definitely want to be able to state that the loading is less than the
As I recall, you are associated with the medical college of a university. I
have zero knowledge of any standards that may exist for medical data or
what the practices are in analyzing these data.
As to using special missing values, I would question whether you have too
many parameters and possible values for each parameter to use this approach
(this assumes that a single data set would contain data on multiple
parameters and that there may be a number of detection limits ). This
approach, if it would work, will avoid the step of having to strip off the
"<" and create a number but I would hope that someone would use the
non-detect value (values) as part of the analysis. In the case of our data,
at least, this is important information albeit a bit sticky to deal with.
You asked about a public file:
The United States Geological Survey (USGS) has started posting various
water data online. The following very lengthy url is for water quality data
from the Potomac River in the state of West Virginia
A perhaps easier way to reach the site would be to go to
there , check the box in the second column labeled site name. Submit this
and on the next page enter "Potomac" in the site name field and click on
the 'table of sites' radio button. If you scan the columns labeled
"ammonia", you will see a few values with "<" prefixes.
This particular site is relatively new. In dealing with old tape-format
USGS data, I seem to recall that they would present the number and included
a column with a flag which indicated when something was bdl.
I do have one suggestion if you are going to offer these data to general
users: offer a link to some sort of narrative that discusses values that
are not detected and how one may need to use them in analyses.
Thanks for an interesting topic. I hope that we see some more replies.
Sent by: "SAS(r) SAS-L@LISTSERV.UGA.EDU
Lab data / Documentation of Below
Detectable Limits (BDL) or
05/04/2006 05:01 Above Threshhold Value (ATV)
Please respond to
we just had a very engaged discussion in our group how to represent
special non-numerical off the scale (OTS) data from laboratory analyses..
The original data from the lab software are character vars, with OTS
values represented as "<2.0" or ">100" and on-scale data as "3.0",
There are also (really) missing data like "no analysis made -->not
(A little extra complication is, that through (ir)regular calibration of
lab machines, the scale limits undergo small changes from time to time,
like <2.0 could change to <2.1, then <1.9 etc.).
Our task is to make the data _publicly_ available in numerical format
for many (>100) SAS and SPSS users (in their own data formats).
The SPSS party (mostly medical/dental folk, some rather fresh in
likes the data in pure numerical form, e.g. "<2.0" transformed to 2.0
and then labeled as "<2.0"
(to make sure OTS data are not dropped in (numerical) analyses when
coded as missing).
The SAS party (statisticians, (bio-)mathematicians) sees the potential
errors (bias, high influence etc.),
when these "imputed" values enter regression analyses and thus rather
using methods that can handle censored data, and represent the data
a) coding them as special .X-like missing values (i.e. technically
missing, but not semantically) with relevant labels ("<2.0") or
b) leave the data in character form as they come from the lab and have
each analyst decide him/herself what to do.
1) What are your experiences with this kind of data?
2) What is the best way for this task for a public-use-file? (with many
users, who we sometimes do not know)
3) Are there any official/international rules how to do it?
Very interested in your input!
DIETRICH ALTE, Dipl.-Statistiker, Dr. rer. med.
Projektmanager "Study of Health in Pomerania (SHIP)"
Institut für Epidemiologie & Sozialmedizin
EMA-Universität Greifswald - Medizinische Fakultät
Walther-Rathenau-Str. 48, D-17487 Greifswald, Germany
Phone ++49(0)3834-867713, Fax ++49(0)3834-866684
CONFIDENTIALITY NOTICE: This electronic message contains
information which may be legally confidential and/or privileged and
does not in any case represent a firm ENERGY COMMODITY bid or offer
relating thereto which binds the sender without an additional
express written confirmation to that effect. The information is
intended solely for the individual or entity named above and access
by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution, or use of the
contents of this information is prohibited and may be unlawful. If
you have received this electronic transmission in error, please
reply immediately to the sender that you have received the message
in error, and delete it. Thank you.