LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2000, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 15 May 2000 03:42:09 GMT
Reply-To:     "Paul M. Dorfman" <sashole@MEDIAONE.NET>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Paul M. Dorfman" <sashole@MEDIAONE.NET>
Organization: KInPh
Subject:      Re: subset the dataset dynamically?
Content-Type: text/plain; charset=us-ascii

Roger,

Right. I knew that others would offer if-the-else, select, or other conventional logic, so I intentionally decided to demonstrate a 'single experssion' approach.

Roger Lustig wrote:

> Very elegant, Paul! > > A few comments... > > --do we really need to use the MIN operator *and* the MIN function > in the same statement? That can confuse other users.

No, of course not. But it makes one guess 'Why the heck did he do that?', and trying to understand the tricks incorporated in the expression triggers curiosity. Those who already know or do not want to know, need it not; those who do not know and want to dissect the expression in order to know, may find it instructive.

> --Your formula depends on the fact that 1000 is 900 + 100.

If you know of any deviations from this rule, please relate them to me at once.

> Either we need to generalize, or (if the pattern of observation > counts required isn't so regular) replace the formula with the > result of a SELECT block.

True, but in this particular case, I was not intending to create the General Theory of Everything.

> Yes, the IF/THEN/ELSE logic that > Shiling Zhang and Steve James offered is OK too, but the SELECT > is ultimately easier to type, structure, and read. A hobby-horse > of mine...

Agreed with the advantages of the case structure, even though algorithmically, it is but a disguised bunch of GOTOs converging to the same exit. It has a subtle caveat (a SAS design bug in my opinion): It cannot be used in a loop where LEAVE is intended to terminate iteration, because inside the SELECT, LEAVE bears its own meaning. Say, a typical binary search inner loop has no problem executing correctly if coded with if-then-else as e.g.

do while (lo > hi); m = floor ((lo+hi)*.5); if key < a(m) then hi = m-1; else if key > a(m) then lo = m+1; else do; inxfound = m; leave; end; end;

But try to code it as

do while (lo > hi); m = floor ((lo+hi)*.5); select; when (key < a(m)) hi = m-1; when (key > a(m)) lo = m+1; otherwise do; inxfound = m; leave; end; end; end;

and in the case of a successful search, it will loop forever: LEAVE here directs control past the bottom of SELECT, _not_ out of the DO WHILE loop.

> /**************************************************************/ > data sub (drop=max); > set a nobs=tot_obs; > > retain max; **** We'll only calculate this > once, but keep it around > throughout; > > if _n_=1 then select; **** Only need to do this at the > start; > > **** From here to the END statement > only the first true WHEN > will be executed; > when (tot_obs >= 1000) max=1000; > when (tot_obs >= 900) max=900; > when (tot_obs >= 600) max=600; > when (tot_obs >= 300) max=300; > otherwise stop; **** Under 300? We're done ; > *otherwise; **** Also legit, by the way; > end; > **** Now we know how many we need; > > if _n_ > max then stop; > run; > /**************************************************************/ > This has the advantage of being somewhat closer to "beginner's SAS", > i.e., the SET comes right after the DATA statement. I agree--it's > important to learn what SET actually does, but there's something to > be said for not having to learn it all at once.

On the other hand, it might be important to learn it right from the very beginning. One of the comments above contains a self- contradiction. "Only need to do this at the start" sounds good, yet you are checking whether it is the 'start' at each observation. Generally, the construct <if _n_=1> or <if end> should be considered harmful. First, it performs an unnecessary test or two at each observation; second, it replaces the common-sense logic "do smth; read the file and process it; do smth;" with "read file and process it; if first record then do smth; if last record then do smth;". Your step above, if rewritten in the 'natural' logical sequence, could spell

data sub (drop=max); set a nobs=max; select; when (max => 1000) max = 1000; when (max => 900) max = 900; when (max => 600) max = 600; when (max => 300) max = 300; otherwise max = 0; end; do _n_=1 to max; set a nobs=max; output; end; stop; run;

> That said, this is a borderline case for *another* hobby-horse of > mine, which involves keeping data out of the DATA step code. To > my way of thinking, the numbers 0-300-600-900-1000 are data, i.e., > things you're likely to change later. > > My solution? PROC FORMAT, of course. > /**************************************************************/ > proc format; > value max_num > low-<300='0' > 300-<600='300' > 600-<900='600' > 900-<1000='900' > 1000-high='1000' > ; > run; > > data sub (drop=max); > set a nobs=tot_obs; > > retain max; **** We'll only calculate this > once, but keep it around > throughout; > > if _n_=1 then max=input(put(tot_obs,max_num.),12.); > > **** Now we know how many we need; > > if _n_ > max then stop; > run; > /**************************************************************/ > Two steps, to be sure--but the second one works no matter what > kind of interval scheme you want, and doesn't require modifying > a formula & some parameters (your solution), or a bunch of WHEN > statements (my solution above), or a bunch of IF/THEN/ELSE statements, > either contiguous (Steve) or throughout the program (Shiling). > Instead, the spec gets its own little PROC up front. If you're > doing this sort of thing often, you can store the DATA step > elsewhere and %include it after the PROC FORMAT, which makes > it *really* hard to introduce errors.

Usign PROC FORMAT in this manner is surely a common practice. However, in my opinion, it is a poor practice. It does not really eliminate data from code but merely moves the data to code elsewhere, namely to PROC FORMAT. The strength of an analytic formula lies exactly in the ease of modifying it. Imagine how your PROC FORMAT would end up looking like if some new specifications changed 1,000 to 10,000, and how long it would take to modify the formula. Of course, the latter has not to be embedded in the code - it can reside in an autocall library as a fuction-style macro.

> For checking purposes, one might also write a note to the log, > right after the SET statement in either program above: > > put '******* INPUT DATA SET HAD ' tot_obs ' OBSERVATIONS ********';

Yup, in V<7 it is handy, else it is already taken care of.

> That way one can see whether one got the intended result > without having to look back at the previous data set--the input > and output counts will be right next to one another in the log. > > You're right--there's no limit on the number of ways to do this. > However, good SAS style not only involves knowing the actual function > of the statements you're using (as you very nicely urged Kathy > to find out), but also making the program easy to read, easy to > use, easy to modify, and easy to hand off to others.

Could not agree more.

> End of sermon.

Amen.

Kind regards, =================== Paul M. Dorfman Jacksonville, Fl ===================

> Roger Lustig > > Paul Dorfman wrote: > > > > Kathy, > > > > The logic is simple: First, compute the number of records to be read and > > written out depending on the number of observations present in the data set, > > and then read and output exactly as many records as you have computed. This > > can be done in infinite number of ways, and I am looking forward to see > > rather diverse replies. I will show just two ways of doing the same thing. > > First, a 2-step method (the input data set name is A): > > > > data _null_; > > retain l 300 h 1000; > > cut = (min(h,l*(int(n/l)))+100*(n ge h)) min h; > > call symput ('CUT', left(put(cut,best.))); > > stop; > > set A nobs=n; > > run; > > > > Now, you can supply the value of the macro variable CUT to the data set > > option OBS= whenever you reference the dataset A, as in > > > > proc append data=A(obs=&CUT) base=SUB; run; > > > > It is not even necessary to create a separate data set SUB, just reference > > the original data set A as A(obs=&CUT). Beware, though, that WHERE clause is > > incompatible with such a reference. > > > > Now, the 1-step way: > > > > data sub; > > l=300; h=1000; > > do _n_=1 to (min(h,l*(int(n/l)))+100*(n ge h)) min h; > > set a nobs=n; > > output; > > end; > > stop; > > run; > > > > It exactly parallels the logic outlined in the beginning. First, we compute > > the necessary number of output records using the expression > > > > (min(h,l*(int(n/l)))+100*(n ge h)) min h > > > > Then the internal value it produces, say some n_int, becomes the upper bound > > in the DO loop causing it to iterate exactly n_int times. At each iteration, > > SET reads 1 record from A and OUTPUT writes the record to SUB. When the loop > > is done, STOP terminates the step. Without it, control would be transferred > > back to the top of the step and, since non-read record remain in A, next > > n_int records would be read, and so on. But we do not want it, hence the > > STOP. > > > > If you understand how the expression above works, excellent! If you do not, > > try to understand it on your own. If you succeed, you have learned a couple > > of tricks. If not, write to sas-l, and I am sure someone will explain it to > > you eagerly. > > > > Kind regards, > > ===================== > > Paul M. Dorfman > > Jacksonville, Fl > > ===================== > > >From: Kathy White <kathyzhi@HOTMAIL.COM> > > >Reply-To: Kathy White <kathyzhi@HOTMAIL.COM> > > >To: SAS-L@LISTSERV.UGA.EDU > > >Subject: subset the dataset dynamically? > > >Date: Fri, 12 May 2000 15:23:38 EDT > > > > > >Suppose I have a sas dataset with many records. What I want to do now is to > > >select the records dynamically. That means, The most records I need is > > >1000. > > >Then the number of records selected should be the times of 300. So if the > > >current dataset have more than 1000 records, I will just pick up the first > > >1000 records, How to make it. Following is my logic: > > > > > >Data subset; > > > set mydata; > > > count= _N_; > > > if count > 1000 then /* just output the first 1000 to the subdatset; > > > if 900< count <1000 then just pick up the first 900; > > > if 600<count < 900 then just pick up the first 600; > > > if 300 < 600 then pick up the first 300; > > > if count <300 no records should be selected; > > > > > >How to make this? Is it possible to impletment with SAS? If any expert can > > >provide your advice, it would be highly appreciated. Thanks! > > > > > >Kathy


Back to: Top of message | Previous page | Main SAS-L page