LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2011, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 21 Jun 2011 18:35:27 -0400
Reply-To:     Arthur Tabachneck <art297@ROGERS.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Arthur Tabachneck <art297@ROGERS.COM>
Subject:      Re: finding maximum of multiple variables over multiple date
              windows
Comments: To: "Data _Null_;" <iebupdte@GMAIL.COM>

datanull,

I'm not going to pretend that I understand your code as, honestly, I'd need some more time to study it. That said, I'm sure that you could assign the original entries to a two dimensional array, and then proceed with your code.

The following doesn't do that but, rather, takes the brute force approach. Even on an aging computer, with little memory, the task took less than twenty minutes which included building a test data set. It sure sounds better than three hours. I took the liberty of changing one small part of your code, namely regarding the naming of the arrays.

Art

data start; input id year (a b c d e f)(:$1.); datalines; 1 2006 1 0 0 1 1 1 1 2008 1 0 0 0 1 1 1 2009 0 0 . 0 1 1 1 2010 0 0 0 0 1 1 1 2011 0 0 0 0 1 0 1 2012 0 0 0 0 1 0 2 2008 0 1 1 0 1 1 2 2009 1 1 1 0 0 0 3 2006 1 0 0 1 1 1 3 2007 1 1 1 0 0 0 3 2008 1 0 0 0 1 1 3 2009 0 0 . 0 1 1 3 2010 0 0 0 0 1 1 3 2011 0 0 0 0 1 0 3 2012 0 0 0 0 1 0 4 2006 1 0 0 1 1 0 4 2007 1 1 1 0 0 0 4 2008 1 0 0 0 1 0 4 2009 0 0 . 0 1 0 4 2010 0 0 0 0 1 . 4 2011 0 0 0 0 1 0 4 2012 0 0 0 0 1 0 4 2013 0 0 0 0 1 0 4 2014 0 0 0 0 1 0 4 2015 0 0 0 0 1 0 5 2006 1 0 0 1 1 1 5 2008 1 0 0 0 1 1 5 2009 0 0 . 0 1 1 5 2010 0 0 0 0 1 1 5 2011 0 0 0 0 1 0 5 2012 0 0 0 0 1 0 6 2008 0 1 1 0 1 1 6 2009 1 1 1 0 0 0 7 2006 1 0 0 1 1 1 7 2007 1 1 1 0 0 0 7 2008 1 0 0 0 1 1 7 2009 0 0 . 0 1 1 7 2010 0 0 0 0 1 1 7 2011 0 0 0 0 1 0 7 2012 0 0 0 0 1 0 8 2006 1 0 0 1 1 0 8 2007 1 1 1 0 0 0 8 2008 1 0 0 0 1 0 8 2009 0 0 . 0 1 0 8 2010 0 0 0 0 1 . 8 2011 0 0 0 0 1 0 8 2012 0 0 0 0 1 0 8 2013 0 0 0 0 1 0 8 2014 0 0 0 0 1 0 8 2015 0 0 0 0 1 0 9 2006 1 0 0 1 1 1 9 2008 1 0 0 0 1 1 9 2009 0 0 . 0 1 1 9 2010 0 0 0 0 1 1 9 2011 0 0 0 0 1 0 9 2012 0 0 0 0 1 0 10 2008 0 1 1 0 1 1 10 2009 1 1 1 0 0 0 ; data have; set start (rename=(id=i)); do j=1 to 400000; id=j*10+i-10; output; end; run; proc sort data=have; by id year; run; data expanded(drop=last_year yr); set have; array current[*] a b c d e f; retain last_year; by id; yr=year; output; if first.id then do; if year gt 2006 then do; call missing(of current(*)); do year=2006 to yr-1; output; end; end; end; else if yr gt last_year+1 then do; call missing(of current(*)); do year=last_year+1 to yr-1; output; end; end; if last.id and yr lt 2012 then do; call missing(of current(*)); do year=yr+1 to 2012; output; end; end;

last_year=yr; run; proc sort data=expanded; by id year; run; data need; set expanded; by id; array y[*] a b c d e f; array l1_[6] $1 _temporary_; array l2_[6] $1 _temporary_; array l3_[6] $1 _temporary_; array l4_[6] $1 _temporary_; array l5_[6] $1 _temporary_; array l6_[6] $1 _temporary_; array l7_[6] $1 _temporary_; do i=1 to dim(y); l1_[i] = lag1(y[i]); l2_[i] = lag2(y[i]); l3_[i] = lag3(y[i]); l4_[i] = lag4(y[i]); l5_[i] = lag5(y[i]); l6_[i] = lag6(y[i]); l7_[i] = lag7(y[i]); end; if first.id then count = 0; count + 1; if count le 1 then call missing(of l1_[*]); if count le 2 then call missing(of l2_[*]); if count le 3 then call missing(of l3_[*]); if count le 4 then call missing(of l4_[*]); if count le 5 then call missing(of l5_[*]); if count le 6 then call missing(of l6_[*]); if count le 7 then call missing(of l7_[*]); array m2_[6] $1 m2_a m2_b m2_c m2_d m2_e m2_f; array m3_[6] $1 m3_a m3_b m3_c m3_d m3_e m3_f; array m4_[6] $1 m4_a m4_b m4_c m4_d m4_e m4_f; array m5_[6] $1 m5_a m5_b m5_c m5_d m5_e m5_f; array m6_[6] $1 m6_a m6_b m6_c m6_d m6_e m6_f; array m7_[6] $1 m7_a m7_b m7_c m7_d m7_e m7_f; array m8_[6] $1 m8_a m8_b m8_c m8_d m8_e m8_f; do i = 1 to dim(y); m2_[i] = y[i] max l1_[i]; m3_[i] = m2_[i] max l2_[i]; m4_[i] = m3_[i] max l3_[i]; m5_[i] = m4_[i] max l4_[i]; m6_[i] = m5_[i] max l5_[i]; m7_[i] = m6_[i] max l6_[i]; m8_[i] = m7_[i] max l7_[i]; end; drop count i; run; ------ On Tue, 21 Jun 2011 12:10:57 -0500, Data _null_; <iebupdte@GMAIL.COM> wrote:

>I did not understand comment about the missing year. I though is was >to be ignored. You are correct that my method needs to account for >the missing year. Let me see if I can figure a method that won't be >too slow. > >On Tue, Jun 21, 2011 at 12:05 PM, Ya Huang <ya.huang@amylin.com> wrote: >> I have a feeling that you overlooked the requirement that >> window is based on the calendar year, not the number of obs. >> So for a window of 3 years, the id=1 year 2009, we should not include 2006, >> since it is too old. With lag(), you will pick up 2006, since it is 3 obs >> ahead. I think to use lag, we need to merge a shell (which >> has all the concecutive years) data first, to make sure obs match >> concecutive years. >> >> >> On Tue, 21 Jun 2011 11:55:51 -0500, Data _null_; <iebupdte@GMAIL.COM> wrote: >> >>>Here is my go at it. I think I understand the problem but you will >>>have to check that. You know I like my data steps so that is what I >>>used. Please check closely there are lots of enumerated lists and >>>arrays named as enumerated lists. >>> >>>data HAVE; *note: dates may not be continuous - e.g. id 1 is missing year >> 2007; >>> input id year (a b c d e f)(:$1.); >>> datalines; >>>1 2006 1 0 0 1 1 1 >>>1 2008 1 0 0 0 1 1 >>>1 2009 0 0 . 0 1 1 >>>1 2010 0 0 0 0 1 1 >>>1 2011 0 0 0 0 1 0 >>>1 2012 0 0 0 0 1 0 >>>2 2008 0 1 1 0 1 1 >>>2 2009 1 1 1 0 0 0 >>>3 2006 1 0 0 1 1 1 >>>3 2007 1 1 1 0 0 0 >>>3 2008 1 0 0 0 1 1 >>>3 2009 0 0 . 0 1 1 >>>3 2010 0 0 0 0 1 1 >>>3 2011 0 0 0 0 1 0 >>>3 2012 0 0 0 0 1 0 >>>4 2006 1 0 0 1 1 0 >>>4 2007 1 1 1 0 0 0 >>>4 2008 1 0 0 0 1 0 >>>4 2009 0 0 . 0 1 0 >>>4 2010 0 0 0 0 1 . >>>4 2011 0 0 0 0 1 0 >>>4 2012 0 0 0 0 1 0 >>>4 2013 0 0 0 0 1 0 >>>4 2014 0 0 0 0 1 0 >>>4 2015 0 0 0 0 1 0 >>>;;;; >>> Run; >>>proc print; >>> run; >>>data need; >>> set have; >>> by id; >>> array y[*] a b c d e f; >>> array l1_[6] $1 _temporary_; >>> array l2_[6] $1 _temporary_; >>> array l3_[6] $1 _temporary_; >>> array l4_[6] $1 _temporary_; >>> array l5_[6] $1 _temporary_; >>> array l6_[6] $1 _temporary_; >>> array l7_[6] $1 _temporary_; >>> do i=1 to dim(y); >>> l1_[i] = lag1(y[i]); >>> l2_[i] = lag2(y[i]); >>> l3_[i] = lag3(y[i]); >>> l4_[i] = lag4(y[i]); >>> l5_[i] = lag5(y[i]); >>> l6_[i] = lag6(y[i]); >>> l7_[i] = lag7(y[i]); >>> end; >>> if first.id then count = 0; >>> count + 1; >>> if count le 1 then call missing(of l1_[*]); >>> if count le 2 then call missing(of l2_[*]); >>> if count le 3 then call missing(of l3_[*]); >>> if count le 4 then call missing(of l4_[*]); >>> if count le 5 then call missing(of l5_[*]); >>> if count le 6 then call missing(of l6_[*]); >>> if count le 7 then call missing(of l7_[*]); >>> >>> array m2_[6] $1 m2_a m2_b m2_c m2_d m2_e m2_f; >>> array m3_[6] $1; >>> array m4_[6] $1; >>> array m5_[6] $1; >>> array m6_[6] $1; >>> array m7_[6] $1; >>> array m8_[6] $1; >>> >>> do i = 1 to dim(y); >>> m2_[i] = y[i] max l1_[i]; >>> m3_[i] = m2_[i] max l2_[i]; >>> m4_[i] = m3_[i] max l3_[i]; >>> m5_[i] = m4_[i] max l4_[i]; >>> m6_[i] = m5_[i] max l5_[i]; >>> m7_[i] = m6_[i] max l6_[i]; >>> m8_[i] = m7_[i] max l7_[i]; >>> end; >>> >>> drop count i; >>> run; >>>proc print; >>> run; >>> >>> >>>On Tue, Jun 21, 2011 at 11:07 AM, Bolotin Yevgeniy >>><YBolotin@schools.nyc.gov> wrote: >>>> That's almost exactly like what i have now, actually :) except i don't >>>> transpose first, just have a list (variable names changed to match the >>>> example below) >>>> >>>> t1.id >>>> , t1.year >>>> , max(t2.a) as >>>> max_of_latest_&y.yr_a >>>> , max(t2.b) as >>>> max_of_latest_&y.yr_b >>>> ... >>>> >>>> Takes 3 hours, because those joins take quite a while (and i have to do >>>> them multiple times), and merging the datasets afterwards is also a >>>> massive timesink... >>>> >>>> thanks for the answer, though :) >>>> >>>> -----Original Message----- >>>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Ya >>>> Huang >>>> Sent: Tuesday, June 21, 2011 11:58 AM >>>> To: SAS-L@LISTSERV.UGA.EDU >>>> Subject: Re: finding maximum of multiple variables over multiple date >>>> windows >>>> >>>> Here is one. We transpose the data from wide and short >>>> to narrow and long, so SQL have handle more easily. After >>>> SQL to aggregate, we transpose back: >>>> >>>> %let nyear=3; >>>> >>>> proc transpose data=have out=have1; >>>> by id year; >>>> var a b c d e f; >>>> run; >>>> >>>> proc sql; >>>> create table have2 as >>>> select distinct a.id,a._name_,a.year,max(b.col1) as max_3year >>>> from have1 a, have1 b >>>> where a.id=b.id and >>>> a._name_=b._name_ and >>>> a.year - %eval(&nyear -1) <= b.year <=a.year >>>> group by a.id,a._name_,a.year >>>> order by a.id,a.year,a._name_ >>>> ; >>>> >>>> proc transpose data=have2 out=have3 (drop=_name_) prefix=max&nyear; >>>> by id year; >>>> id _name_; >>>> var max_3year; >>>> run; >>>> >>>> proc print; >>>> run; >>>> >>>> id year max3a max3b max3c max3d max3e max3f >>>> >>>> 1 2006 1 0 0 1 1 1 >>>> 1 2008 1 0 0 1 1 1 >>>> 1 2009 1 0 0 0 1 1 >>>> 1 2010 1 0 0 0 1 1 >>>> 1 2011 0 0 0 0 1 1 >>>> 1 2012 0 0 0 0 1 1 >>>> 2 2008 0 1 1 0 1 1 >>>> 2 2009 1 1 1 0 1 1 >>>> >>>> This might be a bit slower when you have 20 mil records, but the code >>>> more readable. >>>> >>>> Change nyear, you will get different window. >>>> >>>> HTH >>>> >>>> Ya >>>> >>>> >>>> On Tue, 21 Jun 2011 11:32:04 -0400, Bolotin Yevgeniy >>>> <YBolotin@SCHOOLS.NYC.GOV> wrote: >>>> >>>>>Slight simplification: for purposes of this calculation, missing values >>>>>can be treated as '0' (i.e. we only care about the '1' outcomes) >>>>> >>>>>-----Original Message----- >>>>>From: Bolotin Yevgeniy >>>>>Sent: Tuesday, June 21, 2011 11:30 AM >>>>>To: SAS-L@LISTSERV.UGA.EDU >>>>>Subject: finding maximum of multiple variables over multiple date >>>>>windows >>>>> >>>>>my google-fu is insufficient to find anything, although I'm reasonably >>>>>sure it's been done before... >>>>> >>>>> >>>>>I have a reasonably large (20 million-ish records) dataset with about a >>>>>dozen variables I need to aggregate (sample data at the end of the >>>>>email. each variable is trinary: character '0', '1', or missing) >>>>> >>>>>For each of these variables, I need to find its maximum value in the >>>>>last 2 years, last 3 years, ... etc up to 8, for a given set of years >>>>> >>>>> >>>>>Requirements: >>>>>* should be easy to read/understand/modify (primary requirement) and, >>>>>* if at all possible, should be reasonably fast >>>>>(i have a variant which does this with a few macro loops and a bunch of >>>>>PROC SQLs, but this takes about 3 hours to run - it should >>>> theoretically >>>>>be doable much faster by doing them all in one pass, rather than one at >>>>>a time like it's being done now...) >>>>> >>>>>I tried writing a single-datastep version with macro logic, but the >>>>>result consisted almost entirely of macro variables and practically no >>>>>code, and was completely unreadable. >>>>> >>>>> >>>>>Sample input data: >>>>> >>>>>data HAVE; *note: dates may not be continuous - e.g. id 1 is missing >>>>>year 2007; >>>>>input >>>>> id year a :$1. b :$1. c :$1. d :$1. e :$1. f :$1.; >>>>>datalines; >>>>>1 2006 1 0 0 1 1 1 >>>>>1 2008 1 0 0 0 1 1 >>>>>1 2009 0 0 . 0 1 1 >>>>>1 2010 0 0 0 0 1 1 >>>>>1 2011 0 0 0 0 1 0 >>>>>1 2012 0 0 0 0 1 0 >>>>>2 2008 0 1 1 0 1 1 >>>>>2 2009 1 1 1 0 0 0 >>>>>; >>>>>Run; >>>>> >>>>>/* >>>>>Output should have something like the following variables: >>>>> id year max_of_last_2_years_a max_of_last_3_years_a >>>>>max_of_last_4_years_a ... max_of_last_8_years_a max_of_last_2_years_b >>>>>... max_of_last_8_years_f; >>>>> >>>>>with the following values for e.g. max_of_last_3_years_a: >>>>>id year max_of_last_3_years_a comments >>>>>1 2006 1 value was 1 in the last 3 years inclusive (in >>>>>this case, just in 2006) >>>>>1 2008 1 value was 1 in the last 3 years inclusive (in >>>>>2006 and 2008). note: no entry for 2007, so it's not in the output >>>>>1 2009 1 value was 1 in the last 3 years inclusive (in >>>>>2008; 2006 is now too old to be considered for this variable) >>>>>1 2010 1 value was 1 in the last 3 years inclusive (in >>>>>2008) >>>>>1 2011 0 value was not 1 in the last 3 years inclusive >>>>>(in 2009, 2010 or 2011) >>>>>1 2012 0 value was not 1 in the last 3 years inclusive >>>>>(in 2010, 2011 or 2012) >>>>>2 2008 0 value was not 1 in the last 3 years inclusive >>>>>(in 2008) >>>>>2 2009 1 value was 1 in the last 3 years inclusive (in >>>>>2009) >>>>>*/ >>>> >>


Back to: Top of message | Previous page | Main SAS-L page