Date: Thu, 4 Nov 2010 17:53:04 -0500
Reply-To: Joe Matise <snoopy369@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Joe Matise <snoopy369@GMAIL.COM>
Subject: Re: Bit type
In-Reply-To: <AANLkTikykyimU4XQX_JTzkT0SC=Hb9WErwtSi5rw5gHV@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
If you just make a function that does the extracting for you, you don't have
to use a view - you can just WHERE on that function's results (using BAND,
or some math) and never uncompress, if you don't plan on asking for a
frequency of the YN variables. (You don't have to make an actual function
if you don't want to, though it might be easier for your users - a macro, or
even just plain math/BAND usage in the PROC, should do).
-Joe
On Thu, Nov 4, 2010 at 5:44 PM, Sterling Paramore <gnilrets@gmail.com>wrote:
> I took a 6 million record sample of my data and converted the Y/N fields
> into a compressed value:
>
>
> data Membs_Enr_Compress (drop = MembsEnr_Eff_Flag_:);
> set Membs_Enr;
>
> length compr_value 3;
> compr_value = 0;
> i = 0;
> do iField = 200912,201001 to 201009;
> i = i + 1;
> compr_value = compr_value +
> (vvaluex(cats("MembsEnr_Eff_Flag_",put(iField,6.0)))="Y")*2**(i-1);
> end;
> run;
>
>
>
>
> I then created a dataset version and a view version of the uncompressed
> values:
>
>
> data Membs_Enr_Uncompress;
> set Membs_Enr_Compress;
>
> array YNField_(12) $1;
>
> do i = 1 to 12;
> if band(2**(i-1),compr_value) then
> YNField_(i) = "Y";
> else
> YNField_(i) = "N";
> end;
>
> run;
>
> data Membs_Enr_Uncompress_V / view = Membs_Enr_Uncompress_V;
> set Membs_Enr_Compress;
>
> array YNField_(12) $1;
>
> do i = 1 to 12;
> if band(2**(i-1),compr_value) then
> YNField_(i) = "Y";
> else
> YNField_(i) = "N";
> end;
>
> run;
>
>
> Here's a typical use of these tables:
>
> 15 proc means missing nway data = Membs_Enr_Uncompress SUM;
> 16 where YNField_3 = "Y";
> 17 var Membs_Counter;
> 18 class Membs_Enroll_yrmo;
> 19 run;
>
> NOTE: There were 1978518 observations read from the data set
> WORK.MEMBS_ENR_UNCOMPRESS.
> WHERE YNField_3='Y';
> NOTE: PROCEDURE MEANS used (Total process time):
> real time 1.93 seconds
> cpu time 2.32 seconds
>
>
> 20
> 21 proc means missing nway data = Membs_Enr_Uncompress_V SUM;
> 22 where YNField_3 = "Y";
> 23 var Membs_Counter;
> 24 class Membs_Enroll_yrmo;
> 25 run;
>
> NOTE: View WORK.MEMBS_ENR_UNCOMPRESS_V.VIEW used (Total process time):
> real time 53.40 seconds
> cpu time 48.53 seconds
>
> NOTE: There were 6011208 observations read from the data set
> WORK.MEMBS_ENR_COMPRESS.
> NOTE: There were 1978518 observations read from the data set
> WORK.MEMBS_ENR_UNCOMPRESS_V.
> WHERE YNField_3='Y';
> NOTE: PROCEDURE MEANS used (Total process time):
> real time 53.43 seconds
> cpu time 48.54 seconds
>
>
> As you can see, the view has terrible performance!
>
> -Sterling
>
>
>
>
> On Thu, Nov 4, 2010 at 2:44 PM, Joe Matise <snoopy369@gmail.com> wrote:
>
> > How about making a PROC FCMP function, or even a macro, to fix it?
> >
> > I mean, the code would be simple:
> >
> > compr_value=0;
> > do __t = 1 to dim(vars);
> > compr_value = compr_value + ifn(vars(__t)='Y',1,0)*2**(__t-1);
> > end;
> >
> > to compress, and then for simple uses (ie, filtering by one variable's
> > value) you could have a function/macro that took an argument "varnum" and
> > returned the value (using BAND). If you are doing frequencies of the
> > individual variables, then of course you'd need to use a view, but the
> view
> > wouldn't be any worse than storing the whole uncompressed dataset, and if
> > you could WHERE the view based on the value of one of them first, it'd be
> > much faster.
> >
> > -Joe
> >
> >
> > On Thu, Nov 4, 2010 at 3:50 PM, Sterling Paramore <gnilrets@gmail.com
> >wrote:
> >
> >> I've tried to build datastep views to hide complexity several times in
> the
> >> past, but they always end up being ridiculously slow.
> >>
> >> On Thu, Nov 4, 2010 at 1:45 PM, Fehd, Ronald J. (CDC/OCOO/ITSO) <
> >> rjf2@cdc.gov> wrote:
> >>
> >> > > From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-
> >> > > l@listserv.uga.edu] On Behalf Of Sterling Paramore
> >> > > Sent: Thursday, November 04, 2010 4:29 PM
> >> > > To: SAS-L@LISTSERV.UGA.EDU
> >> > > Subject: Re: Bit type
> >> > >
> >> > > It would be worth the disk space savings but it would not be worth
> the
> >> > > overhead of dealing with bitwise functions or explaining binary to
> my
> >> > > end users.
> >> >
> >> >
> >> > [ ] cheap
> >> > [ ] easy
> >> > [ ] fast
> >> >
> >> > { } job security <---<<<
> >> >
> >> > seems to me that hiding the complexity in data step views
> >> > both for the compression and decompression
> >> > would be the way to go w/this problem
> >> >
> >> > "I hear that employees who retire
> >> > and then come back as consultants make good money!"
> >> >
> >> > Ron Fehd the module/routine/subroutine maven CDC Atlanta GA USA RJF2
> >> at
> >> > cdc dot gov
> >> >
> >>
> >
> >
>
|