Date: Wed, 6 Jun 2007 11:08:16 -0400
Reply-To: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject: Re: PROC FREQ--DATA STEP--MODELING QUESTION
In-Reply-To: <20070605222639.A37A3478077@ws1-5.us4.outblaze.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Tom,
I wonder how you are getting in the first OUTPUT, the first row having ID =
1271 and Count = 1 and the third row having ID = 1435 and Count = 3.
To beat the SAS memory limitation, you can use arrays to hold the Row Totals
and Column Totals in the first pass of data step and output the records with
required PCTs based on the two arrays in the second pass.
Hope this idea helps you to go ahead.
Regards,
Muthia Kachirayan
On 6/5/07, Tom White <tw2@mail.com> wrote:
>
> Hello SAS-L,
>
> The data below are already sorted (or can be sorted if need be)
> by ID (first) and CODE (second).
>
> The data set of interest contains about 10 mil records and about 100
> fields.
>
> In this example, I show 20 obs and two fields of interest.
>
>
> data foo;
> input ID $ CODE $;
> cards;
> 1271 .
> 1271 201
> 1435 .
> 1435 842
> 1435 842
> 1435 307
> 1435 307
> 1435 307
> 1435 309
> 1435 .
> 1434 .
> 8393 070
> 8393 070
> 8393 070
> 8393 070
> 8393 070
> 8393 070
> 8393 070
> 8393 070
> 8393 070
> ;
> run;
>
>
> I would like to produce two datasets like:
>
> (1)
>
> ID CODE COUNT PERCENT PCT_ROW PCT_COL
> 1271 1
> 1271 201 1 6.25 100 100
> 1435 3
> 1435 842 2 12.5 33.33 100
> 1435 307 3 18.75 50 100
> 1435 309 1 6.25 16.67 100
> 8393 070 9 56.25 100 100
>
> This data set is easily created by using;
>
> proc freq data=foo;
> tables ID * CODE/outpct out=stats;
> run;
>
>
> However, when I run this PROC FREQ on the entire dataset of 10 mil obs,
> I get a message error
>
>
> ERROR: The requested table is too large to process.
> NOTE: The SAS System stopped processing this step because of errors.
> NOTE: There were 10154176 observations read from the data set WORK.FOO.
> WARNING: The data set WORK.STATS may be incomplete. When this step was
> stopped there were 0
> observations and 6 variables.
> WARNING: Data set WORK.STATS was not replaced because this step was
> stopped.
> NOTE: PROCEDURE FREQ used:
> real time 54.70 seconds
> cpu time 53.28 seconds
>
>
>
> (2)
>
> The other table I would like to get is
>
> ID CODE COUNT PERCENT PCT_ROW PCT_COL
> 1271
> 1271 201 1 6.25 100 100
> 1435 1
> 1435 842 1 6.25 16.665 50
> 1435 842 1 6.25 16.665 50
> 1435 307 1 6.25 16.667 33.33
> 1435 307 1 6.25 16.667 33.33
> 1435 307 1 6.25 16.667 33.33
> 1435 309 1 6.25 16.67 100
> 1435 1
> 1434 1
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
> 8393 070 1 6.25 11.11 11.11
>
> This table is simply the same one as table (1)
> except that it is not rolled-up like table (1) is.
>
> For example, in (1) I show,
>
> ID CODE COUNT PERCENT PCT_ROW PCT_COL
> 1435 3
> 1435 842 2 12.5 33.33 100
> 1435 307 3 18.75 50 100
> 1435 309 1 6.25 16.67 100
>
> That's 9 instances of ID 1435 as shown in tbale (2) above--not rolled-up.
>
> So I am thinking, since I will need table (2) at some point for modeling,
> I need to keep (2) in it's original form and not like in form (1) coming
> out of PROC FREQ. yet, I still need to create the new modeling variables
> COUNT, PERCENT, PCT_ROW, and maybe PCT_COL as shown in (1) and populate
> these
> corect values in the form of table (2).
>
> These new variables in (2) will become inputs to my logistic model I am
> trying
> to build.
>
> So, then, since in (1) PERCENT=12.5 for ID=1435 having COUNT=2, then,
> I am thinking, maybe wrongly, that if I divide 12.5 by 2, I should
> put 6.25 in above table (2).
>
> And so on with all the rest of the numbers, I just divide numbers in
> (1) by however many counts (COUNT) I have, and hopefully I get something
> like
> table (2).
>
> Please give me some guidance as to how to create table (1) and then,
> for modeling purposes, how can I keep the original data intact,
> yet create new modeling fields like counts and percentages as shown above
> in (2).
>
> Thank you.
>
> tom
>
>
>
>
> =
> Pedometers as Low as $1 - Free Shipping
> Huge Selection of Quality Brands Like, Yamax, Sportline, Freestyle, and
> More. Customer Logos, Free Shipping. Fast Delivery.
>
> http://a8-asy.a8ww.net/a8-ads/adftrclick?redirectid=1ed972f8bac8e902c6afbaf95c7b7930
>
|