Date: Wed, 26 Jan 2005 15:27:27 -0800
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: Utility to scan string variables for non-printable characters?
In-Reply-To: <35po6eF4pt0p7U1@individual.net>
Content-type: text/plain; charset=US-ASCII
"Richard A. DeVenezia" <radevenz@IX.NETCOM.COM> sagely replied [in
part]:
> %let seed = 4669211;
>
> data strings;
> do i = 0 to 255;
> length s $20;
> s = repeat (byte(i),20*ranuni(&seed));
> output;
> end;
> run;
>
> data foo;
/* Richard pointed out in a subsequent note that there's an invisible */
retain unwantedCharSet;
/* right here, so pretend you can see it :-) */
> set strings;
>
> * Let ASCII printables be codes 32..128;
> * vary as needed per platform or application;
> if _n_ = 1 then
> unwantedCharSet = collate(0,31) || collate(129,255);
>
> unwantedCountC = countc(s,unwantedCharSet,'t');
> drop unwantedCharSet;
> run;
Nice code. (As always!) Have you considered using the 'o' option
of COUNTC() as well? You're not changing the unwantedCharSet variable
anywhere in your data step, so you could go with 'to' instead of 't'
to get the 'compile only once' option.
We can check by enlarging that test data set. On my machine, I have a
lot of other stuff currently running, so the times vary wildly, but
this is fairly typical:
51 %let seed = 4669211;
52
53 data strings;
54 length s $20;
55 do j = 1 to 1000;
56 do i = 0 to 255;
57 s = repeat (byte(i),20*ranuni(&seed));
58 output;
59 end;
60 end;
61 run;
NOTE: The data set WORK.STRINGS has 256000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.23 seconds
cpu time 0.23 seconds
Now we'll use SASFILE so the times will be somewhat less dependent on
I/O and memory buffering.
171 sasfile work.strings load;
NOTE: The file WORK.STRINGS.DATA has been opened by the SASFILE
statement.
172
173 data foo2(drop=unwantedCharSet);
174 retain unwantedCharSet;
175 set strings;
176 if _n_ = 1 then unwantedCharSet = collate(0,31) ||
collate(129,255);
177 unwantedCountC = countc(s,unwantedCharSet,'t');
178 run;
NOTE: There were 256000 observations read from the data set
WORK.STRINGS.
NOTE: The data set WORK.FOO2 has 256000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 3.01 seconds
cpu time 0.37 seconds
179
180 data foo1(drop=unwantedCharSet);
181 retain unwantedCharSet;
182 set strings;
183 if _n_ = 1 then unwantedCharSet = collate(0,31) ||
collate(129,255);
184 unwantedCountC = countc(s,unwantedCharSet,'to');
185 run;
NOTE: There were 256000 observations read from the data set
WORK.STRINGS.
NOTE: The data set WORK.FOO1 has 256000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.25 seconds
cpu time 0.25 seconds
186
187 sasfile work.strings close;
NOTE: The file WORK.STRINGS.DATA has been closed by the SASFILE
statement.
So we get a fairly significant improvement in processing time.
Still, YMMV.
I suppose that makes the 'o' option worth using when the 'character
set' to be checked never gets modified. But be aware that if you
DO use the 'o' option and then modify the character set, the SAS
docs say that COUNTC() will ignore your modifications. So
_caveat_utor_ here.
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician