|
Lou:
Your point is well taken.
However, the poster was reading with $char20., therefore the leading spaces
would be preserved. Yes, if he'd read with $20. and inadvertently dropped
the leading spaces, that could have been the cause of the excessive dups
being dropped. I also understood the data to be of binary / packed format,
from the data "look like a mess" statement.
-----Original Message-----
From: Lou [mailto:lpogodajr292185@COMCAST.NET]
Sent: Monday, February 16, 2004 3:54 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: sorting data on mainframe
"Droogendyk, Harry" <Harry.Droogendyk@CIBC.COM> wrote in
message
news:F0161D3F7AC5D411A5BE009027E774D60E61D02E@gemmrd-scc013eu.gem.cibc.com..
.
> If you want to verify the duplicates, i.e. to go back to
the users and
prove
> that it ain't 1%, use something like the following to keep
the dups. It
may
> be that they meant that 1% of the sort keys were
duplicated. However,
each
> duplicate key has many duplicates.
>
> data dedupped
> dups;
> set a;
> by i;
> if first.i then
> output dedupped;
> else
> output dups;
> run;
>
> I wouldn't think the informat matters and based on the
test below, it
> doesn't.
The INFORMAT most definitely could matter. When the
original poster is
reading in a fast file, reading a 20 byte character variable
with a $20.
informat will drop any leading spaces in the value, while
reading the
variable with a $char20. informat will preserve anyleading
spaces as part of
the value.
You need to view this with a monospace font to be sure of
seeing it
correctly, but if we have five bytes with the values of
space/space/a/b/c
reading those bytes with a $5. informat will result in a
value of
"abc "
while reading them in with a $char5.informat will result in
a value of
" abc".
And of course, these two values sort differently.
> 1
> 2 data a;
> 3 informat fld $20.;
> 4 do fld = '01'x, '3F'x, '9E'x, 'ff'x;
> 5 output;
> 6 output;
> 7 end;
> 8 run;
>
> NOTE: The data set WORK.A has 8 observations and 1
variables.
> NOTE: The DATA statement used 0.01 CPU seconds.
>
> 9
> 10 proc sort data=a nodupkey;
> 11 by fld;
> 12 run;
>
> NOTE: 4 observations with duplicate key values were
deleted.
> NOTE: There were 8 observations read from the data set
WORK.A.
> NOTE: The data set WORK.A has 4 observations and 1
variables.
> NOTE: The PROCEDURE SORT used 0.00 CPU seconds.
>
>
>
> -----Original Message-----
> From: SAS(r) Discussion
[mailto:SAS-L@LISTSERV.UGA.EDU]
On
> Behalf Of PD
> Sent: Monday, February 16, 2004 1:16 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: sorting data on mainframe
>
> I have a data set that has a 20 byte long
'character'
> variable, on
> mainframe. The data set is a flat text
file.
>
> When browsing the data on the mainframe,
without turning
on
> the HEX
> command at ISPF, the 20 bytes look like a
mess. With HEX
> command
> turned on, the 20 bytes look like Ok,
clean 20 bytes with
> numbers and
> characters, the way it is supposed to be.
>
> Now I need to read it into SAS for some
processing. I need
> to sort it
> first.
>
> Question 1 is the informat: which informat
should I use,
> $char20. or
> else? I tried $char20. and did a sorting,
>
> proc sort nodupkeys; by the_variable; run;
>
> Then I lost 2 thirds of the values /
observations. Our
> business people
> told us there should be only about 1%
duplicates.
>
> Question 2 is about proc sort: I read SAS
documents about
> hosts using
> ASCII sort order vs. EBCDIC sorting order.
I am not sure
if
> this is
> relevant to my case here. Should I have
added any options
> when sorting
> on the mainframe?
>
> Thanks.
>
> PD
|