|
PD -
SAS is 99% environment independent, INFORMAT $char20. and proc sort
nodupkeys should do the trick on MVS just like on a PC. $Char20. will read
anything except perhaps special end-of-file markers and cr/lf's. Even that
can be circumvented by reading the data as a stream.
Why don't you keep the offending duplicate records and look at them?
(untested)
Data datax;
Input.....;
Proc sort;
By key;
data _null_;
set datax;
by key;
if first.key and last.key then do;
file 'mvs.file.singles';
put key;
end;
else do; *else if not first.key or not last.key then do;
file 'mvs.file.dups';
put key;
end;
And then look at them with ISPF?
When you first read the data in you could create a line counter and output
it also, allowing you to compare back to the source.
Or simply read the file in and out to a new file and then use ISPF's compare
utility (=3.12) to see if SAS reads it correctly?
Your log will tell you how many records are read and if SAS went to a new
line, etc. You may want to post it.
Good luck
Paul Choate
DDS Data Extraction
(916) 654-2160
-----Original Message-----
From: PD [mailto:sophe88@YAHOO.COM]
Sent: Wednesday, February 18, 2004 1:06 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: sorting data on mainframe
Thanks for all your reply.
1. I did use $char20. to read it in.
2. It is a flat, FB text file generated on mainframe and intended for
use on the mainframe.
3. The var in question is NOT packed Decimal of any kind.
Below is the 'messy' records without HEX turned on
00601.-.¦-¤á.²[!Ah§k
00601.-.¦-¤á.²¥;5r_Ø
00601.-.[...:ÐÆ<"þÁè
00602.-. -ú©À [d¼i.+
00602.-.¥.z\|K[qd¤%ù
00602.-.ó-úE'+¯þ¥½>*
00602.-.--H .-·¬Ì¥uª
00602.-.÷..º2És;òþË(
00602.-.9-.©¾s;òº[¤
The first 5 bytes are zip code. No problem there. After HEX is turned
on, the data look like this (for the first two records)
00601.-.¦-¤á.²[!Ah§k
FFFFF06166941EB5C8B9
0060100FA0F5FAAA1852
-------------------
00601.-.¦-¤á.²¥;5r_Ø
FFFFF06166941EB5F968
0060100FA0F5FA2E59D0
My concern is this, and only this so far:
If I use INFORMAT $char20. to read it in, and it is not correct, then
this may have contributed to the fact I lose 2/3 of them when using
"proc sort nodupkeys". That is why I am NOT ready to tell my business
people that their notion of 1% dup is wrong. In other words, I don't
yet have data evidence to support my allegation that they are wrong.
It could be I should have used another INFORMAT, not $char20.. OR I
should have plugged in something at Proc sort (especially if $char20.
is the right informat), to accomendate the fact that my host is OS390,
a system that is not ASCII like Windows; the sort table or order
embedded in the proc sort process may be different than if the SAS
program is being used on Windows.
Thanks again for your input on this.
PD
ghellrieg@T-ONLINE.DE (Gerhard Hellriegel) wrote in message
news:<200402180853.i1I8rn919598@listserv.cc.uga.edu>...
> On Tue, 17 Feb 2004 09:24:27 -0800, Choate, Paul@DDS <pchoate@DDS.CA.GOV>
wrote:
>
> >Hi PD -
> >
> >Why not post a little of the data (with hex=on) for us to look at?
> >
> >What is supposed to be in the file? You said character data, but is it
one
> >long string like a comment or address, or are there separate variables
such
> >as dates, dollar amounts, id's etc?
> >
> >What is the source of the data, a PC file? If the file's source was
other
> >than MVS, then how did it get on the mainframe (FTP, ind$file, proc
upload)?
> >
> >What are the Data Set Information values of the dataset (ISPF 3.2)?
> >
> >I'd guess that maybe it's an ASCII file that wasn't properly uploaded,
but
> >I'd have to see it first. ASCII and EBCDIC are related by a translation
> >table, depending on the file type you would move the file with a binary
or
> >text transfer, if the wrong transfer was used then the data might look
> >garbled as you describe.
> >
> >hth
> >
> >Paul Choate
> >DDS Data Extraction
> >(916) 654-2160
> >
> >-----Original Message-----
> >From: PD [mailto:sophe88@YAHOO.COM]
> >Sent: Monday, February 16, 2004 10:16 AM
> >To: SAS-L@LISTSERV.UGA.EDU
> >Subject: sorting data on mainframe
> >
> >I have a data set that has a 20 byte long 'character' variable, on
> >mainframe. The data set is a flat text file.
> >
> >When browsing the data on the mainframe, without turning on the HEX
> >command at ISPF, the 20 bytes look like a mess. With HEX command
> >turned on, the 20 bytes look like Ok, clean 20 bytes with numbers and
> >characters, the way it is supposed to be.
> >
> >Now I need to read it into SAS for some processing. I need to sort it
> >first.
> >
> >Question 1 is the informat: which informat should I use, $char20. or
> >else? I tried $char20. and did a sorting,
> >
> >proc sort nodupkeys; by the_variable; run;
> >
> >Then I lost 2 thirds of the values / observations. Our business people
> >told us there should be only about 1% duplicates.
> >
> >Question 2 is about proc sort: I read SAS documents about hosts using
> >ASCII sort order vs. EBCDIC sorting order. I am not sure if this is
> >relevant to my case here. Should I have added any options when sorting
> >on the mainframe?
> >
> >Thanks.
> >
> >PD
>
>
> How do you read it in SAS? EBCDIC and ASCII should not do anything, except
> for changing the order.
>
> Try to read it like:
>
> data in;
> length c $50; /* what is your LRECL?? */
> infile xxx;
> input;
> c=_infile_;
> run;
>
> proc sort nodupkey;
> by c;
> run;
>
> What are the results now?
|