Date: Mon, 14 Aug 2006 22:08:01 -0400
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: Proving records with bad quality in a file before reading the
file into a data set
Content-Type: text/plain; charset=ISO-8859-1
On Sat, 12 Aug 2006 20:54:09 +0200, Rune Runnestø <rune@FASTLANE.NO> wrote:
>Hi,
>
>I have written a program to read an external file into a data set. Here is
>the file,
>it has three records, each record separated by '----':
>
>----------------------------------------------------------------------------
>SAKSNR: 1994000004
>ARKIV: 326.12
>TITTEL: First title line is this
> and here is the second title line
>SAKSDATO: 04.01.1994 SISTE DOK.: 26.11.2001 ANT.DOK: 26
>----------------------------------------------------------------------------
>SAKSNR: 1994000007
>ARKIV: 341.9
>TITTEL: First title line
>SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ANT.DOK: 22
>----------------------------------------------------------------------------
>SAKSNR: 1994000008
>ARKIV: 326.10
>TITTEL: First title line
>SAKSDATO: 06.01.1994 SISTE DOK.: 15.05.2003 ANT.DOK: 17
>----------------------------------------------------------------------------
>
>
>A successful reading of the file depends on that the labels are present. The
>output from
>the data set looks like this:
>
>Obs Saksnr Arkiv_nokkelkode Sakstittel
>Saksdato Siste_dok Antall_dok
>1 1994000004 326.12 First title line is this and here is the
>second title line 04/01/1994 26/11/2001 26
>2 1994000007 341.9 First title line
>06/01/1994 16/03/2001 22
>3 1994000008 326.10 First title line
>06/01/1994 15/05/2003 17
>
>The labels are anchors for their respective data values. If one or more of
>the labels
>are missing, there will be logical errors in the data set. This may cause
>that the program
>I have written, will jump to the next record in the middle of a record, or
>that the record
>will be skipped from the data set.
>
>The program I want to write shall search for the existence of all the
>labels, and if at
>least one of them are absent, then print the whole record (all data lines
>between the two
>subsequent '-----------' to be written out to a file. For orders sake, the
>labels are:
>SAKSNR:
>ARKIV:
>TITTEL:
>SAKSDATO:
>SISTE DOK.:
>ANT.DOK:
>
>An external file with bad quality might look like this:
>----------------------------------------------------------------------------
>SAKSNR: 1994000004
>ARKIV: 326.12
>TITTEL: First title line is this
> and here is the second title line
>SAKSDATO: 04.01.1994 SISTE DOK.: 26.11.2001 ANT.DOK: 26
>----------------------------------------------------------------------------
> 1994000007
>ARKIV: 341.9
>TITTEL: First title line
>SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ANT.DOK: 22
>----------------------------------------------------------------------------
>SAKSNR: 1994000008
>ARKIV: 326.10
>TITTEL: First title line
> 06.01.1994 SISTE DOK.: 15.05.2003 ANT.DOK: 17
>----------------------------------------------------------------------------
>
>Here, just the first record is OK, the second is short of the label SAKSNR:
>and the third is short of the label SAKSDATO:
>So in this case, I would want the output file to look like this:
>
>----------------------------------------------------------------------------
><--- data line # 7
> 1994000007
>ARKIV: 341.9
>TITTEL: First title line
>SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ANT.DOK: 22
>----------------------------------------------------------------------------
><---data line # 12
>SAKSNR: 1994000008
>ARKIV: 326.10
>TITTEL: First title line
> 06.01.1994 SISTE DOK.: 15.05.2003 ANT.DOK: 17
>----------------------------------------------------------------------------
>
>The identifying og the data line # where the bad records start is of very
>much help
>when looking back into the original external file where to find the records.
>Especially
>when the file is 20.000 records of size and may have 100.000 data lines.
>
>Can anyone help me with this program logic ?
>
>Regards, Rune
I think the key is to avoid trying to do it in one step. Here is a
multi-step approach.
First create a test file:
filename demo 'c:\temp\demo';
data _null_;
file demo;
put '-----------------------------------------------------------';
put 'SAKSNR: 1994000004';
put 'ARKIV: 326.12';
put 'TITTEL: First title line is this';
put ' and here is the second title line';
put 'SAKSDATO: 04.01.1994 SISTE DOK.: 26.11.2001 ';
put '-----------------------------------------------------------';
put ' 1994000007';
put 'ARKIV: 341.9';
put 'TITTEL: First title line';
put 'SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ';
put '-----------------------------------------------------------';
put 'SAKSNR: 1994000008';
put 'ARKIV: 326.10';
put 'TITTEL: First title line';
put ' 06.01.1994 SISTE DOK.: 15.05.2003 ';
put '-----------------------------------------------------------';
run;
I've chopped off the ends of the lines to avoid wrapping.
The first step is to put the lines into a data set:
data lines;
infile demo;
input;
if missing(compress(_infile_,'-') ) then do;
group + 1;
delete;
end;
else line = _infile_;
run;
Now analyze:
data notOK(keep=group);
labelcount = 0;
do until (last.group);
set lines;
by group;
label = scan(line,1,':');
select (label);
when("SAKSNR") labelcount + 1;
when("ARKIV") labelcount + 1;
when("TITTEL") labelcount + 1;
when("SAKSDATO") do;
labelcount + 1;
if scan(substr(line,36),1,':')='SISTE DOK.'
then labelcount + 1;
end;
otherwise;
end;
end;
if labelcount<5;
run;
You'll have to add an IF statement for ANT.DOK and change the threshold from
5 to 6.
Finally present the problem cases:
data _null_;
merge notOK(in=dump) lines;
by group;
if dump;
if first.group then put '--------';
put line $char75.;
if last.group then put '--------';
run;
Results:
--------
1994000007
ARKIV: 341.9
TITTEL: First title line
SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001
--------
--------
SAKSNR: 1994000008
ARKIV: 326.10
TITTEL: First title line
06.01.1994 SISTE DOK.: 15.05.2003
--------