LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2006, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 14 Aug 2006 22:08:01 -0400
Reply-To:     "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject:      Re: Proving records with bad quality in a file before reading the
              file into a data set
Content-Type: text/plain; charset=ISO-8859-1

On Sat, 12 Aug 2006 20:54:09 +0200, Rune Runnestø <rune@FASTLANE.NO> wrote:

>Hi, > >I have written a program to read an external file into a data set. Here is >the file, >it has three records, each record separated by '----': > >---------------------------------------------------------------------------- >SAKSNR: 1994000004 >ARKIV: 326.12 >TITTEL: First title line is this > and here is the second title line >SAKSDATO: 04.01.1994 SISTE DOK.: 26.11.2001 ANT.DOK: 26 >---------------------------------------------------------------------------- >SAKSNR: 1994000007 >ARKIV: 341.9 >TITTEL: First title line >SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ANT.DOK: 22 >---------------------------------------------------------------------------- >SAKSNR: 1994000008 >ARKIV: 326.10 >TITTEL: First title line >SAKSDATO: 06.01.1994 SISTE DOK.: 15.05.2003 ANT.DOK: 17 >---------------------------------------------------------------------------- > > >A successful reading of the file depends on that the labels are present. The >output from >the data set looks like this: > >Obs Saksnr Arkiv_nokkelkode Sakstittel >Saksdato Siste_dok Antall_dok >1 1994000004 326.12 First title line is this and here is the >second title line 04/01/1994 26/11/2001 26 >2 1994000007 341.9 First title line >06/01/1994 16/03/2001 22 >3 1994000008 326.10 First title line >06/01/1994 15/05/2003 17 > >The labels are anchors for their respective data values. If one or more of >the labels >are missing, there will be logical errors in the data set. This may cause >that the program >I have written, will jump to the next record in the middle of a record, or >that the record >will be skipped from the data set. > >The program I want to write shall search for the existence of all the >labels, and if at >least one of them are absent, then print the whole record (all data lines >between the two >subsequent '-----------' to be written out to a file. For orders sake, the >labels are: >SAKSNR: >ARKIV: >TITTEL: >SAKSDATO: >SISTE DOK.: >ANT.DOK: > >An external file with bad quality might look like this: >---------------------------------------------------------------------------- >SAKSNR: 1994000004 >ARKIV: 326.12 >TITTEL: First title line is this > and here is the second title line >SAKSDATO: 04.01.1994 SISTE DOK.: 26.11.2001 ANT.DOK: 26 >---------------------------------------------------------------------------- > 1994000007 >ARKIV: 341.9 >TITTEL: First title line >SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ANT.DOK: 22 >---------------------------------------------------------------------------- >SAKSNR: 1994000008 >ARKIV: 326.10 >TITTEL: First title line > 06.01.1994 SISTE DOK.: 15.05.2003 ANT.DOK: 17 >---------------------------------------------------------------------------- > >Here, just the first record is OK, the second is short of the label SAKSNR: >and the third is short of the label SAKSDATO: >So in this case, I would want the output file to look like this: > >---------------------------------------------------------------------------- ><--- data line # 7 > 1994000007 >ARKIV: 341.9 >TITTEL: First title line >SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 ANT.DOK: 22 >---------------------------------------------------------------------------- ><---data line # 12 >SAKSNR: 1994000008 >ARKIV: 326.10 >TITTEL: First title line > 06.01.1994 SISTE DOK.: 15.05.2003 ANT.DOK: 17 >---------------------------------------------------------------------------- > >The identifying og the data line # where the bad records start is of very >much help >when looking back into the original external file where to find the records. >Especially >when the file is 20.000 records of size and may have 100.000 data lines. > >Can anyone help me with this program logic ? > >Regards, Rune

I think the key is to avoid trying to do it in one step. Here is a multi-step approach.

First create a test file:

filename demo 'c:\temp\demo';

data _null_; file demo; put '-----------------------------------------------------------'; put 'SAKSNR: 1994000004'; put 'ARKIV: 326.12'; put 'TITTEL: First title line is this'; put ' and here is the second title line'; put 'SAKSDATO: 04.01.1994 SISTE DOK.: 26.11.2001 '; put '-----------------------------------------------------------'; put ' 1994000007'; put 'ARKIV: 341.9'; put 'TITTEL: First title line'; put 'SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 '; put '-----------------------------------------------------------'; put 'SAKSNR: 1994000008'; put 'ARKIV: 326.10'; put 'TITTEL: First title line'; put ' 06.01.1994 SISTE DOK.: 15.05.2003 '; put '-----------------------------------------------------------'; run;

I've chopped off the ends of the lines to avoid wrapping.

The first step is to put the lines into a data set:

data lines; infile demo; input; if missing(compress(_infile_,'-') ) then do; group + 1; delete; end; else line = _infile_; run;

Now analyze:

data notOK(keep=group); labelcount = 0; do until (last.group); set lines; by group; label = scan(line,1,':'); select (label); when("SAKSNR") labelcount + 1; when("ARKIV") labelcount + 1; when("TITTEL") labelcount + 1; when("SAKSDATO") do; labelcount + 1; if scan(substr(line,36),1,':')='SISTE DOK.' then labelcount + 1; end; otherwise; end; end; if labelcount<5; run;

You'll have to add an IF statement for ANT.DOK and change the threshold from 5 to 6.

Finally present the problem cases:

data _null_; merge notOK(in=dump) lines; by group; if dump; if first.group then put '--------'; put line $char75.; if last.group then put '--------'; run;

Results:

-------- 1994000007 ARKIV: 341.9 TITTEL: First title line SAKSDATO: 06.01.1994 SISTE DOK.: 16.03.2001 -------- -------- SAKSNR: 1994000008 ARKIV: 326.10 TITTEL: First title line 06.01.1994 SISTE DOK.: 15.05.2003 --------


Back to: Top of message | Previous page | Main SAS-L page