Date: Mon, 3 Jan 2005 08:57:29 -0500
Reply-To: Arthur Tabachneck <art297@NETSCAPE.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@NETSCAPE.NET>
Subject: Re: Parsing Raw Data to remove Carriage Returns
Kevin,
On Sun, 2 Jan 2005 15:40:32 -0500, Kevin Christensen wrote:
>I have a large XML file that I am trying to convert to a sas dataset.
>Unfortunately there are carriage returns scattered throughout the dataset
>that I need to eliminate, or at least disregard in order to read the
>records properly.>
>
>P.S. Anyone know a good site or book that will help with parsing data?
>Something on SAS functions and data steps perhaps?
As shown below, Kevin sent me a sample of his data off line.
A search of the SAS-L archives will provide a lot of useful hints and
references to helpful guides.
Unfortunately, getting what you want probably isn't as trivial a problem as
I made it out to be in the following code but, then again, that you will
have to discover for yourself.
Upon reviewing your sample file, it appears to come from a structured data
base, thus the problem could end up being trivial. I didn't see carriage
returns as posing any significant problem in accomplishing what you want:
infile "C:\sample.XML" truncover end=lst;
retain dnum date;
if not(lst) then do;
input record $255. ;
if index(record,'<DNUM><PDAT>') > 0 then do;
start=index(record,'<DNUM><PDAT>')+12;
numchars=index(record,'</PDAT')-start;
dnum=substr(record,start,numchars);
end;
else if index(record,'<DATE><PDAT>') > 0 then do;
start=index(record,'<DATE><PDAT>')+12;
numchars=index(record,'</PDAT')-start;
date=substr(record,start,numchars);
output;
dnum='';
date='';
end;
end;
run;
Art
---------
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE PATDOC SYSTEM "ST32-US-Grant-025xml.dtd" [
<!ENTITY USD0484671-20040106-D00000.TIF SYSTEM "USD0484671-20040106-
D00000.TIF" NDATA TIF>
<!ENTITY USD0484671-20040106-D00001.TIF SYSTEM "USD0484671-20040106-
D00001.TIF" NDATA TIF>
]>
<PATDOC DTD="2.5" STATUS="Build 20030724">
<SDOBI>
<B100>
<B110><DNUM><PDAT>D0484671</PDAT></DNUM></B110>
<B130><PDAT>S1</PDAT></B130>
<B140><DATE><PDAT>20040106</PDAT></DATE></B140>
<B190><PDAT>US</PDAT></B190>
</B100>
<B200>
<B210><DNUM><PDAT>29174009</PDAT></DNUM></B210>
<B211US><PDAT>29</PDAT></B211US>
<B220><DATE><PDAT>20030110</PDAT></DATE></B220>
</B200>
<B400>
<B472>
<B474><PDAT>14</PDAT></B474>
</B472>
</B400>
<B500>
<B510>
<B511><PDAT>0201</PDAT></B511>
<B516><PDAT>7</PDAT></B516>
</B510>
<B520>
<B521><PDAT>D 2712</PDAT></B521>
</B520>
<B540><STEXT><PDAT>Apparel</PDAT></STEXT></B540>
<B560>
<B561>
<PCIT>
<DOC><DNUM><PDAT>1998140</PDAT></DNUM>
<DATE><PDAT>19350400</PDAT></DATE>
<KIND><PDAT>A</PDAT></KIND>
</DOC>
<PARTY-US>
<NAM><SNM><STEXT><PDAT>Loew</PDAT></STEXT></SNM></NAM>
</PARTY-US>
</PCIT><CITED-BY-OTHER/>
</B561>
<B561>
<PCIT>
|