LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 3 Jan 2005 08:57:29 -0500
Reply-To:     Arthur Tabachneck <art297@NETSCAPE.NET>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Arthur Tabachneck <art297@NETSCAPE.NET>
Subject:      Re: Parsing Raw Data to remove Carriage Returns
Comments: To: Kevin Christensen <chriske2@UFL.EDU>

Kevin,

On Sun, 2 Jan 2005 15:40:32 -0500, Kevin Christensen wrote:

>I have a large XML file that I am trying to convert to a sas dataset. >Unfortunately there are carriage returns scattered throughout the dataset >that I need to eliminate, or at least disregard in order to read the >records properly.> > >P.S. Anyone know a good site or book that will help with parsing data? >Something on SAS functions and data steps perhaps?

As shown below, Kevin sent me a sample of his data off line.

A search of the SAS-L archives will provide a lot of useful hints and references to helpful guides.

Unfortunately, getting what you want probably isn't as trivial a problem as I made it out to be in the following code but, then again, that you will have to discover for yourself.

Upon reviewing your sample file, it appears to come from a structured data base, thus the problem could end up being trivial. I didn't see carriage returns as posing any significant problem in accomplishing what you want:

infile "C:\sample.XML" truncover end=lst; retain dnum date; if not(lst) then do; input record $255. ; if index(record,'<DNUM><PDAT>') > 0 then do; start=index(record,'<DNUM><PDAT>')+12; numchars=index(record,'</PDAT')-start; dnum=substr(record,start,numchars); end; else if index(record,'<DATE><PDAT>') > 0 then do; start=index(record,'<DATE><PDAT>')+12; numchars=index(record,'</PDAT')-start; date=substr(record,start,numchars); output; dnum=''; date=''; end; end; run;

Art --------- <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE PATDOC SYSTEM "ST32-US-Grant-025xml.dtd" [ <!ENTITY USD0484671-20040106-D00000.TIF SYSTEM "USD0484671-20040106- D00000.TIF" NDATA TIF> <!ENTITY USD0484671-20040106-D00001.TIF SYSTEM "USD0484671-20040106- D00001.TIF" NDATA TIF> ]> <PATDOC DTD="2.5" STATUS="Build 20030724"> <SDOBI> <B100> <B110><DNUM><PDAT>D0484671</PDAT></DNUM></B110> <B130><PDAT>S1</PDAT></B130> <B140><DATE><PDAT>20040106</PDAT></DATE></B140> <B190><PDAT>US</PDAT></B190> </B100> <B200> <B210><DNUM><PDAT>29174009</PDAT></DNUM></B210> <B211US><PDAT>29</PDAT></B211US> <B220><DATE><PDAT>20030110</PDAT></DATE></B220> </B200> <B400> <B472> <B474><PDAT>14</PDAT></B474> </B472> </B400> <B500> <B510> <B511><PDAT>0201</PDAT></B511> <B516><PDAT>7</PDAT></B516> </B510> <B520> <B521><PDAT>D 2712</PDAT></B521> </B520> <B540><STEXT><PDAT>Apparel</PDAT></STEXT></B540> <B560> <B561> <PCIT> <DOC><DNUM><PDAT>1998140</PDAT></DNUM> <DATE><PDAT>19350400</PDAT></DATE> <KIND><PDAT>A</PDAT></KIND> </DOC> <PARTY-US> <NAM><SNM><STEXT><PDAT>Loew</PDAT></STEXT></SNM></NAM> </PARTY-US> </PCIT><CITED-BY-OTHER/> </B561> <B561> <PCIT>


Back to: Top of message | Previous page | Main SAS-L page