Date: Wed, 6 Oct 2010 13:25:28 -0400
Reply-To: "Gerstle, John (CDC/OID/NCHHSTP)" <yzg9@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Gerstle, John (CDC/OID/NCHHSTP)" <yzg9@CDC.GOV>
Subject: Re: Resubmit: Reading XML files via XML92 getting 0 observation
datasets
In-Reply-To: <AANLkTin0iF2fc_=BeV9ovjyvdpa1=1_UPmyHqL92yH4N@mail.gmail.com>
Content-Type: text/plain; charset="us-ascii"
Joe,
That's a good idea and I did try something similar which did not work.
But the structure of the file is not conducive to splitting. There's a
small node at the top - Header Info - meta-data of the file, then the
second major node is split into 2 smaller nodes, both with a lot of data
within. I did split the file with only the Header node, but that
doesn't speak to the rest of the file. Of course, the map and a
shortened version of the map (Header only) did not work (the 1
observation still not read).
I do have a Tech support ticket open.
thanks
John Gerstle
Scientific Information Specialist
Centers for Disease Control and Prevention
NCHHSTP\DHAP-SE\QSDMB\Data Management Team
Phone: 404-639-3980
Fax: 404-639-8642
Email: yzg9 at cdc dot gov
Socrates, proclaimed: "I came to know one thing; that I know nothing".
"Every question I answer will simply lead to another question."
From: Joe Matise [mailto:snoopy369@gmail.com]
Sent: Wednesday, October 06, 2010 1:12 PM
To: Gerstle, John (CDC/OID/NCHHSTP)
Cc: SAS-L@listserv.uga.edu
Subject: Re: Resubmit: Reading XML files via XML92 getting 0 observation
datasets
Not sure what the structure is, but is it splittable into multiple
files? If so, can you do that and see if it's some specific high level
node(s) that fails, or possibly even if it's just the size?
IE, if you have 4000 nodes at the second-highest level (or thereabouts)
with 70 lines each, can you split that into 1000 node files, or even 100
node files and try reading each in? If some read in some don't, then
you might be able to pinpoint the issue, if it's data related.
-Joe
On Wed, Oct 6, 2010 at 9:14 AM, Gerstle, John (CDC/OID/NCHHSTP)
<yzg9@cdc.gov> wrote:
Alan,
I have XMLSpy (and DiffDog) and have tried looking for XML code issues
but haven't found anything definitive. The problem file is over 280k
lines so not easy to eyeball. I compared it with a smaller XML file that
SAS has no issue reading and really haven't found anything besides, what
looks like, some child-child-child nodes not aligned but that could be
data driven (some clients have the data and some do not).
SAX vs Dom - could you define these terms?
Thanks
John Gerstle
Scientific Information Specialist
Centers for Disease Control and Prevention
NCHHSTP\DHAP-SE\QSDMB\Data Management Team
Phone: 404-639-3980
Fax: 404-639-8642
Email: yzg9 at cdc dot gov
Socrates, proclaimed: "I came to know one thing; that I know nothing".
"Every question I answer will simply lead to another question."
>>-----Original Message-----
>>From: owner-sas-l@listserv.uga.edu
[mailto:owner-sas-l@listserv.uga.edu] On
>>Behalf Of Alan Churchill
>>Sent: Tuesday, October 05, 2010 6:13 PM
>>To: SAS-L@LISTSERV.UGA.EDU
>>Subject: RE: Resubmit: Reading XML files via XML92 getting 0
observation
>>datasets
>>
>>John,
>>
>>Look at SAX vs Dom on why access is limited. It depends on the engine
>>chosen.
>>
>>It is hard to guess as to what is happening w/o seeing the XML in
question.
>>Have you opened up the files in something like XmlSpy to look for
>>differences?
>>
>>Alan
>>
>>Alan Churchill
>>Savian
>>Work: 719-687-5954
>>Cell: 719-310-4870
>>
>>-----Original Message-----
>>From: Gerstle, John (CDC/OID/NCHHSTP) [mailto:yzg9@CDC.GOV]
>>Sent: Tuesday, October 05, 2010 9:26 AM
>>Subject: Resubmit: Reading XML files via XML92 getting 0 observation
>>datasets
>>
>>SAS v9.22, WinXP, XML Mapper
>>
>>I've manually created map file from a complex schema and am using the
XML92
>>engine to read in the XML data files. I have successfully tested this
method
>>on 3 XML files, 1 of which is close to 450MB in size. Recently, I
received a
>>new sample file (only 14Mb) and now it's failing (well, it's failing
in the
>>sense that no data observations are being read by SAS). Interestingly,
>>within XML Mapper, I can use the Table View tab to see the data,
correctly
>>mapped. But Base SAS is unable to replicate this. Even SAS Explorer
is
>>unable to open any 'tables' to view.
>>
>>Code:
>>
>>libname incoming xml92 "&xml_file"
>> xmlmap="&xml_map"
>> xmlschema="&xml_schema"
>> xmltype=xmlmap
>> xmlmeta=schemadata;
>>proc print data=incoming.x_headerinfo; run;
>>
>>...where the x_headerinfo is the first node of data in the file.
>>
>>Log:
>>NOTE: Processing XMLMap version 1.9.
>>NOTE: Libref INCOMING was successfully assigned as follows:
>> Engine: XML92
>> Physical Name: W:\Data_Management\test.xml
>>2111 proc print data=incoming.x_headerinfo; run;
>>
>>NOTE: Access by observation number not available. Observation numbers
will
>>be counted by PROC PRINT.
>>NOTE: No observations in data set INCOMING.x_headerinfo.
>>NOTE: There were 0 observations read from the data set
>>INCOMING.x_headerinfo.
>>
>>
>>I've added an End Path for the table, which is the same as the Path,
set as
>>End. And added an automatic enumerator to the table. No luck on the
Base
>>SAS side but I see correct mapping in the Table View of XML Mapper.
>>
>>I've been researching this problem for the past 2 weeks and have read
>>several really good papers on the subject (Larry Hoyle's recent papers
and
>>Lex Jensen's workshop at SGF2010), but haven't found reference to this
>>specific problem.
>>
>>I feel that I've missed something in my map, though the map does work
for
>>the other data files, so it's possible that the data file in question
is
>>problematic.
>>
>>3 Questions:
>>1) What are the reasons why Base SAS is unable to achieve access by
>>observation number in an XML file? (something to do with Sequential
Reading
>>of the file instead of Random reading?)
>>2) Any references to suggest?
>>3) Any suggestions for the above problem?
>>
>>I'm considering having the sender re-create their XML file. the only
thing I
>>can find in their file that might be problematic is that the order of
nodes
>>is not the same as one of the other test files that does work.
>>
>>
>>John Gerstle
>>Scientific Information Specialist
>>Centers for Disease Control and Prevention NCHHSTP\DHAP-SE\QSDMB\Data
>>Management Team
>>Phone: 404-639-3980
>>Fax: 404-639-8642
>>Email: yzg9 at cdc dot gov
>>Socrates, proclaimed: "I came to know one thing; that I know nothing".
>>
>>"Every question I answer will simply lead to another question."
|