Date: Tue, 11 Sep 2007 22:34:57 -0400
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: Problem importing some pipe-delimited text files
On Mon, 10 Sep 2007 12:19:58 -0400, Jack Wozek <Jack.A.Wozek@BOS.FRB.ORG> wrote:
>Hi Howard,
>
>Thanks for the follow-up. It's very informative. I'll respond in reverse
>to your questions.
>
>Once the files are read, my thought is to stack them into one database to
>be analyzed (simple statistics by groupings). In some inital runs, I've
>kept the files separate, run summaries on them, and then stacked the
>summary information. The summary information will be output in reports.
>To stack all of the files into one, I anticipate that I would only take
>the columns I need in a particular order from each file.
How about sheding the unneeded ones earlier, when you are reading the
pipe-separated stuff?
>
>The columns come from a known and limited set. When a particular column
>name occurs in different files, the column attributes are consistent.
>Unfortunately, not every column name occurs in every file. All of the
>columns I need are in every file. I believe I have enough information to
>build a meta data table. How would that help in this case?
You would be able to supply informats instead of having SAS guess. You will
find many posts in the SAS-L archives about problems which arise when SAS
has to guess.
>
>All in all, each delimited file has 100-120 columns of data; but I only
>need a small subset of those for analysis and output.
>
>Thanks again in advance for any help you can offer.
I would first run a step which just reads the first line of each file, to
get the lists of included variables. Then I would combine that info with the
known metadata (esp. informats) and build a long data step with a bunch of
file-specific INPUT statements. Things I might use include: %INCLUDE, FILEVAR=.
>
>--Jack
>
>
>
>
>
>"Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
>Sent by: "SAS(r) Discussion" <SAS-L@listserv.vt.edu>
>09/09/2007 11:28 PM
>Please respond to
>"Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
>
>
>To
>SAS-L@LISTSERV.VT.EDU
>cc
>
>Subject
>Re: Problem importing some pipe-delimited text files
>
>
>
>
>
>
>On Fri, 7 Sep 2007 09:48:24 -0400, Jack Wozek <Jack.A.Wozek@BOS.FRB.ORG>
>wrote:
>
>>Hi.
>>
>>I need to read in a sizable volume of pipe-delimited text files.
>
>The use of pipe rather than comma or tab or other more commonly used
>separator is incidental to the problems.
>
>>Using
>>some code adapted from "Reading a Pipe-Delimited File with in a
>>Comma-Delimited File" (Carey Smoak), my plan is to read the file names
>>from the directory, put the file names into separate macrovariables, and
>>use proc import inside a do loop (copied below). I opted for the simple
>>proc import because the files are not standard (different combinations of
>>column, but all have a common denominator of columns I need).
>
>Do the columns come from a known and limited set? When a particular column
>name occurs in different files, are the column attributes (type, length,
>informat, etc.) consistent? I'm wondering if a master data dictionary can
>be
>built and used in controlling the process of reading these files.
>
>>I want this
>>to be as automated and maintenance-free as possible since this will be a
>>repeated process completed by others.
>
>Using the phrases "proc import" and "maintenance free" in the same
>paragraph
>is kind of dubious.
>
>>However, I am running into an error
>>with some files. One of the columns I am reading has a text
>description,
>>and in some cases it is very long. When a long description appears in
>the
>>first row of data (or the first 20 I gather), proc import guesses an
>>informat/format that is unknown to SAS (see error below). A past
>posting
>>(Howard Schreier, Dec 26, 2002) suggested block copying the data step
>code
>>generated in the log by proc import, fixing the error, and then
>re-running
>>the data step code. I tried this and it works. However, I would like
>>something more efficient since I am dealing with thousands of files.
>
>I've replicated the problem (see below). Basically, PROC IMPORT internally
>seems to apply a BEST4. format to the detected width of a field, to
>construct an informat. When you have a field longer than 9999, this
>generates scientific notation, which does not produce a valid informat.
>
>>
>>Any ideas how to code around this problem? Or, any ideas ideas how to
>>uniformly identify when this error occurs to speed up the manual fix
>>process?
>
>I can't help wondering about the bigger picture, since it may influence
>the
>direction to take in solving this problem. What happens when the thousands
>of files are read? Is there a need to integrate their content? Will
>normailization reduce or eliminate the discrepancies in structure?
>
>>
>>Thanks.
>>
>>Jack
>>
>>/* Import Code
>>
>>%macro importras;
>>%do i = 1 %to &filecount;
>>
>>PROC IMPORT OUT= data.&&fileb&i
>> DATAFILE= "&datalocation2.&&file&i...txt"
>> DBMS=DLM REPLACE;
>> Delimiter = "|";
>> getnames = yes;
>>RUN;
>>%end;
>>%mend;
>>
>>
>>/* Error
>>
>>130 informat SDR_DESCRIPTION_TEXT $12E3. ;
>> -----
>> 228 22
>> -
>> 200
>>ERROR 228-185: Informat 12E3 is unknown.
>>
>>ERROR 22-322: Syntax error, expecting one of the following: a name, -, ;,
>>DEFAULT, _ALL_,
>> _CHARACTER_, _CHAR_, _NUMERIC_.
>>
>>ERROR 200-322: The symbol is not recognized and will be ignored.
>
>To generate this problem, run
>
> data _null_;
> file 'test' lrecl=10000;
> length long $ 10000;
> long = repeat('x', 9999);
> put 'myvar' / long;
> run;
>
> PROC IMPORT OUT= test
> DATAFILE= "test"
> DBMS=DLM REPLACE;
> Delimiter = ",";
> getnames = yes;
> RUN;
>
>The field is 10,000 characters long (remember that the second argument to
>REPEAT is the number of instances of the string to be appended to the
>given
>instance). It generates $1E4. as an informat.
>
>Change 9999 to 9998 and re-run. The informat is $9999. and there is no
>problem.
|