Date: Sun, 9 Sep 2007 23:17:56 -0400
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: Problem importing some pipe-delimited text files
On Fri, 7 Sep 2007 09:48:24 -0400, Jack Wozek <Jack.A.Wozek@BOS.FRB.ORG> wrote:
>I need to read in a sizable volume of pipe-delimited text files.
The use of pipe rather than comma or tab or other more commonly used
separator is incidental to the problems.
>some code adapted from "Reading a Pipe-Delimited File with in a
>Comma-Delimited File" (Carey Smoak), my plan is to read the file names
>from the directory, put the file names into separate macrovariables, and
>use proc import inside a do loop (copied below). I opted for the simple
>proc import because the files are not standard (different combinations of
>column, but all have a common denominator of columns I need).
Do the columns come from a known and limited set? When a particular column
name occurs in different files, are the column attributes (type, length,
informat, etc.) consistent? I'm wondering if a master data dictionary can be
built and used in controlling the process of reading these files.
>I want this
>to be as automated and maintenance-free as possible since this will be a
>repeated process completed by others.
Using the phrases "proc import" and "maintenance free" in the same paragraph
is kind of dubious.
>However, I am running into an error
>with some files. One of the columns I am reading has a text description,
>and in some cases it is very long. When a long description appears in the
>first row of data (or the first 20 I gather), proc import guesses an
>informat/format that is unknown to SAS (see error below). A past posting
>(Howard Schreier, Dec 26, 2002) suggested block copying the data step code
>generated in the log by proc import, fixing the error, and then re-running
>the data step code. I tried this and it works. However, I would like
>something more efficient since I am dealing with thousands of files.
I've replicated the problem (see below). Basically, PROC IMPORT internally
seems to apply a BEST4. format to the detected width of a field, to
construct an informat. When you have a field longer than 9999, this
generates scientific notation, which does not produce a valid informat.
>Any ideas how to code around this problem? Or, any ideas ideas how to
>uniformly identify when this error occurs to speed up the manual fix
I can't help wondering about the bigger picture, since it may influence the
direction to take in solving this problem. What happens when the thousands
of files are read? Is there a need to integrate their content? Will
normailization reduce or eliminate the discrepancies in structure?
>/* Import Code
>%do i = 1 %to &filecount;
>PROC IMPORT OUT= data.&&fileb&i
> DATAFILE= "&datalocation2.&&file&i...txt"
> DBMS=DLM REPLACE;
> Delimiter = "|";
> getnames = yes;
>130 informat SDR_DESCRIPTION_TEXT $12E3. ;
> 228 22
>ERROR 228-185: Informat 12E3 is unknown.
>ERROR 22-322: Syntax error, expecting one of the following: a name, -, ;,
> _CHARACTER_, _CHAR_, _NUMERIC_.
>ERROR 200-322: The symbol is not recognized and will be ignored.
To generate this problem, run
file 'test' lrecl=10000;
length long $ 10000;
long = repeat('x', 9999);
put 'myvar' / long;
PROC IMPORT OUT= test
Delimiter = ",";
getnames = yes;
The field is 10,000 characters long (remember that the second argument to
REPEAT is the number of instances of the string to be appended to the given
instance). It generates $1E4. as an informat.
Change 9999 to 9998 and re-run. The informat is $9999. and there is no problem.