LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sun, 9 Sep 2007 23:17:56 -0400
Reply-To:     "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject:      Re: Problem importing some pipe-delimited text files

On Fri, 7 Sep 2007 09:48:24 -0400, Jack Wozek <Jack.A.Wozek@BOS.FRB.ORG> wrote:

>Hi. > >I need to read in a sizable volume of pipe-delimited text files.

The use of pipe rather than comma or tab or other more commonly used separator is incidental to the problems.

>Using >some code adapted from "Reading a Pipe-Delimited File with in a >Comma-Delimited File" (Carey Smoak), my plan is to read the file names >from the directory, put the file names into separate macrovariables, and >use proc import inside a do loop (copied below). I opted for the simple >proc import because the files are not standard (different combinations of >column, but all have a common denominator of columns I need).

Do the columns come from a known and limited set? When a particular column name occurs in different files, are the column attributes (type, length, informat, etc.) consistent? I'm wondering if a master data dictionary can be built and used in controlling the process of reading these files.

>I want this >to be as automated and maintenance-free as possible since this will be a >repeated process completed by others.

Using the phrases "proc import" and "maintenance free" in the same paragraph is kind of dubious.

>However, I am running into an error >with some files. One of the columns I am reading has a text description, >and in some cases it is very long. When a long description appears in the >first row of data (or the first 20 I gather), proc import guesses an >informat/format that is unknown to SAS (see error below). A past posting >(Howard Schreier, Dec 26, 2002) suggested block copying the data step code >generated in the log by proc import, fixing the error, and then re-running >the data step code. I tried this and it works. However, I would like >something more efficient since I am dealing with thousands of files.

I've replicated the problem (see below). Basically, PROC IMPORT internally seems to apply a BEST4. format to the detected width of a field, to construct an informat. When you have a field longer than 9999, this generates scientific notation, which does not produce a valid informat.

> >Any ideas how to code around this problem? Or, any ideas ideas how to >uniformly identify when this error occurs to speed up the manual fix >process?

I can't help wondering about the bigger picture, since it may influence the direction to take in solving this problem. What happens when the thousands of files are read? Is there a need to integrate their content? Will normailization reduce or eliminate the discrepancies in structure?

> >Thanks. > >Jack > >/* Import Code > >%macro importras; >%do i = 1 %to &filecount; > >PROC IMPORT OUT= data.&&fileb&i > DATAFILE= "&datalocation2.&&file&i...txt" > DBMS=DLM REPLACE; > Delimiter = "|"; > getnames = yes; >RUN; >%end; >%mend; > > >/* Error > >130 informat SDR_DESCRIPTION_TEXT $12E3. ; > ----- > 228 22 > - > 200 >ERROR 228-185: Informat 12E3 is unknown. > >ERROR 22-322: Syntax error, expecting one of the following: a name, -, ;, >DEFAULT, _ALL_, > _CHARACTER_, _CHAR_, _NUMERIC_. > >ERROR 200-322: The symbol is not recognized and will be ignored.

To generate this problem, run

data _null_; file 'test' lrecl=10000; length long $ 10000; long = repeat('x', 9999); put 'myvar' / long; run;

PROC IMPORT OUT= test DATAFILE= "test" DBMS=DLM REPLACE; Delimiter = ","; getnames = yes; RUN;

The field is 10,000 characters long (remember that the second argument to REPEAT is the number of instances of the string to be appended to the given instance). It generates $1E4. as an informat.

Change 9999 to 9998 and re-run. The informat is $9999. and there is no problem.


Back to: Top of message | Previous page | Main SAS-L page