LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (September 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 11 Sep 2007 22:34:57 -0400
Reply-To:     "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject:      Re: Problem importing some pipe-delimited text files

On Mon, 10 Sep 2007 12:19:58 -0400, Jack Wozek <Jack.A.Wozek@BOS.FRB.ORG> wrote:

>Hi Howard, > >Thanks for the follow-up. It's very informative. I'll respond in reverse >to your questions. > >Once the files are read, my thought is to stack them into one database to >be analyzed (simple statistics by groupings). In some inital runs, I've >kept the files separate, run summaries on them, and then stacked the >summary information. The summary information will be output in reports. >To stack all of the files into one, I anticipate that I would only take >the columns I need in a particular order from each file.

How about sheding the unneeded ones earlier, when you are reading the pipe-separated stuff?

> >The columns come from a known and limited set. When a particular column >name occurs in different files, the column attributes are consistent. >Unfortunately, not every column name occurs in every file. All of the >columns I need are in every file. I believe I have enough information to >build a meta data table. How would that help in this case?

You would be able to supply informats instead of having SAS guess. You will find many posts in the SAS-L archives about problems which arise when SAS has to guess.

> >All in all, each delimited file has 100-120 columns of data; but I only >need a small subset of those for analysis and output. > >Thanks again in advance for any help you can offer.

I would first run a step which just reads the first line of each file, to get the lists of included variables. Then I would combine that info with the known metadata (esp. informats) and build a long data step with a bunch of file-specific INPUT statements. Things I might use include: %INCLUDE, FILEVAR=.

> >--Jack > > > > > >"Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM> >Sent by: "SAS(r) Discussion" <SAS-L@listserv.vt.edu> >09/09/2007 11:28 PM >Please respond to >"Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM> > > >To >SAS-L@LISTSERV.VT.EDU >cc > >Subject >Re: Problem importing some pipe-delimited text files > > > > > > >On Fri, 7 Sep 2007 09:48:24 -0400, Jack Wozek <Jack.A.Wozek@BOS.FRB.ORG> >wrote: > >>Hi. >> >>I need to read in a sizable volume of pipe-delimited text files. > >The use of pipe rather than comma or tab or other more commonly used >separator is incidental to the problems. > >>Using >>some code adapted from "Reading a Pipe-Delimited File with in a >>Comma-Delimited File" (Carey Smoak), my plan is to read the file names >>from the directory, put the file names into separate macrovariables, and >>use proc import inside a do loop (copied below). I opted for the simple >>proc import because the files are not standard (different combinations of >>column, but all have a common denominator of columns I need). > >Do the columns come from a known and limited set? When a particular column >name occurs in different files, are the column attributes (type, length, >informat, etc.) consistent? I'm wondering if a master data dictionary can >be >built and used in controlling the process of reading these files. > >>I want this >>to be as automated and maintenance-free as possible since this will be a >>repeated process completed by others. > >Using the phrases "proc import" and "maintenance free" in the same >paragraph >is kind of dubious. > >>However, I am running into an error >>with some files. One of the columns I am reading has a text >description, >>and in some cases it is very long. When a long description appears in >the >>first row of data (or the first 20 I gather), proc import guesses an >>informat/format that is unknown to SAS (see error below). A past >posting >>(Howard Schreier, Dec 26, 2002) suggested block copying the data step >code >>generated in the log by proc import, fixing the error, and then >re-running >>the data step code. I tried this and it works. However, I would like >>something more efficient since I am dealing with thousands of files. > >I've replicated the problem (see below). Basically, PROC IMPORT internally >seems to apply a BEST4. format to the detected width of a field, to >construct an informat. When you have a field longer than 9999, this >generates scientific notation, which does not produce a valid informat. > >> >>Any ideas how to code around this problem? Or, any ideas ideas how to >>uniformly identify when this error occurs to speed up the manual fix >>process? > >I can't help wondering about the bigger picture, since it may influence >the >direction to take in solving this problem. What happens when the thousands >of files are read? Is there a need to integrate their content? Will >normailization reduce or eliminate the discrepancies in structure? > >> >>Thanks. >> >>Jack >> >>/* Import Code >> >>%macro importras; >>%do i = 1 %to &filecount; >> >>PROC IMPORT OUT= data.&&fileb&i >> DATAFILE= "&datalocation2.&&file&i...txt" >> DBMS=DLM REPLACE; >> Delimiter = "|"; >> getnames = yes; >>RUN; >>%end; >>%mend; >> >> >>/* Error >> >>130 informat SDR_DESCRIPTION_TEXT $12E3. ; >> ----- >> 228 22 >> - >> 200 >>ERROR 228-185: Informat 12E3 is unknown. >> >>ERROR 22-322: Syntax error, expecting one of the following: a name, -, ;, >>DEFAULT, _ALL_, >> _CHARACTER_, _CHAR_, _NUMERIC_. >> >>ERROR 200-322: The symbol is not recognized and will be ignored. > >To generate this problem, run > > data _null_; > file 'test' lrecl=10000; > length long $ 10000; > long = repeat('x', 9999); > put 'myvar' / long; > run; > > PROC IMPORT OUT= test > DATAFILE= "test" > DBMS=DLM REPLACE; > Delimiter = ","; > getnames = yes; > RUN; > >The field is 10,000 characters long (remember that the second argument to >REPEAT is the number of instances of the string to be appended to the >given >instance). It generates $1E4. as an informat. > >Change 9999 to 9998 and re-run. The informat is $9999. and there is no >problem.


Back to: Top of message | Previous page | Main SAS-L page