Date: Mon, 31 Aug 2009 11:30:38 -0500
Reply-To: matt.pettis@THOMSONREUTERS.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Matthew Pettis <matt.pettis@THOMSONREUTERS.COM>
Subject: Re: Reading a PDF File
In-Reply-To: A<966B4B225F74914599E617416969BF1A4384CD8953@DOM-MBX02.mbu.ad.dominionnet.com>
Content-Type: text/plain; charset="us-ascii"
I've used to success the pdftotext free command line utility. It will
do as Nat says you need to do: convert a pdf to a text file and parse
it. Here is some info on it:
http://en.wikipedia.org/wiki/Pdftotext
here's where you can download it:
http://www.foolabs.com/xpdf/download.html
For a sane output format of the text, I recommend including the
'-layout' switch on the command line.
Hopefully, the Acrobat 'save as text' option works well for you, but if
not, this might be a good backup plan. Either case may require some
output file massaging done manually.
HTH,
Matt
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Nathaniel Wooding
Sent: Monday, August 31, 2009 10:57 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Reading a PDF File
Roger
You will need to convert the file to a TXT file and then parse the data.
Hopefully, it will have a simple standard layout.
Acrobat Reader 7 has a save as text feature. I do not know whether this
is available on Linux but you should be able to do the translation on a
Windows box and then move it to LINUX.
I have a couple versions of a paper posted on the web but these deal
with reading a lot of pdfs where doing a simple open and save as were
not practical.
The big issue for you will be dealing with the file layout. Depending on
how long the file is, some manual editing may simplify the parsing
process.
Nat Wooding
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
NOMAIL Roger S. Clark
Sent: Monday, August 31, 2009 11:49 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Reading a PDF File
Hi, SAS-L Group;
I am programming Independent Verification and Validation of a product
that
my division will deliver to an internal customer.
I have just learned (about 45 minutes ago) that one of the files I will
need to read into SAS is a file with a .pdf extension.
I've found considerable information in the online documentation for
creating pdf output, but nothing regarding using a pdf file as input.
Is it possible? If so, could someone advise how it is done?
This program is in the planning stage, so I have no code developed to
include in the E-mail.
My program will be running SAS 9.1.3 SP4 in a Red Hat LINUX system.
Thanx,
Roger S. Clark
Address Products Management Branch
763-9177 4H584U
CONFIDENTIALITY NOTICE: This electronic message contains
information which may be legally confidential and or privileged and
does not in any case represent a firm ENERGY COMMODITY bid or offer
relating thereto which binds the sender without an additional
express written confirmation to that effect. The information is
intended solely for the individual or entity named above and access
by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution, or use of the
contents of this information is prohibited and may be unlawful. If
you have received this electronic transmission in error, please
reply immediately to the sender that you have received the message
in error, and delete it. Thank you.