Date: Mon, 19 Dec 2005 09:31:13 -0600
Reply-To: Kevin Myers <KMyers@PROCOMINC.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Kevin Myers <KMyers@PROCOMINC.NET>
Subject: Re: Simple URL Filename Access Problem
Content-Type: text/plain; charset="iso-8859-1"
"David L Cassell" <davidlcassell@msn.com> wrote:
> KMyers@PROCOMINC.NET replied:
> >It turns out that the experimental fix mentioned in
> >http://support.sas.com/techsup/unotes/SN/011/011102.html is exactly what
it
> >takes to fix the problem discussed in this thread. Furthermore, this
> >problem appears likely to occur when accessing data via URL on any web
> >servers that have been upgraded to Windows 2003. Apparantly this issue
is
> >corrected in SAS 9.1.2. It seems like a fix with this much potential
> >impact
> >should be upgraded to an officially supported hotfix...
> >
> >Thanks very much to George Fernandez for providing additional information
> >regarding this issue!!!
>
> Bad news, Kevin.
>
> There are other things which can make your attempts with the URL engine of
> the FILENAME statement go astray. This is not the most robust tool SAS
has
> built. It may not do all the IE/FireFox/Nyetscape/... tricks of adjusting
> the url if
> needed. It will not automatically handle ports for you if needed. It
> probably
> will not handle a page which has dynamic programming. It may not be able
to
> get past a robot-rejector. It may not handle a proxy server properly.
>
> You may be a lot better off using a tool like curl in a pipe, so you can
get
> the
> text fed into a data step.. or at least read off the errors that get spit
> back.
> I like using Perl, preferably with something like the LWP::Simple module
to
> handle simple stuff, or one of a dozen other modules for further
trickiness.
> But you knew I was going to say the word 'Perl' in there somewhere...
>
Yes, I have used Perl for this kind of thing before. But using Perl is like
pulling teeth for me. I find the structure and syntax of that language
completely arcane. It is *SO* different from everything else that I have
ever used. I use it very infrequently, and each time that I do it is almost
like a totally new learning experience from scratch.
After working through yesterday's difficulties I am much farther along in my
HTTP learning curve. Garth Helf's paper was a big help once I finally came
across it. I may end up using curl as you suggested, but am also
considering SAS macros based on the socket access method similar to that in
Garth's paper.
It seems to me that the URL access method could be greatly improved by
providing a mechanism to support the use of cookies, possibly by storing
them in macro variables. For example, one might extend the filename
statement similar to the following:
filename myFile url 'http://myURL' cookieVar=myCookie;
The above statement would use the contents of macro variable myCookie (if
non-blank) to generate a Cookie: record in the HTTP request header. Then
the contents of this same macro variable would be updated based on the value
of any Set-Cookie: record in the response header (or set to blank if no
Set-Cookie: record is returned). The user could of course alter the
contents of the macro variable, if desired, between individual filename
statements, and could also specify the use of a different macro variable for
different filename statements.
My knowledge of cookies is pretty limited at this time, so there might be
some reason that the above handling would be inadequate. But FWIW, I do
know that something along these lines would work for the scenarios given in
Garth's paper and for the web site that I am presently working with.
Another extremely useful enhancement would be to support the POST method,
probably through additional filename statement options. If specified, this
option would use the POST method rather than the GET method to request URL
content. The user would also be allowed to specify a macro variable (or
even a file?) containing data for the content portion of the POST request.
For example, the user might specify:
%let myPostContent=j_username=helf&j_password=notmypw&Logon=Log+On; /* from
Garth's paper */
filename myFile url 'http://myPostURL' method=POST cookieVar=myCookie
contentVar=myPostContent;
With these two enhancements, SAS could handle *ALL* of the web pages from
which I have ever attempted to extract data content. I know there are more
sophisticated techniques that some web pages use in an attempt to defeat
bots, but so far I have never had the need to try to get around such extreme
measures, and I don't believe most other SAS users would need to either. It
seems to me that the above enhancements would far exceed the 80/20 rule
regarding URL data extraction needs for most SAS users, whereas the existing
filename URL capabilities are probably adequate much less than half the
time.
So, what do you think about these suggestions?
Regards,
s/KAM