| Date: | Tue, 24 Jun 2008 23:01:30 -0400 |
| Reply-To: | Arthur Tabachneck <art297@NETSCAPE.NET> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Arthur Tabachneck <art297@NETSCAPE.NET> |
| Subject: | Re: help: reading in open-ended text word-by-word |
|
Matt,
As Jack indicated, what you are seeking is far from trivial. However, if
all of your sentences are like those in your example, something like the
following might come close:
data want (drop=hold_word);
length word hold_word $50.;
retain hold_word;
infile cards end=eof;
input word $ @@;
if _n_ eq 1 or
(substr(reverse(trim(hold_word)),1,1) eq '.' and
rank(substr(word,1,1)) in (65:90)) then do;
count=1;
sentence+1;
end;
else count+1;
hold_word=word;
word=lowcase(compress(word,'.'));
cards;
Siddhartha was thus loved by everyone. He was a source of joy for
everybody.
;
HTH,
Art
---------
On Tue, 24 Jun 2008 16:05:02 -0700, Matthew Yurdin
<Matthew.Yurdin@UCOP.EDU> wrote:
>I am looking for a way to read a txt file containing one very long piece
>of open-ended text (i.e., a book-like document) into a
>one-word-per-observation dataset with a within-sentence word count and a
>sentence count. For example, I'd like to turn these two sentences:
>
>Siddhartha was thus loved by everyone. He was a source of joy for
>everybody.
>
>into these rows:
>
>VALUE COUNT SENTENCE
>siddhartha 1 1
>was 2 1
>thus 3 1
>loved 4 1
>by 5 1
>everyone 6 1
>he 1 2
>was 2 2
>a 3 2
>source 4 2
>of 5 2
>joy 6 2
>for 7 2
>everybody 8 2
>
>Any advice on how to go about this? My apologies if this is something
>that's already been asked and answered; I didn't see examples with
>anything this unstructured in the list archive.
>
>Thanks
>-Matt
|