|
And even if you create a list of all English words, many of those are non-English words also.
A hard task you have there. "Rendez-vous a Tours" is all French and also all English (since Rendezvous has pretty much become English). French is probably the hardest case, because there's been so much interchange of words over the centuries. Legal English contains a lot of French words - voir dire, for example.
--
JackHamilton@FirstHealth.com
Manager, Technical Development
Metrics Department, First Health
West Sacramento, California USA
>>> "David Jackson" <david.jackson@EUROPE.PPDI.COM> 01/08/2004 1:34 AM >>>
Thanks to all who have suggested solutions.
To Dave Andrae
if length(compress(COMMENT)) gt length(compress(COMMENT,'<foreign
characters>')) then <conditions, etc.>;
and Gerhard Hellriegel
data test; /* test data */
a="esfrjkh kjfd kfdjskjhfd skjhsdjhgfdsxyzXYZ äßÄ"; output;
a="jhedfhjhsdhjfgjdsjh g dsjhfghjsdgjhfgsdh"; output;
run;
data x;
set test;
foreign=0;
do i=1 to length(a);
c=substr(a,i,1);
x=rank(c);
if x>122 then foreign+1;
end;
put foreign "strange CHARs in string";
run;
However Greg's pointed out that not all foreign words contain foreign characters.
Richard Crawley-Boevey has suggested "PROC SPELL to check whether it is English or
not. This would require an exhaustive list of any English words. Using the word list,
you will (in theory) find any words that do not match those of the dictionary."
It looks I might have to create a data set that contains every English word. This
could take me some time.
Dave
Greg Woolridge wrote:
> Depending on your definition of foreign language, your method may not work.
> Azul is the Spanish word for blue, but you method would not pick it up.
>
> Best solution I can think of is to get a copy of Webster's dictionary on
> disk and create a SAS data set containing each word as an observation.
> Then parse your comments into individual words and pass against your
> dictionary data set to see where you get no match. Unfortunately, I
> suspect this would take longer than you want to spend on this task. Maybe
> someone will have a better idea.
>
> Greg M. Woolridge
> Manager, Study Programming
> TAP Pharmaceutical Products Inc.
> e-mail: greg.woolridge@tap.com
> phone: 847-582-2332
> fax: 847-582-2403
>
> David Jackson
> <david.jackson@EUROP To: SAS-L@LISTSERV.UGA.EDU
> E.PPDI.COM> cc:
> Sent by: "SAS(r) Subject: Foreign Languages
> Discussion"
> <SAS-L@LISTSERV.UGA.
> EDU>
>
> 01/07/2004 10:53 AM
> Please respond to
> David Jackson
>
> SAS-L
>
> I'm expecting delivery of a data set that will contain a "Comments"
> column.
>
> My task is to search the comments and pick out any text that has been
> written in a "foreign" language (not english).
>
> My (very long) solution involves checking each field to see if it
> contains any one of the following characters using the index() function.
>
> ... ... ... ÀÀÁÂÃÄÅÆÇãåæçèéêëìíîïñóõöûü ... ... ... (a subset of all
> foreign characters)
>
> Any ideas that might improve this
>
> Thanks
>
> Dave
>
> _______________________________________________________
> This e-mail transmission and any documents, files or previous email
> messages attached to it may contain information that is confidential or
> legally privileged. If you are not the intended recipient or a person
> responsible for delivering this transmission to the intended recipient, you
> are hereby notified that you must not read this transmission and that any
> disclosure, copying, printing, distribution or use of this transmission is
> strictly prohibited. If you have received this transmission in error,
> please immediately notify the sender by telephone or return email and
> delete the original transmission and its attachments without reading or
> saving in any manner.
_______________________________________________________
This e-mail transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner.
|