LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2004, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 8 Jan 2004 10:42:36 -0700
Reply-To:   Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Subject:   Re: Foreign Languages
Comments:   To: david.jackson@EUROPE.PPDI.COM
Content-Type:   text/plain; charset=iso-8859-1

And even if you create a list of all English words, many of those are non-English words also.

A hard task you have there. "Rendez-vous a Tours" is all French and also all English (since Rendezvous has pretty much become English). French is probably the hardest case, because there's been so much interchange of words over the centuries. Legal English contains a lot of French words - voir dire, for example.

-- JackHamilton@FirstHealth.com Manager, Technical Development Metrics Department, First Health West Sacramento, California USA

>>> "David Jackson" <david.jackson@EUROPE.PPDI.COM> 01/08/2004 1:34 AM >>> Thanks to all who have suggested solutions.

To Dave Andrae

if length(compress(COMMENT)) gt length(compress(COMMENT,'<foreign characters>')) then <conditions, etc.>;

and Gerhard Hellriegel

data test; /* test data */ a="esfrjkh kjfd kfdjskjhfd skjhsdjhgfdsxyzXYZ äßÄ"; output; a="jhedfhjhsdhjfgjdsjh g dsjhfghjsdgjhfgsdh"; output; run;

data x; set test; foreign=0; do i=1 to length(a); c=substr(a,i,1); x=rank(c); if x>122 then foreign+1; end; put foreign "strange CHARs in string"; run;

However Greg's pointed out that not all foreign words contain foreign characters.

Richard Crawley-Boevey has suggested "PROC SPELL to check whether it is English or not. This would require an exhaustive list of any English words. Using the word list, you will (in theory) find any words that do not match those of the dictionary."

It looks I might have to create a data set that contains every English word. This could take me some time.

Dave

Greg Woolridge wrote:

> Depending on your definition of foreign language, your method may not work. > Azul is the Spanish word for blue, but you method would not pick it up. > > Best solution I can think of is to get a copy of Webster's dictionary on > disk and create a SAS data set containing each word as an observation. > Then parse your comments into individual words and pass against your > dictionary data set to see where you get no match. Unfortunately, I > suspect this would take longer than you want to spend on this task. Maybe > someone will have a better idea. > > Greg M. Woolridge > Manager, Study Programming > TAP Pharmaceutical Products Inc. > e-mail: greg.woolridge@tap.com > phone: 847-582-2332 > fax: 847-582-2403 > > David Jackson > <david.jackson@EUROP To: SAS-L@LISTSERV.UGA.EDU > E.PPDI.COM> cc: > Sent by: "SAS(r) Subject: Foreign Languages > Discussion" > <SAS-L@LISTSERV.UGA. > EDU> > > 01/07/2004 10:53 AM > Please respond to > David Jackson > > SAS-L > > I'm expecting delivery of a data set that will contain a "Comments" > column. > > My task is to search the comments and pick out any text that has been > written in a "foreign" language (not english). > > My (very long) solution involves checking each field to see if it > contains any one of the following characters using the index() function. > > ... ... ... ÀÀÁÂÃÄÅÆÇãåæçèéêëìíîïñóõöûü ... ... ... (a subset of all > foreign characters) > > Any ideas that might improve this > > Thanks > > Dave > > _______________________________________________________ > This e-mail transmission and any documents, files or previous email > messages attached to it may contain information that is confidential or > legally privileged. If you are not the intended recipient or a person > responsible for delivering this transmission to the intended recipient, you > are hereby notified that you must not read this transmission and that any > disclosure, copying, printing, distribution or use of this transmission is > strictly prohibited. If you have received this transmission in error, > please immediately notify the sender by telephone or return email and > delete the original transmission and its attachments without reading or > saving in any manner. _______________________________________________________ This e-mail transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner.


Back to: Top of message | Previous page | Main SAS-L page