LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (October 2001, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 11 Oct 2001 08:23:42 -0400
Reply-To:   "Diskin, Dennis" <Dennis.Diskin@PHARMA.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Diskin, Dennis" <Dennis.Diskin@PHARMA.COM>
Subject:   Re: Most common value for a string var
Comments:   To: Laurel Copeland <lacop@UMICH.EDU>
Content-Type:   text/plain

Several approaches:

FREQ can give you a quick count of the occurrences but, if there are too many values (depending on your system memory) it will blow up. The same results can be accomplished via arrays in a datastep. but you must set a limit on the maximum number of values and be willing to do the coding. This may be more efficient depending on your coding skills.. SORT can put the dataset in order on the variable and then you can count them in a data step. You could also do this with a large array. SQL can probably do this with an ORDER statement and COUNT but I'm not good enough in SQL to code this without a lot of effort. You should be aware that SQL will do a lot of work in the background to accomplish this.

HTH, Dennis Diskin

638 data a; 639 length i $ 30; 640 i='this is an alpa field'; 641 drop a; 642 do a = 1 to 100000; 643 output; 644 end; 645 i='this is an alpa field also'; 646 do a = 1 to 100000; 647 output; 648 end; 649 i='this is a test field'; 650 do a = 1 to 100000; 651 output; 652 end; 653 i='this another is'; 654 do a = 1 to 150000; 655 output; 656 end; 657 658 run;

NOTE: The data set WORK.A has 450000 observations and 1 variables. NOTE: DATA statement used: real time 2.74 seconds cpu time 1.01 seconds 659 proc freq data=a order=freq noprint; 660 table i / out=freq (keep=i count); 661

NOTE: The data set WORK.FREQ has 4 observations and 2 variables. NOTE: PROCEDURE FREQ used: real time 2.52 seconds cpu time 2.27 seconds

662 data _null_; 663 set freq; 664 call symput ('FREQ',i); 665 stop; 666 run;

NOTE: DATA statement used: real time 0.06 seconds cpu time 0.02 seconds

667 668 proc sort data=a out=t; 669 by i; 670

NOTE: The data set WORK.T has 450000 observations and 1 variables. NOTE: PROCEDURE SORT used: real time 22.03 seconds cpu time 19.56 seconds

671 data _null_; 672 set t; 673 by i; 674 retain maxx 0; 675 n+1; 676 if last.i; 677 if n > maxx then 678 do; 679 maxx = n; 680 call symput ('SORT',i); 681 end; 682 n = 0; 683 684 run;

NOTE: DATA statement used: real time 2.43 seconds cpu time 2.33 seconds

685 %put &freq; this another is 686 %put &sort; this another is 687 run;

> -----Original Message----- > From: Laurel Copeland [SMTP:lacop@UMICH.EDU] > Sent: Wednesday, October 10, 2001 3:08 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Most common value for a string var > > Hello, > From a dataset with many records per person, how would I (efficiently) > select the most common value of a character variable? > For ex., if I have 0.5M recs on 50K people, and each rec contains an ID, > date, diagnosis indicators, and location indicator, I can get the summary > of each person's diagnoses with functions: > > /* data x contains... > ID Date AIDS Lung Pancr Loc > 101 21oct1999 0 1 0 222AA > 101 14nov1999 0 0 1 222AA > 101 22dec1999 0 0 1 543E > 101 01jan2000 0 1 0 222AA > 102... > */ > > PROC SQL; CREATE TABLE x AS SELECT > id, min(date) as firstdat, max(aids) as aids, > max(lung) as lungca, max(pancr) as pancr > FROM y > GROUP BY id; > > *Now select most common location*; > How now? I am willing to settle for arbitrary (non-random) choice in case > of tie. > > Thanks, Laurel > Laurel Copeland > VA


Back to: Top of message | Previous page | Main SAS-L page