|
Several approaches:
FREQ can give you a quick count of the occurrences but, if there are too
many values (depending on your system memory) it will blow up.
The same results can be accomplished via arrays in a datastep. but
you must set a limit on the maximum number of values and be willing to do
the coding. This may be more efficient depending on your coding skills..
SORT can put the dataset in order on the variable and then you can count
them in a data step. You could also do this with a large array.
SQL can probably do this with an ORDER statement and COUNT but I'm not good
enough in SQL to code this without a lot of effort. You should be aware that
SQL will do a lot of work in the background to accomplish this.
HTH,
Dennis Diskin
638 data a;
639 length i $ 30;
640 i='this is an alpa field';
641 drop a;
642 do a = 1 to 100000;
643 output;
644 end;
645 i='this is an alpa field also';
646 do a = 1 to 100000;
647 output;
648 end;
649 i='this is a test field';
650 do a = 1 to 100000;
651 output;
652 end;
653 i='this another is';
654 do a = 1 to 150000;
655 output;
656 end;
657
658 run;
NOTE: The data set WORK.A has 450000 observations and 1 variables.
NOTE: DATA statement used:
real time 2.74 seconds
cpu time 1.01 seconds
659 proc freq data=a order=freq noprint;
660 table i / out=freq (keep=i count);
661
NOTE: The data set WORK.FREQ has 4 observations and 2 variables.
NOTE: PROCEDURE FREQ used:
real time 2.52 seconds
cpu time 2.27 seconds
662 data _null_;
663 set freq;
664 call symput ('FREQ',i);
665 stop;
666 run;
NOTE: DATA statement used:
real time 0.06 seconds
cpu time 0.02 seconds
667
668 proc sort data=a out=t;
669 by i;
670
NOTE: The data set WORK.T has 450000 observations and 1 variables.
NOTE: PROCEDURE SORT used:
real time 22.03 seconds
cpu time 19.56 seconds
671 data _null_;
672 set t;
673 by i;
674 retain maxx 0;
675 n+1;
676 if last.i;
677 if n > maxx then
678 do;
679 maxx = n;
680 call symput ('SORT',i);
681 end;
682 n = 0;
683
684 run;
NOTE: DATA statement used:
real time 2.43 seconds
cpu time 2.33 seconds
685 %put &freq;
this another is
686 %put &sort;
this another is
687 run;
> -----Original Message-----
> From: Laurel Copeland [SMTP:lacop@UMICH.EDU]
> Sent: Wednesday, October 10, 2001 3:08 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Most common value for a string var
>
> Hello,
> From a dataset with many records per person, how would I (efficiently)
> select the most common value of a character variable?
> For ex., if I have 0.5M recs on 50K people, and each rec contains an ID,
> date, diagnosis indicators, and location indicator, I can get the summary
> of each person's diagnoses with functions:
>
> /* data x contains...
> ID Date AIDS Lung Pancr Loc
> 101 21oct1999 0 1 0 222AA
> 101 14nov1999 0 0 1 222AA
> 101 22dec1999 0 0 1 543E
> 101 01jan2000 0 1 0 222AA
> 102...
> */
>
> PROC SQL; CREATE TABLE x AS SELECT
> id, min(date) as firstdat, max(aids) as aids,
> max(lung) as lungca, max(pancr) as pancr
> FROM y
> GROUP BY id;
>
> *Now select most common location*;
> How now? I am willing to settle for arbitrary (non-random) choice in case
> of tie.
>
> Thanks, Laurel
> Laurel Copeland
> VA
|