LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2002, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 31 Jul 2002 06:29:24 -0700
Reply-To:   paula D <sophe@USA.NET>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   paula D <sophe@USA.NET>
Organization:   http://groups.google.com/
Subject:   Re: Consolidating categorical data
Content-Type:   text/plain; charset=ISO-8859-1

I am really thrilled by all these quick and quality responses.

1. To present the core of the question more effectively, I left out many background info. Here are some details.

Every token actually looks like Visa42665020...,Mast5424....., and every one is 32 character long and even. Visa42665020 probably means BP Visa, master5424 probably means Citi Sony Master card and so on. Amex cards are no different. You may say, Amex card numbers are different from visa and master, but they are really not card numbers. They are kind of portfolio numbers. The total number of tokens in each value/obs line is 36, long and even.

To anwer Dale's Q, I don't need to leave a combination in the data set. I guess Dale's response leans toward solving the out-of-control magnitudes (1,10,100, ,,,,1000000000). I like the idea is to use only 36 of them, output the qualified data, sit them aside and re-use the 36 on the remaining data. Finally, combine the data sets together. My initial intention is to seek a better algorithm. Now, I agree efficiency should be factored in at this design stage, since I have about 102 million records involved.

The whole project is more like a RFM in database marketing area. Actually, I need to summarize a financial column in the data set by this "problem" variable. The data answered my first Q easily: in that Q, the order in each of value of Slogan matters, since I wanted to know which token has the strongest "ability" to pull the next two tokens to max the financials. What pins down the order are 36 time points. The next Q, which is this outstanding one, is to see which combo is the best. This project so far is pretty ad hoc. It may be suitable later to split the tokens (or not) for a conjoint analysis, for example.

2. I tried ED's receipe and Paul Dorfman's expanded version. Paul's version, as I observed during the past at this forcum, is always more thougtful and thorough. This case is no exception. Special salute to the chief.

3. My macro is still kind of slow. I need more time to study the two papers Fred, Ronald recommended.

After trying for a little more than 2 years with SAS, now I increasingly feel the design, not the coding, is far more important. As long as I know the keywords to use, I can find almost any solution at this forcum (except for the "access violation" problem with variable length when reading data directly from EBCDIC into Windows SAS).

Thanks again. Paula D

sashole@bellsouth.net (Paul M. Dorfman) wrote in message news:<001b01c23789$920566f0$0101a8c0@sasholewmis4a7>... > Paula, > > Ed Heaton has shown you one way of creating a scramble identifier using > integer arithmetic (I think you had thought of it yourself). It is a > fine method when the subkeys (tokens) consist of explicit integers > falling in a limited range. > > A method relying on comparisons between the tokens instead would be > more > general. At due time, Sig Hermansen used a similar idea to perform a > fuzzy record linkage on parts of SSNs whose digits might have been > inadvertently transposed. Namely, if you extract the tokens from the > string and order them, all transpositions of the same tokens will sort > into the same ordered triplet. > > In your case, it is especially easy to achieve since there are only > three tokens to deal with, so the choice of sorting method is > practically irrelevant. Below, the tokens are essentially bubble-sorted > by hand. > > Let us consider the simplest case when amex can have only two suffixes, > and visa and mast - just one (the first step just generates test data): > > %let a 2 ; > %let m 1 ; > %let v 1 ; > > data input (keep slogan) ; > length slogan $21. ; > do v 1 to &v ; > vv 'visa' || put(v,2.-l) ; > do m 1 to &m ;> > mm 'mast' || put(m,2.-l) ; > do a 1 to &a ; > aa 'amex' || put(a,3.-l) ; > slogan trim(vv) || '-' || trim(mm) || '-' || aa ; output > ; > slogan trim(vv) || '-' || trim(aa) || '-' || mm ; output > ; > slogan trim(mm) || '-' || trim(vv) || '-' || aa ; output > ; > slogan trim(mm) || '-' || trim(aa) || '-' || vv ; output > ; > slogan trim(aa) || '-' || trim(mm) || '-' || vv ; output > ; > slogan trim(aa) || '-' || trim(vv) || '-' || mm ; output > ; > end ; > end ; > end ; > run ; > > Now let us break SLOGAN in tokens in each observation and order the > tokens: > > data vcommon (keep slogan c:) ; > array token $7. c1-c3 ; > set input ; > do over token ; > token scan (slogan, i , '-') ; > end ; > if c1 > c2 then do ; t c1 ; c1 c2 ; c2 t ; end ; > if c2 > c3 then do ; t c2 ; c2 c3 ; c3 t ; end ; > if c1 > c2 then do ; t c1 ; c1 c2 ; c2 t ; end ; > run ; > > C1-C3 form a composite key identical for all variations of a given > triplet of tokens C1-C3, so now we can use it to consolidate slogans > differing only by the order of their tokens in a single by-group: > > proc sort data vcommon out common equals ; > by c: ; > run ; > > Printing these data shows: > > proc print noobs u ; > by c: ; > id c: ; > var slogan ; > run; > > c1 c2 c3 slogan > > amex1 mast1 visa1 visa1-mast1-amex1 > visa1-amex1-mast1 > mast1-visa1-amex1 > mast1-amex1-visa1 > amex1-mast1-visa1 > amex1-visa1-mast1 > > amex2 mast1 visa1 visa1-mast1-amex2 > visa1-amex2-mast1 > mast1-visa1-amex2 > mast1-amex2-visa1 > amex2-mast1-visa1 > amex2-visa1-mast1 > > > Note that because of the EQUALS sort option, the original relative > order > of the SLOGAN values has been retained. Now you can decide which value > of SLOGAN within the same transposition group you want to serve as > common. For example, if you want the first value in each group, > > data choosefirst ( keep slogan: ) ; > set common (rename (sloganslogan r)) ; > by c: ; > retain slogan ; > if first.c3 then slogan slogan r ; > run ; > > Which prints: > > proc print noobs u ; > by slogan ; > id slogan ; > var slogan r ; > run; > > slogan slogan r > > visa1-mast1-amex1 visa1-mast1-amex1 > visa1-amex1-mast1 > mast1-visa1-amex1 > mast1-amex1-visa1 > amex1-mast1-visa1 > amex1-visa1-mast1 > > visa1-mast1-amex2 visa1-mast1-amex2 > visa1-amex2-mast1 > mast1-visa1-amex2 > mast1-amex2-visa1 > amex2-mast1-visa1 > amex2-visa1-mast1 > > Which of one of the values of slogan r to pick up? It is either up to > you or your supervisor. > > Kind regards, > > Paul M. Dorfman > Jacksonville, FL > > > > > > > > > From: paula D <sophe@USA.NET> > > > > > > I have one character variable (slogan) with values like this: > > > > > > visa1-mast2-amex3; > > > visa2-mast3-amex9; > > > mast7-visa2-amex8; > > > .... > > > > > > that is, every value randomly takes 3 of (visa1-visa50, > mast1-mast50, > > > amex1-amex150) and combine with 2 "-" in between. It is Ok if the > > > value is > > > > > > visa1-visa1-visa1. > > > > > > My problem is if the 3 components in any 2 values are the same, such > > > as > > > > > > visa1-mast2-amex3 and mast2-visa1-amex3, > > > > > > I want to treat them as the same, taking 1 of the 2 values to > > > represent them. > > > > > > So far I have tried one idea. It works, but too complicated. The > idea > > > is to assign a value to each component in such as way that > > > > > > 1. comp1substr(slogan, 1, 5); comp2substr(slogan, 7,5); > > > comp3substr(slogan, 13,5); > > > 2. if comp1"visa1" then tag11; else if comp1"visa2" then tag1 ; > > > else if comp1"visa3" then tag1 0......; > > > 3. if comp2"visa1" then tag21; else if comp2"visa2" then tag2 ; > > > else if comp2"visa3" then tag2 0......; > > > > > > then the total of tag1-tag3 is at a unique range to correspond to > each > > > unique original value in slogan (in this sense, the magnitude does > not > > > have to be 10). Finally, re-format the totals back using original > > > Slogan values. > > > > > > I would appreciate if someone can help with any better solution. I > > > guess I am not the first person who has ever needed to do this, but > I > > > have no idea what keywords to use to query the SAS-L to find > previous > > > listings in this regard. Thank you in advance. > > > > > > Again, my old email sophe@usa.net is no longer in use. Please use > > > sophe88@yahoo.com for emails. Thanks. > > > > --


Back to: Top of message | Previous page | Main SAS-L page