LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2009, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 17 Dec 2009 09:57:14 -0800
Reply-To:     ChrisG <chris.godlewski@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         ChrisG <chris.godlewski@GMAIL.COM>
Organization: http://groups.google.com
Subject:      Re: matching names + cleaning data
Comments: To: sas-l@uga.edu
Content-Type: text/plain; charset=ISO-8859-1

On 17 déc, 14:55, muthia.kachira...@GMAIL.COM (Muthia Kachirayan) wrote: > Chris, > > Here are two solutions. The first one is done with external sorting which > may not be efficient to deal with a large data set. The second one is based > on hash objects and does not depend on external sorting and hence considered > to be more efficient. No testing is done. > Here goes the data set. > data have; > input @1 loanid $8. @10 Bank_Name &$42. @54 Type &$18.; > cards; > BF486261 Corp Credit Lyonnais Mandated Arranger > BF486265 Corp JP Morgan Bookrunner > BF486265 Corp JP Morgan Mandated Arranger > BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated Arranger > BF486269 Corp Societe Generale Agent(s) > BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau Arranger(s) > BF486269 Corp Societe Generale Bookrunner > BF486269 Corp Societe Generale Mandated Arranger > BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s) > BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s) > BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner > BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s) > BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s) > BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger > BF497955 Corp Societe Generale Bookrunner > BF497955 Corp Societe Generale Mandated Arranger > BF500064 Corp Royal Bank of Scotland Agent(s) > BF500064 Corp Royal Bank of Scotland Bookrunner > BF500064 Corp BNP Paribas Group Lender(s) > BF500064 Corp HSBC Financial Products (France) SNC Mandated Arranger > BF500064 Corp Royal Bank of Scotland Mandated Arranger > BF500797 Corp ING Bank NV Bookrunner > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF500797 Corp ING Bank NV Mandated Arranger > BF501049 Corp Credit Lyonnais Agent(s) > BF501049 Corp CIT Group Lender(s) > BF501049 Corp Credit Lyonnais Lender(s) > BF501049 Corp ING Groep NV Lender(s) > BF501049 Corp CIT Group Mandated Arranger > BF501049 Corp Credit Lyonnais Mandated Arranger > BF501049 Corp ING Groep NV Mandated Arranger > LN059143 Corp Banca Commerciale Italiana Lender(s) > LN059143 Corp Banca di Roma (France) Lender(s) > LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s) > LN059143 Corp Banco Espanol de Credito Lender(s) > LN059143 Corp Bank One NA Lender(s) > LN059143 Corp Barclays Bank PLC (US) Lender(s) > LN059143 Corp Deutsche Bank AG Lender(s) > LN059143 Corp Dexia Banque SA Lender(s) > LN059143 Corp ING Bank NV/United States Lender(s) > LN059143 Corp Rabobank International Lender(s) > LN059143 Corp Raiffeisen Zentralbank Oesterreich AG Lender(s) > LN059143 Corp Sanpaolo IMI SpA Lender(s) > LN059143 Corp Unicredito Italiano Lender(s) > LN059143 Corp Deutsche Bank AG Mandated Arranger > LN059143 Corp IntesaBci SpA Mandated Arranger > ; > run; > > Both the solutions depend on the logic of finding the frequency of > occurrence of BANK_NAMEs and to visually inspect the condensed list (Step > 1). The short-list will give clues to re-group the BANK_NAMEs. > Here is the list of names. > > Obs Bank_Name > count > 1 Corp Banco Espanol de Credito > 1 > 2 Corp Deutsche Bank AG > 2 > 3 Corp Rabobank International > 1 > 4 Corp JP Morgan > 2 > 5 Corp ING Groep NV > 2 > 6 Corp KFW Kreditanstalt Fur Wiederaufbau > 2 > 7 Corp ING Bank NV > 13 > 8 Corp Landesbank Baden-Wuerttemberg > 2 > 9 Corp Credit Lyonnais > 4 > 10 Corp Sanpaolo IMI SpA > 1 > 11 Corp Societe Generale > 5 > 12 Corp Raiffeisen Zentralbank Oesterreich AG > 1 > 13 Corp Banca di Roma (France) > 1 > 14 Corp ING Bank NV/United States > 1 > 15 Corp Banco Bilbao Vizcaya Argentaria > 1 > 16 Corp Bank One NA > 1 > 17 Corp IntesaBci SpA > 1 > 18 Corp CIT Group > 2 > 19 Corp Unicredito Italiano > 1 > 20 Corp Bank of Tokyo-Mitsubishi UFJ Ltd > 4 > 21 Corp Banca Commerciale Italiana > 1 > 22 Corp HSBC Financial Products (France) SNC > 1 > 23 Corp Barclays Bank PLC (US) > 1 > 24 Corp BNP Paribas Group > 1 > 25 Corp Royal Bank of Scotland > 3 > 26 Corp Dexia Banque SA > 1 > > The names in Obs 5 and Obs 14 are similar and they can be re-grouped as one > name. The logic to re-group can be different to different people. I look for > the string "ING" in the names and replace all of them with a typical name. > In this case, all are replaced with "ING Bank NV" (Step 2). The Steps 1 and > 2 can be repeated as many times as desired until one gets a reasonable list > of names. > Once the clean-up is done, we examine the frequency of TYPEs(Step 3) to > determine the hierarchy of them to short-list the banks within each ORDERID. > There are 5 of them and each one is given an order(level). This will help to > retain the top-most row in the final data set. Step 4 gives order to the > TYPEs as found in the last data set. > Step 5 gets the desired output. > The output of Type: > > Obs loanid Bank_Name > Type count > 1 BF486269 Corp Societe Generale > Agent(s) 4 > 2 BF486273 Corp Landesbank Baden-Wuerttemberg > Arranger(s) 2 > 3 BF497955 Corp Societe Generale > Bookrunner 6 > 4 LN059143 Corp Unicredito Italiano > Lender(s) 19 > 5 BF497955 Corp Societe Generale Mandated > Arranger 25 > There are 5 Types and are used to assign 1 to 5 for defining level. > The Final data set: > > Obs loanid Bank_Name > Type level > 1 BF486261 Corp Credit Lyonnais Mandated > Arranger 2 > 2 BF486265 Corp JP Morgan Mandated > Arranger 2 > 3 BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated > Arranger 2 > 4 BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau > Arranger(s) 3 > 5 BF486269 Corp Societe Generale > Agent(s) 1 > 6 BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd > Agent(s) 1 > 7 BF486273 Corp Landesbank Baden-Wuerttemberg > Arranger(s) 3 > 8 BF497955 Corp Societe Generale Mandated > Arranger 2 > 9 BF500064 Corp BNP Paribas Group > Lender(s) 4 > 10 BF500064 Corp HSBC Financial Products (France) SNC Mandated > Arranger 2 > 11 BF500064 Corp Royal Bank of Scotland > Agent(s) 1 > 12 BF500797 ING Bank NV Mandated > Arranger 2 > 13 BF501049 Corp CIT Group Mandated > Arranger 2 > 14 BF501049 Corp Credit Lyonnais > Agent(s) 1 > 15 BF501049 ING Bank NV Mandated > Arranger 2 > 16 LN059143 Corp Banca Commerciale Italiana > Lender(s) 4 > 17 LN059143 Corp Banca di Roma (France) > Lender(s) 4 > 18 LN059143 Corp Banco Bilbao Vizcaya Argentaria > Lender(s) 4 > 19 LN059143 Corp Banco Espanol de Credito > Lender(s) 4 > 20 LN059143 Corp Bank One NA > Lender(s) 4 > 21 LN059143 Corp Barclays Bank PLC (US) > Lender(s) 4 > 22 LN059143 Corp Deutsche Bank AG Mandated > Arranger 2 > 23 LN059143 Corp Dexia Banque SA > Lender(s) 4 > 24 LN059143 Corp IntesaBci SpA Mandated > Arranger 2 > 25 LN059143 Corp Rabobank International > Lender(s) 4 > 26 LN059143 Corp Raiffeisen Zentralbank Oesterreich AG > Lender(s) 4 > 27 LN059143 Corp Sanpaolo IMI SpA > Lender(s) 4 > 28 LN059143 Corp Unicredito Italiano > Lender(s) 4 > 29 LN059143 ING Bank NV > Lender(s) 4 > > ********************** Datastep Non-Hash solution ************; > *** Step 1: Count of Bank_Name ***; > proc sort data = have; > by Bank_Name; > run; > data out_01; > do until(last.Bank_Name); > set have; > by Bank_Name; > if first.Bank_Name then count = 0; > count + 1; > end; > run; > *** Step 2: Replacing duplicate Names with unique Name ***; > data have2; > set have; > *** Assumes presence of ING in > ... > > plus de détails »

Hello

thanks a lot ! For the moment i have tested the non hash way (i am discovering this one so i prefer to also understand how it works) and it works great ! I will test the rest later ...

Best CG


Back to: Top of message | Previous page | Main SAS-L page