Date: Thu, 17 Dec 2009 09:57:14 -0800
Reply-To: ChrisG <chris.godlewski@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: ChrisG <chris.godlewski@GMAIL.COM>
Organization: http://groups.google.com
Subject: Re: matching names + cleaning data
Content-Type: text/plain; charset=ISO-8859-1
On 17 déc, 14:55, muthia.kachira...@GMAIL.COM (Muthia Kachirayan)
wrote:
> Chris,
>
> Here are two solutions. The first one is done with external sorting which
> may not be efficient to deal with a large data set. The second one is based
> on hash objects and does not depend on external sorting and hence considered
> to be more efficient. No testing is done.
> Here goes the data set.
> data have;
> input @1 loanid $8. @10 Bank_Name &$42. @54 Type &$18.;
> cards;
> BF486261 Corp Credit Lyonnais Mandated Arranger
> BF486265 Corp JP Morgan Bookrunner
> BF486265 Corp JP Morgan Mandated Arranger
> BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated Arranger
> BF486269 Corp Societe Generale Agent(s)
> BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau Arranger(s)
> BF486269 Corp Societe Generale Bookrunner
> BF486269 Corp Societe Generale Mandated Arranger
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger
> BF497955 Corp Societe Generale Bookrunner
> BF497955 Corp Societe Generale Mandated Arranger
> BF500064 Corp Royal Bank of Scotland Agent(s)
> BF500064 Corp Royal Bank of Scotland Bookrunner
> BF500064 Corp BNP Paribas Group Lender(s)
> BF500064 Corp HSBC Financial Products (France) SNC Mandated Arranger
> BF500064 Corp Royal Bank of Scotland Mandated Arranger
> BF500797 Corp ING Bank NV Bookrunner
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF501049 Corp Credit Lyonnais Agent(s)
> BF501049 Corp CIT Group Lender(s)
> BF501049 Corp Credit Lyonnais Lender(s)
> BF501049 Corp ING Groep NV Lender(s)
> BF501049 Corp CIT Group Mandated Arranger
> BF501049 Corp Credit Lyonnais Mandated Arranger
> BF501049 Corp ING Groep NV Mandated Arranger
> LN059143 Corp Banca Commerciale Italiana Lender(s)
> LN059143 Corp Banca di Roma (France) Lender(s)
> LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s)
> LN059143 Corp Banco Espanol de Credito Lender(s)
> LN059143 Corp Bank One NA Lender(s)
> LN059143 Corp Barclays Bank PLC (US) Lender(s)
> LN059143 Corp Deutsche Bank AG Lender(s)
> LN059143 Corp Dexia Banque SA Lender(s)
> LN059143 Corp ING Bank NV/United States Lender(s)
> LN059143 Corp Rabobank International Lender(s)
> LN059143 Corp Raiffeisen Zentralbank Oesterreich AG Lender(s)
> LN059143 Corp Sanpaolo IMI SpA Lender(s)
> LN059143 Corp Unicredito Italiano Lender(s)
> LN059143 Corp Deutsche Bank AG Mandated Arranger
> LN059143 Corp IntesaBci SpA Mandated Arranger
> ;
> run;
>
> Both the solutions depend on the logic of finding the frequency of
> occurrence of BANK_NAMEs and to visually inspect the condensed list (Step
> 1). The short-list will give clues to re-group the BANK_NAMEs.
> Here is the list of names.
>
> Obs Bank_Name
> count
> 1 Corp Banco Espanol de Credito
> 1
> 2 Corp Deutsche Bank AG
> 2
> 3 Corp Rabobank International
> 1
> 4 Corp JP Morgan
> 2
> 5 Corp ING Groep NV
> 2
> 6 Corp KFW Kreditanstalt Fur Wiederaufbau
> 2
> 7 Corp ING Bank NV
> 13
> 8 Corp Landesbank Baden-Wuerttemberg
> 2
> 9 Corp Credit Lyonnais
> 4
> 10 Corp Sanpaolo IMI SpA
> 1
> 11 Corp Societe Generale
> 5
> 12 Corp Raiffeisen Zentralbank Oesterreich AG
> 1
> 13 Corp Banca di Roma (France)
> 1
> 14 Corp ING Bank NV/United States
> 1
> 15 Corp Banco Bilbao Vizcaya Argentaria
> 1
> 16 Corp Bank One NA
> 1
> 17 Corp IntesaBci SpA
> 1
> 18 Corp CIT Group
> 2
> 19 Corp Unicredito Italiano
> 1
> 20 Corp Bank of Tokyo-Mitsubishi UFJ Ltd
> 4
> 21 Corp Banca Commerciale Italiana
> 1
> 22 Corp HSBC Financial Products (France) SNC
> 1
> 23 Corp Barclays Bank PLC (US)
> 1
> 24 Corp BNP Paribas Group
> 1
> 25 Corp Royal Bank of Scotland
> 3
> 26 Corp Dexia Banque SA
> 1
>
> The names in Obs 5 and Obs 14 are similar and they can be re-grouped as one
> name. The logic to re-group can be different to different people. I look for
> the string "ING" in the names and replace all of them with a typical name.
> In this case, all are replaced with "ING Bank NV" (Step 2). The Steps 1 and
> 2 can be repeated as many times as desired until one gets a reasonable list
> of names.
> Once the clean-up is done, we examine the frequency of TYPEs(Step 3) to
> determine the hierarchy of them to short-list the banks within each ORDERID.
> There are 5 of them and each one is given an order(level). This will help to
> retain the top-most row in the final data set. Step 4 gives order to the
> TYPEs as found in the last data set.
> Step 5 gets the desired output.
> The output of Type:
>
> Obs loanid Bank_Name
> Type count
> 1 BF486269 Corp Societe Generale
> Agent(s) 4
> 2 BF486273 Corp Landesbank Baden-Wuerttemberg
> Arranger(s) 2
> 3 BF497955 Corp Societe Generale
> Bookrunner 6
> 4 LN059143 Corp Unicredito Italiano
> Lender(s) 19
> 5 BF497955 Corp Societe Generale Mandated
> Arranger 25
> There are 5 Types and are used to assign 1 to 5 for defining level.
> The Final data set:
>
> Obs loanid Bank_Name
> Type level
> 1 BF486261 Corp Credit Lyonnais Mandated
> Arranger 2
> 2 BF486265 Corp JP Morgan Mandated
> Arranger 2
> 3 BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated
> Arranger 2
> 4 BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau
> Arranger(s) 3
> 5 BF486269 Corp Societe Generale
> Agent(s) 1
> 6 BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd
> Agent(s) 1
> 7 BF486273 Corp Landesbank Baden-Wuerttemberg
> Arranger(s) 3
> 8 BF497955 Corp Societe Generale Mandated
> Arranger 2
> 9 BF500064 Corp BNP Paribas Group
> Lender(s) 4
> 10 BF500064 Corp HSBC Financial Products (France) SNC Mandated
> Arranger 2
> 11 BF500064 Corp Royal Bank of Scotland
> Agent(s) 1
> 12 BF500797 ING Bank NV Mandated
> Arranger 2
> 13 BF501049 Corp CIT Group Mandated
> Arranger 2
> 14 BF501049 Corp Credit Lyonnais
> Agent(s) 1
> 15 BF501049 ING Bank NV Mandated
> Arranger 2
> 16 LN059143 Corp Banca Commerciale Italiana
> Lender(s) 4
> 17 LN059143 Corp Banca di Roma (France)
> Lender(s) 4
> 18 LN059143 Corp Banco Bilbao Vizcaya Argentaria
> Lender(s) 4
> 19 LN059143 Corp Banco Espanol de Credito
> Lender(s) 4
> 20 LN059143 Corp Bank One NA
> Lender(s) 4
> 21 LN059143 Corp Barclays Bank PLC (US)
> Lender(s) 4
> 22 LN059143 Corp Deutsche Bank AG Mandated
> Arranger 2
> 23 LN059143 Corp Dexia Banque SA
> Lender(s) 4
> 24 LN059143 Corp IntesaBci SpA Mandated
> Arranger 2
> 25 LN059143 Corp Rabobank International
> Lender(s) 4
> 26 LN059143 Corp Raiffeisen Zentralbank Oesterreich AG
> Lender(s) 4
> 27 LN059143 Corp Sanpaolo IMI SpA
> Lender(s) 4
> 28 LN059143 Corp Unicredito Italiano
> Lender(s) 4
> 29 LN059143 ING Bank NV
> Lender(s) 4
>
> ********************** Datastep Non-Hash solution ************;
> *** Step 1: Count of Bank_Name ***;
> proc sort data = have;
> by Bank_Name;
> run;
> data out_01;
> do until(last.Bank_Name);
> set have;
> by Bank_Name;
> if first.Bank_Name then count = 0;
> count + 1;
> end;
> run;
> *** Step 2: Replacing duplicate Names with unique Name ***;
> data have2;
> set have;
> *** Assumes presence of ING in
> ...
>
> plus de détails »
Hello
thanks a lot !
For the moment i have tested the non hash way (i am discovering this
one so i prefer to also understand how it works)
and it works great !
I will test the rest later ...
Best
CG
|