Date: Wed, 16 Dec 2009 14:20:27 -0800
Reply-To: ChrisG <chris.godlewski@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: ChrisG <chris.godlewski@GMAIL.COM>
Organization: http://groups.google.com
Subject: Re: matching names + cleaning data
Content-Type: text/plain; charset=ISO-8859-1
On 16 déc, 22:56, HERMA...@WESTAT.COM (Sigurd Hermansen) wrote:
> CG:
> You really need a mnemonic Bank ID that can be mapped to a standard bank name. We have to do this type of standardization often. It never works perfectly in new situations.
>
> I'd experiment first with different length constraints on the SOUNDEX() function. Once you have a relation of soundex values and standard bank names, then you may be able to use one of the similarity functions (edit distance or rearrangement cost [e.g., SPEDIS()]) to check to see if the relation contains a close enough match to a standard name. If not, it would have to be reconciled off-line.
>
> Here's a rough cut of how you might create a relation of standardized bank names and soundex values:
> data banks;
> infile cards pad truncover;
> input @1 loanid $char8. @10 class $4. @16 Bank_Name $char39. @55 Type $char20.
> ;
> datalines;
> BF486261 Corp Credit Lyonnais Mandated Arranger
> BF486265 Corp JP Morgan Bookrunner
> BF486265 Corp JP Morgan Mandated Arranger
> BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated Arranger
> BF486269 Corp Societe Generale Agent(s)
> BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau Arranger(s)
> BF486269 Corp Societe Generale Bookrunner
> BF486269 Corp Societe Generale Mandated Arranger
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger
> BF497955 Corp Societe Generale Bookrunner
> BF497955 Corp Societe Generale Mandated Arranger
> BF500064 Corp Royal Bank of Scotland Agent(s)
> BF500064 Corp Royal Bank of Scotland Bookrunner
> BF500064 Corp BNP Paribas Group Lender(s)
> BF500064 Corp HSBC Financial Products (France) SNC Mandated Arranger
> BF500064 Corp Royal Bank of Scotland Mandated Arranger
> BF500797 Corp ING Bank NV Bookrunner
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF501049 Corp Credit Lyonnais Agent(s)
> BF501049 Corp CIT Group Lender(s)
> BF501049 Corp Credit Lyonnais Lender(s)
> BF501049 Corp ING Groep NV Lender(s)
> BF501049 Corp CIT Group Mandated Arranger
> BF501049 Corp Credit Lyonnais Mandated Arranger
> BF501049 Corp ING Groep NV Mandated Arranger
> LN059143 Corp Banca Commerciale Italiana Lender(s)
> LN059143 Corp Banca di Roma (France) Lender(s)
> LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s)
> LN059143 Corp Banco Espanol de Credito Lender(s)
> LN059143 Corp Bank One NA Lender(s)
> LN059143 Corp Barclays Bank PLC (US) Lender(s)
> LN059143 Corp Deutsche Bank AG Lender(s)
> LN059143 Corp Dexia Banque SA Lender(s)
> LN059143 Corp ING Bank NV/United States Lender(s)
> LN059143 Corp Rabobank International Lender(s)
> LN059143 Corp Raiffeisen Zentralbank Oesterreich AG Lender(s)
> LN059143 Corp Sanpaolo IMI SpA Lender(s)
> LN059143 Corp Unicredito Italiano Lender(s)
> LN059143 Corp Deutsche Bank AG Mandated Arranger
> LN059143 Corp IntesaBci SpA Mandated Arranger
> ;
> run;
> proc sql;
> create table BankStan as
> select distinct Bank_Name,substr(soundex(Bank_Name),1,8) as stanBankName
> from banks
> group by stanBankName having Bank_Name=min(Bank_Name)
> ;
> quit;
>
> The subsequent comparisons of observed bank names to standardized bank names using SPEDIS() or other similarity functions would come next, and would return the standardized name. I am assuming here that the bank name doesn't change with Type.
> S
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of ChrisG
> Sent: Wednesday, December 16, 2009 9:40 AM
> To: SA...@LISTSERV.UGA.EDU
> Subject: matching names + cleaning data
>
> Hi guys
>
> Thank you in advance for any clue, advice and so on ...
>
> Here is my big issue
>
> I have data on the composition of bank loan syndicates (i.e. several
> banks make a loan together to a borrower)
> I have a variable LOANID to identify the loan, a variable BANK_NAME
> for the name of the bank, and TYPE to classify the "rank" of the bank
> in the syndicate
> I have copy paste a portion of the data below :
>
> loanid Bank_Name Type
> BF486261 Corp Credit Lyonnais Mandated Arranger
> BF486265 Corp JP Morgan Bookrunner
> BF486265 Corp JP Morgan Mandated Arranger
> BF486265 Corp KFW Kreditanstalt Fur Wiederaufbau Mandated Arranger
> BF486269 Corp Societe Generale Agent(s)
> BF486269 Corp KFW Kreditanstalt Fur Wiederaufbau Arranger(s)
> BF486269 Corp Societe Generale Bookrunner
> BF486269 Corp Societe Generale Mandated Arranger
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Agent(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Bookrunner
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Lender(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Lender(s)
> BF486273 Corp Bank of Tokyo-Mitsubishi UFJ Ltd Mandated Arranger
> BF497955 Corp Societe Generale Bookrunner
> BF497955 Corp Societe Generale Mandated Arranger
> BF500064 Corp Royal Bank of Scotland Agent(s)
> BF500064 Corp Royal Bank of Scotland Bookrunner
> BF500064 Corp BNP Paribas Group Lender(s)
> BF500064 Corp HSBC Financial Products (France) SNC Mandated Arranger
> BF500064 Corp Royal Bank of Scotland Mandated Arranger
> BF500797 Corp ING Bank NV Bookrunner
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF500797 Corp ING Bank NV Mandated Arranger
> BF501049 Corp Credit Lyonnais Agent(s)
> BF501049 Corp CIT Group Lender(s)
> BF501049 Corp Credit Lyonnais Lender(s)
> BF501049 Corp ING Groep NV Lender(s)
> BF501049 Corp CIT Group Mandated Arranger
> BF501049 Corp Credit Lyonnais Mandated Arranger
> BF501049 Corp ING Groep NV Mandated Arranger
> LN059143 Corp Banca Commerciale Italiana Lender(s)
> LN059143 Corp Banca di Roma (France) Lender(s)
> LN059143 Corp Banco Bilbao Vizcaya Argentaria Lender(s)
> LN059143 Corp Banco Espanol de Credito Lender(s)
> LN059143 Corp Bank One NA Lender(s)
> LN059143 Corp Barclays Bank PLC (US) Lender(s)
> LN059143 Corp Deutsche Bank AG Lender(s)
> LN059143 Corp Dexia Banque SA Lender(s)
> LN059143 Corp ING Bank NV/United States Lender(s)
> LN059143 Corp Rabobank International Lender(s)
> LN059143 Corp Raiffeisen Zentralbank Oesterreich AG Lender(s)
> LN059143 Corp Sanpaolo IMI SpA Lender(s)
> LN059143 Corp Unicredito Italiano Lender(s)
> LN059143 Corp Deutsche Bank AG Mandated Arranger
> LN059143 Corp IntesaBci SpA Mandated Arranger
>
> I need to do 2 things (not sure in what order to make things less
> complicated / more efficient)
>
> 1. i need to harmonize the names of the banks
> as you can see in the data above, for instance i have sometimes ING
> Bank NV and sometimes ING Bank NV/United States
> I want to have always let say ING Bank NV or ING or whatever but one
> name for any possible "variation" of a company which belongs to the
> ING Group
> I tried some proc transpose then some compged and so on but honestly i
> have the impression im going nowhere...
>
> let say for instance that at the end of that step i would like to get
> something like that (with unique and simple names of banks):
>
> loanid Bank_Name Type
> BF486261 Corp Credit Lyonnais Mandated Arranger
> BF486265 Corp JP Morgan Bookrunner
> BF486265 Corp JP Morgan Mandated Arranger
> BF486265 Corp KFW Kreditanstalt Mandated Arranger
> BF486269 Corp Societe Generale Agent(s)
> BF486269 Corp KFW Kreditanstalt Arranger(s)
> BF486269 Corp Societe Generale Bookrunner
> BF486269 Corp Societe Generale Mandated Arranger
> BF486273 Corp Bank of Tokyo-Mitsubishi Agent(s)
> BF486273 Corp Landesbank Baden-Wuerttemberg Arranger(s)
> BF486273
>
> ...
>
> plus de détails »
That is a smart "trick"
thanks again !
CG
|