```Date: Wed, 8 Feb 2006 16:12:02 +0200 Reply-To: BoraYavuz@HSBC.COM.TR Sender: "SAS(r) Discussion" From: Bora Yavuz Subject: Re: Is There a Procedure Like "PROC SURVEYCORR"? Comments: cc: mmz1@cdc.gov Content-Type: text/plain; charset=Windows-1254 Matthew, May be I wasn't clear enough on the "feature extraction" part. What I did was, in essence, two-fold: --> First step: Apply PROC VARCLUS and choose only one variable (the one with the lowest "1 - R^2" ratio) from each cluster of variables. --> Second step: Obtain the p-values of the Spearman correlations of the remaining variables with the dependent variable and throw away the ones with p-values that are "sufficiently high" (say above 0.50). As for the third step: --> Third step: Rank the variables (that remain after Step 2) first with respect to their Spearman correlations and Hoeffding's D's and check if a variable's rank on these two measures differ significantly -- since this may be an indication of the non-linearity between the variable and the dependent variable (response), in which case we should "straighten" the variable possibly using Tukey's transformation ("ladder of powers", etc.) To my knowledge, the emprical "rule-of-thumb" mentioned above has been published in SI's course notes on logistic regression. May be you can find more detailed references and pointers in there. Can you let me know if you come across any theoretical or practical justification of this? Cheers, Bora Y. |---------+-------------------------------> | |(Embedded image moved to file: | | |pic18951.gif) | | | "Zack, Matthew M." | | | | | | | | | 07/02/2006 18:19 | | | | |---------+-------------------------------> >-----------------------------------------------------------------------------------------------------------| | | | To: | | cc: | |Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"? | >-----------------------------------------------------------------------------------------------------------| For Hoeffding's D, I think you would have to calculate a weighted version in a DATA step, not in PROC CORR. You can use the formula for Hoeffding's D described in the PROC CORR documentation but modified as the weighted Pearson's correlation is compared to the unweighted Pearson's correlation. I don't have the exact formula for a weighted Hoeffding's D statistic. However, my problem is with your argument that the comparison between Spearman's correlation and Hoeffding's D is a check on the linearity of a variable. Has this argument been studied empirically? If so, have the results been published. I also don't understand your statements below that Feature extraction: I had used PROC VARCLUS in the first stage of reducing the number of numeric variables. In this step, I look at the p-values of the Spearman correlations and throw away the variables with "sufficiently high" p-values (e.g., 0.50). PROC VARCLUS groups variables that are highly correlated with one another. Variables in different groups, by definition, are less highly correlated with one another and would therefore have low Spearman correlations and "sufficiently high" P-values. Why would you throw variables in different groups away? Instead, you should select one or a few variables from each group of highly correlated variables to represent that group of variables in further analyses. This variable selection process can be based on subject-matter knowledge, ease of obtaining information about the variable(s), or statistical considerations (for example Matthew Zack -----Original Message----- From: BoraYavuz@hsbc.com.tr [mailto:BoraYavuz@hsbc.com.tr] Sent: Tuesday, February 07, 2006 3:36 AM To: Zack, Matthew M.; SAS-L@LISTSERV.UGA.EDU Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"? Matthew, --> Thanks a lot for the useful info. Though I managed to take a look --> at the on-line documentation (first things first) I wasn't able to come across the notes you mentioned. --> And in the case of Hoeffding's D, what statistic option for PROC --> CORR (instead of Hoeffding's D) do we request after employing PROC RANK? [By the way, I'm not interested in the variance of these measures -- just want to obtain the point estimates.] David, I'll try to clarify the points you mentioned: --> The reason I conducted stratified random sampling (i.e., why I ended --> up with sample weights) is due to the fact that I'm trying to build a response model on the responses of the customers from a previous direct marketing campaign, whom were selected on the basis of their propensity scores, and that the propensity scores of the targeted customers are way different from those in the population (on which I built the original propensity model). This is not surprising since apart from a small proportion of "random" groups (namely, "random control" and "random treatment" groups) all targeted customers possessed high propensity scores. Hence, I used the following variables as strata in order to form the development and validation samples for response modelling: Propensity score brackets Whether the customer was contacted or not (binary) Customer response (binary) --> The reason I end up using PROC CORR for obtaining the Spearman correlations and Hoeffding's D's of quite a few variables with the response variable is two-fold: Feature extraction: I had used PROC VARCLUS in the first stage of reducing the number of numeric variables. In this step, I look at the p-values of the Spearman correlations and throw away the variables with "sufficiently high" p-values (e.g., 0.50). Checking for non-linearity: It is argued that one should be suspicious of a non-linear relationship between a variable X and the response variable, should the variable X rank "much differently" with respect to the Spearmann correlations than with respect to the Hoeffding's D's. In other words, after obtaining for all variables the Spearman and Hoeffding's D stats using PROC CORR, I rank all the variables first with respect to Spearmann than Hoeffding's D. If X ranks 1st w.r.t Spearmann but 50th w.r.t. Hoeffding's D then we should treat X "cautiously" (and possibly we should "straighten" the variable using a suitable Tukey transform before putting it in the logistic regression). I hope this helps clarify things. Bora Y. |---------+-------------------------------> | |(Embedded image moved to file: | | |pic17371.gif) | | | "Zack, Matthew M." | | | | | | | | | 06/02/2006 17:51 | | | | |---------+-------------------------------> >-----------------------------------------------------------------------------------------------------------| | | | To: | | cc: | |Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"? | >-----------------------------------------------------------------------------------------------------------| Since PROC CORR calculates Spearman correlations by ranking numerical variables and performing Pearson correlations on these ranked variables, you can rank the numerical variables using PROC RANK and then run them through PROC CORR using pearson as an option rather than spearman (cf., the PROC CORR documentation). This same documentation shows that weighted Hoeffding measures might be calculated in the same way. However, the variance of these variables would be incorrect. When you calculate correlations of ranked variables against a dichotomous variable, the results are equivalent to a Mann-Whitney-Wilcoxon test, just like Pearson correlations of cardinal variables against a dichotomous variable is equivalent to a t-test. Matthew Zack -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Bora Yavuz Sent: Monday, February 06, 2006 7:35 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Is There a Procedure Like "PROC SURVEYCORR"? Hi, I'm trying obtain Hoeffding and Spearman correlations of a set of numeric variables with a dummy response variable as follows: ods output spearmancorr= spearman hoeffdingcorr= hoeffding; proc corr data= diff3.dvlpmnt_small_by_is_trtmnt spearman hoeffding rank; var &all_numeric_vars; with overall_afa_activity_as_of_oct1; weight final_weight; run; Here "&all_numeric_vars" is a global macro variable, which is simply a list of selected numeric variables whereas "overall_afa_activity_as_of_oct1" is my 0-1 response variable. However, I also have a weight variable "final_weight" since I conducted stratified random sampling before attempting the analysis. My problem is that I get the following error that says I am not allowed to use the "WEIGHT" statement in this case since I have stipulated "non-parametric options". The log is as follows: 20 ods output spearmancorr= spearman 21 hoeffdingcorr= hoeffding; 22 23 proc corr data= diff3.dvlpmnt_small_by_is_trtmnt spearman hoeffding rank; SYMBOLGEN: Macro variable ALL_NUMERIC_VARS resolves to ATM_ISLEM_YILDA_KAC_DONEM AutoToSavingsFlag BIREYSEL BS_DK_AKTIFLIK_STATUC1 BS_DK_AKTIFLIK_STATUC2 BS_KK_AKTIFLIK_STATUC2 B_DEK_TASIT B_TASIT B_Tipi_Likit_Ort_Adat3 B_Tipi_Likit_Ort_Adat6 CC_flag DEMFON1 DaysSinceLastCCActivity DebitCardFlag DemandFCFlag IB_STATUSUC1 IB_STATUSUC2 INTERNETB_ISLEM_SON_KAC_DONEM ISEMAIL IVP_flag_l3m IVP_flag_l6m KK_ADV_MRC_ISLEM_ADET1 KK_ADV_ONLY_ISLEM1 KK_EKSTRE_NAKIT_USD1 KK_EK_GRANT KK_EK_USER KK_EK_USER_CLASSIC KK_EK_USER_PREMIER KK_NEW_ACQUIRED KK_RE_ACQUIRED KK_RE_ACQUIRED2 KMH_KREDI_ADET KOS KR_KREDI_BAKIYE_TOP1 KR_VADELI_DVBAKIYE_DOLAR1 LoansFXFlag MAAS_ODEME_SISTEMI MAAS_USD_ADJ MESLEK_TIPIC1 MESLEK_TIPIC2 MV_DVBAKIYE_ORT1 OTOM SavingsTLFlag TELEFONB_AGENT_ISLEM_ADET1 TELEFONB_ISLEM_YILDA_KAC_DONEM afa_activity_L3M afa_activity_flag6 afa_ownership_flag3 afa_ownership_flag6 avg_lim_util_l3m avg_limit_l3m avg_ratio_kk_ekstre_faiz_l3m avg_ratio_kk_ekstre_nakit_l3m avg_ratio_kk_ekstre_satis_l3m cust_age_mnth maks_revolving_balance_l3m max_liquid_fund_31 max_ratio_kk_ekstre_faiz_l6m max_ratio_kk_ekstre_taksit_l6m min_ratio_kk_ekstre_gelir_l3m min_ratio_kk_ekstre_nakit_l6m min_ratio_kk_ekstre_satis_l3m min_ratio_kk_ekstre_taksit_l3m mt_evtel num_of_rev_l3m num_of_tel ort_kk_ekstre_satis_l6m ort_oo_tutar_l3m ort_vdszbak_TL_l3m perc_sl_liquid_fund_31C1 perc_sl_liquid_fund_31C2 perc_sl_oo_tutar_21C1 perc_sl_oo_tutar_21C2 perc_sl_tot_assets_31C1 perc_sl_tot_assets_31C2 perc_sl_vdszbak_TL_21C1 perc_sl_vdszbak_TL_21C2 ratio_kk_ekstre_gelir1 std_KK_TKSTDHL_ISL_ADET_l3m std_kk_ekstre_odeme_l6m std_limit_l6m sum_atm_islem_adet_l3m sum_kk_nakit_l3m sum_mv_vdli_islem_l3m tot_assets1 u10 u11 u12 u13 x14 x9 z10_l3m 24 var &all_numeric_vars; 25 with overall_afa_activity_as_of_oct1; 26 weight final_weight; 27 run; ERROR: Nonparametric options not allowed with WEIGHT statement. NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE CORR used (Total process time): real time 0.04 seconds cpu time 0.01 seconds WARNING: Output 'hoeffdingcorr' was not created. Make sure that the output object name, label, or path is spelled correctly. Also, verify that the appropriate procedure options are used to produce the requested output object. For example, verify that the NOPRINT option is not used. WARNING: Output 'spearmancorr' was not created. Make sure that the output object name, label, or path is spelled correctly. Also, verify that the appropriate procedure options are used to produce the requested output object. For example, verify that the NOPRINT option is not used. --> As I asked in the title of this message, "Is There a Procedure Like "PROC SURVEYCORR"?" whereby I can use the "WEIGHT" statement? --> Or simply put, how can I derive the desired non-parametric --> statistics in this case where I have sampling weights? Thank you very much in advance, Bora Y. Bu E-posta mesaji gizlidir. Ayrica hukuken de gizli olabilir. Mesajin gönderilmek istendigi kisi siz degilseniz hiçbir kismini kopyalayamaz, baskasina gönderemez, baskasina açiklayamaz veya kullanamazsiniz. Eger bu mesaj size yanlislikla ulasmissa, lütfen mesaji ve tüm kopyalarini sisteminizden silin ve gönderen kisiyi E-posta yolu ile bilgilendirin. Internet iletisiminde zamaninda, güvenli, hatasiz ya da virüssüz gönderim garanti edilemez. Gönderen taraf hata veya unutmalardan sorumluluk kabul etmez. ******************************************************************** This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return E-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ******************************************************************** (Embedded image moved to file: pic26953.pcx) ```

Back to: Top of message | Previous page | Main SAS-L page