Date: Thu, 9 Feb 2006 08:55:46 +0200
Reply-To: BoraYavuz@HSBC.COM.TR
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Bora Yavuz <BoraYavuz@HSBC.COM.TR>
Subject: Re: Is There a Procedure Like "PROC SURVEYCORR"?
Content-Type: text/plain; charset=Windows-1254
Surely, "the pattern of residuals vs. predicted
values and vs. the independent variables"
is supposed to pick up non-linearities. Honestly, I don't know the
motivation for the "rule-of-thumb"
SI suggests.
Thank you very much for your comments,
Bora Y.
|---------+------------------------------->
| |(Embedded image moved to file: |
| |pic27595.gif) |
| | "Zack, Matthew M." |
| | <mmz1@cdc.gov> |
| | |
| | 08/02/2006 16:45 |
| | |
|---------+------------------------------->
>-----------------------------------------------------------------------------------------------------------|
| |
| To: <BoraYavuz@hsbc.com.tr> |
| cc: |
|Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"? |
>-----------------------------------------------------------------------------------------------------------|
Your feature extraction sounds plausible, though the P-values in your
second step are higher
than usually recommended for screening variables: Variables with P-values
above (0.15-0.25)
are usually removed (the reference for this is based on linear regression
though it should
also be applicable to nonlinear regression like logistic regression).
I've heard of Tukey's ladder of powers but not the "rule-of-thumb" you
cite, and I haven't taken SI's logistic regression course. I wonder why
the pattern of residuals vs. predicted
values and vs. the independent variables would not pick up nonlinearities.
Matthew Zack
-----Original Message-----
From: BoraYavuz@hsbc.com.tr [mailto:BoraYavuz@hsbc.com.tr]
Sent: Wednesday, February 08, 2006 9:12 AM
To: SAS-L@LISTSERV.UGA.EDU
Cc: Zack, Matthew M.
Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"?
Matthew,
May be I wasn't clear enough on the "feature extraction" part. What I did
was, in essence, two-fold:
--> First step: Apply PROC VARCLUS and choose only one variable (the one
with the lowest "1 - R^2" ratio) from each cluster of variables.
--> Second step: Obtain the p-values of the Spearman correlations of the
remaining variables with the dependent variable and throw away the ones
with p-values that are "sufficiently high" (say above 0.50).
As for the third step:
--> Third step: Rank the variables (that remain after Step 2) first with
respect to their Spearman correlations and Hoeffding's D's and check if a
variable's rank on these two measures differ significantly -- since this
may be an indication of the non-linearity between the variable and the
dependent variable (response), in which case we should "straighten" the
variable possibly using Tukey's transformation ("ladder of powers", etc.)
To my knowledge, the emprical "rule-of-thumb" mentioned above has been
published in SI's course notes on logistic regression. May be you can find
more detailed references and pointers in there. Can you let me know if you
come across any theoretical or practical justification of this?
Cheers,
Bora Y.
|---------+------------------------------->
| |(Embedded image moved to file: |
| |pic18951.gif) |
| | "Zack, Matthew M." |
| | <mmz1@cdc.gov> |
| | |
| | 07/02/2006 18:19 |
| | |
|---------+------------------------------->
>-----------------------------------------------------------------------------------------------------------|
|
|
| To: <BoraYavuz@hsbc.com.tr>
|
| cc:
|
|Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"?
|
>-----------------------------------------------------------------------------------------------------------|
For Hoeffding's D, I think you would have to calculate a weighted version
in a DATA step, not in PROC CORR. You can use the formula for Hoeffding's
D described in the PROC CORR documentation but modified as the weighted
Pearson's correlation is compared to the unweighted Pearson's correlation.
I don't have the exact formula for a weighted Hoeffding's D statistic.
However, my problem is with your argument that the comparison between
Spearman's correlation and Hoeffding's D is a check on the linearity of a
variable. Has this argument been studied empirically? If so, have the
results been published.
I also don't understand your statements below that
Feature extraction: I had used PROC VARCLUS in the first stage of
reducing the number of numeric variables. In this step, I look at
the p-values of the Spearman correlations and throw away the
variables with "sufficiently high" p-values (e.g., 0.50).
PROC VARCLUS groups variables that are highly correlated with one another.
Variables in different groups, by definition, are less highly correlated
with one another and would therefore have low Spearman correlations and
"sufficiently high" P-values. Why would you throw variables in different
groups away?
Instead, you should select one or a few variables from each group of highly
correlated variables to represent that group of variables in further
analyses. This variable selection process can be based on subject-matter
knowledge, ease of obtaining information about the variable(s), or
statistical considerations (for example
Matthew Zack
-----Original Message-----
From: BoraYavuz@hsbc.com.tr [mailto:BoraYavuz@hsbc.com.tr]
Sent: Tuesday, February 07, 2006 3:36 AM
To: Zack, Matthew M.; SAS-L@LISTSERV.UGA.EDU
Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"?
Matthew,
--> Thanks a lot for the useful info. Though I managed to take a look
--> at
the on-line documentation (first things first) I wasn't able to come across
the notes you mentioned.
--> And in the case of Hoeffding's D, what statistic option for PROC
--> CORR
(instead of Hoeffding's D) do we request after employing PROC RANK? [By
the way, I'm not interested in the variance of these measures -- just want
to obtain the point estimates.]
David, I'll try to clarify the points you mentioned:
--> The reason I conducted stratified random sampling (i.e., why I ended
--> up
with sample weights) is due to the fact that I'm trying to build a response
model on the responses of the customers from a previous direct marketing
campaign, whom were selected on the basis of their propensity scores, and
that the propensity scores of the targeted customers are way different from
those in the population (on which I built the original propensity model).
This is not surprising since apart from a small proportion of "random"
groups (namely, "random control" and "random treatment" groups) all
targeted customers possessed high propensity scores. Hence, I used the
following variables as strata in order to form the development and
validation samples for response modelling:
Propensity score brackets
Whether the customer was contacted or not (binary)
Customer response (binary)
--> The reason I end up using PROC CORR for obtaining the Spearman
correlations and Hoeffding's D's of quite a few variables with the response
variable is two-fold:
Feature extraction: I had used PROC VARCLUS in the first stage of
reducing the number of numeric variables. In this step, I look at
the p-values of the Spearman correlations and throw away the
variables with "sufficiently high" p-values (e.g., 0.50).
Checking for non-linearity: It is argued that one should be
suspicious of a non-linear relationship between a variable X and the
response variable, should the variable X rank "much differently" with
respect to the Spearmann correlations than with respect to the
Hoeffding's D's. In other words, after obtaining for all variables
the Spearman and Hoeffding's D stats using PROC CORR, I rank all the
variables first with respect to Spearmann than Hoeffding's D. If X
ranks 1st w.r.t Spearmann but 50th w.r.t. Hoeffding's D then we
should treat X "cautiously" (and possibly we should "straighten" the
variable using a suitable Tukey transform before putting it in the
logistic regression).
I hope this helps clarify things.
Bora Y.
|---------+------------------------------->
| |(Embedded image moved to file: |
| |pic17371.gif) |
| | "Zack, Matthew M." |
| | <mmz1@cdc.gov> |
| | |
| | 06/02/2006 17:51 |
| | |
|---------+------------------------------->
>-----------------------------------------------------------------------------------------------------------|
|
|
| To: <BoraYavuz@HSBC.COM.TR>
|
| cc:
|
|Subject: RE: Is There a Procedure Like "PROC SURVEYCORR"?
|
>-----------------------------------------------------------------------------------------------------------|
Since PROC CORR calculates Spearman correlations by ranking numerical
variables and performing Pearson correlations on these ranked variables,
you can rank the numerical variables using PROC RANK and then run them
through PROC CORR using pearson as an option rather than spearman (cf., the
PROC CORR documentation). This same documentation shows that weighted
Hoeffding measures might be calculated in the same way.
However, the variance of these variables would be incorrect.
When you calculate correlations of ranked variables against a dichotomous
variable, the results are equivalent to a Mann-Whitney-Wilcoxon test, just
like Pearson correlations of cardinal variables against a dichotomous
variable is equivalent to a t-test.
Matthew Zack
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Bora
Yavuz
Sent: Monday, February 06, 2006 7:35 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Is There a Procedure Like "PROC SURVEYCORR"?
Hi,
I'm trying obtain Hoeffding and Spearman correlations of a set of numeric
variables with a dummy response variable as follows:
ods output spearmancorr= spearman
hoeffdingcorr= hoeffding;
proc corr data= diff3.dvlpmnt_small_by_is_trtmnt spearman hoeffding rank;
var &all_numeric_vars;
with overall_afa_activity_as_of_oct1;
weight final_weight;
run;
Here "&all_numeric_vars" is a global macro variable, which is simply a list
of selected numeric variables whereas "overall_afa_activity_as_of_oct1" is
my 0-1 response variable.
However, I also have a weight variable "final_weight" since I conducted
stratified random sampling before attempting the analysis.
My problem is that I get the following error that says I am not allowed to
use the "WEIGHT" statement in this case since I have stipulated
"non-parametric options". The log is as follows:
20 ods output spearmancorr= spearman
21 hoeffdingcorr= hoeffding;
22
23 proc corr data= diff3.dvlpmnt_small_by_is_trtmnt spearman
hoeffding rank;
SYMBOLGEN: Macro variable ALL_NUMERIC_VARS resolves to
ATM_ISLEM_YILDA_KAC_DONEM AutoToSavingsFlag
BIREYSEL BS_DK_AKTIFLIK_STATUC1 BS_DK_AKTIFLIK_STATUC2
BS_KK_AKTIFLIK_STATUC2
B_DEK_TASIT B_TASIT B_Tipi_Likit_Ort_Adat3
B_Tipi_Likit_Ort_Adat6 CC_flag DEMFON1
DaysSinceLastCCActivity DebitCardFlag DemandFCFlag IB_STATUSUC1
IB_STATUSUC2
INTERNETB_ISLEM_SON_KAC_DONEM ISEMAIL IVP_flag_l3m IVP_flag_l6m
KK_ADV_MRC_ISLEM_ADET1
KK_ADV_ONLY_ISLEM1 KK_EKSTRE_NAKIT_USD1 KK_EK_GRANT KK_EK_USER
KK_EK_USER_CLASSIC
KK_EK_USER_PREMIER KK_NEW_ACQUIRED KK_RE_ACQUIRED
KK_RE_ACQUIRED2 KMH_KREDI_ADET KOS
KR_KREDI_BAKIYE_TOP1 KR_VADELI_DVBAKIYE_DOLAR1 LoansFXFlag
MAAS_ODEME_SISTEMI
MAAS_USD_ADJ MESLEK_TIPIC1 MESLEK_TIPIC2 MV_DVBAKIYE_ORT1 OTOM
SavingsTLFlag
TELEFONB_AGENT_ISLEM_ADET1 TELEFONB_ISLEM_YILDA_KAC_DONEM
afa_activity_L3M
afa_activity_flag6 afa_ownership_flag3 afa_ownership_flag6
avg_lim_util_l3m
avg_limit_l3m avg_ratio_kk_ekstre_faiz_l3m
avg_ratio_kk_ekstre_nakit_l3m
avg_ratio_kk_ekstre_satis_l3m cust_age_mnth
maks_revolving_balance_l3m
max_liquid_fund_31 max_ratio_kk_ekstre_faiz_l6m
max_ratio_kk_ekstre_taksit_l6m
min_ratio_kk_ekstre_gelir_l3m min_ratio_kk_ekstre_nakit_l6m
min_ratio_kk_ekstre_satis_l3m min_ratio_kk_ekstre_taksit_l3m
mt_evtel num_of_rev_l3m
num_of_tel ort_kk_ekstre_satis_l6m ort_oo_tutar_l3m
ort_vdszbak_TL_l3m
perc_sl_liquid_fund_31C1 perc_sl_liquid_fund_31C2
perc_sl_oo_tutar_21C1
perc_sl_oo_tutar_21C2 perc_sl_tot_assets_31C1
perc_sl_tot_assets_31C2
perc_sl_vdszbak_TL_21C1 perc_sl_vdszbak_TL_21C2
ratio_kk_ekstre_gelir1
std_KK_TKSTDHL_ISL_ADET_l3m std_kk_ekstre_odeme_l6m
std_limit_l6m
sum_atm_islem_adet_l3m sum_kk_nakit_l3m sum_mv_vdli_islem_l3m
tot_assets1 u10 u11 u12
u13 x14 x9 z10_l3m
24 var &all_numeric_vars;
25 with overall_afa_activity_as_of_oct1;
26 weight final_weight;
27 run;
ERROR: Nonparametric options not allowed with WEIGHT statement.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE CORR used (Total process time):
real time 0.04 seconds
cpu time 0.01 seconds
WARNING: Output 'hoeffdingcorr' was not created. Make sure that the output
object name, label, or
path is spelled correctly. Also, verify that the appropriate
procedure options are used
to produce the requested output object. For example, verify that
the NOPRINT option is
not used.
WARNING: Output 'spearmancorr' was not created. Make sure that the output
object name, label, or
path is spelled correctly. Also, verify that the appropriate
procedure options are used
to produce the requested output object. For example, verify that
the NOPRINT option is
not used.
--> As I asked in the title of this message, "Is There a Procedure Like
"PROC SURVEYCORR"?" whereby I can use the "WEIGHT" statement?
--> Or simply put, how can I derive the desired non-parametric
--> statistics
in this case where I have sampling weights?
Thank you very much in advance,
Bora Y.
Bu E-posta mesaji gizlidir. Ayrica hukuken de gizli olabilir.
Mesajin gönderilmek istendigi kisi siz degilseniz hiçbir kismini
kopyalayamaz, baskasina gönderemez, baskasina açiklayamaz veya
kullanamazsiniz. Eger bu mesaj size yanlislikla ulasmissa, lütfen mesaji ve
tüm kopyalarini sisteminizden silin ve gönderen kisiyi E-posta yolu ile
bilgilendirin.
Internet iletisiminde zamaninda, güvenli, hatasiz ya da virüssüz gönderim
garanti edilemez.
Gönderen taraf hata veya unutmalardan sorumluluk kabul etmez.
********************************************************************
This E-mail is confidential. It may also be legally privileged. If you are
not the addressee you may not copy, forward, disclose or use any part of
it. If you have received this message in error, please delete it and all
copies from your system and notify the sender immediately by return E-mail.
Internet communications cannot be guaranteed to be timely, secure, error or
virus-free.
The sender does not accept liability for any errors or omissions.
********************************************************************
(Embedded image moved to file: pic26953.pcx)