You are on page 1of 100

Business Analytics

Logistic regression

Pristine www.edupristine.com
Pristine
Logistic Regression
I. Problem with fitting linear regression on data having Binary Response variable

II. Introduction to Generalized Linear Modeling (GLMs)

III. Logistic Regression Theory

IV. Logistic Regression Case


i. Fitting the regression using R
ii. Scoring equation
iii. Model diagnostics
iv. Gains curve and Gini
v. K-S stat
V. Score Card Development

Pristine 1
Case- Multivariate Logistic Regression
Romanov, an Analytics consultant works with Credit One Bank. His manager gave him data having
Credit and personal information of a group of customers. Some of the customers had defaulted in
making the payment on balance due. He asked him to identify and quantify the factors
responsible for defaults in a multivariate fashion and find out the probability of default
corresponding to each of the customers. Romanov has no knowledge of running a multivariate
regression se)which predicts probability of occurrence of an event (default in this case).

Now suppose, he approaches you and request for your help to complete the assignment. Lets help
Romanov in solving the problem.

Pristine 2
Case- Multivariate Logistic Regression
In due course of helping Romanov to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Outlier treatment
Not required for Binary variables
Independent variables analyses
Identify the prospective independent variables (that can explain response variable)
Bivariate analysis of response variable against independent variables
Variable treatment /transformation
Creation of dummy variables corresponding to the categorcial variables
Grouping of distinct values/levels
Mathematical transformation e.g. log, splines etc.

Pristine 3
Case- Multivariate Logistic Regression
Fitting the regression
Run the appropriate regression
Exclude the insignificant variables
Check for correlation between independent variables
This is to take care of Multicollinearity
Re-fit the regression after fixing for Multicollinearity
Construction of scoring equation
Analysis of results
Check for reduction in Deviance/AIC
Model performance check
Actual vs Predicted comparison
Lift/Gains chart and Gini coefficient
K-S stat

Pristine 4
Understanding the Data
Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called as attributes in credit industry)
20 attributes
3 numerical and 17 categorical
For details refer to
Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx

Pristine 5
Response variable analysis
Frequency distribution
Default_On_Payment # Observations
0 3505
1 1495
Total 5000

Event rate (for a Binary response variable)


Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%

Pristine 6
Bivariate Analysis (of Independent variables)
1 Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate
A11 1370 675 49.27%
A12 1345 520 38.66%
A13 315 70 22.22%
A14 1970 230 11.68%

2 Duration_in_Months # Observations Default_On_Payment Default Rate


12 1795 380 21.2%
These are transformed values. For original
24 2055 610 29.7% values refer to Analysis_of_Default.xlsx
36 715 285 39.9%
48 355 180 50.7%
60 80 40 50.0%
72 5 5 100.0%

Credit_History # Observations Default_On_Payment Default Rate


3 A30 200 125 62.50%
A31 245 140 57.14%
A32 2650 840 31.70%
A33 440 140 31.82%
A34 1465 250 17.06%

Pristine 7
Bivariate Analysis (of Independent variables)
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate A40: car (new)
A40 1170 445 38.03% A41: car (used)
A41 515 85 16.50% A42: furniture/equipment
A42 905 290 32.04%
A43: radio/television
A43 1400 305 21.79%
A44 60 20 33.33%
A44: domestic appliances
A45 110 40 36.36% A45: repairs
A46 250 110 44.00% A46: education
A48 45 5 11.11% A47: (vacation - does not exist?)
A49 485 170 35.05% A48: retraining
A410 60 25 41.67% A49: business
A410: others
5 Credit_Amount # Observations Default_On_Payment Default Rate
These are transformed values. For original
4000 3770 975 25.86%
11000 1085 420 38.71% values refer to Analysis_of_Default.xlsx
12000 145 100 68.97%

6 Savings_Acc # Observations Default_On_Payment Default Rate


A61 3015 1080 35.82%
A62 515 170 33.01%
A63 315 55 17.46%
A64 240 30 12.50%
A65 915 160 17.49%

Pristine 8
Bivariate Analysis (of Independent variables)
Years_At_Present_Employment # Observations Default_On_Payment Default Rate
7 A71 310 115 37.10%
A72 860 350 40.70%
A73 1695 515 30.38%
A74 870 195 22.41%
A75 1265 320 25.30%

8 Inst_Rt_Income # Observations Default_On_Payment Default Rate


1 680 170 25.00%
2 1155 305 26.41%
3 785 225 28.66%
4 2380 795 33.40%

9 Marital_Status_Gender # Observations Default_On_Payment Default Rate


A91 250 100 40.00%
A92 1550 540 34.84%
A93 2740 730 26.64%
A94 460 125 27.17%

10 Other_Debtors_Guarantors # Observations Default_On_Payment Default Rate


A101 4535 1355 29.88%
A102 205 90 43.90%
A103 260 50 19.23%

Pristine 9
Bivariate Analysis (of Independent variables)
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 650 180 27.69%
2 1540 480 31.17%
3 745 215 28.86%
4 2065 620 30.02%

12 Property # Observations Default_On_Payment Default Rate


A121 1410 295 20.92%
A122 1160 355 30.60%
A123 1660 510 30.72%
A124 770 335 43.51%

13 Age:
Refer to Analysis_of_Default.xlsx
14
Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 695 285 41.01%
A142 235 95 40.43%
A143 4070 1115 27.40%

15
Housing # Observations Default_On_Payment Default Rate
A151 895 350 39.11%
A152 3565 925 25.95%
A153 540 220 40.74%

Pristine 10
Bivariate Analysis (of Independent variables)
16 Num_CC # Observations Default_On_Payment Default Rate
1 3165 995 31.44%
2 1665 460 27.63%
3 140 30 21.43%
4 30 10 33.33%

17 Job # Observations Default_On_Payment Default Rate


A171 110 35 31.82%
A172 1000 280 28.00%
A173 3150 925 29.37%
A174 740 255 34.46%

18 Dependents # Observations Default_On_Payment Default Rate


1 4225 1265 29.94%
2 775 230 29.68%

19 Telephone # Observations Default_On_Payment Default Rate


A191 2980 930 31.21%
A192 2020 565 27.97%

20 Foreign_Worker # Observations Default_On_Payment Default Rate


A201 4815 1475 30.63%
A202 185 20 10.81%

Pristine 11
Dummy variable (0,1) creation for Categorical variables
1
Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate Three dummies for A11, A12 and A13.
A11 1370 675 49.27%
A12 1345 520 38.66% A14 will be base level
A13 315 70 22.22%
A14 1970 230 11.68%

2
Duration_in_Months # Observations Default_On_Payment Default Rate
12 1795 380 21.2%
Will be used as a continuous variables
24 2055 610 29.7%
36 715 285 39.9%
48 355 180 50.7%
60 80 40 50.0%
72 5 5 100.0%

3 Credit_History # Observations Default_On_Payment Default Rate


A30 200 125 62.50% Four dummies for A30, A31, A33 and A34.
A31 245 140 57.14% A32 will be base level
A32 2650 840 31.70%
A33 440 140 31.82%
A34 1465 250 17.06%

Pristine 12
Dummy variable (0,1) creation for Categorical variables
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate Six dummies for A41,A42, A43& A44, A45, A46
A40 1170 445 38.03% and A49.
A41 515 85 16.50% A40, A48 and A410 will be base levels
A42 905 290 32.04%
A43 1400 305 21.79%
A44 60 20 33.33%
A45 110 40 36.36%
A46 250 110 44.00%
A48 45 5 11.11%
A49 485 170 35.05%
A410 60 25 41.67%

5 Credit_Amount # Observations Default_On_Payment Default Rate


4000 3770 975 25.86%
Two dummies for 11000 AND 12000.
11000 1085 420 38.71% 4000 WILL BE BASE LEVEL
12000 145 100 68.97%

6 Savings_Acc # Observations Default_On_Payment Default Rate


A61 3015 1080 35.82% A61 will be base level.
A62 515 170 33.01%
A63 315 55 17.46%
Four dummies for A62-A65
A64 240 30 12.50%
A65 915 160 17.49%

Pristine 13
Dummy variable (0,1) creation for Categorical variables
7 Years_At_Present_Employment # Observations Default_On_Payment Default Rate
A71 310 115 37.10%
A73 will be base level.
A72 860 350 40.70% Four dummies for the rest of levels
A73 1695 515 30.38%
A74 870 195 22.41%
A75 1265 320 25.30%

8 Inst_Rt_Income # Observations Default_On_Payment Default Rate


1 680 170 25.00% Continuous variable
2 1155 305 26.41%
3 785 225 28.66%
4 2380 795 33.40%

9 Marital_Status_Gender # Observations Default_On_Payment Default Rate


A91 250 100 40.00% A93 will be base level.
A92 1550 540 34.84% Three dummies for the rest.
A93 2740 730 26.64%
A94 460 125 27.17%

10
Other_Debtors_Guarantors # Observations Default_On_Payment Default Rate
A101 4535 1355 29.88%
A101 will be base level.
A102 205 90 43.90% Two dummies for the rest.
A103 260 50 19.23%

Pristine 14
Dummy variable (0,1) creation for Categorical variables
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 650 180 27.69% Continuous variable.
2 1540 480 31.17%
3 745 215 28.86%
4 2065 620 30.02%

12 Property # Observations Default_On_Payment Default Rate


A121 1410 295 20.92% A123 will be base level.
A122 1160 355 30.60% Three dummies for the rest.
A123 1660 510 30.72%
A124 770 335 43.51%

13 Age:
Continuous variable.
Refer to Analysis_of_Default.xlsx
14
Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 695 285 41.01% A143 will be base level.
A142 235 95 40.43% Two dummies for the rest.
A143 4070 1115 27.40%

15
Housing # Observations Default_On_Payment Default Rate A152 will be base level.
A151 895 350 39.11%
A152 3565 925 25.95%
Two dummies for the rest.
A153 540 220 40.74%

Pristine 15
Dummy variable (0,1) creation for Categorical variables
16 Num_CC # Observations Default_On_Payment Default Rate
1 3165 995 31.44% Continuous variable.
2 1665 460 27.63%
3 140 30 21.43%
4 30 10 33.33%

17 Job # Observations Default_On_Payment Default Rate


A171 110 35 31.82% A173 will be base level.
A172 1000 280 28.00%
Three dummies for the rest.
A173 3150 925 29.37%
A174 740 255 34.46%

18 Dependents # Observations Default_On_Payment Default Rate


1 4225 1265 29.94% Continuous variable.
2 775 230 29.68%

19 Telephone # Observations Default_On_Payment Default Rate A191 is base level.


A191 2980 930 31.21%
A192 2020 565 27.97%
One dummy for A192.

20 Foreign_Worker # Observations Default_On_Payment Default Rate 96.3% of population lies in level 201.
A201 4815 1475 30.63%
A202 185 20 10.81%
We should not use this variable.

Pristine 16
Fitting a Linear Regression on Binary Data
Standardize the variable name
Replace the variable name by Attr<i>_Dummy<j>
Where i = 1----- 19 (we will use 19 attributes for out analysis)
J is the respective number of dummies for the individual attributes
Create a *.csv file and save it on local disc
We will use R to run the regression
Read the *.csv file in R environment and store it in object DefaultData.
Fit the regression having formula
lm(Default_On_Payment ~ Attr1_Dummy1 + Attr1_Dummy2 + Attr1_Dummy3 + Attr2_Trans +
Attr3_dummy1 + Attr3_dummy2 + Attr3_dummy3 + Attr3_dummy4 + Attr4_Dummy1 + Attr4_Dummy2 +
Attr4_Dummy3 + Attr4_Dummy4 + Attr4_Dummy5 + Attr4_Dummy6 + Attr5_Dummy1 + Attr5_Dummy2 +
Attr6_Dummy1 + Attr6_Dummy2 + Attr6_Dummy3 + Attr6_Dummy4 + Attr7_Dummy1 + Attr7_Dummy2 +
Attr7_Dummy3 + Attr7_Dummy4 + Attr8 + Attr9_Dummy1 + Attr9_Dummy2 + Attr9_Dummy3 +
Attr10_Dummy1 + Attr10_Dummy2 + Attr11 + Attr12_Dummy1 + Attr12_Dummy2 + Attr12_Dummy3 +
Attr13 + Attr14_Dummy + Attr15_Dummy + Attr16 + Attr17_Dummy1 + Attr17_Dummy2 + Attr17_Dummy3
+ Attr18 + Attr19_Dummy, data = DefaultData)

Pristine 17
Fitting a Linear Regression on Binary Data
Estimate Std. Error t value Pr(>|t|)
Regression output (Intercept)
Attr1_Dummy1
0.0283
0.2725
0.0438
0.0151
0.6470
17.9900
0.517609
< 2e-16
Attr1_Dummy2 0.1823 0.0148 12.2820 < 2e-16
Attr1_Dummy3 0.0776 0.0245 3.1740 0.001515
Attr2_Trans 0.0032 0.0006 5.2480 1.60E-07
Attr3_dummy1 0.1183 0.0312 3.7880 0.000153
Attr3_dummy2 0.1456 0.0278 5.2430 1.65E-07
Attr3_dummy3 -0.0430 0.0222 -1.9350 0.053051
Attr3_dummy4 -0.1180 0.0160 -7.3540 2.24E-13
Attr4_Dummy1 -0.2047 0.0219 -9.3400 < 2e-16
Attr4_Dummy2 -0.0920 0.0179 -5.1340 2.94E-07
Attr4_Dummy3 -0.1021 0.0159 -6.4330 1.37E-10
Attr4_Dummy4 -0.0239 0.0397 -0.6010 0.547956
Attr4_Dummy5 0.0436 0.0279 1.5660 0.117491
Attr4_Dummy6 -0.0791 0.0224 -3.5390 0.000406
Attr5_Dummy1 0.1297 0.0176 7.3820 1.82E-13
Attr5_Dummy2 0.3394 0.0385 8.8240 < 2e-16
Attr6_Dummy1 -0.0391 0.0195 -2.0090 0.044631
Attr6_Dummy2 -0.0734 0.0241 -3.0420 0.002363
Attr6_Dummy3 -0.1409 0.0271 -5.2010 2.07E-07
Attr6_Dummy4 -0.1231 0.0156 -7.8700 4.32E-15
Attr7_Dummy1 0.0272 0.0291 0.9350 0.350063
Attr7_Dummy2 0.0409 0.0170 2.4010 0.016408
Attr7_Dummy3 -0.0772 0.0167 -4.6200 3.93E-06
Attr7_Dummy4 -0.0118 0.0165 -0.7130 0.475975
Attr8 0.0503 0.0055 9.1390 < 2e-16
Attr9_Dummy1 0.1411 0.0270 5.2210 1.85E-07
Attr9_Dummy2 0.0772 0.0140 5.5100 3.77E-08
Attr9_Dummy3 0.0373 0.0213 1.7530 0.079599
Attr10_Dummy1 0.0575 0.0287 2.0040 0.045086
Attr10_Dummy2 -0.1694 0.0265 -6.3970 1.73E-10
Attr11 -0.0030 0.0058 -0.5100 0.610103
Attr12_Dummy1 -0.0438 0.0156 -2.8080 0.005007
Attr12_Dummy2 -0.0028 0.0157 -0.1780 0.859092
Attr12_Dummy3 0.0235 0.0207 1.1350 0.256506
Attr13 -0.0021 0.0006 -3.3600 0.000786
Attr14_Dummy 0.0776 0.0152 5.0910 3.69E-07
Attr15_Dummy 0.0561 0.0153 3.6700 0.000245
Attr16 0.0343 0.0124 2.7740 0.005551
Attr17_Dummy1 -0.0860 0.0442 -1.9470 0.051566
Attr17_Dummy2 -0.0208 0.0154 -1.3490 0.177527
Attr17_Dummy3 -0.0411 0.0192 -2.1400 0.03239
Attr18 0.0321 0.0167 1.9190 0.055088
Attr19_Dummy -0.0405 0.0130 -3.1250 0.001786
Pristine 18
Fitting a Linear Regression on Binary Data

The predicted Default_On_Payment is taking values outside the range [0, 1]


We intend to predict probability of default which must take value in the range (0, 1)
We cannot achieve this my using Linear Regression
What is the solution???
Use Logistic Regression
o A subset of Generalized Linear Models (GLMs)

Pristine 19
Generalized Linear Models (GLMs)
The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that
allows for response variables that have other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response
variable via a link function and by allowing the magnitude of the variance of each measurement
to be a function of its predicted value.
It uses an iteratively reweighted least squares method for maximum likelihood estimation of the
model parameters.
The GLM consists of three elements:
A probability distribution from the exponential family.
A linear predictor = X .
A link function g such that E(Y) = = g-1().

Pristine 20
Generalized Linear Models (GLMs)- Exponential family

A distribution for a random variable Y belongs to an exponential family if its density has the
following form:

where a, b and c are functions.


Parameter , is called the natural parameter, is the one which is relevant to the model for
relating the response (Y) to the covariates
is known as the scale parameter or dispersion parameter.
For example, for Normal distribution is the natural parameter and 2 is the dispersion parameter

Pristine 21
Generalized Linear Models (GLMs)- Binomial Distribution

Suppose Z ~ binomial (n, ) . Let Y = Z/n so that Z = nY.


Distribution of Z is
Substituting for Z = nY, we get distribution of Y as

Generic form of exponential family of distribution (just for reference)

Pristine 22
GLMs Linear Predictor and Link Function

The linear predictor is the quantity which incorporates the information about the independent
variables into the model. The symbol (Greek "eta") denotes a linear predictor.
It is related to the expected value of the data (thus, "predictor") through the link function.
is expressed as linear combinations (thus, "linear") of unknown parameters .
= Xi i
If g() is a link function then g() =
Some common link functions:

Pristine 23
Logistic Regression - Introduction
Logistic regression is used to analyze relationships between a dichotomous dependent variable
and metric or dichotomous independent variables.
Logistic regression combines the independent variables to estimate the probability that a
particular event will occur, i.e. a subject will be a member of one of the groups defined by the
dichotomous dependent variable.
The value produced by logistic regression is a probability value between 0.0 and 1.0.
If the probability for group membership in the modeled category is above some cut point (the
default is 0.50), the subject is predicted to be a member of the modeled group. If the probability
is below the cut point, the subject is predicted to be a member of the other group.
For any given case, logistic regression computes the probability that a case with a particular set of
values for the independent variable is a member of the modeled category.

Pristine 24
Logistic Regression Variable Requirements
Logistic regression analysis requires that the dependent variable be dichotomous.
Logistic regression analysis requires that the independent variables be numerical or dichotomous.
If an independent variable is nominal level and not dichotomous, we need to dummy code the
variable.
Logistic regression does not make any assumptions of normality, linearity, and homogeneity of
variance for the independent variables.
The regression equation is
Log(p/(1-p)) = a + b1 x1 + b2 x2 +.. +bn xn + e
Log(p/(1-p)) = (linear predictor)

Pristine 25
Logistic Regression Methods for including variables
There are three methods available for including variables in the regression equation:
1. The simultaneous method in which all independents are included at the same time
2. The hierarchical method in which control variables are entered in the analysis before the predictors whose
effects we are primarily concerned with.
3. The stepwise method (in-built functionality in SAS and SPSS) in which variables are selected in the order in
which they maximize the statistically significant contribution to the model.
For all methods, the contribution to the model is measures by model Deviance or AIC.
A better model will have a lower Deviance/AIC
Deviance is calculated from Maximum-likelihood estimation (MLE)
MLE is an interactive procedure that successively tries works to get closer and closer to the
correct answer.
A perfect model will have MLE = 0
Deviance = 2 (log Likelihood of perfect model log Likelihood of current model)
AIC = Deviance + # parameters in model

Pristine 26
Logistic Regression Overall Test of Relationship
The overall test of relationship among the independent variables and groups defined by the
dependent is based on the reduction in the likelihood values for a model which does not contain
any independent variables and the model that contains the independent variables.

This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-
square.

The significance test for the model chi-square is our statistical evidence of the presence of a
relationship between the dependent variable and the combination of the independent variables.

Pristine 27
Running Logistic Regression Using R
Steps:
1. Save the *.csv file on hard disk
2. Read the *.csv file and store in a R object
<R DataObject> <- read.csv(file = Path/*.csv
3. Run the regression
<Regression Object> <- glm(response ~ x1 + x2 +.+xn, family=binomial (logit), data= <RDataObject>)
4. Identify the insignificant variables and remove them
5. Re-run the regression

Pristine 28
Running Logistic Regression Using R
Steps:
Save the *.csv file on hard disk
DefaultData <- read.csv(file = "D:\\Logistic Regression Using R\\Analysis_of_Default.csv")
Run the regression
lgOut = glm(Default_On_Payment ~ Attr1_Dummy1 + Attr1_Dummy2 + Attr1_Dummy3 + Attr2_Trans +
Attr3_dummy1 + Attr3_dummy2 + Attr3_dummy3 + Attr3_dummy4 + Attr4_Dummy1 + Attr4_Dummy2 +
Attr4_Dummy3 + Attr4_Dummy4 + Attr4_Dummy5 + Attr4_Dummy6 + Attr5_Dummy1 + Attr5_Dummy2 +
Attr6_Dummy1 + Attr6_Dummy2 + Attr6_Dummy3 + Attr6_Dummy4 + Attr7_Dummy1 + Attr7_Dummy2 +
Attr7_Dummy3 + Attr7_Dummy4 + Attr8 + Attr9_Dummy1 + Attr9_Dummy2 + Attr9_Dummy3 +
Attr10_Dummy1 + Attr10_Dummy2 + Attr11 + Attr12_Dummy1 + Attr12_Dummy2 + Attr12_Dummy3 + Attr13 +
Attr14_Dummy + Attr15_Dummy + Attr16 + Attr17_Dummy1 + Attr17_Dummy2 + Attr17_Dummy3 + Attr18 +
Attr19_Dummy , family=binomial (logit), data= DefaultData)

Pristine 29
Running Logistic Regression Using R
Insignificant variables
1. Attr4_Dummy4 2. Attr4_Dummy5
3. Attr6_Dummy1 4. Attr7_Dummy1
5. Attr7_Dummy2 6. Attr7_Dummy4
7. Attr10_Dummy1 8. Attr11
9. Attr12_Dummy2 10. Attr12_Dummy3
11.Attr17_Dummy1 12. Attr17_Dummy2
13.Attr17_Dummy3
Exclude these variables and re-run regression

Pristine 30
Running Logistic Regression Using R
Model coefficients

Pristine 31
Running Logistic Regression Using R- Checking for
Multicollinearity
SquareRoot(VIFs) < 2
Hence no multicollinearity

Scoring Equation
log {Probability of Default/(1- Probability of Default)} =
-3.050495 +
1.725226 * Attr1_Dummy1 + 1.265737 * Attr1_Dummy2 + 0.712851 * Attr1_Dummy3 +
0.02124 * Attr2_Trans + 0.404285 * Attr3_dummy1 + 0.649632 * Attr3_dummy2 +
-0.284722 * Attr3_dummy3 -0.897908 * Attr3_dummy4 -1.623227 * Attr4_Dummy1 +
-0.545947 * Attr4_Dummy2 -0.678789 * Attr4_Dummy3 -0.553012 * Attr4_Dummy6 +
0.897737 * Attr5_Dummy1 + 2.031098 * Attr5_Dummy2 -0.365587 * Attr6_Dummy2 +
-1.133627 * Attr6_Dummy3 -0.905696 * Attr6_Dummy4 -0.594617 * Attr7_Dummy3 +
0.345461 * Attr8 + 0.870902 * Attr9_Dummy1 + 0.582205 * Attr9_Dummy2 +
0.38614 * Attr9_Dummy3 -1.044407 * Attr10_Dummy2 -0.269308 * Attr12_Dummy1 +
-0.016944 * Attr13 + 0.577225 * Attr14_Dummy + 0.412508 * Attr15_Dummy +
0.247984 * Attr16 + 0.250946 * Attr18 -0.293946 * Attr19_Dummy;

Pristine 32
Logistic Regression Gains curve and Gini
Predicted Cumulative
Bins # Observations Actual Default Lift Random Area Gini
Default Actual
0 0 0 0 0 0 0 0.00000 44.98%
1 500 370 394 370 24.75% 10.00% 0.01237
2 500 340 304 710 47.49% 20.00% 0.03612
3 500 235 240 945 63.21% 30.00% 0.05535
4 500 160 186 1105 73.91% 40.00% 0.06856
5 500 140 132 1245 83.28% 50.00% 0.07860
6 500 90 93 1335 89.30% 60.00% 0.08629
7 500 80 65 1415 94.65% 70.00% 0.09197
8 500 60 44 1475 98.66% 80.00% 0.09666
9 500 15 27 1490 99.67% 90.00% 0.09916
10 500 5 11 1495 100.00% 100.00% 0.09983

Pristine 33
Logistic Regression KS Stat
Actual Non Predicted Cumulative Cumulative Lift- Non Difference: Cumulative
Bins # Observations Actual Default Lift- Default Random Area Gini
Default Default Default Non Default Default Default & Non Defaullt
0 0 0 0 0 0 0 0 0 0 0.00000 44.98% 0
1 500 370 130 394 370 130 24.75% 3.71% 10.00% 0.01237 21.04%
2 500 340 160 304 710 290 47.49% 8.27% 20.00% 0.03612 39.22%
3 500 235 265 240 945 555 63.21% 15.83% 30.00% 0.05535 47.38%
4 500 160 340 186 1105 895 73.91% 25.53% 40.00% 0.06856 48.38%
5 500 140 360 132 1245 1255 83.28% 35.81% 50.00% 0.07860 47.47%
6 500 90 410 93 1335 1665 89.30% 47.50% 60.00% 0.08629 41.79%
7 500 80 420 65 1415 2085 94.65% 59.49% 70.00% 0.09197
35.16%
8 500 60 440 44 1475 2525 98.66% 72.04% 80.00% 0.09666
26.62%
9 500 15 485 27 1490 3010 99.67% 85.88% 90.00% 0.09916
13.79%
10 500 5 495 11 1495 3505 100.00% 100.00% 100.00% 0.09983
0.00%

The difference between


%Cumulative Default (Bad) and Non
Default (Good) is 48.38% which is
occurring at 4th decile
At 4th decile, model is able to
capture 73.91% of total bads
i.e. if we follow the model, in 40% of
records we can capture 78.91% of
total bads
The K-S of the model is 48.38% at 4th
decile

Pristine 34
Scorecard Development
I. Need for a scorecard

II. Conversion of Logistic Regression Output to a Score

III. Performance Measure of Score

Pristine 35
Why Scorecard?
Output of logistic regression is probability values in the range of (0,1)
Model results are used by
Risk managers to select/reject customers
Management to take strategic decisions
End customers to assess their credit worthiness
As an input to some other model
Using the raw output will be cumbersome and will pose technical and mathematical challenges to
deal with
Conversion of raw probabilities to a more readable format (without compromising on the
effectiveness) will solve the above mentioned problems
The transformed probabilities are the scores and the set is the Scorecard

Pristine 36
Transformation of Probabilities to Scores
Decide upon the range of score i.e. upper and lower limits
Say, Score_Lower_Bound (A) and Score_Upper_Bound (B)
Case1: If we have modeled and outcome that has adverse impact on business, say Default on
Payment then
o Higher is the probability of default, lower is the score
Case2: If the modeled outcome is favorable to business, say new customer/business
acquisition, then
o Higher is the probability of conversion, higher is the score
Finally transform the probability in a linear fashion to get the scorecard
o Case1: Score = { B probability * (B A) }
o Case2: Score = { A + probability * (B A) }
Scorecard for Default on Payment model in the range of (100, 200) can be created using
the transformation
o Score = { 200 probability * (200 100) }

Pristine 37
Scorecard Performance Checks
Basic checks are:
1. How well the scorecard rank orders the Target Variable
2. Gini coefficient
3. K-S stat

Steps involved are:


1. Sort the data/records by Score
2. Create bins of equal weights (observations in logistic regression)
3. Calculate average of Target Variable in each bin
4. Calculate Gini coefficient (as described in earlier section)
5. Calculate K-S (as described in earlier section)

Pristine 38
Scorecard Performance Checks- Rank Ordering and Gini
Records sorted by Score (Ascending)
Cumulative
Equal Obs # Default on Area Under
# Obs Default on Lift Random Gini
Bin Payment Curve
Payment
0 0 0 0 0.00% 0% 0.000 44.98%
1 500 370 370 24.75% 10% 0.012
2 500 340 710 47.49% 20% 0.036
3 500 235 945 63.21% 30% 0.055
4 500 160 1105 73.91% 40% 0.069
5 500 140 1245 83.28% 50% 0.079
6 500 90 1335 89.30% 60% 0.086
7 500 80 1415 94.65% 70% 0.092
8 500 60 1475 98.66% 80% 0.097
9 500 15 1490 99.67% 90% 0.099
10 500 5 1495 100.00% 100% 0.100

Pristine 39
Scorecard Performance Checks- KS Stat
KS - Maximum separation between good and bad
# Default Cumulative Lift- Non Difference-
Equal Non Default Cumulative Default on Lift- Default on
# Obs on Non Default on Default on Random Cumulative Default
Obs Bin on Payment Payment Payment
Payment Payment Payment vs Non Default
0 0 0 0 0 0 0.00% 0.00% 0% 0.00%
1 500 370 130 370 130 24.75% 3.71% 10% 21.04%
2 500 340 160 710 290 47.49% 8.27% 20% 39.22%
3 500 235 265 945 555 63.21% 15.83% 30% 47.38%
4 500 160 340 1105 895 73.91% 25.53% 40% 48.38%
5 500 140 360 1245 1255 83.28% 35.81% 50% 47.47%
6 500 90 410 1335 1665 89.30% 47.50% 60% 41.79%
7 500 80 420 1415 2085 94.65% 59.49% 70% 35.16%
8 500 60 440 1475 2525 98.66% 72.04% 80% 26.62%
9 500 15 485 1490 3010 99.67% 85.88% 90% 13.79%
10 500 5 495 1495 3505 100.00% 100.00% 100% 0.00%

The difference between


%Cumulative Default (Bad) and Non
Default (Good) is 48.38% which is
occurring at 4th decile
At 4th decile, model is able to
capture 73.91% of total bads
i.e. if we follow the model, in 40% of
records we can capture 38.91% of
total bads
The K-S of the model is 48.38% at 4th
decile

Pristine 40
Scorecard Performance Checks- Comparison with Model results

Comparing the performance of Scorecard with original model results

Model Score Card

Gini Coeff 44.98% 44.98%

KS 48.38% 48.38%
A properly transformed scorecard produces the same result as obtained from fundamental
model output

Pristine 41
Logistic Regression Demonstration in SAS Language

Pristine 42
Start WPS Session
Start WPS session from Start Menu and select a suitable workspace
From the File Menu create a new Project

1 2

Pristine 43
Start WPS Session
1 2 3

4 5

Pristine 44
Understanding the Data
Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called as attributes in credit industry)
20 attributes
3 numerical and 17 categorical
For details refer to
Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx

Pristine 45
Data Exploration- Crate SAS script

1 2

Pristine 46
Data Exploration- Assigning library and Import Data
/*Assign library where data will be stored*/

libname logit 'D:\Analytics\Logistic Regression';

Pristine 47
Data Exploration- Assigning library and Import Data

1 2

Pristine 48
Data Exploration- Assigning library and Import Data

3 4

Pristine 49
Data Exploration- Assigning library and Import Data

5 6

*Using "proc import" to import csv file;


proc import
datafile = "D:\Analytics\Logistic Regression\Default_On_Payment.csv"
out = logit.Default_On_Payment_1 dbms = csv replace;
delimiter = ',';
getnames = yes;
run;
7

Pristine 50
Data Exploration- Contents of Data
*Content of data;
proc contents data = logit.Default_On_Payment; run;

Pristine 51
Response variable analysis
Frequency distribution
SAS code
*Response variable- "Default_On_Payment";
proc freq
data= logit.Default_On_Payment;
tables Default_On_Payment/list missing;
run;

The WPS System

The FREQ Procedure

Default_On_Payment
Default_On_Payment Frequency Percent Cumulative Cumulative
Frequency Percent
0 28040 70.10 28040 70.10
1 11960 29.90 40000 100.00

Event rate (for a Binary response variable)


Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 11960/40000 = 29.9%

Pristine 52
Bivariate Analysis (of Independent variables)
SAS code *Bivariate Analysis of independent variable;
%macro mBivariate(var);
proc sql;
Create table &var._tab as
select &var, count(*) as freq,
sum(Default_On_Payment) as Default_On_Payment
from logit.Default_On_Payment
group by &var;
quit;

data &var._tab;
set &var._tab;
Default_Rate = Default_On_Payment/freq;
run;

proc print data = &var._tab; run;


%mend mBivariate;
%mBivariate(Status_Checking_Acc);
%mBivariate(Duration_in_Months);
%mBivariate(Credit_History);
%mBivariate(Purposre_Credit_Taken);
%mBivariate(Credit_Amount);
%mBivariate(Savings_Acc);
%mBivariate(Years_At_Present_Employment);
%mBivariate(Inst_Rt_Income);
%mBivariate(Marital_Status_Gender);
%mBivariate(Other_Debtors_Guarantors);
%mBivariate(Current_Address_Yrs);
%mBivariate(Property);
%mBivariate(Age);
%mBivariate(Other_Inst_Plans);
%mBivariate(Housing);
%mBivariate(Num_CC);
%mBivariate(Job);
%mBivariate(Dependents);
%mBivariate(Telephone);
%mBivariate(Foreign_Worker);
Pristine 53
Bivariate Analysis (of Independent variables)
1
Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate
A11 10960 5400 49.27%
A12 10760 4160 38.66%
A13 2520 560 22.22%
A14 15760 1840 11.68%

2
Duration_in_Months # Observations Default_On_Payment Default Rate
12 14360 3040 21.17%
These are transformed values. For original
24 16440 4880 29.68% values refer to Analysis_of_Default.xlsx
36 5720 2280 39.86%
48 2840 1440 50.70%
60+ 3440 1720 50.00%

3 Credit_History # Observations Default_On_Payment Default Rate


A30 1600 1000 62.50%
A31 1960 1120 57.14%
A32 21200 6720 31.70%
A33 3520 1120 31.82%
A34 11720 2000 17.06%

Pristine 54
Bivariate Analysis (of Independent variables)
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate A40: car (new)
A40 9360 3560 38.03% A41: car (used)
A41 4120 680 16.50% A42: furniture/equipment
A42 480 200 32.04% A43: radio/television
A43 7240 2320 21.79%
A44
A44: domestic appliances
11200 2440 33.33%
A45 480 160 36.36%
A45: repairs
A46 880 320 44.00% A46: education
A48 2000 880 11.11% A47: (vacation - does not exist?)
A49 360 40 35.05% A48: retraining
A410 3880 1360 41.67% A49: business
A410: others
5 Credit_Amount # Observations Default_On_Payment Default Rate
4000 30160 7800 25.86%
These are transformed values. For original
11000 8680 3360 38.71% values refer to Analysis_of_Default.xlsx
11000+ 1160 800 68.97%

6
Savings_Acc # Observations Default_On_Payment Default Rate
A61 24120 8640 35.82%
A62 4120 1360 33.01%
A63 2520 440 17.46%
A64 1920 240 12.50%
A65 7320 1280 17.49%

Pristine 55
Bivariate Analysis (of Independent variables)
7 Years_At_Present_Employment # Observations Default_On_Payment Default Rate
A71 2480 920 37.10%
A72 6880 2800 40.70%
A73 13560 4120 30.38%
A74 6960 1560 22.41%
A75 10120 2560 25.30%

8 Inst_Rt_Income # Observations Default_On_Payment Default Rate


1 5440 1360 25.00%
2 9240 2440 26.41%
3 6280 1800 28.66%
4 19040 6360 33.40%

9 Marital_Status_Gender # Observations Default_On_Payment Default Rate


A91 2000 800 40.00%
A92 12400 4320 34.84%
A93 21920 5840 26.64%
A94 3680 1000 27.17%

10
Other_Debtors_Guarantors # Observations Default_On_Payment Default Rate
A101 36280 10840 29.88%
A102 1640 720 43.90%
A103 2080 400 19.23%

Pristine 56
Bivariate Analysis (of Independent variables)
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 5200 1440 27.69%
2 12320 3840 31.17%
3 5960 1720 28.86%
4 16520 4960 30.02%

12 Property # Observations Default_On_Payment Default Rate


A121 11280 2360 20.92%
A122 9280 2840 30.60%
A123 13280 4080 30.72%
A124 6160 2680 43.51%

13 Age:
Refer to Analysis_of_Default.xlsx
14
Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 5560 2280 41.01%
A142 1880 760 40.43%
A143 32560 8920 27.40%

15
Housing # Observations Default_On_Payment Default Rate
A151 7160 2800 39.11%
A152 28520 7400 25.95%
A153 4320 1760 40.74%

Pristine 57
Bivariate Analysis (of Independent variables)
16 Num_CC # Observations Default_On_Payment Default Rate
1 25320 7960 31.44%
2 13320 3680 27.63%
3 1120 240 21.43%
4 240 80 33.33%

17 Job # Observations Default_On_Payment Default Rate


A171 880 280 31.82%
A172 8000 2240 28.00%
A173 25200 7400 29.37%
A174 5920 2040 34.46%

18 Dependents # Observations Default_On_Payment Default Rate


1 33800 10120 29.94%
2 6200 1840 29.68%

19 Telephone # Observations Default_On_Payment Default Rate


A191 23840 7440 31.21%
A192 16160 4520 27.97%

20 Foreign_Worker # Observations Default_On_Payment Default Rate


A201 38520 11800 30.63%
A202 1480 160 10.81%

Pristine 58
Dummy variable (0,1) creation for Categorical variables
1
Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate Three dummies for A11, A12 and A13.
A11 10960 5400 49.27%
A12 10760 4160 38.66%
A14 will be base level
A13 2520 560 22.22%
A14 15760 1840 11.68%

2
Duration_in_Months # Observations Default_On_Payment Default Rate
12 14360 3040 21.17%
Will be used as a continuous variables
24 16440 4880 29.68%
36 5720 2280 39.86%
48 2840 1440 50.70%
60+ 3440 1720 50.00%

Credit_History # Observations Default_On_Payment Default Rate


3
A30 1600 1000 62.50% Four dummies for A30, A31, A33 and A34.
A31 1960 1120 57.14% A32 will be base level
A32 21200 6720 31.70%
A33 3520 1120 31.82%
A34 11720 2000 17.06%

Pristine 59
Dummy variable (0,1) creation for Categorical variables
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate Six dummies for A40 & A410, A41,A42, A45,
A40 9360 3560 38.03% A46 and A49.
A41 4120 680 16.50% A43 & A44 &A48 will be base levels
A42 480 200 32.04%
A43 7240 2320 21.79%
A44 11200 2440 33.33%
A45 480 160 36.36%
A46 880 320 44.00%
A48 2000 880 11.11%
A49 360 40 35.05%
A410 3880 1360 41.67%

5 Credit_Amount # Observations Default_On_Payment Default Rate


4000 30160 7800 25.86%
Two dummies for 11000 AND 11000+.
11000 8680 3360 38.71% 4000 WILL BE BASE LEVEL
11000+ 1160 800 68.97%

6
Savings_Acc # Observations Default_On_Payment Default Rate
A61 24120 8640 35.82% A61 will be base level.
A62 4120 1360 33.01%
A63 2520 440 17.46%
Four dummies for A62-A65
A64 1920 240 12.50%
A65 7320 1280 17.49%

Pristine 60
Dummy variable (0,1) creation for Categorical variables
7 Years_At_Present_Employment # Observations Default_On_Payment Default Rate
A71 2480 920 37.10%
A73 will be base level.
A72 6880 2800 40.70% Four dummies for the rest of levels
A73 13560 4120 30.38%
A74 6960 1560 22.41%
A75 10120 2560 25.30%

8 Inst_Rt_Income # Observations Default_On_Payment Default Rate


1 5440 1360 25.00% Continuous variable
2 9240 2440 26.41%
3 6280 1800 28.66%
4 19040 6360 33.40%

9 Marital_Status_Gender # Observations Default_On_Payment Default Rate


A91 2000 800 40.00% A93 will be base level.
A92 12400 4320 34.84% Three dummies for the rest.
A93 21920 5840 26.64%
A94 3680 1000 27.17%

10
Other_Debtors_Guarantors # Observations Default_On_Payment Default Rate
A101 36280 10840 29.88%
A101 will be base level.
A102 1640 720 43.90% Two dummies for the rest.
A103 2080 400 19.23%

Pristine 61
Dummy variable (0,1) creation for Categorical variables
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 5200 1440 27.69%
Continuous variable.
2 12320 3840 31.17%
3 5960 1720 28.86%
4 16520 4960 30.02%

12 Property # Observations Default_On_Payment Default Rate


A121 11280 2360 20.92% A123 will be base level.
A122 9280 2840 30.60% Three dummies for the rest.
A123 13280 4080 30.72%
A124 6160 2680 43.51%

13 Age:
Continuous variable.
Refer to Analysis_of_Default.xlsx
14
Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 5560 2280 41.01% A143 will be base level.
A142 1880 760 40.43% Two dummies for the rest.
A143 32560 8920 27.40%

15
Housing # Observations Default_On_Payment Default Rate A152 will be base level.
A151 7160 2800 39.11%
A152
Two dummies for the rest.
28520 7400 25.95%
A153 4320 1760 40.74%

Pristine 62
Dummy variable (0,1) creation for Categorical variables
16 Num_CC # Observations Default_On_Payment Default Rate
1 25320 7960 31.44% Continuous variable.
2 13320 3680 27.63%
3 1120 240 21.43%
4 240 80 33.33%

17 Job # Observations Default_On_Payment Default Rate


A171 880 280 31.82% A173 will be base level.
A172 8000 2240 28.00% Three dummies for the rest.
A173 25200 7400 29.37%
A174 5920 2040 34.46%

18 Dependents # Observations Default_On_Payment Default Rate


1 33800 10120 29.94%
Continuous variable.
2 6200 1840 29.68%

19 Telephone # Observations Default_On_Payment Default Rate A191 is base level.


A191 23840 7440 31.21%
A192 16160 4520 27.97%
One dummy for A192.

20 Foreign_Worker # Observations Default_On_Payment Default Rate


A201 38520 11800 30.63%
A202 1480 160 10.81%

Pristine 63
Dummy variable (0,1) creation for Categorical variables

Refer to the SAS code: 02.Creation of Dummy Vars for Categorical Vars.sas
Variable List: To be used in modeling
Sl # Variable Name Sl # Variable Name
1 Duration_in_Months 23 Savings_Acc_dummy_1
2 Inst_Rt_Income 24 Savings_Acc_dummy_2
3 Current_Address_Yrs 25 Savings_Acc_dummy_3
26 Savings_Acc_dummy_4
4 Age
27 Yrs_At_Present_Emp_dummy_1
5 Num_CC
28 Yrs_At_Present_Emp_dummy_2
6 Dependents 29 Yrs_At_Present_Emp_dummy_3
7 Status_Checking_Acc_dummy_1 30 Yrs_At_Present_Emp_dummy_4
8 Status_Checking_Acc_dummy_2 31 Marital_Status_Gender_dummy_1
9 Status_Checking_Acc_dummy_3 32 Marital_Status_Gender_dummy_2
10 Duration_in_Months_trans 33 Marital_Status_Gender_dummy_3
11 Credit_History_dummy_1 34 Other_Debtors_Guarantors_dummy_1
12 Credit_History_dummy_2 35 Other_Debtors_Guarantors_dummy_2
13 Credit_History_dummy_3 36 Property_dummy_1
37 Property_dummy_2
14 Credit_History_dummy_4
38 Property_dummy_3
15 Purposre_Credit_Taken_dummy_1
39 Other_Inst_Plans_dummy_1
16 Purposre_Credit_Taken_dummy_2 40 Other_Inst_Plans_dummy_2
17 Purposre_Credit_Taken_dummy_3 41 Housing_dummy_1
18 Purposre_Credit_Taken_dummy_4 42 Housing_dummy_2
19 Purposre_Credit_Taken_dummy_5 43 Job_dummy_1
20 Purposre_Credit_Taken_dummy_6 44 Job_dummy_2
21 Credit_Amount_dummy_1 45 Job_dummy_3
22 Credit_Amount_dummy_2 46 Telephone_dummy_1
47 Foreign_Worker_dummy_1

Pristine 64
Random Split of the data into Training and Validation

SAS code:

*Splitting the data into Training and Validation (60:40);


data logit.Default_On_Payment_Train_v1 logit.Default_On_Payment_Test_v1;
set logit.Default_On_Payment_v1;
if ranuni(100) le 0.60 then output logit.Default_On_Payment_Train_v1;
else output logit.Default_On_Payment_Test_v1;
run;

Pristine 65
Fitting a Linear Regression on Binary Response Variable
SAS code for fitting Linear Regression
/*Testing the feasibility of running linear regression on data
having binary response variable*/
proc reg
data = logit.Default_On_Payment_v1
outest = LinRegOut;
model Default_On_Payment =
Duration_in_Months Inst_Rt_Income Current_Address_Yrs
Age Num_CC Dependents
Status_Checking_Acc_dummy_1 Status_Checking_Acc_dummy_2
Status_Checking_Acc_dummy_3
Duration_in_Months_trans Credit_History_dummy_1
Credit_History_dummy_2 Credit_History_dummy_3 Credit_History_dummy_4
Purposre_Credit_Taken_dummy_1 Purposre_Credit_Taken_dummy_2
Purposre_Credit_Taken_dummy_3 Purposre_Credit_Taken_dummy_4
Purposre_Credit_Taken_dummy_5 Purposre_Credit_Taken_dummy_6
Credit_Amount_dummy_1 Credit_Amount_dummy_2
Savings_Acc_dummy_1 Savings_Acc_dummy_2
Savings_Acc_dummy_3 Savings_Acc_dummy_4
Yrs_At_Present_Emp_dummy_1 Yrs_At_Present_Emp_dummy_2
Yrs_At_Present_Emp_dummy_3 Yrs_At_Present_Emp_dummy_4
Marital_Status_Gender_dummy_1 Marital_Status_Gender_dummy_2
Marital_Status_Gender_dummy_3
Other_Debtors_Guarantors_dummy_1 Other_Debtors_Guarantors_dummy_2
Property_dummy_1 Property_dummy_2 Property_dummy_3
Other_Inst_Plans_dummy_1 Other_Inst_Plans_dummy_2
Housing_dummy_1 Housing_dummy_2 Job_dummy_1 Job_dummy_2 Job_dummy_3
Telephone_dummy_1 Foreign_Worker_dummy_1;
run;

Pristine 66
Fitting a Linear Regression on Binary Response Variable
SAS code for Scoring Linear Regression output
/*Scoring the records with Linear Regression Output*/
proc score
data = logit.Default_On_Payment_v1
Score = LinRegOut
out = ScoredLinReg type = parms;
Var Duration_in_Months Inst_Rt_Income Current_Address_Yrs
Age Num_CC Dependents
Status_Checking_Acc_dummy_1 Status_Checking_Acc_dummy_2
Status_Checking_Acc_dummy_3
Duration_in_Months_trans Credit_History_dummy_1
Credit_History_dummy_2 Credit_History_dummy_3 Credit_History_dummy_4
Purposre_Credit_Taken_dummy_1 Purposre_Credit_Taken_dummy_2
Purposre_Credit_Taken_dummy_3 Purposre_Credit_Taken_dummy_4
Purposre_Credit_Taken_dummy_5 Purposre_Credit_Taken_dummy_6
Credit_Amount_dummy_1 Credit_Amount_dummy_2
Savings_Acc_dummy_1 Savings_Acc_dummy_2
Savings_Acc_dummy_3 Savings_Acc_dummy_4
Yrs_At_Present_Emp_dummy_1 Yrs_At_Present_Emp_dummy_2
Yrs_At_Present_Emp_dummy_3 Yrs_At_Present_Emp_dummy_4
Marital_Status_Gender_dummy_1 Marital_Status_Gender_dummy_2
Marital_Status_Gender_dummy_3
Other_Debtors_Guarantors_dummy_1 Other_Debtors_Guarantors_dummy_2
Property_dummy_1 Property_dummy_2 Property_dummy_3
Other_Inst_Plans_dummy_1 Other_Inst_Plans_dummy_2
Housing_dummy_1 Housing_dummy_2 Job_dummy_1 Job_dummy_2 Job_dummy_3
Telephone_dummy_1 Foreign_Worker_dummy_1;
run;

*Plotting the predicted values;


proc gplot data = ScoredLinReg;Plot Customer_ID * Model1; run;
Pristine 67
Fitting a Linear Regression on Binary Response Variable
The predicted Default_On_Payment is taking values outside the range [0, 1]
We intend to predict probability of default which must take value in the range (0, 1)
We cannot achieve this my using Linear Regression
What is the solution???
Use Logistic Regression
A subset of Generalized Linear Models (GLMs)

Pristine 68
Generalized Linear Models (GLMs)
The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that
allows for response variables that have other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response
variable via a link function and by allowing the magnitude of the variance of each measurement
to be a function of its predicted value.
It uses an iteratively reweighted least squares method for maximum likelihood estimation of the
model parameters.
The GLM consists of three elements:
1. A probability distribution from the exponential family.
2. A linear predictor = X .
3. A link function g such that E(Y) = = g-1().

Pristine 69
Generalized Linear Models (GLMs)- Exponential family
A distribution for a random variable Y belongs to an exponential family if its density has the
following form:

where a, b and c are functions.


Parameter , is called the natural parameter, is the one which is relevant to the model for
relating the response (Y) to the covariates
is known as the scale parameter or dispersion parameter.
For example, for Normal distribution is the natural parameter and 2 is the dispersion parameter

Pristine 70
Generalized Linear Models (GLMs)- Binomial Distribution
Suppose Z ~ binomial (n, ) . Let Y = Z/n so that Z = nY.
Distribution of Z is
Substituting for Z = nY, we get distribution of Y as

Generic form of exponential family of distribution (just for reference)

Pristine 71
GLMs Linear Predictor and Link Function
The linear predictor is the quantity which incorporates the information about the independent
variables into the model. The symbol (Greek "eta") denotes a linear predictor.
It is related to the expected value of the data (thus, "predictor") through the link function.
is expressed as linear combinations (thus, "linear") of unknown parameters .
= Xi i
If g() is a link function then g() =
Some common link functions:

Pristine 72
Logistic Regression - Introduction
Logistic regression is used to analyze relationships between a dichotomous dependent variable
and metric or dichotomous independent variables.
Logistic regression combines the independent variables to estimate the probability that a
particular event will occur, i.e. a subject will be a member of one of the groups defined by the
dichotomous dependent variable.
The value produced by logistic regression is a probability value between 0.0 and 1.0.
If the probability for group membership in the modeled category is above some cut point (the
default is 0.50), the subject is predicted to be a member of the modeled group. If the probability
is below the cut point, the subject is predicted to be a member of the other group.
For any given case, logistic regression computes the probability that a case with a particular set of
values for the independent variable is a member of the modeled category.

Pristine 73
Logistic Regression Variable Requirements
Logistic regression analysis requires that the dependent variable be dichotomous.
Logistic regression analysis requires that the independent variables be numerical or dichotomous.
If an independent variable is nominal level and not dichotomous, we need to dummy code the
variable.
Logistic regression does not make any assumptions of normality, linearity, and homogeneity of
variance for the independent variables.
The regression equation is
Log(p/(1-p)) = a + b1 x1 + b2 x2 +.. +bn xn + e
Log(p/(1-p)) = (linear predictor)

Pristine 74
Logistic Regression Methods for including variables
There are three methods available for including variables in the regression equation:
1. The simultaneous method in which all independents are included at the same time
2. The hierarchical method in which control variables are entered in the analysis before the predictors whose
effects we are primarily concerned with.
3. The stepwise method (in-built functionality in SAS and SPSS) in which variables are selected in the order in
which they maximize the statistically significant contribution to the model.
For all methods, the contribution to the model is measures by model Deviance or AIC.
A better model will have a lower Deviance/AIC
Deviance is calculated from Maximum-likelihood estimation (MLE)
MLE is an interactive procedure that successively tries works to get closer and closer to the
correct answer.
A perfect model will have MLE = 0
Deviance = 2 (log Likelihood of perfect model log Likelihood of current model)
AIC = Deviance + # parameters in model

Pristine 75
Logistic Regression Overall Test of Relationship
The overall test of relationship among the independent variables and groups defined by the
dependent is based on the reduction in the likelihood values for a model which does not contain
any independent variables and the model that contains the independent variables.

This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-
square.

The significance test for the model chi-square is our statistical evidence of the presence of a
relationship between the dependent variable and the combination of the independent variables.

Pristine 76
Running Logistic Regression Using SAS- Proc Logistic
SAS code for running regression:
*Running proc logistic to predict probability of default;
proc logistic data=logit.Default_On_Payment_Train_v1 descending outest=betas covout;
model Default_On_Payment= &VarList
/ selection=stepwise
slentry=0.01
slstay=0.005
details
lackfit;
output out= Pred_Default_On_Payment_Train_v1 p=phat lower=lcl upper=ucl
predprobs=(individual crossvalidate);
run;

*List of variables to be used in Logistic Regression;


%Let Varlist = Inst_Rt_Income Current_Address_Yrs
Age Num_CC Dependents
Status_Checking_Acc_dummy_1 Status_Checking_Acc_dummy_2 Status_Checking_Acc_dummy_3
Duration_in_Months_trans Credit_History_dummy_1
Credit_History_dummy_2 Credit_History_dummy_3 Credit_History_dummy_4
Purposre_Credit_Taken_dummy_1 Purposre_Credit_Taken_dummy_2
Purposre_Credit_Taken_dummy_3 Purposre_Credit_Taken_dummy_4
Purposre_Credit_Taken_dummy_5 Purposre_Credit_Taken_dummy_6
Credit_Amount_dummy_1 Credit_Amount_dummy_2 Savings_Acc_dummy_1 Savings_Acc_dummy_2
Savings_Acc_dummy_3 Savings_Acc_dummy_4 Yrs_At_Present_Emp_dummy_1 Yrs_At_Present_Emp_dummy_2
Yrs_At_Present_Emp_dummy_3 Yrs_At_Present_Emp_dummy_4
Marital_Status_Gender_dummy_1 Marital_Status_Gender_dummy_2
Marital_Status_Gender_dummy_3 Other_Debtors_Guarantors_dummy_1 Other_Debtors_Guarantors_dummy_2
Property_dummy_1 Property_dummy_2 Property_dummy_3 Other_Inst_Plans_dummy_1 Other_Inst_Plans_dummy_2
Housing_dummy_1 Housing_dummy_2 Job_dummy_1 Job_dummy_2 Job_dummy_3
Telephone_dummy_1 Foreign_Worker_dummy_1;

Pristine 77
Running Logistic Regression Using SAS- Proc Logistic
SAS code for running regression:
*Running proc logistic to predict probability of default;
proc logistic data=logit.Default_On_Payment_Train_v1 descending outest=betas covout;
model Default_On_Payment= &VarList
/ selection=stepwise
slentry=0.01
slstay=0.005
details
lackfit;
output out= Pred_Default_On_Payment_Train_v1 p=phat lower=lcl upper=ucl
predprobs=(individual crossvalidate);
run;

Input dataset: data=logit.Default_On_Payment_Train_v1


Model coefficients dataset: outest=betas

Output dataset having predicted probabilities:


output out= Pred_Default_On_Payment_Train_v1

Pristine 78
Running Logistic Regression Using SAS- Predicted Probabilities
Plotting the predicted output
Predicted values are in the range of (0,1)

*Plotting the predicted values;


proc gplot data = Pred_Default_On_Payment_Train_v1;Plot Customer_ID * phat; run;

Pristine 79
Running Logistic Regression Using SAS- Multicollinearity (VIF) Test
SAS code
*Checking for multicollinearity among predictors;
proc reg data= logit.Default_On_Payment_Train_v1;
model Default_On_Payment = &VarList1/ vif tol collin;
quit;

*List of variables tested for VIF;


%Let Varlist1 = Inst_Rt_Income Age Num_CC Dependents Status_Checking_Acc_dummy_1
Status_Checking_Acc_dummy_2 Status_Checking_Acc_dummy_3
Duration_in_Months_trans Credit_History_dummy_1 Credit_History_dummy_2
Credit_History_dummy_3 Credit_History_dummy_4 Purposre_Credit_Taken_dummy_1
Purposre_Credit_Taken_dummy_2 Purposre_Credit_Taken_dummy_3
Purposre_Credit_Taken_dummy_5 Purposre_Credit_Taken_dummy_6
Credit_Amount_dummy_1 Credit_Amount_dummy_2 Savings_Acc_dummy_1
Savings_Acc_dummy_2 Savings_Acc_dummy_3 Savings_Acc_dummy_4
Yrs_At_Present_Emp_dummy_2 Yrs_At_Present_Emp_dummy_3
Marital_Status_Gender_dummy_1 Marital_Status_Gender_dummy_2
Marital_Status_Gender_dummy_3 Other_Debtors_Guarantors_dummy_1
Other_Debtors_Guarantors_dummy_2 Property_dummy_1 Property_dummy_3
Other_Inst_Plans_dummy_1 Other_Inst_Plans_dummy_2 Housing_dummy_1
Telephone_dummy_1 Foreign_Worker_dummy_1;

If Variance Inflation is > 2 then there is multicollinearity


There is no multicollinearity in the model (refer to table in next slide)

Pristine 80
Running Logistic Regression Using SAS- Multicollinearity (VIF) Test
Parameter Estimates
Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Tolerance Variance Inflation
Intercept Intercept 1 -0.03249 0.01878 -1.73 0.0836 . 0
Inst_Rt_Income Inst_Rt_Income 1 0.04394 0.00248 17.74 <.0001 0.83105 1.2033
Age Age 1 -0.00215 0.0002514 -8.55 <.0001 0.7836 1.27616
Num_CC Num_CC 1 0.03613 0.00556 6.5 <.0001 0.62058 1.61141
Dependents Dependents 1 0.03621 0.00751 4.82 <.0001 0.86072 1.16182
Status_Checking_Acc_dummy_1 1 0.2876 0.00681 42.24 <.0001 0.69032 1.4486
Status_Checking_Acc_dummy_2 1 0.18146 0.00675 26.88 <.0001 0.71578 1.39707
Status_Checking_Acc_dummy_3 1 0.05798 0.01114 5.2 <.0001 0.88382 1.13146
Duration_in_Months_trans 1 0.00332 0.0002801 11.86 <.0001 0.60698 1.6475
Credit_History_dummy_1 1 0.13422 0.01426 9.41 <.0001 0.82217 1.21629
Credit_History_dummy_2 1 0.1437 0.0125 11.5 <.0001 0.86004 1.16273
Credit_History_dummy_3 1 -0.03815 0.01014 -3.76 0.0002 0.7832 1.27681
Credit_History_dummy_4 1 -0.11035 0.00723 -15.27 <.0001 0.58721 1.70296
Purposre_Credit_Taken_dummy_1 1 -0.23617 0.02493 -9.47 <.0001 0.86203 1.16005
Purposre_Credit_Taken_dummy_2 1 -0.16687 0.00926 -18.02 <.0001 0.80155 1.24758
Purposre_Credit_Taken_dummy_3 1 -0.04183 0.00718 -5.83 <.0001 0.83029 1.20439
Purposre_Credit_Taken_dummy_5 1 0.08792 0.01217 7.23 <.0001 0.90977 1.09918
Purposre_Credit_Taken_dummy_6 1 -0.03568 0.00956 -3.73 0.0002 0.80843 1.23697
Credit_Amount_dummy_1 1 0.12201 0.00794 15.37 <.0001 0.59442 1.68232
Credit_Amount_dummy_2 1 0.37167 0.01762 21.1 <.0001 0.74568 1.34106
Savings_Acc_dummy_1 1 -0.0335 0.00886 -3.78 0.0002 0.88349 1.13187
Savings_Acc_dummy_2 1 -0.08557 0.01101 -7.77 <.0001 0.90279 1.10767
Savings_Acc_dummy_3 1 -0.13818 0.01223 -11.29 <.0001 0.92921 1.07618
Savings_Acc_dummy_4 1 -0.1254 0.00707 -17.74 <.0001 0.85835 1.16502
Yrs_At_Present_Emp_dummy_2 1 0.03824 0.00727 5.26 <.0001 0.84204 1.1876
Yrs_At_Present_Emp_dummy_3 1 -0.07284 0.00702 -10.38 <.0001 0.90591 1.10386
Marital_Status_Gender_dummy_1 1 0.12894 0.01215 10.61 <.0001 0.89098 1.12237
Marital_Status_Gender_dummy_2 1 0.0718 0.0064 11.23 <.0001 0.72954 1.37072
Marital_Status_Gender_dummy_3 1 0.01797 0.00972 1.85 0.0647 0.81835 1.22197
Other_Debtors_Guarantors_dummy_1 1 0.07867 0.01314 5.99 <.0001 0.926 1.07992
Other_Debtors_Guarantors_dummy_2 1 -0.16428 0.01201 -13.68 <.0001 0.90409 1.10609
Property_dummy_1 1 -0.03909 0.00625 -6.26 <.0001 0.80541 1.2416
Property_dummy_3 1 0.05901 0.00792 7.46 <.0001 0.78536 1.2733
Other_Inst_Plans_dummy_1 1 0.08615 0.00775 11.11 <.0001 0.88298 1.13252
Other_Inst_Plans_dummy_2 1 0.0741 0.01249 5.93 <.0001 0.92184 1.08478
Housing_dummy_1 1 0.07793 0.00713 10.93 <.0001 0.86372 1.15778
Telephone_dummy_1 1 -0.03855 0.00554 -6.96 <.0001 0.86516 1.15585
Foreign_Worker_dummy_1 1 -0.11644 0.01401 -8.31 <.0001 0.91884 1.08832

Pristine 81
Running Logistic Regression Using SAS- Output
Association of Predicted Probabilities and Observed Responses
Gamma:
The Goodman-Kruskal Gamma method does not penalize for ties on
Percent Concordant 82.8 Somer's D 0.658 either variable.
Percent Discordant 17 Gamma 0.659 Its values range from -1.0 (no association) to 1.0 (perfect association).
Because it does not penalize for ties, its value will generally be greater
Percent Tied 0.2 Tau-a 0.276
than the values for Somer's D.
Pairs 1.20E+08 c 0.829 In our model, Gamm = (82.8-17)/(100 -0.2)= 0.659.

Concordance:
Tau-a:
The predicted probabilites for those with the event (cases, 1) are It is defined to be the ratio of the difference between the number of
compared to the predicted probabilities of those without the event ( concordant pairs and the number of discordant pairs to the number of
noncases, 0). possible pairs .
Each case is compared to every noncase. If there are m cases and n ((nc-nd)/(N(N-1)/2)
noncases, then there are m*n paired comparisons.
If the predicted probability of a case is larger than the predicted
Area under Receiver Operating Characteristic (ROC) Curve(c):
probability of a noncase, the pair (comparison) is considered
concordant. The receiver operating characteristic (ROC) curve is a diagnostic tool
The percent concordant is number of concordant pairs divided by m*n. for assessing the ability of a logistic model to discriminate between
If a noncase has the higher probability, the comparison is said to be events and nonevents.
discordant. If the model discriminates perfectly, the ROC curve passes through the
When the probabilities are equal, the comparisons are declared ties. point (0,1) and the area below the curve is one.
If the model has no discriminating ability, the curve is a line from (0,0)
Somers D: to (1,1).
Each point on the ROC curve provides the sensitivity and specificity
It is used to determine the strength and direction of relation between
measures associated with a cutoff on the probability scale which
pairs of variables.
allows classification of each observation as either a predicted event or
Its values range from -1.0 (all pairs disagree) to 1.0 (all pairs agree).
a predicted nonevent.
It is defined as (%Concordant - %Discordant)/100 . c ranges from 0.5 to 1, where 0.5 corresponds to the model randomly
In our model, Somers D = (82.8-17)/100 = 0.658. predicting the response, and a 1 corresponds to the model perfectly
discriminating the response.
Pristine 82
Running Logistic Regression Using SAS- Output (Coefficients)
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -3.3692 0.1308 663.9035 <.0001
Inst_Rt_Income 1 0.3024 0.0168 323.2719 <.0001
Age 1 -0.0152 0.00175 74.7986 <.0001
Num_CC 1 0.2694 0.0372 52.4981 <.0001
Dependents 1 0.2899 0.0499 33.7694 <.0001
Status_Checking_Acc_dummy_1 1 1.7893 0.0463 1491.0422 <.0001
Status_Checking_Acc_dummy_2 1 1.2806 0.0467 752.2189 <.0001
Status_Checking_Acc_dummy_3 1 0.5209 0.0794 43.0764 <.0001
Duration_in_Months_trans 1 0.0198 0.00177 124.4905 <.0001
Credit_History_dummy_1 1 0.5512 0.0869 40.2216 <.0001
Credit_History_dummy_2 1 0.6816 0.0762 79.9364 <.0001
Credit_History_dummy_3 1 -0.2567 0.065 15.5812 <.0001
Credit_History_dummy_4 1 -0.786 0.0513 234.664 <.0001
Purposre_Credit_Taken_dummy_1 1 -1.3713 0.1607 72.7997 <.0001
Purposre_Credit_Taken_dummy_2 1 -1.3503 0.0745 328.5343 <.0001
Purposre_Credit_Taken_dummy_3 1 -0.2608 0.0464 31.6319 <.0001
Purposre_Credit_Taken_dummy_5 1 0.4131 0.0784 27.7333 <.0001
Purposre_Credit_Taken_dummy_6 1 -0.2536 0.064 15.6887 <.0001
Credit_Amount_dummy_1 1 0.8517 0.0513 275.6672 <.0001
Credit_Amount_dummy_2 1 2.2031 0.1128 381.5283 <.0001
Savings_Acc_dummy_1 1 -0.2248 0.0572 15.4159 <.0001
Savings_Acc_dummy_2 1 -0.4838 0.0807 35.96 <.0001
Savings_Acc_dummy_3 1 -1.2521 0.1071 136.6858 <.0001
Savings_Acc_dummy_4 1 -0.9065 0.0525 298.1245 <.0001
Yrs_At_Present_Emp_dummy_2 1 0.1769 0.0455 15.1108 0.0001
Yrs_At_Present_Emp_dummy_3 1 -0.5525 0.0502 120.9904 <.0001
Marital_Status_Gender_dummy_1 1 0.8177 0.0776 111.0017 <.0001
Marital_Status_Gender_dummy_2 1 0.5105 0.0422 146.1068 <.0001
Marital_Status_Gender_dummy_3 1 0.2111 0.0636 11.0151 0.0009
Other_Debtors_Guarantors_dummy_1 1 0.4304 0.0824 27.2748 <.0001
Other_Debtors_Guarantors_dummy_2 1 -0.9967 0.0845 139.1031 <.0001
Property_dummy_1 1 -0.245 0.0427 32.9165 <.0001
Property_dummy_3 1 0.3948 0.052 57.6029 <.0001
Other_Inst_Plans_dummy_1 1 0.6512 0.0483 181.474 <.0001
Other_Inst_Plans_dummy_2 1 0.4642 0.0767 36.6074 <.0001
Housing_dummy_1 1 0.5017 0.0459 119.4719 <.0001
Telephone_dummy_1 1 -0.2523 0.0375 45.2208 <.0001
Foreign_Worker_dummy_1 1 -1.1664 0.1251 86.9253 <.0001

Pristine 83
Running Logistic Regression Using SAS- Scoring Equation
Linear_Predictor (lp) = - 3.3692 + 0.3024*Inst_Rt_Income - 0.0152*Age + 0.2694*Num_CC + 0.2899*Dependents +
1.7893*Status_Checking_Acc_dummy_1 + 1.2806*Status_Checking_Acc_dummy_2 +
0.5209*Status_Checking_Acc_dummy_3 + 0.0198*Duration_in_Months_trans + 0.5512*Credit_History_dummy_1 +
0.6816*Credit_History_dummy_2 - 0.2567*Credit_History_dummy_3 - 0.786*Credit_History_dummy_4 -
1.3713*Purposre_Credit_Taken_dummy_1 - 1.3503*Purposre_Credit_Taken_dummy_2 -
0.2608*Purposre_Credit_Taken_dummy_3 + 0.4131*Purposre_Credit_Taken_dummy_5 -
0.2536*Purposre_Credit_Taken_dummy_6 + 0.8517*Credit_Amount_dummy_1 + 2.2031*Credit_Amount_dummy_2 -
0.2248*Savings_Acc_dummy_1 - 0.4838*Savings_Acc_dummy_2 - 1.2521*Savings_Acc_dummy_3 -
0.9065*Savings_Acc_dummy_4 + 0.1769*Yrs_At_Present_Emp_dummy_2 - 0.5525*Yrs_At_Present_Emp_dummy_3 +
0.8177*Marital_Status_Gender_dummy_1 + 0.5105*Marital_Status_Gender_dummy_2 +
0.2111*Marital_Status_Gender_dummy_3 + 0.4304*Other_Debtors_Guarantors_dummy_1 -
0.9967*Other_Debtors_Guarantors_dummy_2 - 0.245*Property_dummy_1 + 0.3948*Property_dummy_3 +
0.6512*Other_Inst_Plans_dummy_1 + 0.4642*Other_Inst_Plans_dummy_2 + 0.5017*Housing_dummy_1 -
0.2523*Telephone_dummy_1 - 1.1664*Foreign_Worker_dummy_1;

Probability_Default (Pred_Prob) = exp(lp)/(1 + exp(lp));

Pristine 84
Logistic Regression Gains curve and Gini (Training)
SAS code
*Generating Gains Curve and Calculating Gini Coeff & KS on Training records;

*Sort the records by predicted probabilities in descending order;


proc sort data = Pred_Default_On_Payment_Train_v1; by descending phat;
run;

*Create a variable that will store cumulative number of observations;


*Divide the data in 10 equal obseration bins;
%Let NoOfRecords = 23938;
%Let NoOfBins = 10;
data Pred_Default_On_Payment_Train_v2;
set Pred_Default_On_Payment_Train_v1;
retain Cumulative_Count;
Count = 1;
Cumulative_Count = sum(Cumulative_Count, Count);
Bin = round(Cumulative_Count/(&NoOfRecords/&NoOfBins) - 0.5) + 1;
if Bin GT &NoOfBins then Bin = &NoOfBins;
run;

proc sql;
create table Gains_v1 as
select Bin, count(*) as freq, sum(Default_On_Payment) as Actual,
sum(phat) as Predicted from Pred_Default_On_Payment_Train_v2
group by Bin;
quit;

Pristine 85
Logistic Regression Gains curve and Gini (Training)
Predicted Cumulative
Bins # Observations Actual Default Lift Random Area Gini
Default Actual
0 0 0 0 0 0 0 0.00000 45.53%
1 2393 1812 1918.78 1812 25.25% 10.00% 0.01263
2 2394 1692 1452.82 3504 48.84% 20.00% 0.03705
3 2394 1002 1146.3 4506 62.80% 30.00% 0.05582
4 2394 938 861.02 5444 75.87% 40.00% 0.06934
5 2393 529 631.9 5973 83.25% 50.00% 0.07956
6 2394 466 455.5 6439 89.74% 60.00% 0.08649
7 2394 299 316.74 6738 93.91% 70.00% 0.09183
8 2394 316 218.17 7054 98.31% 80.00% 0.09611
9 2394 96 121.68 7150 99.65% 90.00% 0.09898
10 2394 25 52.15 7175 100.00% 100.00% 0.09983

Pristine 86
Logistic Regression KS Stat (Training)
# Actual Actual Non Predicted Cumulative Cumulative Lift- Non Difference: Cumulative
Bins Lift- Default Random Area Gini
Observations Default Default Default Default Non Default Default Default & Non Defaullt
0 0 0 0 0 0 0 0 0 0 0.00000 45.53% 0
1 2393 1812 581 1918.78 1812 581 25.25% 3.47% 10.00% 0.012627 21.79%
2 2394 1692 702 1452.82 3504 1283 48.84% 7.65% 20.00% 0.037045 41.18%
3 2394 1002 1392 1146.3 4506 2675 62.80% 15.96% 30.00% 0.055819 46.84%
4 2394 938 1456 861.02 5444 4131 75.87% 24.64% 40.00% 0.069338 51.23%
5 2393 529 1864 631.9 5973 5995 83.25% 35.76% 50.00% 0.079561 47.48%
6 2394 466 1928 455.5 6439 7923 89.74% 47.26% 60.00% 0.086495 42.48%
7 2394 299 2095 316.74 6738 10018 93.91% 59.76% 70.00% 0.091826 34.15%
8 2394 316 2078 218.17 7054 12096 98.31% 72.16% 80.00% 0.096111 26.15%
9 2394 96 2298 121.68 7150 14394 99.65% 85.87% 90.00% 0.098983 13.78%
10 2394 25 2369 52.15 7175 16763 100.00% 100.00% 100.00% 0.099826 0.00%

The difference between


%Cumulative Default (Bad) and Non
Default (Good) is 51.23% which is
occurring at 4th decile
At 4th decile, model is able to
capture 75.87% of total bads
i.e. if we follow the model, in 40% of
records we can capture 75.87% of
total bads
The K-S of the model is 51.23% at 4th
decile

Pristine 87
Scoring Holdout Validation Records
SAS code
/*Scoring the hold out records*/
data logit.Default_On_Payment_Test_v2;
set logit.Default_On_Payment_Test_v1;
Predicted_Default_LP =
-3.3692
+0.3024*Inst_Rt_Income -0.0152*Age +0.2694*Num_CC +0.2899*Dependents
+1.793*Status_Checking_Acc_dummy_1 +1.2806*Status_Checking_Acc_dummy_2
+0.5209*Status_Checking_Acc_dummy_3 +0.0198*Duration_in_Months_trans
+0.5512*Credit_History_dummy_1 +0.6816*Credit_History_dummy_2
-0.2567*Credit_History_dummy_3 -0.786*Credit_History_dummy_4
-1.3713*Purposre_Credit_Taken_dummy_1 -1.3503*Purposre_Credit_Taken_dummy_2
-0.2608*Purposre_Credit_Taken_dummy_3 +0.4131*Purposre_Credit_Taken_dummy_5
-0.2536*Purposre_Credit_Taken_dummy_6 +0.8517*Credit_Amount_dummy_1
+2.2031*Credit_Amount_dummy_2 -0.2248*Savings_Acc_dummy_1
-0.4838*Savings_Acc_dummy_2 -1.2521*Savings_Acc_dummy_3
-0.9065*Savings_Acc_dummy_4 +0.1769*Yrs_At_Present_Emp_dummy_2
-0.5525*Yrs_At_Present_Emp_dummy_3 +0.8177*Marital_Status_Gender_dummy_1
+0.5105*Marital_Status_Gender_dummy_2 +0.2111*Marital_Status_Gender_dummy_3
+0.4304*Other_Debtors_Guarantors_dummy_1 -0.9967*Other_Debtors_Guarantors_dummy_2
-0.245*Property_dummy_1 +0.3948*Property_dummy_3
+0.6512*Other_Inst_Plans_dummy_1 +0.4642*Other_Inst_Plans_dummy_2
+0.5017*Housing_dummy_1 -0.2523*Telephone_dummy_1
-1.1664*Foreign_Worker_dummy_1;
Predicted_Default_Probability = 1/(1+exp(-Predicted_Default_LP));
run;

Pristine 88
Scoring Holdout Validation Records- Calculating Gini and KS
SAS code
*Generating Gains Curve and Calculating Gini Coeff & KS on Training records;

*Sort the records by predicted probabilities in descending order;


proc sort data= logit.Default_On_Payment_Test_v2; by descending Predicted_Default_Probability;
run;

*Create a variable that will store cumulative number of observations;


*Divide the data in 10 equal obseration bins;
%Let NoOfRecords = 16062;
%Let NoOfBins = 10;
data Default_On_Payment_Test_v3;
set logit.Default_On_Payment_Test_v2;
retain Cumulative_Count;
Count = 1;
Cumulative_Count = sum(Cumulative_Count, Count);
Bin = round(Cumulative_Count/(&NoOfRecords/&NoOfBins) - 0.5) + 1;
if Bin GT &NoOfBins then Bin = &NoOfBins;
run;

proc sql;
create table Gains_v1 as
select Bin, count(*) as freq, sum(Default_On_Payment) as Actual,
sum(Predicted_Default_Probability) as Predicted from Default_On_Payment_Test_v3
group by Bin;
quit;

proc print data = Gains_V1; run;

Pristine 89
Logistic Regression Gains curve and Gini (Validation)

Predicted Cumulative
Bins # Observations Actual Default Lift Random Area Gini
Default Actual
0 0 0 0 0 0 0 0.00000 45.77%
1 1606 1227 1284.957 1227 25.64% 10.00% 0.0128
2 1606 1101 970.2386 2328 48.65% 20.00% 0.0371
3 1606 678 765.3812 3006 62.82% 30.00% 0.0557
4 1606 620 574.6954 3626 75.78% 40.00% 0.0693
5 1606 401 425.5729 4027 84.16% 50.00% 0.0800
6 1607 278 308.8064 4305 89.97% 60.00% 0.0871
7 1606 197 216.289 4502 94.09% 70.00% 0.0920
8 1606 189 149.3579 4691 98.04% 80.00% 0.0961
9 1606 79 82.53703 4770 99.69% 90.00% 0.0989
10 1607 15 34.83957 4785 100.00% 100.00% 0.0998

Pristine 90
Logistic Regression KS Stat (Validation)
# Actual Actual Non Predicted Cumulative Cumulative Lift- Non Difference: Cumulative
Bins Lift- Default Random Area Gini
Observations Default Default Default Default Non Default Default Default & Non Defaullt
0 0 0 0 0 0 0 0 0 0 0.00000 45.77% 0
1 1606 1227 379 1285 1227 379 25.64% 3.36% 10.00% 0.0128 22.28%
2 1606 1101 505 970 2328 884 48.65% 7.84% 20.00% 0.0371 40.81%
3 1606 678 928 765 3006 1812 62.82% 16.07% 30.00% 0.0557 46.75%
4 1606 620 986 575 3626 2798 75.78% 24.81% 40.00% 0.0693 50.97%
5 1606 401 1205 426 4027 4003 84.16% 35.50% 50.00% 0.08 48.66%
6 1607 278 1329 309 4305 5332 89.97% 47.28% 60.00% 0.0871 42.69%
7 1606 197 1409 216 4502 6741 94.09% 59.78% 70.00% 0.092 34.31%
8 1606 189 1417 149 4691 8158 98.04% 72.34% 80.00% 0.0961 25.69%
9 1606 79 1527 83 4770 9685 99.69% 85.88% 90.00% 0.0989 13.80%
10 1607 15 1592 35 4785 11277 100.00% 100.00% 100.00% 0.0998 0.00%

The difference between


%Cumulative Default (Bad) and Non
Default (Good) is 50.97% which is
occurring at 4th decile
At 4th decile, model is able to
capture 75.78% of total bads
i.e. if we follow the model, in 40% of
records we can capture 75.78% of
total bads
The K-S of the model is 50.97% at 4th
decile

Pristine 91
Scorecard Performance Checks- Comparison with Model results

Comparing the performance of model on Training and Testing records

Testing/
Training Validation

Gini Coeff 45.53% 45.77%

KS 51.23% 50.97%
The numbers are comparable

Hence, we can conclude that model is getting validated

Pristine 92
Transformation of Probabilities to Scores
Decide upon the range of score i.e. upper and lower limits
Say, Score_Lower_Bound (A) and Score_Upper_Bound (B)
Case1: If we have modeled and outcome that has adverse impact on business, say Default on
Payment then
o Higher is the probability of default, lower is the score
Case2: If the modeled outcome is favorable to business, say new customer/business
acquisition, then
o Higher is the probability of conversion, higher is the score
Finally transform the probability in a linear fashion to get the scorecard
o Case1: Score = { B probability * (B A) }
o Case2: Score = { A + probability * (B A) }
Scorecard for Default on Payment model in the range of (100, 200) can be created using
the transformation
o Score = { 200 probability * (200 100) }

Pristine 93
Transformation of Probabilities to Scores
SAS code
*Scoring the training records in order to generate probability of default;
data logit.Default_On_Payment_Score_v1;
set logit.Default_On_Payment_Train_v1;
Predicted_Default_LP =
-3.3692
+0.3024*Inst_Rt_Income -0.0152*Age +0.2694*Num_CC +0.2899*Dependents
+1.7893*Status_Checking_Acc_dummy_1 +1.2806*Status_Checking_Acc_dummy_2
+0.5209*Status_Checking_Acc_dummy_3 +0.0198*Duration_in_Months_trans
+0.5512*Credit_History_dummy_1 +0.6816*Credit_History_dummy_2
-0.2567*Credit_History_dummy_3 -0.786*Credit_History_dummy_4
-1.3713*Purposre_Credit_Taken_dummy_1 -1.3503*Purposre_Credit_Taken_dummy_2
-0.2608*Purposre_Credit_Taken_dummy_3 +0.4131*Purposre_Credit_Taken_dummy_5
-0.2536*Purposre_Credit_Taken_dummy_6 +0.8517*Credit_Amount_dummy_1
+2.2031*Credit_Amount_dummy_2 -0.2248*Savings_Acc_dummy_1
-0.4838*Savings_Acc_dummy_2 -1.2521*Savings_Acc_dummy_3
-0.9065*Savings_Acc_dummy_4 +0.1769*Yrs_At_Present_Emp_dummy_2
-0.5525*Yrs_At_Present_Emp_dummy_3 +0.8177*Marital_Status_Gender_dummy_1
+0.5105*Marital_Status_Gender_dummy_2 +0.2111*Marital_Status_Gender_dummy_3
+0.4304*Other_Debtors_Guarantors_dummy_1 -0.9967*Other_Debtors_Guarantors_dummy_2
-0.245*Property_dummy_1 +0.3948*Property_dummy_3
+0.6512*Other_Inst_Plans_dummy_1 +0.4642*Other_Inst_Plans_dummy_2
+0.5017*Housing_dummy_1 -0.2523*Telephone_dummy_1
-1.1664*Foreign_Worker_dummy_1;
Predicted_Default_Probability = 1/(1+exp(-Predicted_Default_LP));

run;

*Transforming the probabilities to score in the range of 100-200;


data logit.Default_On_Payment_Score_v1;
set logit.Default_On_Payment_Score_v1;
Score = Round(200- Predicted_Default_Probability * (200-100));
Run;

Pristine 94
Scorecard Performance Checks
Basic checks are:
1. How well the scorecard rank orders the Target Variable
2. Gini coefficient
3. K-S stat

Steps involved are:


1. Sort the data/records by Score
2. Create bins of equal weights (observations in logistic regression)
3. Calculate average of Target Variable in each bin
4. Calculate Gini coefficient (as described in earlier section)
5. Calculate K-S (as described in earlier section)

Pristine 95
Scorecard Performance Checks
SAS code
*Sorting the records by Score;
proc sort data = logit.Default_On_Payment_Score_v1; by Score; run;

*Create a variable that will store cumulative number of observations;


*Divide the data in 10 equal obseration bins;
%Let NoOfRecords = 23938;
%Let NoOfBins = 10;
data logit.Default_On_Payment_Score_v2;
set logit.Default_On_Payment_Score_v1;
retain Cumulative_Count;
Count = 1;
Cumulative_Count = sum(Cumulative_Count, Count);
Bin = round(Cumulative_Count/(&NoOfRecords/&NoOfBins) - 0.5) + 1;
if Bin GT &NoOfBins then Bin = &NoOfBins;
run;

*Summarize th records by equal observation Bins;


proc sql;
create table Gains_Score_v1 as
select Bin, count(*) as freq, sum(Default_On_Payment) as Actual
from logit.Default_On_Payment_Score_v2
group by Bin;
quit;

*Print the summarized table;


proc print data = Gains_Score_v1; run;

Pristine 96
Scorecard Performance Checks- Rank Ordering and Gini
Records sorted by Score (Ascending)

Equal Obs # Default on Cumulative Default Area Under


# Obs Lift Random Gini
Bin Payment on Payment Curve

0 0 0 0 0.00% 0% 0.000 45.34%


1 2393 1820 1820 25.37% 10% 0.01268293
2 2394 1676 3496 48.72% 20% 0.0370453
3 2394 1018 4514 62.91% 30% 0.05581882
4 2394 917 5431 75.69% 40% 0.06930314
5 2393 521 5952 82.95% 50% 0.07932404
6 2394 471 6423 89.52% 60% 0.08623693
7 2394 297 6720 93.66% 70% 0.09158885
8 2394 327 7047 98.22% 80% 0.09593728
9 2394 103 7150 99.65% 90% 0.0989338
10 2394 25 7175 100.00% 100% 0.09982578

Pristine 97
Scorecard Performance Checks- KS Stat
KS - Maximum separation between good and bad
# Default Cumulative Lift- Non Difference-
Equal Non Default Cumulative Default on Lift- Default on
# Obs on Non Default on Default on Random Cumulative Default
Obs Bin on Payment Payment Payment
Payment Payment Payment vs Non Default
0 0 0 0 0 0 0.00% 0.00% 0% 0.00%
1 2393 1820 573 1820 573 25.37% 3.42% 10% 21.95%
2 2394 1676 718 3496 1291 48.72% 7.70% 20% 41.02%
3 2394 1018 1376 4514 2667 62.91% 15.91% 30% 47.00%
4 2394 917 1477 5431 4144 75.69% 24.72% 40% 50.97%
5 2393 521 1872 5952 6016 82.95% 35.89% 50% 47.07%
6 2394 471 1923 6423 7939 89.52% 47.36% 60% 42.16%
7 2394 297 2097 6720 10036 93.66% 59.87% 70% 33.79%
8 2394 327 2067 7047 12103 98.22% 72.20% 80% 26.02%
9 2394 103 2291 7150 14394 99.65% 85.87% 90% 13.78%
10 2394 25 2369 7175 16763 100.00% 100.00% 100% 0.00%

The difference between


%Cumulative Default (Bad) and Non
Default (Good) is 50.97% which is
occurring at 4th decile
At 4th decile, model is able to
capture 75.69% of total bads
i.e. if we follow the model, in 40% of
records we can capture 75.69% of
total bads
The K-S of the model is 50.97% at 4th
decile

Pristine 98
Scorecard Performance Checks- Comparison with Model results

Comparing the performance of Scorecard with original model results

Model Score Card

Gini Coeff 45.53% 45.34%

KS 51.23% 50.97%
A properly transformed scorecard produces the similar result as obtained from
fundamental model output

Pristine 99

You might also like