MIT Logistic Regression Course Notes

6/28/2017 The_Analytics_Edge_edX_MIT15.
071x_June2015_2
The_Analytics_Edge_edX_MIT15.071x_June2015_2
Course Notes by Nishant Upadhyay Unit 3
26 June 2015
Unit 3: Logistic Regression
Preliminaries
1.Modeling the expert_An introduction to Logistics Regression:VIDEOS 1 to 3
1.1 VIDEO 4: LOGISTIC REGRESSION IN R (R script reproduced here)
1.2 VIDEO 5: THRESHOLDING (R script reproduced here)
1.3 VIDEO 6: ROC CURVES (R script reproduced here)
1.4 VIDEO 7: INTERPRETING THE MODEL
2. THE FRAMINGHAM HEART STUDY_Evaluating Risk Factors to Save Lives
2.1 VIDEO 1: THE FRAMINGHAM HEART STUDY
2.2 VIDEO 2: RISK FACTORS
2.3 VIDEO 3: A LOGISTIC REGRESSION MODEL (R script reproduced here)
2.4 VIDEO 4: VALIDATING THE MODEL
3. ELECTION FORECASTING_Predicting the Winner Before any Votes are Cast (recitation)
3.1 VIDEO 3: A SOPHISTICATED BASELINE METHOD (R script reproduced here)
3.2 VIDEO 4: LOGISTIC REGRESSION MODELS (R script reproduced here)
3.3 VIDEO 5: TEST SET PREDICTIONS (R script reproduced here)
Assignment 3: POPULARITY OF MUSIC RECORDS
Assignment 3: PREDICTING PAROLE VIOLATORS
Assignment 3:PREDICTING LOAN REPAYMENT
Assignment 3:PREDICTING THE BASEBALL WORLD SERIES CHAMPION (OPTIONAL)

Source le The_Analytics_Edge_edX_MIT062015_2.Rmd (data:text/x-
markdown;base64,LS0tDQp0aXRsZTogIlRoZV9BbmFseXRpY3NfRWRnZV9lZFhfTUlUMTUuMDcxeF9KdW5lMjAxNV8yIg0KYXV0aG9yOiAiQ291cnNlIE5vdGVzIGJ5IE5pc2h
Unit 3: Logistic Regression

Preliminaries
These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015 (https://courses.edx.org/courses/course-
v1:MITx+15.071x_2a+2T2015) by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the
lectures.
I am also pursuing the course oered by Coursera :Applied Logistic Regression (https://class.coursera.org/logisticregression-001) by
Professor Stanley Lemeshow.
A good list of resources about using R for Logistic Regression are given below:
R Data Analysis Examples: Logit Regression_UCLA (http://www.ats.ucla.edu/stat/r/dae/logit.htm)

Quick-R: Logistic-regression by Robert Kabaco (http://www.r-tutor.com/elementary-statistics/logistic-regression)
R-Tutorials: LOGISTIC REGRESSION by William King (http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html)
Win-Vector Blog (http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/)
GeradNico website (http://gerardnico.com/wiki/data_mining/simple_logistic_regression)
rstatistics.net: LOGISTIC REGRESSION: A DETAILED DISCUSSION (http://rstatistics.net/logistic-regression/)
Logistic Regression and Generalised Linear Models:CRAN (http://cran.r-
project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf)
Generalized Linear Models_princeton (http://data.princeton.edu/R/glms.html)
logistic regression using R(nlp.stanford.edu) (http://nlp.stanford.edu/manning/courses/ling289/logistic.pdf)
Logistic regression_Cookbook for R (http://www.cookbook-r.com/Statistical_analysis/Logistic_regression/)
Binary logistic regression(www.uni-kiel.de) (http://www.uni-kiel.de/psychologie/rexrepos/posts/regressionLogistic.html)
Logistic Regression (R):Statistical Consulting Group (http://scg.sdsu.edu/logit_r/)
An introduction to logistic regression in r_ Rohan Palmer (http://psych.colorado.edu/wiki/lib/exe/fetch.php?
media=labs:learnr:rohan_-_logistical_regression_in_r:an_introduction_to_logistic_regression_in_r.pdf)
Logistic Regression Using R_Joseph M Hilbe (http://works.bepress.com/cgi/viewcontent.cgi?article=1031&context=joseph_hilbe)
RPubs - Classication_ Logistic Regression_Ryan Kelly (http://rpubs.com/ryankelly/21379)
Logistic regression_Michael Friendly (http://www.psych.yorku.ca/lab/psy6140/lectures/LogisticRegression2x2.pdf)
Logistic regression_ complete notes_hadley (http://had.co.nz/notes/modelling/logistic-regression.html)
Section 10: Logistic Regression(www.cs.tcd.ie) (https://www.cs.tcd.ie/~houldinb/Index/MVA_les/Lab10-R.pdf)
Visualizing Categorical Data with SAS and R_Michael friendly (http://www.datavis.ca/courses/VCD/vcd4-handout-2x2.pdf)
Logistic regression (http://www.stat.wisc.edu/~ane/st572/notes/lec06.pdf)
Using Logistic Regression in R: Emma Gertlowitz (http://gertlowitz.blogspot.in/2013/06/using-logistic-regression-in-r.html)
Applied Linear Regression (http://homepage.stat.uiowa.edu/~rdecook/stat3200/notes/LogReg_part2_4pp.pdf)
Notes on Logistic Regression with Examples in R (http://jocelynchi.com/notes-on-logistic-regression-with-examples-in-r/)
Model Performance metrics: How well does my model perform? - Part 2: Analytics Vidya
(http://www.analyticsvidhya.com/blog/2015/01/model-perform-part-2/)
Simple guide to confusion matrix terminology (http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)
ROC curves and Area Under the Curve explained (video) (http://www.dataschool.io/roc-curves-and-auc-explained/)
An introduction to ROC analysis_Tom Fawcett_pdf (http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf)
https://rpubs.com/nishantsbi/90539 1/57
6/28/2017 The_Analytics_Edge_edX_MIT15.071x_June2015_2
How to assess quality and correctness of classication models? Part 4 - ROC Curve
(http://www.datasciencecentral.com/proles/blogs/how-to-assess-quality-and-correctness-of-classication-models)
1.Modeling the expert_An introduction to Logistics Regression:VIDEOS 1 to 3

QUICK QUESTION
Q:Which of the following dependent variables are categorical? (Select all that apply.)
Ans:Deciding whether to buy, sell, or hold a stock,The winner of an election with two candidates ,The day of the week with the highest revenue,
Whether or not revenue will exceed $50,000
EXPLANATION:The weekly revenue of a company is not categorical, since it has a large number of possible values, on a continuous range. The
number of daily car thefts in New York City is also not categorical because the number of car thefts could range from 0 to hundreds.On the other
hand, the other options each have a limiited number of possible outcomes.
Q:Which of the following dependent variables are binary? (Select all that apply.)
Ans:The winner of an election with two candidates, Whether or not revenue will exceed $50,000
EXPLANATION:The only variables with two possible outcomes are the winner of an election with two candidates, and whether or not revenue will
exceed $50,000.
QUICK QUESTION
Q:Suppose the coecients of a logistic regression model with two independent variables are as follows:
,
0 = 1.5 1 = 3 , 2 = 0.5
And we have an observation with the following values for the independent variables:
,
x1 = 1 x2 = 5
So the logit(i.e.Log(odds)) equation becomes:
log(Odds) = 0 + 1 x1 + 2 x2
What is the value of the Logit for this observation? Recall that the Logit is log(Odds).
#substituting the values of betas and x's in the logit eqn, we get
logit<- -1.5 + 3*1 + -0.5*5
logit
## [1] -1
Ans:-1
EXPLANATION:The Logit is just log(Odds), and looks like the linear regression equation. So the Logit is -1.5 + 31 - 0.55 = -1.
Q:What is the value of the Odds for this observation? Note that you can compute e^x, for some number x, in your R console by typing exp(x).
The function exp() computes the exponential of its argument.4
we know that Odds = e

( + x1 + x2 )
0 1 2
Odds<-exp(logit)
Odds
## [1] 0.3678794
Ans:0.3678794
EXPLANATION:Using the value of the Logit from the previous question, we have that Odds = e^(-1) = 0.3678794.
Q:What is the value of P(y = 1) for this observation?

P(y=1)
We know that Odds Ratio is Odds =
P(y=0)
and the Logistic response Function is:
1
P(y = 1) =
( + x1 + x2 )
1 + e 0 1 2
1
=
logit
1 + e
#From the logistic function we get

1/(1 + exp(-(-1)))
## [1] 0.2689414
Ans:0.2689414
EXPLANATION:Using the Logistic Response Function, we can compute that P(y = 1) = 1/(1 + e^(-Logit)) = 1/(1 + e^(1)) = 0.2689414.

The variables in the dataset quality.csv are as follows:
MemberID numbers the patients from 1 to 131, and is just an identifying number.
InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.
ERVisits is the number of times the patient visited the emergency room.
OceVisits is the number of times the patient visited any doctors oce.
Narcotics is the number of prescriptions the patient had for narcotics.
DaysSinceLastERVisit is the number of days between the patients last emergency room visit and the end of the study period (set to the
length of the study period if they never visited the ER).
Pain is the number of visits for which the patient complained about pain.
TotalVisits is the total number of times the patient visited any healthcare provider.
ProviderCount is the number of providers that served the patient.
MedicalClaims is the number of days on which the patient had a medical claim.
ClaimLines is the total number of medical claims.
StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
AcuteDrugGapSmall is the fraction of acute drugs that were relled quickly after the prescription ran out.
PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.
# Unit 3, Modeling the Expert
# Video 4
# Read in dataset
quality = read.csv("quality.csv")
# Look at structure
str(quality)
## 'data.frame': 131 obs. of 14 variables:

## $ MemberID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ InpatientDays : int 0 1 0 0 8 2 16 2 2 4 ...
## $ ERVisits : int 0 1 0 1 2 0 1 0 1 2 ...
## $ OfficeVisits : int 18 6 5 19 19 9 8 8 4 0 ...
## $ Narcotics : int 1 1 3 0 3 2 1 0 3 2 ...
## $ DaysSinceLastERVisit: num 731 411 731 158 449 ...
## $ Pain : int 10 0 10 34 10 6 4 5 5 2 ...
## $ TotalVisits : int 18 8 5 20 29 11 25 10 7 6 ...
## $ ProviderCount : int 21 27 16 14 24 40 19 11 28 21 ...
## $ MedicalClaims : int 93 19 27 59 51 53 40 28 20 17 ...
## $ ClaimLines : int 222 115 148 242 204 156 261 87 98 66 ...
## $ StartedOnCombination: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ AcuteDrugGapSmall : int 0 1 5 0 0 4 0 0 0 0 ...
## $ PoorCare : int 0 0 0 0 0 1 0 0 1 0 ...
# Table outcome for DV

table(quality$PoorCare)
##
## 0 1
## 98 33
# Baseline accuracy (in a classification problem, the standard baseline method is to just predict the most freque
nt outcome for all observations)
98/131
## [1] 0.7480916
#This is the baseline model which we will try to beat with our logistic model
# Install and load caTools package

#install.packages("caTools")
library(caTools)
##
## Attaching package: 'caTools'
## The following objects are masked from 'package:base64enc':

##
## base64decode, base64encode
# Randomly split data

set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75) #the DV is split into 75% which we get from baseline ac
curacy above
split # TRUE means we should put that obs in Training set and FALSE means that obs should be put in testing set
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [34] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [45] FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [67] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [111] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [122] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
# Create training and testing sets

qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
nrow(qualityTrain)
## [1] 99
nrow(qualityTest)
## [1] 32
# Logistic Regression Model

QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
# Make predictions on training set

predictTrain = predict(QualityLog, type="response") #type="response" gives us probabilities
# Analyze predictions
summary(predictTrain) # since we are dealing with probabilities, all the nos will be between 0 & 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460
#lets see if we are predicting higher probabilities for the actual poorcare cases as we expect
tapply(predictTrain, qualityTrain$PoorCare, mean) #avg prediction for each of the TRUE outcomes
## 0 1
## 0.1894512 0.4392246
QUICK QUESTION
Q:In R, create a logistic regression model to predict PoorCare using the independent variables StartedOnCombination and ProviderCount.
Use the training set we created in the previous video to build the model.
# Logistic Regression Model on the training set

QualityLog1 = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial)
summary(QualityLog1)
##
## Call:
## glm(formula = PoorCare ~ StartedOnCombination + ProviderCount,
## family = binomial, data = qualityTrain)
##
## -1.61826 -0.72782 -0.64555 -0.08407 1.94662
##
## Coefficients:
## (Intercept) -2.00097 0.55097 -3.632 0.000282 ***
## StartedOnCombinationTRUE 1.95230 1.22342 1.596 0.110541
## ProviderCount 0.03366 0.01983 1.697 0.089706 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 110.37
##
#Ans:1.95230
QUICK QUESTION
Q:StartedOnCombination is a binary variable, which equals 1 if the patient is started on a combination of drugs to treat their diabetes, and equals
0 if the patient is not started on a combination of drugs. All else being equal, does this model imply that starting a patient on a combination of
drugs is indicative of poor care, or good care?
Ans:Poor Care
EXPLANATION:The coecient value is positive, meaning that positive values of the variable make the outcome of 1 more likely. This corresponds
to Poor Care.
1.2 VIDEO 5: THRESHOLDING (R script reproduced here)
# Video 5
# Confusion matrix for threshold of 0.5

table(qualityTrain$PoorCare, predictTrain > 0.5) #first arg is for the rows(true coutcomes) and second arg is for
the col labelling(predicted outcomes)
##
## FALSE TRUE
## 0 70 4
## 1 15 10
# Sensitivity and specificity

10/25
## [1] 0.4
70/74
## [1] 0.9459459
# Confusion matrix for increased threshold of 0.7

table(qualityTrain$PoorCare, predictTrain > 0.7)
##
## FALSE TRUE
## 0 73 1
## 1 17 8

8/25 #Sensitivity=TP/total no of positive cases
## [1] 0.32
73/74 #specificity=TN/total no of negative cases
## [1] 0.9864865
# Confusion matrix for aa decreased threshold of 0.2

table(qualityTrain$PoorCare, predictTrain > 0.2)
##
## FALSE TRUE
## 0 54 20
## 1 9 16

16/25
## [1] 0.64
54/74
## [1] 0.7297297
#from the above we can see that higher threshold value will have LOWER sensitivity & a HIGHER specificity and low
er threshold value will have HIGHER sensitivity & a LOWER specificity
#So the big question is how to select the threshold value .This is done in the next section
CONFUSION MATRIX OR CLASSIFICATION MATRIX
Confusion_matrix
#lets recreate the confusion matrix give in the Qs
#Confusion Matrix #1:
conf1<-matrix(c(15,10,5,20),ncol=2,byrow=TRUE)
colnames(conf1)<-c("Predicted = 0","Predicted = 1")
rownames(conf1)<-c("Actual = 0","Actual = 1")
conf1<-as.table(conf1)
conf1
## Predicted = 0 Predicted = 1
## Actual = 0 15 10
## Actual = 1 5 20
#Confusion Matrix #2:

conf2<-matrix(c(20,5,10,15),ncol=2,byrow=TRUE)
colnames(conf2)<-c("Predicted = 0","Predicted = 1")
rownames(conf2)<-c("Actual = 0","Actual = 1")
conf1<-as.table(conf2)
conf2
## Predicted = 0 Predicted = 1
## Actual = 0 20 5
## Actual = 1 10 15
#QUICK QUESTION
#What is the sensitivity of Confusion Matrix #1?

20/(5+20)
## [1] 0.8
#Ans:0.8
#EXPLANATION:The sensitivity of a confusion matrix is the true positives, divided by the true positives plus the
false negatives. In this case, it is 20/(20+5) = 0.8
#What is the specificity of Confusion Matrix #1?

15/(15+10)
## [1] 0.6
#Ans:0.6
#EXPLANATION:The specificity of a confusion matrix is the true negatives, divided by the true negatives plus the
false positives. In this case, it is 15/(15+10) = 0.6
#########################
#QUICK QUESTION
#To go from Confusion Matrix #1 to Confusion Matrix #2, did we increase or decrease the threshold value?
#lets calc sensitivity for Confusion Matrix #2

15/(10+15)
## [1] 0.6
#Ans:We increased the threshold value.

#EXPLANATION:We predict the outcome 1 less often in Confusion Matrix #2. This means we must have increased the th
reshold.(Also we see that sensitivity has reduced from the 1st matrix to second matrix implying that the threshol
d value is higher for 2nd confusion matrix)
1.3 VIDEO 6: ROC CURVES (R script reproduced here)
# Install and load ROCR package

#install.packages("ROCR")
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':

##
## lowess
# Prediction function
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr") #This defines what we want to plot on x & y axis
# Plot ROC curve

plot(ROCRperf)
# Add colors
plot(ROCRperf, colorize=TRUE)
# Add threshold labels

plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
This question will ask about the following ROC curve:
ROC-Threshold Curve
QUICK QUESTION
Q:Given this ROC curve, which threshold would you pick if you wanted to correctly identify a small group of patients who are receiving the worst
care with high condence?
Ans:t=0.7
EXPLANATION:The threshold 0.7 is best to identify a small group of patients who are receiving the worst care with high condence, since at this
threshold we make very few false positive mistakes, and identify about 35% of the true positives. The threshold t = 0.8 is not a good choice,
since it makes about the same number of false positives, but only identies 10% of the true positives. The thresholds 0.2 and 0.3 both identify
more of the true positives, but they make more false positive mistakes, so our condence decreases.
Q:Which threshold would you pick if you wanted to correctly identify half of the patients receiving poor care, while making as few errors as
possible?
Ans:t=0.3
EXPLANATION:The threshold 0.3 is the best choice in this scenerio. The threshold 0.2 also identies over half of the patients receiving poor care,
but it makes many more false positive mistakes. The thresholds 0.7 and 0.8 dont identify at least half of the patients receiving poor care.
1.4 VIDEO 7: INTERPRETING THE MODEL

QUICK QUESTION
IMPORTANT NOTE: This question uses the original model with the independent variables OceVisits and Narcotics. Be sure to use this
model, instead of the model you built in Quick Question 4.

QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 95.127
##
#Compute the test set predictions in R by running the command:

predictTest = predict(QualityLog, type="response", newdata=qualityTest)
#You can compute the test set AUC by running the following two commands in R:
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)

auc
## [1] 0.7994792
#What is the AUC of this model on the test set?

#Ans:0.7994792
#The AUC of a model has the following nice interpretation: given a random patient from the dataset who actually r
eceived poor care, and a random patient from the dataset who actually received good care, the AUC is the perecent
age of time that our model will classify which is which correctly.
2. THE FRAMINGHAM HEART STUDY_Evaluating Risk Factors to Save Lives

2.1 VIDEO 1: THE FRAMINGHAM HEART STUDY
QUICK QUESTION
Q:Why was the city of Framingham, Massachusetts selected for this study? Select all that apply.
Ans:It had an appropriate size, It had a stable population to observe over time, The doctors and residents in Framingham were willing to
participate
EXPLANATION:The reasons for Framingham being selected for this study are listed on Slide 4 of the previous video: it had an appropriate size, it
had a stable population, and the doctors and residents in the town were willing to participate. However, the city did not represent all types of
people in the United States (well see later in the lecture how to extend the model to dierent populations) and there were not an abnormally large
number of people with heart disease.
2.2 VIDEO 2: RISK FACTORS

QUICK QUESTION
Q:Are risk factors the independent variables or the dependent variables in our model?
Ans:Independent Variables
EXPLANATION:Risk factors are the independent variables in our model, and are what we will use to predict the dependent variable.
Q:In many situations, a dataset is handed to you and you are tasked with discovering which variables are important. But for the Framingham
Heart Study, the researchers had to collect data from patients. In a situation like this one, where data needs to be collected by the researchers,
should the potential risk factors be dened before or after the data is collected?
Ans:Before
EXPLANATION:The researchers should rst hypothesize potential risk factors, and then collect data corresponding to those risk factors. Of
course, they could always dene more risk factors later and collect more data, but this data would take longer to collect.
2.3 VIDEO 3: A LOGISTIC REGRESSION MODEL (R script reproduced here)
# Unit 3, The Framingham Heart Study
# Video 3
# Read in the dataset

framingham = read.csv("framingham.csv")
# Look at structure
str(framingham)

## $ male : int 1 0 1 0 0 0 0 0 1 1 ...
## $ age : int 39 46 48 61 46 43 63 45 52 43 ...
## $ education : int 4 2 1 3 3 2 1 2 1 1 ...
## $ currentSmoker : int 0 0 1 1 1 0 0 1 0 1 ...
## $ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
## $ BPMeds : int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentHyp : int 0 0 0 1 0 1 0 0 1 1 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
## $ sysBP : num 106 121 128 150 130 ...
## $ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
## $ BMI : num 27 28.7 25.3 28.6 23.1 ...
## $ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
## $ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
## $ TenYearCHD : int 0 0 0 1 0 0 1 0 0 0 ...
# Load the library caTools

library(caTools)
# Randomly split the data into training and testing sets

set.seed(1000)
split = sample.split(framingham$TenYearCHD, SplitRatio = 0.65)
# Split up the data using subset

train = subset(framingham, split==TRUE)
test = subset(framingham, split==FALSE)

framinghamLog = glm(TenYearCHD ~ ., data = train, family=binomial)
summary(framinghamLog)
##
## Call:
## glm(formula = TenYearCHD ~ ., family = binomial, data = train)
##
## -1.8487 -0.6007 -0.4257 -0.2842 2.8369
##
## Coefficients:
## (Intercept) -7.886574 0.890729 -8.854 < 2e-16 ***
## male 0.528457 0.135443 3.902 9.55e-05 ***
## age 0.062055 0.008343 7.438 1.02e-13 ***
## education -0.058923 0.062430 -0.944 0.34525
## currentSmoker 0.093240 0.194008 0.481 0.63080
## cigsPerDay 0.015008 0.007826 1.918 0.05514 .
## BPMeds 0.311221 0.287408 1.083 0.27887
## prevalentStroke 1.165794 0.571215 2.041 0.04126 *
## prevalentHyp 0.315818 0.171765 1.839 0.06596 .
## diabetes -0.421494 0.407990 -1.033 0.30156
## totChol 0.003835 0.001377 2.786 0.00533 **
## sysBP 0.011344 0.004566 2.485 0.01297 *
## diaBP -0.004740 0.008001 -0.592 0.55353
## BMI 0.010723 0.016157 0.664 0.50689
## heartRate -0.008099 0.005313 -1.524 0.12739
## glucose 0.008935 0.002836 3.150 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## (371 observations deleted due to missingness)
## AIC: 1824.3
##
# Predictions on the test set

predictTest = predict(framinghamLog, type="response", newdata=test)
# Confusion matrix with threshold of 0.5

table(test$TenYearCHD, predictTest > 0.5)
##
## FALSE TRUE
## 0 1069 6
## 1 187 11
# Overall Accuracy of the model

(1069+11)/(1069+6+187+11)
## [1] 0.8483896
#this needs to be compared with Baseline model
# Baseline accuracy/model
(1069+6)/(1069+6+187+11)
## [1] 0.8444619
#Our model barely beats the baseline model.Is our model worthy?
# Test set AUC

library(ROCR)
ROCRpred = prediction(predictTest, test$TenYearCHD)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.7421095
#Model can differentiate low-risk from high risk patients as out of sample i.e. test was AUC=0.74
QUICK QUESTION
In the previous video, we computed the following confusion matrix for our logistic regression model on our test set with a threshold of 0.5:
FALSE TRUE 0 1069 6 1 187 11
Using this confusion matrix, answer the following questions.
Q:What is the sensitivity of our logistic regression model on the test set, using a threshold of 0.5?
11/(187+11)
## [1] 0.05555556
Ans:0.05555556
Q:What is the specicity of our logistic regression model on the test set, using a threshold of 0.5?
1069/(1069+6)
## [1] 0.9944186
Ans:0.9944186
EXPLANATION:Using this confusion matrix, we can compute that the sensitivity is 11/(11+187) and the specicity is 1069/(1069+6).
2.4 VIDEO 4: VALIDATING THE MODEL

QUICK QUESTION
Q:For which of the following models should external validation be used? Consider both the population used to train the model, and the
population that the model will be used on. (Select all that apply.)
Ans:
A model to predict obesity risk. Data from a random sample of California residents was used to build the model, and we want to use the model to
predict the obesity risk of all United States residents.
A model to predict the probability of a runner winning a marathon. Data from all runners in the Boston Marathon was used to build the model, and
we want use the model to predict the probability of winning for all people who run marathons.
EXPLANATION:In the rst and third models, we are using a special sub-population to build the model. While we can use the model for that sub-
population, we should use external validation to test the model on other populations. The second model uses data from a special sub-population,
but the model is only intended for that sub-population, so external validation is not necessary.
VIDEO 5: INTERVENTIONS
QUICK QUESTION
Q:In Video 3, we built a logistic regression model and found that the following variables were signicant (or almost signicant) for predicting ten
year risk of CHD: male, age, number of cigarettes per day, whether or not the patient previously had a stroke, whether or not the patient is
currently hypertensive, total cholesterol level, systolic blood pressure, and blood glucose level. Which one of the following variables would be the
most dramatically aected by a behavioral intervention? HINT: Think about how much control the patient has over each of the variables.
Ans:Number of Cigarettes per day
EXPLANATION:The number of cigarettes smoked per day would be the most dramatically aected by a behavioral intervention. This is a variable
that the patient has the ability to control the most.
3. ELECTION FORECASTING_Predicting the Winner Before any Votes are Cast

(recitation)
VIDEO 2: DEALING WITH MISSING DATA (R script reproduced here)
# Unit 3, Recitation
# Video 2
# Read in data
polling = read.csv("PollingData.csv")
str(polling)

## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
## $ Rasmussen : int 11 21 NA 16 5 5 8 7 10 NA ...
## $ SurveyUSA : int 18 25 NA NA 15 NA NA 5 NA NA ...
## $ DiffCount : int 5 5 1 6 8 9 4 8 5 2 ...
## $ PropR : num 1 1 1 1 1 ...
## $ Republican: int 1 1 1 1 1 1 1 1 1 1 ...
table(polling$Year)
##
## 2004 2008 2012
## 50 50 45
summary(polling) # to see the no NAs
## State Year Rasmussen SurveyUSA

## Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000
## Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500
## California : 3 Median :2008 Median : 1.0000 Median : -2.0000
## Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243
## Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000
## Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000
## (Other) :127 NA's :46 NA's :71
## DiffCount PropR Republican
## Min. :-19.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 1.000 Median :0.6250 Median :1.0000
## Mean : -1.269 Mean :0.5259 Mean :0.5103
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. : 11.000 Max. :1.0000 Max. :1.0000
##
# Install and load mice package

#install.packages("mice") #Multiple Imputation by Chained Equations(mice) package
library(mice)
## Loading required package: Rcpp
## mice 2.25 2015-11-09
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':

##
## complete
# Multiple imputation
simple = polling[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")] #creating a smaller dataset with less variabl
es
summary(simple)
## Rasmussen SurveyUSA PropR DiffCount

## Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
## 1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
## Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
## Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
## 3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
## Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
## NA's :46 NA's :71
set.seed(144)
imputed = complete(mice(simple)) #dataset with the imputed values for the NAs
##
## iter imp variable
## 1 1 Rasmussen SurveyUSA
summary(imputed)
## Rasmussen SurveyUSA PropR DiffCount

## Min. :-41.000 Min. :-33.000 Min. :0.0000 Min. :-19.000
## 1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.:0.0000 1st Qu.: -6.000
## Median : 3.000 Median : 1.000 Median :0.6250 Median : 1.000
## Mean : 1.731 Mean : 1.517 Mean :0.5259 Mean : -1.269
## 3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.:1.0000 3rd Qu.: 4.000
## Max. : 39.000 Max. : 30.000 Max. :1.0000 Max. : 11.000
#last step is to copy back the var with imputed values back into the main dataset
polling$Rasmussen = imputed$Rasmussen
polling$SurveyUSA = imputed$SurveyUSA
summary(polling)
## State Year Rasmussen SurveyUSA

## Arizona : 3 Min. :2004 Min. :-41.000 Min. :-33.000
## Arkansas : 3 1st Qu.:2004 1st Qu.: -8.000 1st Qu.:-11.000
## California : 3 Median :2008 Median : 3.000 Median : 1.000
## Colorado : 3 Mean :2008 Mean : 1.731 Mean : 1.517
## Connecticut: 3 3rd Qu.:2012 3rd Qu.: 11.000 3rd Qu.: 18.000
## Florida : 3 Max. :2012 Max. : 39.000 Max. : 30.000
## (Other) :127
## DiffCount PropR Republican
## Min. :-19.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Mean : -1.269 Mean :0.5259 Mean :0.5103
## 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. : 11.000 Max. :1.0000 Max. :1.0000
##
3.1 VIDEO 3: A SOPHISTICATED BASELINE METHOD (R script reproduced here)
# Video 3
#Lets start building the models.First we split the data into training & test sets
# Subset data into training set and test set

Train = subset(polling, Year == 2004 | Year == 2008)
Test = subset(polling, Year == 2012)
# Smart Baseline
table(Train$Republican) #lets see the breakdown of the dependent/outcome var
##
## 0 1
## 47 53
#Baseline model is that model which predicts the most common outcome.Here the baseline model predicts 53%.But the
model here is a weak model considering that even where Democrats are ahead by 15-20%, it still predicts Republic
an outcome.Hence we want a smarter/better baseline model as compared to this naive model
#sign() returns a 1 when a postive no is passed into it and returns a -1 if a negative no is passed and finally r
eturns a 0 if 0 is passed into it.
sign(20)
## [1] 1
sign(-10)
## [1] -1
sign(0)
## [1] 0
table(sign(Train$Rasmussen))
##
## -1 0 1
## 42 3 55
table(Train$Republican, sign(Train$Rasmussen)) # this smart base model just makes 4 mistakes out of 100 obs
##
## -1 0 1
## 0 42 2 3
## 1 0 1 52
3.2 VIDEO 4: LOGISTIC REGRESSION MODELS (R script reproduced here)
# Video 4
#Before building the model it is reasonable to suspect that multicollinearity exists among the Independent var as
all of them are predicting the same thing i.e. how is the Republican candidate is performing in each state
# Multicollinearity
#cor(Train) This gives error as we hav a factor var (Non-numeric)
str(Train)

## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
## $ Rasmussen : int 11 21 16 16 5 5 7 10 -11 -27 ...
## $ SurveyUSA : int 18 25 21 21 15 8 5 9 -11 -24 ...
## $ DiffCount : int 5 5 1 6 8 9 8 5 -8 -5 ...
## $ PropR : num 1 1 1 1 1 1 1 1 0 0 ...
## $ Republican: int 1 1 1 1 1 1 1 1 0 0 ...
cor(Train[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount", "Republican")]) # so we select only numeric Independe

nt var and also the dependent var.
## Rasmussen SurveyUSA PropR DiffCount Republican

## Rasmussen 1.0000000 0.9194508 0.8404803 0.5124098 0.8021191
## SurveyUSA 0.9194508 1.0000000 0.8756581 0.5541816 0.8205806
## PropR 0.8404803 0.8756581 1.0000000 0.8273785 0.9484204
## DiffCount 0.5124098 0.5541816 0.8273785 1.0000000 0.8092777
## Republican 0.8021191 0.8205806 0.9484204 0.8092777 1.0000000
#We are seeing large correlation values which are a concern
# Logistic Regression Model building with just one var as a starting point.
mod1 = glm(Republican~PropR, data=Train, family="binomial") # we hav selected that var which is the most correlat
ed with the outcome var 'Republican'
summary(mod1)
##
## Call:
## glm(formula = Republican ~ PropR, family = "binomial", data = Train)
##
## -2.22880 -0.06541 0.10260 0.10260 1.37392
##
## Coefficients:
## (Intercept) -6.146 1.977 -3.108 0.001882 **
## PropR 11.390 3.153 3.613 0.000303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 19.772
##
# Training set predictions using the above model

pred1 = predict(mod1, type="response")
table(Train$Republican, pred1 >= 0.5) # this model makes 4 mistakes similar to our smart base model
##
## FALSE TRUE
## 0 45 2
## 1 2 51
#Lets now see whether we can improve upon this model by adding another var
# Two-variable model
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial") # here we select 2 I.Vs var which are t
he least correlated among themselves as together they will have more predictive power
pred2 = predict(mod2, type="response") # this is a training set prediction
table(Train$Republican, pred2 >= 0.5)
##
## FALSE TRUE
## 0 45 2
## 1 1 52
# This model shows that we made 3 mistakes i.e. one less that model before but not a big improvemnt over the bsel
ine model
summary(mod2)
##
## Call:
## glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial",
## data = Train)
##
## -2.01196 -0.00698 0.01005 0.05074 1.54975
##
## Coefficients:
## (Intercept) -1.1405 1.2456 -0.916 0.3599
## SurveyUSA 0.2976 0.1949 1.527 0.1267
## DiffCount 0.7673 0.4188 1.832 0.0669 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 18.439
##
#postives of this model

#1.this model has a lower AIC value indicating a better/stronger model than the previous one.
#2. coeff are +ve which is intuitively correct
#-ves
#I.V.coeff are NON significant
#We will adopt the Two var model
3.3 VIDEO 5: TEST SET PREDICTIONS (R script reproduced here)
# Video 5
#Time to evaluate the model on the testing set
# Smart baseline accuracy comparision with Test outcome values

table(Test$Republican, sign(Test$Rasmussen)) # here the model makes 4 mistake and 2 inconclusive results
##
## -1 0 1
## 0 18 2 4
## 1 0 0 21
# Test set predictions

TestPrediction = predict(mod2, newdata=Test, type="response")
table(Test$Republican, TestPrediction >= 0.5) # we make just one mistake
##
## FALSE TRUE
## 0 23 1
## 1 0 21
# Analyze mistake by taking out that particular obs from the test dataset
subset(Test, TestPrediction >= 0.5 & Republican == 0)
## State Year Rasmussen SurveyUSA DiffCount PropR Republican

## 24 Florida 2012 2 0 6 0.6666667 0
# Overall these predictions are better that what the smart baseline model is predicting
Assignment 3: POPULARITY OF MUSIC RECORDS

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and
is dominated by three big production companies which make up nearly 82% of the total annual album sales.
Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A
record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the prots from album
sales, singles and concert tickets.
Unfortunately, the success of an artists release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and
digital downloads, while another single may turn out quite unpopular, and therefore unprotable.
Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to
support to maximize their nancial success.
How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a
spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, we aim to use information about a songs properties to predict its popularity. The dataset songs.csv consists of all
songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didnt make the Top 10.
This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.
The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time
signature, loudness, key, pitch, tempo, and timbre.
Heres a detailed description of the variables:
year = the year the song was released
songtitle = the title of the song
artistname = the name of the artist of the song
songID and artistID = identifying variables for the song and artist
timesignature and timesignature_condence = a variable estimating the time signature of the song, and the condence in the estimate
loudness = a continuous variable indicating the average amplitude of the audio in decibels
tempo and tempo_condence = a variable indicating the estimated beats per minute of the song, and the condence in the estimate
key and key_condence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the condence in the
estimate
energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch = a continuous variable that indicates the pitch of the song
timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the
minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10,
and 0 if it was not)
Use the read.csv function to load the dataset songs.csv into R.
#Lets import the dataset

songs<-read.csv("songs.csv")
str(songs)

## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ songtitle : Factor w/ 7141 levels "'03 Bonnie & Clyde",..: 6204 5522 241 3098 47 607 254 4419
2887 6756 ...
## $ artistname : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
## $ songID : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831
3964 6904 2473 ...
## $ artistID : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 67
1 507 ...
## $ timesignature : int 3 4 4 4 4 4 4 4 4 4 ...
## $ timesignature_confidence: num 0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
## $ loudness : num -4.26 -4.05 -3.57 -3.81 -4.71 ...
## $ tempo : num 91.5 140 160.5 97.5 140.1 ...
## $ tempo_confidence : num 0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
## $ key : int 11 10 2 1 6 4 10 5 9 11 ...
## $ key_confidence : num 0.453 0.469 0.209 0.632 0.483 0.627 0.715 0.423 0.751 0.602 ...
## $ energy : num 0.967 0.985 0.99 0.939 0.988 ...
## $ pitch : num 0.024 0.025 0.026 0.013 0.063 0.038 0.026 0.033 0.027 0.004 ...
## $ timbre_0_min : num 0.002 0 0.003 0 0 ...
## $ timbre_0_max : num 57.3 57.4 57.4 57.8 56.9 ...
## $ timbre_1_min : num -6.5 -37.4 -17.2 -32.1 -223.9 ...
## $ timbre_1_max : num 171 171 171 221 171 ...
## $ timbre_2_min : num -81.7 -149.6 -72.9 -138.6 -147.2 ...
## $ timbre_2_max : num 95.1 180.3 157.9 173.4 166 ...
## $ timbre_3_min : num -285 -380.1 -204 -73.5 -128.1 ...
## $ timbre_3_max : num 259 384 251 373 389 ...
## $ timbre_4_min : num -40.4 -48.7 -66 -55.6 -43.9 ...
## $ timbre_4_max : num 73.6 100.4 152.1 119.2 99.3 ...
## $ timbre_5_min : num -104.7 -87.3 -98.7 -77.5 -96.1 ...
## $ timbre_5_max : num 183.1 42.8 141.4 141.2 38.3 ...
## $ timbre_6_min : num -88.8 -86.9 -88.9 -70.8 -110.8 ...
## $ timbre_6_max : num 73.5 75.5 66.5 64.5 72.4 ...
## $ timbre_7_min : num -71.1 -65.8 -67.4 -63.7 -55.9 ...
## $ timbre_7_max : num 82.5 106.9 80.6 96.7 110.3 ...
## $ timbre_8_min : num -52 -61.3 -59.8 -78.7 -56.5 ...
## $ timbre_8_max : num 39.1 35.4 46 41.1 37.6 ...
## $ timbre_9_min : num -35.4 -81.9 -46.3 -49.2 -48.6 ...
## $ timbre_9_max : num 71.6 74.6 59.9 95.4 67.6 ...
## $ timbre_10_min : num -126.4 -103.8 -108.3 -102.7 -52.8 ...
## $ timbre_10_max : num 18.7 121.9 33.3 46.4 22.9 ...
## $ timbre_11_min : num -44.8 -38.9 -43.7 -59.4 -50.4 ...
## $ timbre_11_max : num 26 22.5 25.7 37.1 32.8 ...
## $ Top10 : int 0 0 0 0 0 0 0 0 0 1 ...
#PROBLEM 1.1 - UNDERSTANDING THE DATA
#How many observations (songs) are from the year 2010?

table(songs$year) #or
##
## 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## 328 196 186 324 198 258 178 329 380 357 363 282 518 434 479
## 2005 2006 2007 2008 2009 2010
## 392 479 622 415 483 373
nrow(songs[songs$year>=2010,])
## [1] 373
#Ans:7574
############################
#How many songs does the dataset include for which the artist name is "Michael Jackson"?
nrow(songs[songs$artistname=="Michael Jackson",])
## [1] 18
#or
MichaelJackson = subset(songs, artistname == "Michael Jackson")
nrow(MichaelJackson) #or str(MichaelJackson)
## [1] 18
#Ans:18
########################################
#Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.
MichaelJackson$songtitle[MichaelJackson$Top10=="1"]
## [1] You Rock My World You Are Not Alone Black or White Remember the Time
## [5] In The Closet
## 7141 Levels: '03 Bonnie & Clyde '69 ... Zumbi
MichaelJackson[c("songtitle","Top10")]
## songtitle Top10
## 4329 You Rock My World 1
## 6205 She's Out of My Life 0
## 6206 Wanna Be Startin' Somethin' 0
## 6207 You Are Not Alone 1
## 6208 Billie Jean 0
## 6209 The Way You Make Me Feel 0
## 6210 Black or White 1
## 6211 Rock with You 0
## 6212 Bad 0
## 6213 I Just Can't Stop Loving You 0
## 6214 Man in the Mirror 0
## 6215 Thriller 0
## 6216 Beat It 0
## 6217 The Girl Is Mine 0
## 6218 Remember the Time 1
## 6219 Don't Stop 'Til You Get Enough 0
## 6220 Heal the World 0
## 6915 In The Closet 1
#Ans:You Rock My World ,You Are Not Alone
#################################
#The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only tak
es integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset? Select al
l that apply.
table(songs$timesignature)
##
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
#Ans:0,1,3,4,5,7
#Which timesignature value is the most frequent among songs in our dataset?
#Ans:4
#EXPLANATION:We can also read from the table that 6787 songs have a value of 4 for the timesignature, which is th
e highest count out of all of the possible timesignature values.
##################################
#Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one
is it?
songs$songtitle[songs$tempo==max(songs$tempo)]
## [1] Wanna Be Startin' Somethin'

#or
which.max(songs$tempo)
## [1] 6206
songs$songtitle[6206]
## [1] Wanna Be Startin' Somethin'

#Ans:Wanna Be Startin' Somethin'
#######################################
#PROBLEM 2.1 - CREATING OUR PREDICTION MODEL
#We wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function t
o split the data into a training set "SongsTrain" consisting of all the observations up to and including 2009 son
g releases, and a testing set "SongsTest", consisting of the 2010 song releases.
# Subset data into training set and test set

SongsTrain= subset(songs, year<=2009)
SongsTest= subset(songs, year==2010)
#How many observations (songs) are in the training set?

nrow(SongsTrain) #or str(SongsTrain)
## [1] 7201
#Ans:7201
##################################
#In this problem, our outcome variable is "Top10" - we are trying to predict whether or not a song will make it t
o the Top 10 of the Billboard Hot 100 Chart. Since the outcome variable is binary, we will build a logistic regre
ssion model*. We'll start by using all song attributes as our independent variables, which we'll call Model 1.
#We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic
regression model. So we won't use the variables "year", "songtitle", "artistname", "songID" or "artistID".
#We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the
formula including all the independent variables in R. However, in this case, this is a tedious amount of work si
nce we have a large number of independent variables.There is a nice trick to avoid doing so. Let's suppose that,
except for the outcome variable Top10, all other variables in the training set are inputs to Model 1. Then, we c
an use the formula
#SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
#to build our model. Notice that the "."(called as period#) is used in place of enumerating all the independent v
ariables. (Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can u
nderstand this argument either way.)
#However, in our case, we want to exclude some of the variables in our dataset from being used as independent var
iables ("year", "songtitle", "artistname", "songID", and "artistID"). To do this, we can use the following trick.
#First define a vector of variable names called nonvars - these are the variables that we won't use in our model.
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
#To remove these variables from your training and testing sets, type the following commands in your R console:
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
#Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables
as the independent variables. You should use SongsTrain to build the model.
# Logistic Regression Model building

SongsLog1 = glm(Top10~., data=SongsTrain, family="binomial")
summary(SongsLog1)
##
## Call:
## glm(formula = Top10 ~ ., family = "binomial", data = SongsTrain)
##
## -1.9220 -0.5399 -0.3459 -0.1845 3.0770
##
## Coefficients:
## (Intercept) 1.470e+01 1.806e+00 8.138 4.03e-16 ***
## timesignature 1.264e-01 8.674e-02 1.457 0.145050
## timesignature_confidence 7.450e-01 1.953e-01 3.815 0.000136 ***
## loudness 2.999e-01 2.917e-02 10.282 < 2e-16 ***
## tempo 3.634e-04 1.691e-03 0.215 0.829889
## tempo_confidence 4.732e-01 1.422e-01 3.329 0.000873 ***
## key 1.588e-02 1.039e-02 1.529 0.126349
## key_confidence 3.087e-01 1.412e-01 2.187 0.028760 *
## energy -1.502e+00 3.099e-01 -4.847 1.25e-06 ***
## pitch -4.491e+01 6.835e+00 -6.570 5.02e-11 ***
## timbre_0_min 2.316e-02 4.256e-03 5.441 5.29e-08 ***
## timbre_0_max -3.310e-01 2.569e-02 -12.882 < 2e-16 ***
## timbre_1_min 5.881e-03 7.798e-04 7.542 4.64e-14 ***
## timbre_1_max -2.449e-04 7.152e-04 -0.342 0.732087
## timbre_2_min -2.127e-03 1.126e-03 -1.889 0.058843 .
## timbre_2_max 6.586e-04 9.066e-04 0.726 0.467571
## timbre_3_min 6.920e-04 5.985e-04 1.156 0.247583
## timbre_3_max -2.967e-03 5.815e-04 -5.103 3.34e-07 ***
## timbre_4_min 1.040e-02 1.985e-03 5.237 1.63e-07 ***
## timbre_4_max 6.110e-03 1.550e-03 3.942 8.10e-05 ***
## timbre_5_min -5.598e-03 1.277e-03 -4.385 1.16e-05 ***
## timbre_5_max 7.736e-05 7.935e-04 0.097 0.922337
## timbre_6_min -1.686e-02 2.264e-03 -7.445 9.66e-14 ***
## timbre_6_max 3.668e-03 2.190e-03 1.675 0.093875 .
## timbre_7_min -4.549e-03 1.781e-03 -2.554 0.010661 *
## timbre_7_max -3.774e-03 1.832e-03 -2.060 0.039408 *
## timbre_8_min 3.911e-03 2.851e-03 1.372 0.170123
## timbre_8_max 4.011e-03 3.003e-03 1.336 0.181620
## timbre_9_min 1.367e-03 2.998e-03 0.456 0.648356
## timbre_9_max 1.603e-03 2.434e-03 0.659 0.510188
## timbre_10_min 4.126e-03 1.839e-03 2.244 0.024852 *
## timbre_10_max 5.825e-03 1.769e-03 3.292 0.000995 ***
## timbre_11_min -2.625e-02 3.693e-03 -7.108 1.18e-12 ***
## timbre_11_max 1.967e-02 3.385e-03 5.811 6.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 4827.2
##
#Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?
#Ans:4827.2
################################
#Let's now think about the variables in our dataset related to the confidence of the time signature, key and temp
o (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confid
ence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the
model suggest?
#Ans:The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top
10
#EXPLANATION:If you look at the output summary(model), where model is the name of your logistic regression model,
you can see that the coefficient estimates for the confidence variables (timesignature_confidence, key_confidenc
e, and tempo_confidence) are positive. This means that higher confidence leads to a higher predicted probability
of a Top 10 hit.
###########################################
#In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be
complex. What does Model 1 suggest in terms of complexity?
#Ans:Mainstream listeners tend to prefer less complex songs
#EXPLANATION:Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are
all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream liste
ners tend to prefer less complex songs.
######################################
#Songs with heavier instrumentation tend to be louder (have higher values in the variable "loudness") and more en
ergetic (have higher values in the variable "energy").
#By inspecting the coefficient of the variable "loudness", what does Model 1 suggest?
#Ans:Mainstream listeners prefer songs with heavy instrumentation
#By inspecting the coefficient of the variable "energy", do we draw the same conclusions as above?
#Ans:No
#EXPLANATION:The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder s
ongs, which are those with heavier instrumentation. However, the coefficient estimate for energy is negative, mea
ning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation.
These coefficients lead us to different conclusions!
#######################################################
#PROBLEM 3.1 - BEWARE OF MULTICOLLINEARITY ISSUES!
#What is the correlation between the variables "loudness" and "energy" in the training set?
cor(SongsTrain$loudness,SongsTrain$energy)
## [1] 0.7399067
#Ans:0.7399067
#Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issu
e, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we'll
build two variations of our original model: Model 2, in which we keep "energy" and omit "loudness", and Model 3,
in which we keep "loudness" and omit "energy".
###################################################
#Create Model 2, which is Model 1 without the independent variable "loudness". This can be done with the followin
g command:
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)

summary(SongsLog2)
##
## Call:
## glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
##
## -2.0983 -0.5607 -0.3602 -0.1902 3.3107
##
## Coefficients:
## (Intercept) -2.241e+00 7.465e-01 -3.002 0.002686 **
## timesignature 1.625e-01 8.734e-02 1.860 0.062873 .
## tempo 5.521e-04 1.665e-03 0.332 0.740226
## tempo_confidence 5.497e-01 1.407e-01 3.906 9.40e-05 ***
## key 1.740e-02 1.026e-02 1.697 0.089740 .
## energy 1.813e-01 2.608e-01 0.695 0.486991
## pitch -5.150e+01 6.857e+00 -7.511 5.87e-14 ***
## timbre_0_min 2.479e-02 4.240e-03 5.847 5.01e-09 ***
## timbre_0_max -1.007e-01 1.178e-02 -8.551 < 2e-16 ***
## timbre_1_min 7.143e-03 7.710e-04 9.265 < 2e-16 ***
## timbre_1_max -7.830e-04 7.064e-04 -1.108 0.267650
## timbre_2_min -1.579e-03 1.109e-03 -1.424 0.154531
## timbre_2_max 3.889e-04 8.964e-04 0.434 0.664427
## timbre_3_min 6.500e-04 5.949e-04 1.093 0.274524
## timbre_3_max -2.462e-03 5.674e-04 -4.339 1.43e-05 ***
## timbre_4_min 9.115e-03 1.952e-03 4.670 3.02e-06 ***
## timbre_4_max 6.306e-03 1.532e-03 4.115 3.87e-05 ***
## timbre_5_min -5.641e-03 1.255e-03 -4.495 6.95e-06 ***
## timbre_5_max 6.937e-04 7.807e-04 0.889 0.374256
## timbre_6_min -1.612e-02 2.235e-03 -7.214 5.45e-13 ***
## timbre_6_max 3.814e-03 2.157e-03 1.768 0.076982 .
## timbre_7_min -5.102e-03 1.755e-03 -2.907 0.003644 **
## timbre_7_max -3.158e-03 1.811e-03 -1.744 0.081090 .
## timbre_8_min 4.488e-03 2.810e-03 1.597 0.110254
## timbre_8_max 6.423e-03 2.950e-03 2.177 0.029497 *
## timbre_9_min -4.282e-04 2.955e-03 -0.145 0.884792
## timbre_9_max 3.525e-03 2.377e-03 1.483 0.138017
## timbre_10_min 2.993e-03 1.804e-03 1.660 0.097004 .
## timbre_10_max 7.367e-03 1.731e-03 4.255 2.09e-05 ***
## timbre_11_min -2.837e-02 3.630e-03 -7.815 5.48e-15 ***
## timbre_11_max 1.829e-02 3.341e-03 5.476 4.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 4937.8
##
#We just subtracted the variable loudness. We couldn't do this with the variables "songtitle" and "artistname", b
ecause they are not numeric variables, and we might get different values in the test set that the training set ha
s never seen. But this approach (subtracting the variable from the model formula) will always work when you want
to remove numeric variables.
#Look at the summary of SongsLog2, and inspect the coefficient of the variable "energy". What do you observe?
#Ans:Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observatio
n in Model 1
#EXPLANATION:The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy
levels tend to be more popular. However, note that the variable energy is not significant in this model.
#################################################
#Now, create Model 3, which should be exactly like Model 1, but without the variable "energy".
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
summary(SongsLog3)
##
## Call:
## glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
##
## -1.9182 -0.5417 -0.3481 -0.1874 3.4171
##
## Coefficients:
## (Intercept) 1.196e+01 1.714e+00 6.977 3.01e-12 ***
## timesignature 1.151e-01 8.726e-02 1.319 0.187183
## loudness 2.306e-01 2.528e-02 9.120 < 2e-16 ***
## tempo -6.460e-04 1.665e-03 -0.388 0.698107
## tempo_confidence 3.841e-01 1.398e-01 2.747 0.006019 **
## key 1.649e-02 1.035e-02 1.593 0.111056
## pitch -5.328e+01 6.733e+00 -7.914 2.49e-15 ***
## timbre_0_min 2.205e-02 4.239e-03 5.200 1.99e-07 ***
## timbre_0_max -3.105e-01 2.537e-02 -12.240 < 2e-16 ***
## timbre_1_min 5.416e-03 7.643e-04 7.086 1.38e-12 ***
## timbre_1_max -5.115e-04 7.110e-04 -0.719 0.471928
## timbre_2_min -2.254e-03 1.120e-03 -2.012 0.044190 *
## timbre_2_max 4.119e-04 9.020e-04 0.457 0.647915
## timbre_3_min 3.179e-04 5.869e-04 0.542 0.588083
## timbre_3_max -2.964e-03 5.758e-04 -5.147 2.64e-07 ***
## timbre_4_min 1.105e-02 1.978e-03 5.585 2.34e-08 ***
## timbre_4_max 6.467e-03 1.541e-03 4.196 2.72e-05 ***
## timbre_5_min -5.135e-03 1.269e-03 -4.046 5.21e-05 ***
## timbre_5_max 2.979e-04 7.855e-04 0.379 0.704526
## timbre_6_min -1.784e-02 2.246e-03 -7.945 1.94e-15 ***
## timbre_6_max 3.447e-03 2.182e-03 1.580 0.114203
## timbre_7_min -5.128e-03 1.768e-03 -2.900 0.003733 **
## timbre_7_max -3.394e-03 1.820e-03 -1.865 0.062208 .
## timbre_8_min 3.686e-03 2.833e-03 1.301 0.193229
## timbre_8_max 4.658e-03 2.988e-03 1.559 0.119022
## timbre_9_min -9.318e-05 2.957e-03 -0.032 0.974859
## timbre_9_max 1.342e-03 2.424e-03 0.554 0.579900
## timbre_10_min 4.050e-03 1.827e-03 2.217 0.026637 *
## timbre_10_max 5.793e-03 1.759e-03 3.294 0.000988 ***
## timbre_11_min -2.638e-02 3.683e-03 -7.162 7.96e-13 ***
## timbre_11_max 1.984e-02 3.365e-03 5.896 3.74e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 4848.7
##
#Look at the summary of Model 3 and inspect the coefficient of the variable "loudness". Remembering that higher l
oudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the po
pularity of heavy instrumentation as we did with Model 2?
#Ans:Yes
#EXPLANATION: Looking at the output of summary(SongsLog3), we can see that loudness has a positive coefficient es
timate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular. This is
the same conclusion we got from Model 2.
#########################################
#PROBLEM 4.1 - VALIDATING OUR MODEL
#Make predictions on the test set using Model 3. What is the accuracy of Model 3 on the test set, using a thresho
ld of 0.45? (Compute the accuracy as a number between 0 and 1.)

testPredict = predict(SongsLog3, type="response", newdata=SongsTest)

table(SongsTest$Top10, testPredict > 0.45)
##
## FALSE TRUE
## 0 309 5
## 1 40 19
#therefore the Overall Accuracy is

(309+19)/(309+5+40+19)
## [1] 0.8793566
#0.8793566
######################################
#Let's check if there's any incremental benefit in using Model 3 instead of a baseline model. Given the difficult
y of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song
is not a Top 10 hit) for all songs. What would the accuracy of the baseline model be on the test set? (Give your
answer as a number between 0 and 1.)
# Baseline accuracy/model
(309+5)/(309+5+40+19) #From the Confusion matrix:an easier model would be to pick the most frequent outcome (a s
ong is not a Top 10 hit) for all songs
## [1] 0.8418231
#or
#You can compute the baseline accuracy by tabling the outcome variable in the test set
table(SongsTest$Top10)
##
## 0 1
## 314 59
#The baseline model would get 314 observations correct, and 59 wrong, for an accuracy of
314/(314+59)
## [1] 0.8418231
#Ans:0.8418231
#####################################
#It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?
#Let's view the two models from an investment perspective. A production company is interested in investing in son
gs that are highly likely to make it to the Top 10. The company's objective is to minimize its risk of financial
losses attributed to investing in songs that end up unpopular.
#A competitive edge can therefore be achieved if we can provide the production company a list of songs that are h
ighly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does no
t label any song as a hit. Let us see what our model has to offer.
#How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into
our test set), using a threshold of 0.45?
testPredict = predict(SongsLog3, type="response", newdata=SongsTest)

table(SongsTest$Top10, testPredict > 0.45)
##
## FALSE TRUE
## 0 309 5
## 1 40 19
#Ans:19
#How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a thresh
old of 0.45?
#Ans:5 # from the above confusion matrix
#EXPLANATION:According to our model's confusion matrix:table(SongsTest$Top10, testPredict >= 0.45),We have 19 tru
e positives (Top 10 hits that we predict correctly), and 5 false positives (songs that we predict will be Top 10
hits, but end up not being Top 10 hits).
##################################################
#Using the confusion matrix:
table(SongsTest$Top10, testPredict >= 0.45)
##
## FALSE TRUE
## 0 309 5
## 1 40 19
#What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?
19/(19+40)
## [1] 0.3220339
#Ans:0.3220339
#What is the specificity of Model 3 on the test set, using a threshold of 0.45?
309/(309+5)
## [1] 0.9840764
#Ans:0.9840764
####################################
#What conclusions can you make about our model?

#Ans:Model 3 favors specificity over sensitivity. & Model 3 provides conservative predictions, and predicts that
a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be
very confident in the songs that it does predict to be Top 10 hits.
#EXPLANATION:Model 3 has a very high specificity, meaning that it favors specificity over sensitivity. While Mode
l 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very con
servative in its predictions.
Assignment 3: PREDICTING PAROLE VIOLATORS

In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole
system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to
prison if they violate the terms of their parole.
Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not
commit additional crimes after release. In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or
her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.
For this prediction task, we will use data from the United States 2004 National Corrections Reporting Program, a nationwide census of parole
releases that occurred during 2004. We limited our focus to parolees who served no more than 6 months in prison and whose maximum sentence
for all charges did not exceed 18 months. The dataset contains all such parolees who either successfully completed their term of parole during
2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:
male: 1 if the parolee is male, 0 if female
race: 1 if the parolee is white, 2 otherwise
age: the parolees age (in years) when he or she was released from prison
state: a code for the parolees state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due
to having a high representation in the dataset.
time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
multiple.oenses: 1 if the parolee was incarcerated for multiple oenses, 0 otherwise. crime: a code for the parolees main crime leading to
incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.
#PROBLEM 1.1 - LOADING THE DATASET
#lets load the dataframe and see the details of this dataset
parole<-read.csv("parole.csv ")
str(parole)

## $ male : int 1 0 1 1 1 1 1 0 0 1 ...
## $ race : int 1 1 2 1 2 2 1 1 1 2 ...
## $ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
## $ state : int 1 1 1 1 1 1 1 1 1 1 ...
## $ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
## $ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
## $ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crime : int 4 3 3 1 1 4 3 1 3 2 ...
## $ violator : int 0 0 0 0 0 0 0 0 0 0 ...
summary(parole)
## male race age state

## Min. :0.0000 Min. :1.000 Min. :18.40 Min. :1.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 1st Qu.:2.000
## Median :1.0000 Median :1.000 Median :33.70 Median :3.000
## Mean :0.8074 Mean :1.424 Mean :34.51 Mean :2.887
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:4.000
## Max. :1.0000 Max. :2.000 Max. :67.00 Max. :4.000
## time.served max.sentence multiple.offenses crime
## Min. :0.000 Min. : 1.00 Min. :0.0000 Min. :1.000
## 1st Qu.:3.250 1st Qu.:12.00 1st Qu.:0.0000 1st Qu.:1.000
## Max. :6.000 Max. :18.00 Max. :1.0000 Max. :4.000
## violator
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1156
## 3rd Qu.:0.0000
## Max. :1.0000
#How many parolees are contained in the dataset?

nrow(parole) #or str(parole)
## [1] 675
#Ans:675
##########################################
#PROBLEM 1.2 - LOADING THE DATASET
#How many of the parolees in the dataset violated the terms of their parole?
table(parole$violator)
##
## 0 1
## 597 78
#Ans:78
###########################################
#PROBLEM 2.1 - PREPARING THE DATASET
#You should be familiar with unordered factors (if not, review the Week 2 homework problem "Reading Test Score
s"). Which variables in this dataset are unordered factors with at least three levels? Select all that apply.
sapply(parole,class) #to get classes of all var
## male race age state

## "integer" "integer" "numeric" "integer"
## time.served max.sentence multiple.offenses crime
## "numeric" "integer" "integer" "integer"
## violator
## "integer"
#Ans:state , crime
#EXPLANATION:While the variables male, race, state, crime, and violator are all unordered factors, only state and
crime have at least 3 levels in this dataset.
#############################
#In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to
convert them to factors for our prediction problem (we introduced this idea in the "Reading Test Scores" problem
last week). Using the as.factor() function, convert these variables to factors. Keep in mind that we are not cha
nging the values, just the way R understands them (the values are still numbers).
parole$state<-as.factor(parole$state)
parole$crime<-as.factor(parole$crime)
summary(parole)
## male race age state time.served

## Min. :0.0000 Min. :1.000 Min. :18.40 1:143 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 2:120 1st Qu.:3.250
## Median :1.0000 Median :1.000 Median :33.70 3: 82 Median :4.400
## Mean :0.8074 Mean :1.424 Mean :34.51 4:330 Mean :4.198
## Max. :1.0000 Max. :2.000 Max. :67.00 Max. :6.000
## max.sentence multiple.offenses crime violator
## Min. : 1.00 Min. :0.0000 1:315 Min. :0.0000
## 1st Qu.:12.00 1st Qu.:0.0000 2:106 1st Qu.:0.0000
## Median :12.00 Median :1.0000 3:153 Median :0.0000
## Mean :13.06 Mean :0.5363 4:101 Mean :0.1156
## 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :18.00 Max. :1.0000 Max. :1.0000
#How does the output of summary() change for a factor variable as compared to a numerical variable?
#Ans:The output becomes similar to that of the table() function applied to that variable
#EXPLANATION:The output of summary(parole$state) or summary(parole$crime) now shows a breakdown of the number of
parolees with each level of the factor, which is most similar to the output of the table() function.
##########################################
#PROBLEM 3.1 - SPLITTING INTO A TRAINING AND TESTING SET
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)

test = subset(parole, split == FALSE)
#Roughly what proportion of parolees have been allocated to the training and testing sets?
#Ans:70% to the training set, 30% to the testing set
#EXPLANATION:SplitRatio=0.7 causes split to take the value TRUE roughly 70% of the time, so train should contain
roughly 70% of the values in the dataset. You can verify this by running nrow(train) and nrow(test).
###############################
#PROBLEM 3.2 - SPLITTING INTO A TRAINING AND TESTING SET
#Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect?
#Ans:The exact same training/testing set split as the first execution of [1]-[5]
#If you instead ONLY re-ran lines [3]-[5], what would you expect?
#Ans:A different training/testing set split from the first execution of [1]-[5]
#If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what woul
d you expect?
#Ans:A different training/testing set split from the first execution of [1]-[5]
#EXPLANATION:If you set a random seed, split, set the seed again to the same value, and then split again, you wil
l get the same split. However, if you set the seed and then split twice, you will get different splits. If you se
t the seed to different values, you will get different splits.You can also verify this by running the specified c
ode in R. If you have training sets train1 and train2, the function sum(train1 != train2) will count the number o
f values in those two data frames that are different.
##############################
#PROBLEM 4.1 - BUILDING A LOGISTIC REGRESSION MODEL
#If you tested other training/testing set splits in the previous section, please re-run the original 5 lines of c
ode to obtain the original split.
#Using glm (and remembering the parameter family="binomial"), train a logistic regression model on the training s
et. Your dependent variable is "violator", and you should use all of the other variables as independent variable
s.
mod<-glm(violator~.,data=train,family = binomial)
summary(mod)
##
## Call:
## glm(formula = violator ~ ., family = binomial, data = train)
##
## -1.7041 -0.4236 -0.2719 -0.1690 2.8375
##
## Coefficients:
## (Intercept) -4.2411574 1.2938852 -3.278 0.00105 **
## male 0.3869904 0.4379613 0.884 0.37690
## race 0.8867192 0.3950660 2.244 0.02480 *
## age -0.0001756 0.0160852 -0.011 0.99129
## state2 0.4433007 0.4816619 0.920 0.35739
## state3 0.8349797 0.5562704 1.501 0.13335
## state4 -3.3967878 0.6115860 -5.554 2.79e-08 ***
## time.served -0.1238867 0.1204230 -1.029 0.30359
## max.sentence 0.0802954 0.0553747 1.450 0.14705
## multiple.offenses 1.6119919 0.3853050 4.184 2.87e-05 ***
## crime2 0.6837143 0.5003550 1.366 0.17180
## crime3 -0.2781054 0.4328356 -0.643 0.52054
## crime4 -0.0117627 0.5713035 -0.021 0.98357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 277.48
##
#What variables are significant in this model? Significant variables should have a least one star, or should have
a probability less than 0.05 (the column Pr(>|z|) in the summary output). Select all that apply.
#Ans:race,state4,multiple.offenses
#############################################
#What can we say based on the coefficient of the multiple.offenses variable?
#The following two properties might be useful to you when answering this question:
#1) If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a u
nit increase in the variable.
#2) If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase
in the variable.
exp(1.6119919)
## [1] 5.012786
#Ans:Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a viol
ator than a parolee who did not commit multiple offenses but is otherwise identical.
#EXPLANATION:For parolees A and B who are identical other than A having committed multiple offenses, the predicte
d log odds of A is 1.61 more than the predicted log odds of B. Then we have:
#ln(odds of A) = ln(odds of B) + 1.61
#exp(ln(odds of A)) = exp(ln(odds of B) + 1.61)
#exp(ln(odds of A)) = exp(ln(odds of B)) * exp(1.61)
#odds of A = exp(1.61) * odds of B
#odds of A= 5.01 * odds of B
#In the second step we raised e to the power of both sides. In the third step we used the exponentiation rule tha
t e^(a+b) = e^a * e^b. In the fourth step we used the rule that e^(ln(x)) = x.
###############################################
#Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, serv
ed 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answ
er the following questions based on the model's predictions for this individual. (HINT: You should use the coeffi
cients of your model, the Logistic Response Function, and the Odds equation to solve this problem.)
#According to the model, what are the odds this individual is a violator?
Logodds<--4.2411574+0.3869904*1+0.8867192*1+-0.0001756*50+0+-0.1238867 *3+0.0802954*12+0+ 0.6837143*1
Logodds
## [1] -1.700629
exp(Logodds)
## [1] 0.1825687
#Ans:0.1825687
#According to the model, what is the probability this individual is a violator?

0.1825687/(1+0.1825687) #or
## [1] 0.1543832
1/(1+exp(-(-1.700629)))
## [1] 0.1543831
#Ans:0.1543832
#EXPLANATION:
#From the logistic regression equation,

#we have log(odds) = -4.2411574 + 0.3869904*male + 0.8867192*race - 0.0001756*age + 0.4433007*state2 + 0.8349797*
state3 - 3.3967878*state4 - 0.1238867*time.served + 0.0802954*max.sentence + 1.6119919*multiple.offenses + 0.6837
143*crime2 - 0.2781054*crime3 - 0.0117627*crime4. This parolee has male=1, race=1, age=50, state2=0, state3=0, st
ate4=0, time.served=3, max.sentence=12, multiple.offenses=0, crime2=1, crime3=0, crime4=0. We conclude that log(o
dds) = -1.700629.
#Therefore, the odds ratio is exp(-1.700629) = 0.183, and the predicted probability of violation is 1/(1+exp(1.70
0629)) = 0.154.
###############################################
#PROBLEM 5.1 - EVALUATING THE MODEL ON THE TESTING SET
#Use the predict() function to obtain the model's predicted probabilities for parolees in the testing set, rememb
ering to pass type="response".
predictions<-predict(mod,type="response", newdata=test)
#What is the maximum predicted probability of a violation?

summary(predictions) #or max(predictpar)

## 0.002334 0.023780 0.057910 0.146600 0.147500 0.907300
#Ans:0.9072791
#######################################################
#In the following questions, evaluate the model's predictions on the test set using a threshold of 0.5.
table(test$violator, as.numeric(predictions >= 0.5))
##
## 0 1
## 0 167 12
## 1 11 12
#What is the model's sensitivity?

12/(11+12) # The sensitivity (proportion of the actual violators we got correct) is 12/(11+12) = 0.522
## [1] 0.5217391
#Ans:0.5217391
#What is the model's specificity?

167/(167+12) #the specificity (proportion of the actual non-violators we got correct) is 167/(167+12) = 0.933
## [1] 0.9329609
#Ans:0.9329609
#What is the model's accuracy?

(167+12)/(167+12+11+12) #There are 202 observations in the test set. The accuracy (percentage of values on the di
agonal) is (167+12)/202 = 0.886
## [1] 0.8861386
#Ans:0.8861386
##################################################
#What is the accuracy of a simple model that predicts that every parolee is a non-violator?
(167+12)/(167+12+11+12) #From the Confusion matrix:an easier model would be to pick the most frequent outcome i.
e. corresponding to 0s (Non violators)
## [1] 0.8861386
#or If you table the outcome variable using the following command:
table(test$violator)
##
## 0 1
## 179 23
179/(179+23)
## [1] 0.8861386
#you can see that there are 179 negative examples, which are the ones that the baseline model would get correct.
Thus the baseline model would have an accuracy of 179/202 = 0.886.
####################################################
#Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parol
e board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards te
nd to be particularily concerned with releasing prisoners who will violate their parole. Which of the following m
ost likely describes their preferences and best course of action?
#Ans:The board assigns more cost to a false negative than a false positive, and should therefore use a logistic r
egression cutoff less than 0.5.
#EXPLANATION:If the board used the model for parole decisions, a negative prediction would lead to a prisoner bei
ng granted parole, while a positive prediction would lead to a prisoner being denied parole. The parole board wou
ld experience more regret for releasing a prisoner who then violates parole (a negative prediction that is actual
ly positive, or false negative) than it would experience for denying parole to a prisoner who would not have viol
ated parole (a positive prediction that is actually negative, or false positive).
#Decreasing the cutoff leads to more positive predictions, which increases false positives and decreases false ne
gatives. Meanwhile, increasing the cutoff leads to more negative predictions, which increases false negatives and
decreases false positives. The parole board assigns high cost to false negatives, and therefore should decrease
the cutoff.
##################################################
#Which of the following is the most accurate assessment of the value of the logistic regression model with a cuto
ff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
#Ans:The model is likely of value to the board, and using a different logistic regression cutoff is likely to imp
rove the model's value.
#EXPLANATION:The model at cutoff 0.5 has 12 false positives and 11 false negatives, while the baseline model has
0 false positives and 23 false negatives. Because a parole board is likely to assign more cost to a false negati
ve, the model at cutoff 0.5 is likely of value to the board.
#From the previous question, the parole board would likely benefit from decreasing the logistic regression cutoff
s, which decreases the false negative rate while increasing the false positive rate.
######################################
#Using the ROCR package, what is the AUC value for the model?
# Test set AUC

library(ROCR)
ROCRpred = prediction(predictions, test$violator)
## [1] 0.8945834
#Ans:0.8945834
################################
#Describe the meaning of AUC in this context.

#Ans:The probability the model can correctly differentiate between a randomly selected parole violator and a rand
omly selected parole non-violator.
#EXPLANATION:The AUC deals with differentiating between a randomly selected positive and negative example. It is
independent of the regression cutoff selected.
##########################################
#PROBLEM 6.1 - IDENTIFYING BIAS IN OBSERVATIONAL DATA
#Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parol
e releases for predictions. In this final problem, we'll evaluate a potential source of bias associated with our
analysis. It is always important to evaluate a dataset for possible sources of bias.
#The dataset contains all individuals released from parole in 2004, either due to completing their parole term or
violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor
completed their term in 2004, causing non-violators to be underrepresented. This is called "selection bias" or
"selecting on the dependent variable," because only a subset of all relevant parolees were included in our analy
sis, based on our dependent variable in this analysis (parole violation). How could we improve our dataset to bes
t address selection bias?
#Ans:We should use a dataset tracking a group of parolees from the start of their parole until either they violat
ed parole or they completed their term.
#EXPLANATION:While expanding the dataset to include the missing parolees and labeling each as violator=0 would im
prove the representation of non-violators, it does not capture the true outcome, since the parolee might become a
violator after 2004. Though labeling these new examples with violator=NA correctly identifies that we don't know
their true outcome, we cannot train or test a prediction model with a missing dependent variable.
#As a result, a prospective dataset that tracks a cohort of parolees and observes the true outcome of each is mor
e desirable. Unfortunately, such datasets are often more challenging to obtain (for instance, if a parolee had a
10-year term, it might require tracking that individual for 10 years before building the model). Such a prospect
ive analysis would not be possible using the 2004 National Corrections Reporting Program dataset.
######################################################
Assignment 3:PREDICTING LOAN REPAYMENT

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the
loan, then the lender prots from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore,
lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the
Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February
2010. The binary dependent variable not_fully_paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was
charged o, meaning the borrower was deemed unlikely to ever pay it back).
To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values credit_card, debt_consolidation, educational, major_purchase, small_business,
and all_other).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be
more risky are assigned higher interest rates.
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
co: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrowers revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrowers revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrowers number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrowers number of derogatory public records (bankruptcy lings, tax liens, or judgments).
#Lets load the dataset

loans<-read.csv("loans.csv")
str(loans)

## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
summary(loans)
## credit.policy purpose int.rate

## Min. :0.000 all_other :2331 Min. :0.0600
## 1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
## Median :1.000 debt_consolidation:3957 Median :0.1221
## Mean :0.805 educational : 343 Mean :0.1226
## 3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
## Max. :1.000 major_purchase : 437 Max. :0.2164
## small_business : 619
## installment log.annual.inc dti fico
## Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
## 1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
## Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
## NA's :4
## days.with.cr.line revol.bal revol.util inq.last.6mths
## Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.70 1st Qu.: 0.000
## Median : 4140 Median : 8596 Median : 46.40 Median : 1.000
## Mean : 4562 Mean : 16914 Mean : 46.87 Mean : 1.572
## 3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 71.00 3rd Qu.: 2.000
## Max. :17640 Max. :1207359 Max. :119.00 Max. :33.000
## NA's :29 NA's :62 NA's :29
## delinq.2yrs pub.rec not.fully.paid
## Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mean : 0.1638 Mean :0.0621 Mean :0.1601
## 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :13.0000 Max. :5.0000 Max. :1.0000
## NA's :29 NA's :29
#What proportion of the loans in the dataset were not paid in full? Please input a number between 0 and 1.
prop.table(table(loans$not.fully.paid))
##
## 0 1
## 0.8399457 0.1600543
#Ans:0.1600543
############################################
#Which of the following variables has at least one missing observation? Select all that apply.
colnames(loans)[apply(is.na(loans), 2, any)] #or
## [1] "log.annual.inc" "days.with.cr.line" "revol.util"

## [4] "inq.last.6mths" "delinq.2yrs" "pub.rec"
which(apply(is.na(loans), 2, sum)>0) # gives col names as well as col no
## log.annual.inc days.with.cr.line revol.util inq.last.6mths

## 5 8 10 11
## delinq.2yrs pub.rec
## 12 13
#or
colnames(loans)[apply(is.na(loans), 2, any)] #or very crude way summary(loans)
## [1] "log.annual.inc" "days.with.cr.line" "revol.util"

## [4] "inq.last.6mths" "delinq.2yrs" "pub.rec"
#Ans:log.annual.inc, days.with.cr.line, revol.util, inq.last.6mths, delinq.2yrs and pub.rec
####################################
#Which of the following is the best reason to fill in the missing values for these variables instead of removing
observations with missing data? (Hint: you can use the subset() function to build a data frame with the observat
ions missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.re
c).
#Ans:We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.
#EXPLANATION:Answering this question requires analyzing the loans with missing data. We can build a data frame li
mited to observations with some missing data with the following command:
missing = subset(loans, is.na(log.annual.inc) | is.na(days.with.cr.line) | is.na(revol.util) | is.na(inq.last.6mt

hs) | is.na(delinq.2yrs) | is.na(pub.rec))
nrow(missing)
## [1] 62
table(missing$not.fully.paid)
##
## 0 1
## 50 12
#From nrow(missing), we see that only 62 of 9578 loans have missing data; removing this small number of observati
ons would not lead to overfitting. From table(missing$not.fully.paid), we see that 12 of 62 loans with missing da
ta were not fully paid, or 19.35%. This rate is similar to the 16.01% across all loans, so the form of biasing de
scribed is not an issue. However, to predict risk for loans with missing data we need to fill in the missing valu
es instead of removing the observations.
##########################################################
#For the rest of this problem, we'll be using a revised version of the dataset that has the missing values filled
in with multiple imputation (which was discussed in the Recitation of this Unit). To ensure everybody has the sa
me data frame going forward, you can either run the commands below in your R console (if you haven't already, run
the command install.packages("mice") first), or you can download and load into R the dataset we created after ru
nning the imputation: loans_imputed.csv.
#IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random
seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_impute
d.csv) into R and compare your results, using the summary function. If the results are different, please make sur
e to use the data in loans_imputed.csv for the rest of the problem.
library(mice)
set.seed(144)
vars.for.imputation = setdiff(names(loans), "not.fully.paid") #Note that to do this imputation, we set vars.for.i

mputation to all variables in the data frame except for not.fully.paid, to impute the values using all of the oth
er independent variables.
imputed = complete(mice(loans[vars.for.imputation]))
##
## iter imp variable
## 1 1 log.annual.inc days.with.cr.line revol.util inq.last.6mths delinq.2yrs pub.rec
loans[vars.for.imputation] = imputed
#What best describes the process we just used to handle missing values?
#Ans:We predicted missing variable values using the available independent variables for each observation
#EXPLANATION:Imputation predicts missing variable values for a given observation using the variable values that a
re reported. We called the imputation on a data frame with the dependent variable not.fully.paid removed, so we p
redicted the missing values using only other independent variables.
####################################
#PROBLEM 2.1 - PREDICTION MODELS
#Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody
obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and u
se the sample.split function to select the 70% of observations for the training set (the dependent variable for s
ample.split is not.fully.paid). Name the data frames train and test.
# - SPLITTING INTO A TRAINING AND TESTING SET
set.seed(144)
library(caTools)
split = sample.split(loans$not.fully.paid, SplitRatio = 0.7)
train = subset(loans, split == TRUE)

test = subset(loans, split == FALSE)
#Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using
all the independent variables.
mod<-glm(not.fully.paid~.,data=train,family = binomial)
summary(mod)
##
## Call:
## glm(formula = not.fully.paid ~ ., family = binomial, data = train)
##
## -2.2097 -0.6214 -0.4950 -0.3601 2.6414
##
## Coefficients:
## (Intercept) 9.260e+00 1.554e+00 5.958 2.55e-09 ***
## credit.policy -3.327e-01 1.011e-01 -3.292 0.000995 ***
## purpose2 -6.100e-01 1.344e-01 -4.538 5.67e-06 ***
## purpose3 -3.181e-01 9.185e-02 -3.463 0.000534 ***
## purpose4 1.386e-01 1.753e-01 0.791 0.429074
## purpose5 1.774e-01 1.479e-01 1.199 0.230496
## purpose6 -4.783e-01 2.009e-01 -2.381 0.017260 *
## purpose7 4.159e-01 1.419e-01 2.932 0.003373 **
## int.rate 6.202e-01 2.085e+00 0.297 0.766111
## installment 1.279e-03 2.093e-04 6.110 9.96e-10 ***
## log.annual.inc -4.357e-01 7.151e-02 -6.093 1.11e-09 ***
## dti 4.733e-03 5.501e-03 0.861 0.389508
## fico -9.406e-03 1.707e-03 -5.510 3.60e-08 ***
## days.with.cr.line 3.174e-06 1.587e-05 0.200 0.841463
## revol.bal 3.103e-06 1.169e-06 2.653 0.007966 **
## revol.util 1.796e-03 1.532e-03 1.172 0.241022
## inq.last.6mths 8.386e-02 1.577e-02 5.317 1.06e-07 ***
## delinq.2yrs -7.794e-02 6.532e-02 -1.193 0.232814
## pub.rec 3.191e-01 1.134e-01 2.814 0.004899 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 5523.7
##
#Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr
(>|z|) value less than 0.05.) Select all that apply.
#Ans:credit.policy ,purpose2,purpose3, purpose6, purpose7 ,installment ,log.annual.inc ,fico,revol.bal,inq.last.6
mths ,pub.rec
# Note that some have a positive coefficient (meaning that higher values of the variable lead to an increased ris
k of defaulting) and some have a negative coefficient (meaning that higher values of the variable lead to a decre
ased risk of defaulting).
###########################################
#Consider two loan applications, which are identical other than the fact that the borrower in Application A has F
ICO credit score 700 while the borrower in Application B has FICO credit score 710.
#Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model,
and define Logit(B) similarly for loan B. What is the value of Logit(A) - Logit(B)?
-9.406e-03*-10 # coeff of FICO * (diff in FICO values=700-710=-10)
## [1] 0.09406
#Ans:0.09406
#EXPLANATION:Because Application A is identical to Application B other than having a FICO score 10 lower, its pre
dicted log odds differ by -0.009317 * -10 = 0.09317 from the predicted log odds of Application B.
#Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and
define O(B) similarly for loan B. What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A +
B + C) = exp(A)*exp(B)*exp(C). Also, remember that exp() is the exponential function in R.)
exp(0.09406) #logA-logB=log(A/B) hence oddsA/oddsb=exp(log(oddsA)-log(oddsB))=exp(log(oddsA/oddsB))
## [1] 1.098626
#Ans:1.0976
#EXPLANATION:Using the answer from the previous question, the predicted odds of loan A not being paid back in ful
l are exp(0.09317) = 1.0976 times larger than the predicted odds for loan B. Intuitively, it makes sense that loa
n A should have higher odds of non-payment than loan B, since the borrower has a worse credit score.
###################################################
#Predict the probability of the test set loans not being paid back in full (remember type="response" for the pred
ict function). Store these predicted probabilities in a variable named predicted.risk and add it to your test set
(we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.
5.
predicted.risk<-predict(mod,type="response", newdata=test)
## Warning: contrasts dropped from factor purpose
test$predicted.risk<-predicted.risk # Adding the the predicted prob as a var in the test dataset.This can be ach
ieved in sinle command:test$predicted.risk = predict(mod, newdata=test, type="response")

table(test$not.fully.paid, as.numeric(predicted.risk >= 0.5))
##
## 0 1
## 0 2400 13
## 1 457 3
#What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.
(2400+3)/(2400+13+457+3)
## [1] 0.8364079
#Ans: 0.8364079
#2403 predictions are correct (accuracy 2403/2873=0.8364)
#What is the accuracy of the baseline model? Input the accuracy as a number between 0 and 1.
(2400+13)/(2400+13+457+3) #From the Confusion matrix:an easier model would be to pick the most frequent outcome
## [1] 0.8398886
#or from
table(test$not.fully.paid)
##
## 0 1
## 2413 460
#Ans:0.8398886
#2413 predictions would be correct in the baseline model of guessing every loan would be paid back in full (accur
acy 2413/2873=0.8399)
#######################################
#PROBLEM 2.4 - PREDICTION MODELS (2 points possible)
#Use the ROCR package to compute the test set AUC.
# Test set AUC

library(ROCR)
ROCRpred = prediction(predicted.risk, test$not.fully.paid)
## [1] 0.6721337
#Ans:0.6721337
#The model has poor accuracy at the threshold 0.5. But despite the poor accuracy, we will see later how an invest
or can still leverage this logistic regression model to make profitable investments.
############################################
#PROBLEM 3.1 - A "SMART BASELINE"
#In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC
of 0.5 that would be obtained by randomly ordering observations.
#However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan's risk. This v
ariable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan's
interest rate as a "smart baseline" to order the loans according to risk.
#Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a sing
le independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.
bivariate<-glm(not.fully.paid~ int.rate,data=train,family = binomial)

summary(bivariate)
##
## Call:
## glm(formula = not.fully.paid ~ int.rate, family = binomial, data = train)
##
## -1.0547 -0.6271 -0.5442 -0.4361 2.2914
##
## Coefficients:
## (Intercept) -3.6726 0.1688 -21.76 <2e-16 ***
## int.rate 15.9214 1.2702 12.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 5738.8
##
#The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level
in the model trained with all the independent variables. What is the most likely explanation for this differenc
e?
#Ans:nt.rate is correlated with other risk-related variables, and therefore does not incrementally improve the mo
del when those other variables are included.
#EXPLANATION:Decreased significance between a bivariate and multivariate model is typically due to correlation. F
rom cor(train$int.rate, train$fico), we can see that the interest rate is moderately well correlated with a borro
wer's credit score.Training/testing set split rarely has a large effect on the significance of variables (this ca
n be verified in this case by trying out a few other training/testing splits), and the models were trained on the
same observations.
##############################################################
#Make test set predictions for the bivariate model. What is the highest predicted probability of a loan not being
paid in full on the testing set?
pred.bivariate<-predict(bivariate,type="response", newdata=test)
max(pred.bivariate) #or
## [1] 0.426624
summary(pred.bivariate)

## 0.06196 0.11550 0.15080 0.15960 0.18930 0.42660
#Ans:0.426624
#With a logistic regression cutoff of 0.5, how many loans would be predicted as not being paid in full on the tes
ting set?
#Ans:0
#EXPLANATION:According to the summary function, the maximum predicted probability of the loan not being paid back
is 0.4266, which means no loans would be flagged at a logistic regression cutoff of 0.5.
#####################################################
#What is the test set AUC of the bivariate model?
# Test set AUC

library(ROCR)
ROCRpred = prediction(pred.bivariate, test$not.fully.paid)
## [1] 0.6239081
################################################################
#PROBLEM 4.1 - COMPUTING THE PROFITABILITY OF AN INVESTMENT
#While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that ar
e expected to be profitable. If the loan is paid back in full, then the investor makes interest on the loan. Howe
ver, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loa
ns that best balance this risk and reward.
#To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period
of t years. Using continuous compounding of interest, this investment pays back c * exp(rt) dollars by the end o
f the t years, where exp(rt) is e raised to the r*t power.
#How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compo
unding of interest? Hint: remember to convert the percentage to a proportion before doing the math. Enter the num
ber of dollars, without the $ sign.
10*exp(0.06*3)
## [1] 11.97217
#Ans:11.97217
#EXPLANATION:In this problem, we have c=10, r=0.06, and t=3. Using the formula above, the final value is 10*exp
(0.06*3) = 11.97.
#################################################
#While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the
investment. What is the profit to the investor if the investment is paid back in full?
#Ans:c * exp(rt) - c
#EXPLANATION:A person's profit is what they get minus what they paid for it. In this case, the investor gets c *
exp(rt) but paid c, yielding a profit of c * exp(rt) - c.
#########################################################
#Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conser
vatively, that no money was received from the borrower (often a lender will receive some but not all of the value
of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor i
n this scenario?
#Ans:-c
#EXPLANATION:A person's profit is what they get minus what they paid for it. In this case, the investor gets no m
oney but paid c dollars, yielding a profit of -c dollars.
###########################################################
#PROBLEM 5.1 - A SIMPLE INVESTMENT STRATEGY
#In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r
for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit
if the loan is not paid back in full (pessimistically).
#In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the
test set. For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign t
o the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in th
e cases where the loan was not paid in full. All the loans in our dataset are 3-year loans, meaning t=3 in our ca
lculations. Enter the following commands in your R console to create this new variable:
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1
#What is the maximum profit of a $10 investment in any loan in the testing set (do not include the $ sign in your
answer)?
max(test$profit[test$not.fully.paid == 0]) *10
## [1] 8.894769
#or
summary(test$profit)

## -1.0000 0.2858 0.4111 0.2094 0.4980 0.8895
0.8895 *10
## [1] 8.895
#8.894769
#EXPLANATION:From summary(test$profit), we see the maximum profit for a $1 investment in any loan is $0.8895. The
refore, the maximum profit of a $10 investment is 10 times as large, or $8.895.
#################################################################
#PROBLEM 6.1 - AN INVESTMENT STRATEGY BASED ON RISK
#A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investme
nt. But this simple investment strategy does not leverage the prediction model we built earlier in this problem.
As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high int
erest rates and a low risk of not being paid back.
#To meet this objective, we will analyze an investment strategy in which the investor only purchases loans with a
high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted
risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100
loans.
#First, use the subset() function to build a data frame called highInterest consisting of the test set loans with
an interest rate of at least 15%.
str(test)

## ..- attr(*, "contrasts")= num [1:7, 1:6] 0 1 0 0 0 0 0 0 0 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "all_other" "credit_card" "debt_consolidation" "educational" ...
## .. .. ..$ : chr "2" "3" "4" "5" ...
## $ int.rate : num 0.107 0.136 0.122 0.132 0.08 ...
## $ installment : num 228.2 366.9 84.1 253.6 188 ...
## $ log.annual.inc : num 11.1 10.4 10.2 11.8 11.2 ...
## $ dti : num 14.29 11.63 10 9.16 16.08 ...
## $ fico : int 707 682 707 662 772 662 772 797 712 682 ...
## $ revol.bal : int 33623 3511 5630 5122 29797 4175 3660 6844 3534 43039 ...
## $ revol.util : num 76.7 25.6 23 18.2 23.2 51.5 6.8 14.4 54.4 93.4 ...
## $ inq.last.6mths : int 0 1 1 2 1 0 0 0 0 3 ...
## $ delinq.2yrs : int 0 0 0 1 0 1 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ predicted.risk : num 0.077 0.1735 0.1087 0.1023 0.0676 ...
## $ profit : num 0.379 0.502 0.442 0.488 0.271 ...
highInterest<-subset(test,int.rate>=0.15)
str(highInterest)

## ..- attr(*, "contrasts")= num [1:7, 1:6] 0 1 0 0 0 0 0 0 0 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "all_other" "credit_card" "debt_consolidation" "educational" ...
## .. .. ..$ : chr "2" "3" "4" "5" ...
## $ int.rate : num 0.15 0.153 0.156 0.152 0.162 ...
## $ installment : num 225 444 700 730 881 ...
## $ log.annual.inc : num 12.3 11 11.4 11.2 11.6 ...
## $ dti : num 6.45 19.52 1.53 21.84 24.84 ...
## $ fico : int 677 667 667 677 682 662 667 662 672 662 ...
## $ revol.bal : int 56411 33074 4569 27839 51039 26255 34841 20658 28511 2076 ...
## $ revol.util : num 75.3 68.8 56.4 72.7 93.8 74.3 89.6 55.4 56.7 23.1 ...
## $ inq.last.6mths : int 0 2 1 0 0 1 0 0 1 2 ...
## $ delinq.2yrs : int 0 0 0 1 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
## $ predicted.risk : num 0.164 0.169 0.332 0.203 0.221 ...
## $ profit : num -1 0.584 0.599 0.578 -1 ...
#What is the average profit of a $1 investment in one of these high-interest loans (do not include the $ sign in
your answer)?
mean(highInterest$profit) #or summary(highInterest$profit)
## [1] 0.2251015
#Ans:0.2251015
#What proportion of the high-interest loans were not paid back in full?
prop.table(table(highInterest$not.fully.paid)) #110 of the 437 loans were not paid back in full, for a proportion
of 0.2517.
##
## 0 1
## 0.7482838 0.2517162
#Ans:0.2517162
####################################################
#PROBLEM 6.2 - AN INVESTMENT STRATEGY BASED ON RISK
#Next, we will determine the 100th smallest predicted probability of not paying in full by sorting the predicted
risks in increasing order and selecting the 100th element of this sorted list. Find the highest predicted risk t
hat we will include by typing the following command into your R console:
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]

cutoff
## [1] 0.1769593
#Use the subset() function to build a data frame called selectedLoans consisting of the high-interest loans with
predicted risk not exceeding the cutoff we just computed. Check to make sure you have selected 100 loans for inv
estment.
selectedLoans<-subset(highInterest,predicted.risk <= cutoff)
nrow(selectedLoans) # check to see if it has 100 obs
## [1] 100
summary(selectedLoans$profit)

## -1.0000 0.5830 0.5992 0.3287 0.6317 0.8508
#What is the profit of the investor, who invested $1 in each of these 100 loans (do not include the $ sign in you
r answer)?
sum(selectedLoans$profit)
## [1] 32.87361
#Ans:31.28
#How many of 100 selected loans were not paid back in full?
table(selectedLoans$not.fully.paid)
##
## 0 1
## 82 18
#Ans:18
#We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at
only a slightly lower rate than average, resulting in a significant increase in the profit from our investor's
$100 investment. Although the logistic regression models developed in this problem did not have large AUC value
s, we see that they still provided the edge needed to improve the profitability of an investment portfolio.
#We conclude with a note of warning. Throughout this analysis we assume that the loans we invest in will perform
in the same way as the loans we used to train our model, even though our training set covers a relatively short
period of time. If there is an economic shock like a large financial downturn, default rates might be significan
tly higher than those observed in the training set and we might end up losing money instead of profiting. Investo
rs must pay careful attention to such risk when making investment decisions.
#Note:Slight discrepancy in answers is due to imputation effect of missing data
Assignment 3:PREDICTING THE BASEBALL WORLD SERIES CHAMPION

(OPTIONAL)
Last week, in the Moneyball lecture, we discussed how regular season performance is not strongly correlated with winning the World Series in
baseball. In this homework question, well use the same data to investigate how well we can predict the World Series winner at the beginning of
the playos.
To begin, load the dataset baseball.csv into R using the read.csv function, and call the data frame baseball. This is the same data le we used
during the Moneyball lecture, and the data comes from Baseball-Reference.com.
As a reminder, this dataset contains data concerning a baseball teams performance in a given year. It has the following variables:
Team: A code for the name of the team

League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League)
Year: The year of the corresponding record
RS: The number of runs scored by the team in that year
RA: The number of runs allowed by the team in that year
W: The number of regular season wins by the team in that year
OBP: The on-base percentage of the team in that year
SLG: The slugging percentage of the team in that year
BA: The batting average of the team in that year
Playos: Whether the team made the playos in that year (1 for yes, 0 for no)
RankSeason: Among the playo teams in that year, the ranking of their regular season records (1 is best)
RankPlayos: Among the playo teams in that year, how well they fared in the playos. The team winning the World Series gets a
RankPlayos of 1.
G: The number of games a team played in that year
OOBP: The teams opponents on-base percentage in that year
OSLG: The teams opponents slugging percentage in that year
#Lets start by loading the relevant dataset

baseball<-read.csv("baseball.csv")
str(baseball)

## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
#PROBLEM 1.1 - LIMITING TO TEAMS MAKING THE PLAYOFFS
#Each row in the baseball dataset represents a team in a particular year.
#How many team/year pairs are there in the whole dataset?

nrow(baseball) #or str(baseball)
## [1] 1232
#Ans:1232
############################################
#Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons.
Using the table() function, identify the total number of years included in this dataset.
length(table(baseball$Year)) #or
## [1] 47
length(unique(baseball$Year))
## [1] 47
#Ans:47
################################################
#Because we're only analyzing teams that made the playoffs, use the subset() function to replace baseball with a
data frame limited to teams that made the playoffs (so your subsetted data frame should still be called "basebal
l"). How many team/year pairs are included in the new dataset?
baseball<-subset(baseball,Playoffs ==1)
nrow(baseball) #or str(baseball)
## [1] 244
#Ans:244
########################################
#Through the years, different numbers of teams have been invited to the playoffs. Which of the following has been
the number of teams making the playoffs in some season? Select all that apply.
str(baseball)

## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 3 4 9 12 25 26 32 33 36 39 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 1 2 1 1 1 2 2 1 2 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 700 712 669 726 804 713 718 765 808 731 ...
## $ RA : int 600 705 588 670 668 614 649 648 707 594 ...
## $ W : int 94 93 97 88 95 94 94 88 93 98 ...
## $ OBP : num 0.32 0.311 0.315 0.335 0.337 0.31 0.327 0.338 0.334 0.322 ...
## $ SLG : num 0.389 0.417 0.411 0.422 0.453 0.404 0.397 0.421 0.446 0.428 ...
## $ BA : num 0.247 0.247 0.251 0.268 0.265 0.238 0.269 0.271 0.273 0.261 ...
## $ Playoffs : int 1 1 1 1 1 1 1 1 1 1 ...
## $ RankSeason : int 4 5 2 6 3 4 4 6 5 1 ...
## $ RankPlayoffs: int 5 4 4 2 3 4 1 3 5 4 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.306 0.315 0.305 0.314 0.311 0.306 0.313 0.313 0.309 0.303 ...
## $ OSLG : num 0.378 0.403 0.39 0.402 0.419 0.378 0.393 0.387 0.408 0.373 ...
table(baseball$Year) #or
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
## 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## 2011 2012
## 8 10
table(table(baseball$Year)) #fancier o/p
##
## 2 4 8 10
## 7 23 16 1
#Ans:2,4,8,10
########################################
#PROBLEM 2.1 - ADDING AN IMPORTANT PREDICTOR
#It's much harder to win the World Series if there are 10 teams competing for the championship versus just two. T
herefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will conta
in the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCom
petitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.
#We start by storing the output of the table() function that counts the number of playoff teams from each year:
PlayoffTable = table(baseball$Year)
PlayoffTable
##
## 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977
## 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
## 1978 1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## 2011 2012
## 8 10
#We will use this stored table to look up the number of teams in the playoffs in the year of each team/year pair.
#Just as we can use the names() function to get the names of a data frame's columns, we can use it to get the nam
es of the entries in a table. What best describes the output of names(PlayoffTable)?
names(PlayoffTable)
## [1] "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969" "1970" "1971"
## [11] "1973" "1974" "1975" "1976" "1977" "1978" "1979" "1980" "1982" "1983"
## [21] "1984" "1985" "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993"
## [31] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
## [41] "2006" "2007" "2008" "2009" "2010" "2011" "2012"
class(names(PlayoffTable)) #or
## [1] "character"
str(names(PlayoffTable))
## chr [1:47] "1962" "1963" "1964" "1965" "1966" "1967" ...
#Ans:Vector of years stored as strings (type chr)

#EXPLANATION:From the call str(names(PlayoffTable)) we see PlayoffTable has names of type chr, which are the year
s of the teams in the dataset.
#########################################################
#Given a vector of names, the table will return a vector of frequencies. Which function call returns the number o
f playoff teams in 1990 and 2001? (HINT: If you are not sure how these commands work, go ahead and try them out i
n your R console!)
PlayoffTable[c("1990", "2001")]
##
## 1990 2001
## 4 8
#Ans:PlayoffTable[c("1990", "2001")]
#EXPLANATION:Because PlayoffTable is an object and not a function, we look up elements in it with square brackets
instead of parentheses. We build the vector of years to be passed with the c() function. Because the names of Pl
ayoffTable are strings and not numbers, we need to pass "1990" and "2001".
####################################################
#Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the d
ataset, and store it as a new variable named NumCompetitors in the baseball data frame. Which of the following fu
nction calls accomplishes this? (HINT: Test out the functions if you are not sure what they do.)
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
#Ans:baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
#EXPLANATION:Because PlayoffTable is an object and not a function, we look up elements in it with square brackets
instead of parentheses. as.character() is needed to convert the Year variable in the dataset to a string, which
we know from the previous parts is needed to look up elements in a table. If you're not sure what a function doe
s, remember you can look it up with the ? function. For instance, you could type ?as.character to look up informa
tion about as.character.
##############################################
#Add the NumCompetitors variable to your baseball data frame. How many playoff team/year pairs are there in our d
ataset from years where 8 teams were invited to the playoffs?
table(baseball$NumCompetitors)
##
## 2 4 8 10
## 14 92 128 10
#Ans:128
###########################################
#PROBLEM 3.1 - BIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a R
ankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following comm
and in your R console:
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
#WorldSeries takes value 1 if a team won the World Series in the indicated year and a 0 otherwise. How many obser
vations do we have in our dataset where a team did NOT win the World Series?
table(baseball$WorldSeries)
##
## 0 1
## 197 47
#Ans:197
########################################
#PROBLEM 3.2 - BIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#When we're not sure which of our variables are useful in predicting a particular outcome, it's often helpful to
build bivariate models, which are models that predict the outcome using a single independent variable. Which of
the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regressio
n model? To determine significance, remember to look at the stars in the summary output of the model. We'll defin
e an independent variable as significant if there is at least one star at the end of the coefficients row for tha
t variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to
build 12 models to answer this question! Use the entire dataset baseball to build the models. (Select all that a
pply.)
summary(glm(WorldSeries~Year, data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
##
## -1.0297 -0.6797 -0.5435 -0.4648 2.1504
##
## Coefficients:
## (Intercept) 72.23602 22.64409 3.19 0.00142 **
## Year -0.03700 0.01138 -3.25 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 232.35
##
#similarly doing for other var one by one

summary(glm(WorldSeries~RA , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
##
## -0.9749 -0.6883 -0.6118 -0.4746 2.1577
##
## Coefficients:
## (Intercept) 1.888174 1.483831 1.272 0.2032
## RA -0.005053 0.002273 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 237.88
##
summary(glm(WorldSeries~RankSeason , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RankSeason, family = "binomial",
## data = baseball)
##
## -0.7805 -0.7131 -0.5918 -0.4882 2.1781
##
## Coefficients:
## (Intercept) -0.8256 0.3268 -2.527 0.0115 *
## RankSeason -0.2069 0.1027 -2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 238.75
##
summary(glm(WorldSeries~NumCompetitors , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
## data = baseball)
##
## -0.9871 -0.8017 -0.5089 -0.5089 2.2643
##
## Coefficients:
## (Intercept) 0.03868 0.43750 0.088 0.929559
## NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 230.96
##
#Ans:Year,RA,RankSeason ,NumCompetitors
#XPLANATION:The results come from building each bivariate model and looking at its summary. For instance, the res
ult for the variable Year can be obtained by running summary(glm(WorldSeries~Year, data=baseball, family="binomia
l")). You can save time on repeated model building by using the up arrow in your R terminal. The W and SLG variab
les were both nearly significant, with p = 0.0577 and 0.0504, respectively.
####################################
#PROBLEM 4.1 - MULTIVARIATE MODELS FOR PREDICTING WORLD SERIES WINNER
#In this section, we'll consider multivariate models that combine the variables we found to be significant in biv
ariate models. Build a model using all of the variables that you found to be significant in the bivariate models.
How many variables are significant in the combined model?
LogModel<-glm(WorldSeries~Year+RankSeason+RA+NumCompetitors, data=baseball, family="binomial")
summary(LogModel)
##
## Call:
## glm(formula = WorldSeries ~ Year + RankSeason + RA + NumCompetitors,
## family = "binomial", data = baseball)
##
## -1.0336 -0.7689 -0.5139 -0.4583 2.2195
##
## Coefficients:
## (Intercept) 12.5874376 53.6474210 0.235 0.814
## Year -0.0061425 0.0274665 -0.224 0.823
## RankSeason -0.0685046 0.1203459 -0.569 0.569
## RA -0.0008238 0.0027391 -0.301 0.764
## NumCompetitors -0.1794264 0.1815933 -0.988 0.323
##
##
## AIC: 236.37
##
#Ans:0 none
###################################
#Often, variables that were significant in bivariate models are no longer significant in multivariate analysis du
e to correlation between the variables. Which of the following variable pairs have a high degree of correlation
(a correlation greater than 0.8 or less than -0.8)? Select all that apply.
cor(baseball[c("Year","RankSeason","RA","NumCompetitors")])
## Year RankSeason RA NumCompetitors

## Year 1.0000000 0.3852191 0.4762422 0.9139548
## RankSeason 0.3852191 1.0000000 0.3991413 0.4247393
## RA 0.4762422 0.3991413 1.0000000 0.5136769
## NumCompetitors 0.9139548 0.4247393 0.5136769 1.0000000
#or
vars<-c("Year","RankSeason","RA","NumCompetitors")
cor(baseball[,vars])
## Year RankSeason RA NumCompetitors

## Year 1.0000000 0.3852191 0.4762422 0.9139548
## RankSeason 0.3852191 1.0000000 0.3991413 0.4247393
## RA 0.4762422 0.3991413 1.0000000 0.5136769
## NumCompetitors 0.9139548 0.4247393 0.5136769 1.0000000
#Ans:Year/NumCompetitors (0.9139548)
###########################################
#Build all six of the two variable models listed in the previous problem. Together with the four bivariate model
s, you should have 10 different logistic regression models. Which model has the best AIC value (the minimum AIC v
alue)?
Model1 = glm(WorldSeries ~ Year + RA, data=baseball, family=binomial)
summary(Model1)
##
## Call:
## glm(formula = WorldSeries ~ Year + RA, family = binomial, data = baseball)
##
## -1.0402 -0.6878 -0.5298 -0.4785 2.1370
##
## Coefficients:
## (Intercept) 63.610741 25.654830 2.479 0.0132 *
## Year -0.032084 0.013323 -2.408 0.0160 *
## RA -0.001766 0.002585 -0.683 0.4945
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 233.88
##
Model2 = glm(WorldSeries ~ Year + RankSeason, data=baseball, family=binomial)

summary(Model2)
##
## Call:
## glm(formula = WorldSeries ~ Year + RankSeason, family = binomial,
## data = baseball)
##
## -1.0560 -0.6957 -0.5379 -0.4528 2.2673
##
## Coefficients:
## (Intercept) 63.64855 24.37063 2.612 0.00901 **
## Year -0.03254 0.01231 -2.643 0.00822 **
## RankSeason -0.10064 0.11352 -0.887 0.37534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 233.55
##
Model3 = glm(WorldSeries ~ Year + NumCompetitors, data=baseball, family=binomial)

summary(Model3)
##
## Call:
## glm(formula = WorldSeries ~ Year + NumCompetitors, family = binomial,
## data = baseball)
##
## -1.0050 -0.7823 -0.5115 -0.4970 2.2552
##
## Coefficients:
## (Intercept) 13.350467 53.481896 0.250 0.803
## Year -0.006802 0.027328 -0.249 0.803
## NumCompetitors -0.212610 0.175520 -1.211 0.226
##
##
## AIC: 232.9
##
Model4 = glm(WorldSeries ~ RA + RankSeason, data=baseball, family=binomial)

summary(Model4)
##
## Call:
## glm(formula = WorldSeries ~ RA + RankSeason, family = binomial,
## data = baseball)
##
## -0.9374 -0.6933 -0.5936 -0.4564 2.1979
##
## Coefficients:
## (Intercept) 1.487461 1.506143 0.988 0.323
## RA -0.003815 0.002441 -1.563 0.118
## RankSeason -0.140824 0.110908 -1.270 0.204
##
##
## AIC: 238.22
##
Model5 = glm(WorldSeries ~ RA + NumCompetitors, data=baseball, family=binomial)

summary(Model5)
##
## Call:
## glm(formula = WorldSeries ~ RA + NumCompetitors, family = binomial,
## data = baseball)
##
## -1.0433 -0.7826 -0.5133 -0.4701 2.2208
##
## Coefficients:
## (Intercept) 0.716895 1.528736 0.469 0.63911
## RA -0.001233 0.002661 -0.463 0.64313
## NumCompetitors -0.229385 0.088399 -2.595 0.00946 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 232.74
##
Model6 = glm(WorldSeries ~ RankSeason + NumCompetitors, data=baseball, family=binomial)

summary(Model6)
##
## Call:
## glm(formula = WorldSeries ~ RankSeason + NumCompetitors, family = binomial,
## data = baseball)
##
## -1.0090 -0.7592 -0.5204 -0.4501 2.2562
##
## Coefficients:
## (Intercept) 0.12277 0.45737 0.268 0.78837
## RankSeason -0.07697 0.11711 -0.657 0.51102
## NumCompetitors -0.22784 0.08201 -2.778 0.00546 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 232.52
##
#Bivariate model constructed earlier

summary(glm(WorldSeries~Year, data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)
##
## -1.0297 -0.6797 -0.5435 -0.4648 2.1504
##
## Coefficients:
## (Intercept) 72.23602 22.64409 3.19 0.00142 **
## Year -0.03700 0.01138 -3.25 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 232.35
##
summary(glm(WorldSeries~RA , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)
##
## -0.9749 -0.6883 -0.6118 -0.4746 2.1577
##
## Coefficients:
## (Intercept) 1.888174 1.483831 1.272 0.2032
## RA -0.005053 0.002273 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 237.88
##
summary(glm(WorldSeries~RankSeason , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ RankSeason, family = "binomial",
## data = baseball)
##
## -0.7805 -0.7131 -0.5918 -0.4882 2.1781
##
## Coefficients:
## (Intercept) -0.8256 0.3268 -2.527 0.0115 *
## RankSeason -0.2069 0.1027 -2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 238.75
##
summary(glm(WorldSeries~NumCompetitors , data=baseball, family="binomial"))
##
## Call:
## glm(formula = WorldSeries ~ NumCompetitors, family = "binomial",
## data = baseball)
##
## -0.9871 -0.8017 -0.5089 -0.5089 2.2643
##
## Coefficients:
## (Intercept) 0.03868 0.43750 0.088 0.929559
## NumCompetitors -0.25220 0.07422 -3.398 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 230.96
##
#Ans:NumCompetitors
#EXPLANATION:None of the models with two independent variables had both variables significant, so none seem promi
sing as compared to a simple bivariate model. Indeed the model with the lowest AIC value is the model with just N
umCompetitors as the independent variable.This seems to confirm the claim made by Billy Beane in Moneyball that a
ll that matters in the Playoffs is luck, since NumCompetitors has nothing to do with the quality of the teams!
######################################################
######################################################################

MIT Logistic Regression Course Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MIT Logistic Regression Course Notes

Uploaded by

Copyright:

Available Formats

6/28/2017 The_Analytics_Edge_edX_MIT15.

Unit 3: Logistic Regression

R Data Analysis Examples: Logit Regression_UCLA (http://www.ats.ucla.edu/stat/r/dae/logit.htm)

1.Modeling the expert_An introduction to Logistics Regression:VIDEOS 1 to 3

So the logit(i.e.Log(odds)) equation becomes:

we know that Odds = e

Q:What is the value of P(y = 1) for this observation?

and the Logistic response Function is:

#From the logistic function we get

1.1 VIDEO 4: LOGISTIC REGRESSION IN R (R script reproduced here)

Narcotics is the number of prescriptions the patient had for narcotics.

ProviderCount is the number of providers that served the patient.

ClaimLines is the total number of medical claims.

# Unit 3, Modeling the Expert

## 'data.frame': 131 obs. of 14 variables:

# Table outcome for DV

# Install and load caTools package

## The following objects are masked from 'package:base64enc':

# Randomly split data

# Create training and testing sets

# Logistic Regression Model

# Make predictions on training set

## Min. 1st Qu. Median Mean 3rd Qu. Max.

# Logistic Regression Model on the training set

1.2 VIDEO 5: THRESHOLDING (R script reproduced here)

# Confusion matrix for threshold of 0.5

# Sensitivity and specificity

# Confusion matrix for increased threshold of 0.7

# Sensitivity and specificity

73/74 #specificity=TN/total no of negative cases

# Confusion matrix for aa decreased threshold of 0.2

# Sensitivity and specificity

CONFUSION MATRIX OR CLASSIFICATION MATRIX

#lets recreate the confusion matrix give in the Qs

#Confusion Matrix #1:

#Confusion Matrix #2:

#What is the sensitivity of Confusion Matrix #1?

#What is the specificity of Confusion Matrix #1?

#lets calc sensitivity for Confusion Matrix #2

#Ans:We increased the threshold value.

1.3 VIDEO 6: ROC CURVES (R script reproduced here)

# Install and load ROCR package

## Loading required package: gplots

## The following object is masked from 'package:stats':

# Plot ROC curve

# Add threshold labels

This question will ask about the following ROC curve:

1.4 VIDEO 7: INTERPRETING THE MODEL

# Logistic Regression Model

#Compute the test set predictions in R by running the command:

auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)

#What is the AUC of this model on the test set?

2. THE FRAMINGHAM HEART STUDY_Evaluating Risk Factors to Save Lives

2.2 VIDEO 2: RISK FACTORS

2.3 VIDEO 3: A LOGISTIC REGRESSION MODEL (R script reproduced here)

# Unit 3, The Framingham Heart Study

# Read in the dataset

## 'data.frame': 4240 obs. of 16 variables:

# Load the library caTools

# Randomly split the data into training and testing sets

# Split up the data using subset

# Logistic Regression Model

# Predictions on the test set