Professional Documents
Culture Documents
Disadvantages of very flexible(complex) versus less flexible(simple) approach for model (with circumstances):
• If the data that is being studied has a lot of noise (because of outliers, etc.), then a highly flexible model would
follow the training data too closely, thus the varying errors in it, leading to higher variance in predictions
• It will lead to over-fitting to the training data and result in high overall prediction error on test/unseen data
• Fitting a flexible model requires estimating a large no. of parameters or non-linear functions (time and cost)
• They are not very interpretable as the relationship between each predictor and response is curvilinear
• Circumstances (It is better to use less flexible or simple models in lieu of more complex ones) –
o Relationship between predictors and response is non-linear but Var(error)->large; Inflexible is preferred
else the flexible model will fit errors too (and this data has noise)
o Many predictors but small sample – we should reduce existing model’s variance by reducing model’s
complexity. Since, we don’t have enough data to train and validate model through cross-validation which
could have reduced variance, we prefer simple models to infer patterns in data.
o When the goal is to infer relationship between predictors and response, rather than prediction
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
Ans-4) K-NN method
a) 3-D Euclidean distance (between any point X and Y) = sqrt{[(X1-Y1)^2]+[(X2-Y2)^2]+[(X3-Y3)^2]}
Where point X is (X1, X2, X3) and point Y is (Y1, Y2, Y3) in 3-dimensional space
For our problem, we calculate Euclidean distance (in grey column) between test data point (X1=0, X2=0, X3=0) with
all other data points in the training data defined by each obs/row in below table:
Obs X1 X2 X3 Y Distance Obs X1 X2 X3 Y Distance
1 0 3 0 Red 3 5 -1 0 1 Green 1.414213562
2 2 0 0 Red 2 6 1 1 1 Red 1.732050808
3 0 1 3 Red 3.16227766 2 2 0 0 Red 2
4 0 1 2 Green 2.236067977 Sorting on distance 4 0 1 2 Green 2.236067977
5 -1 0 1 Green 1.414213562 1 0 3 0 Red 3
6 1 1 1 Red 1.732050808 3 0 1 3 Red 3.16227766
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
• Mixed plot using the library: PerformanceAnalytics
Note:
i. Diagonal has histograms and density plots for individual variables
ii. On left side of diagonal, we have scatterplots and trend lines between two variables at a time
iii. On right side of diagonal, we have Pearson correlation between two variables at a time (size of txt~value)
• Comments on observations and findings:
o Mpg seems to vary inversely with Displacement, Horsepower, and Weight in a strong manner
o High values of mpg seem to be attained with 4 cylinders. Beyond 4 no. of cylinders, mpg seems to be
negatively correlated with no. of cylinders
o Displacement, Weight, Horsepower, and cylinders seems to be highly linearly correlated with each other
o Acceleration seems to have an inverse relationship with horsepower, & displacement
• Yes, as per the above plots (especially the Mixed plot using the PerformanceAnalytics library), we could see:
• Mpg has a very high linear inverse relationship with cylinders, displacement, horsepower, & weight. Since
displacement, horsepower & weight are highly correlated with each other, only horsepower could be used, &
others could be dropped as they are highly correlated with horsepower & with each other.
• Mpg has a decent to good positive relationship with year and origin
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
Ans-6) Auto Dataset (Chap-3)
a) Scatterplot for all numeric variables (Bi-variate visual inspection)
b) Correlation matrix using cor()
Matrix variable mpg cylinders displacement horsepower weight acceleration year origin
mpg 1 -0.78 -0.80 NA -0.83 0.42 0.58 0.56
cylinders -0.78 1 0.95 NA 0.90 -0.50 -0.35 -0.56
displacement -0.80 0.95 1 NA 0.93 -0.54 -0.37 -0.61
horsepower NA NA NA 1 NA NA NA NA
weight -0.83 0.90 0.93 NA 1 -0.42 -0.31 -0.58
acceleration 0.42 -0.50 -0.54 NA -0.42 1 0.28 0.21
year 0.58 -0.35 -0.37 NA -0.31 0.28 1 0.18
origin 0.56 -0.56 -0.61 NA -0.58 0.21 0.18 1
Key for above matrix:
• Correlation values are based on Pearson coefficients
• NA – ‘horsepower’ contains missing values hence, it is reflecting correlation values as NA w.r.t other variables
• Darker is the shade of green, higher is the positive linear relationship
• Darker is the shade of red, higher is the negative linear relationship
• Colors vary from red to yellow to green: negative to neutral/no to positive linear relationship, respectively
Note: We could remove rows containing missing values (if they are very less), and re-evaluate correlation matrix
c) Multiple linear regression using mpg as response variable and all others as predictors except name.
Note: We have removed the rows having missing values (which were merely 5 rows) for further analysis
i. The null hypothesis (that there is no significant relationship between mpg & other predictors) has been denied
by very low p-value (<<<0.05) for the F-tests, so there is a statistically significant relation
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
ii. It seems that there is a statistically significant relationship between certain predictors & response like mpg
(response) with displacement, weight, year, & origin based on the p-value with significance (<=0.05) for t-tests
iii. Coefficient of year variable (=0.75) means that if the year increases by unit (say from 1994 to 1995) then the
corresponding increase in mpg is 0.75, meaning over the years, cars tend to become more efficient
d) Checking the Linear Model fit with our data: Diagnostic plots
Observations/Comments along with Findings/Insight:
• Residuals Vs. Fitted: (Roughly tells us about the shape of our underlying true function & spread of data)
Residuals are spread in a mildly non-linear fashion w.r.t fitted values (like a convex down). It means that there
is some underlying non-linearity in the true function probably quadratic in nature (convex down) describing
our training data because:
o For low range of fitted values have slightly positive residuals
o Then in mid-range they give a residual value around (0 to slightly negative)
o And then again for higher range of fitted values, we get positive residuals
This in turn means that our model is not actually capturing the non-linearity in the data. Also, the residual
seems to increase as fitted values increase meaning the data spreads for higher values.
• Normal Q-Q plot: (Roughly conveys the skewness & outlier presence in data)
Most of the residuals lie on the normal Q-Q plot and hence seem to be normally distributed, except at the right
tail, where there seems to be a presence of some outliers and skewness in our data
• Scale-Location: (Roughly tells us about nature of data’s spread – eg. Low to high spread has increasing slope)
Residuals seem to have an increasing variance w.r.t fitted values, meaning pretty clouded across most values,
meaning but slightly spreading for higher values, denying the constant variance of errors assumption
• Residuals vs Leverage: (Roughly tells us about the tightness of the data along with leverage and outliers)
There seems to be some outliers like 325, 321, 324, etc. which fall in this range 2 <standardized residual
value<-2, along with one leverage point ‘14’ (which will be the only point very far in the domain of predictors).
Both these sets of outliers and leverage points could possibly influence our model in an adverse manner.
Note: We ought to treat the outliers and make transformations on data in order to satisfy the assumptions of
linear regression (it being a parametric supervised learning method, it has certain assumptions to be satisfied)
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
e) Interaction effects
Depending on the combination of variables taken for interaction together in the linear regression, the statistically
significant terms pop up. For e.g.:
• Allow all variables to interact, below are the significant pairs:
o Acceleration-origin
o Displacement-year
o Acceleration-year
• Allow originally insignificant variables to interact with significant ones, below are the significant pairs
o Displacement-horsepower
• Allow originally insignificant variables to interact with each other, below are the significant pairs
o Cylinders-horsepower
o Horsepower-acceleration
o Cylinders-acceleration
f) Transforming some variables which are non-linearly related with mpg using log(X), sqrt(X), and X^2
• Mixed plot (after and before)
Observations and Insights
• Log(X) - Makes right skewed data less skew
o For horsepower – from right skewed to less skewed
• Sqrt(X) – Makes right skewed data less skew and could
be applied for values=0 (but weak method)
o For horsepower - Hasn’t made much impact
• X^2 – Makes left skewed data less skew
o For hosepower - Hasn’t made much impact
Other comments:
• It’s helpful to transform predictors that follow a non-linear relationship with response using above methods
• Based on the variables own distribution & requirement of the model or analysis, transformation method could
be applied. Like for linear regression, its required to have a linear relationship between predictors & response
• After transformation, horsepower and weight vary much more linearly than before
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
Top 3 positively (first row) and negatively (second row) linearly correlated variables
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
• Comments on observations and findings:
o Accessibility to highways (rad) is strongly & positively correlated to full value property tax rate per 10k (tax),
which makes sense as closer is the property to highways, higher should be its value and hence tax
o Nitrogen oxide concentration (nox) is highly and positively correlated with proportion of non-retail business
(indus) & proportion of owner-occupied units prior to 1940 (age), respectively, which makes sense as old
industrial area are usually polluted with these toxic fumes
o Weighted mean of distances to five Boston employment centres (dis) is strongly & negatively correlated
with Nitrogen oxide concentration (nox) and proportion of owner-occupied units prior to 1940 (age), which
makes sense as the industrial area that are usually old are situated away from city centers
o Median value of owner-occupied homes in \$1000s (medv) is negatively correlated with lower status of the
population (lstat), meaning the lower status of population don’t have high value homes that makes sense
c) Predictors associated with per capita crime (crim) as per Pearson coefficients (red: (-) corr.; green: (+) corr.)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
crim 1 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29 -0.39 0.46 -0.39
• Comments on observations and findings:
o Per capita crime rate (crim) seems to be strongly positively correlated with the accessibility to highways
(rad) & tax rate per 10k (tax), which makes sense as the areas closer to highways are usually very isolated
remote area, & thus have more potential where crime could happen. Since, (tax) is highly positively
correlated with (rad), its popping up as highly positively correlated variable
o Per capita crime rate (crim) seems to be positively correlated with Nitrogen oxide concentration (nox) &
proportion of non-retail business (indus), which makes sense as again these nitrogen-oxide flooded
industrial areas areas are away from city centers and hence not so safe
o Per capita crime rate (crim) seems to be highly positively correlated with lower status of population, which
makes sense as areas with lower status are usually prone to more crimes germane to crime based on needs
o Per capita crime rate (crim) seems to be negatively correlated with weighted mean of distances to five
Boston employment centres (dis) & Median value of owner-occupied homes in \$1000s (medv), which
makes sense as these areas are city residential areas which are usually under police patrol, & hence safe.
d) Suburbs with high crime rate, tax rate, pupil-teacher rate
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
• Comments on observations and findings:
o Per capita crime rate (crim) seems to be right skewed with very few suburbs with crim>40, & majorly crim is
between 0 & 20, with a mode of frequency 400 at 0 (it might either mean that the crimes were not
recorded or crime is actually less in Boston because of high number of safe suburbs)
o Tax rate per 10k (tax) is majorly distributed between the values 0 to 500 (probably due to residential area),
but there is a mode at around 680, which is possibly due to industrial area (which seem to be many)
o Pupil-teacher ratio seems to be a bit left skewed but pretty normally distributed with a mode at 21
e) 35
f) 19.05
g) Suburb of Boston with minimum medv
• Row nbr = 399 (with medv=5)
• Values of other predictors for this suburb (#399)
ID crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
399 38.35 0.00 18.10 0.00 0.69 5.45 100.00 1.49 24.00 666.00 20.20 396.90 30.59 5.00
• Compare the predictor values for this suburb (#399) with range of all variables/predictors
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
min 0.01 0.00 0.46 0.00 0.39 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
max 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00
Observations:
o zn of #399 = min(Zn)
o chas of #399 = min(Chas)
o age of #399 = max(age)
o rad of #399 = max(rad)
o tax of #399 = towards the higher end
o ptratio of #399 = towards the higher end
o black of #399 = max(black)
o lstat of #399 = near the higher end
Observations, Findings and Insights for all these suburbs:
o All have very low crime rate (near the lowest crim=0)
o All of them have been located either away from the radial highways or low accessible locations (at a rad>8)
o Most of them have a higher black population (black>350)
o Most of them have been resided by low status population (lstat<8)
o Most of them have high value households (medv>35)
o Most of them have buildings that are old aged (age>60) except a few
o Most of them have been located near city business centers (dis><4) except a few
o Most of them are located in industrial area (indus<7 indicates that)
R-code on next page
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
R-code) rm(list=ls())
#-------Setting up library
getwd()
setwd("/Users/swapnilparkhe/Desktop/MSBA-UIC/IDS 575 - Biz Stats/Assignments")
dir()
############(Q1)############
#------Importing data---
train<-read.table("zip.train")
test<-read.table("zip.test")
#-----High level inspection of data (similar to what desired)
dim(train)
names(train)
str(train)
aggregate(train$V1, by=list(c(train$V1)), length)
dim(test)
names(test)
str(test)
aggregate(test$V1, by=list(c(test$V1)), length)
#------Modeling with Linear regression and KNN methods
###----JUST FOR PRACTISE---LM_1 (all rows G=0,1,2,3,...9) has high Err(train)=75%, hence segregating training and
testing data for 2 & 3 separately for training and test, respectively
#train_1<-train
#test_1<-test
#LM_1<-lm(V1~., data=train_1)
#pred_train_1<-round(LM_1$fitted.values)
#error_train_1<-mean(pred_train_1!=train_1$V1)
#pred_test_1<-round(predict.lm(LM_1, test_1))
#error_test_1<-mean(pred_test_1!=test_1$V1)
###LM_2 (rows with G=2,3) has Err(train)=0.5%, Err(test)=4.1%
train_2<-subset(train, train$V1==2|train$V1==3)
test_2<-subset(test,test$V1==2|test$V1==3)
LM_2<-lm(V1~., data=train_2)
pred_train_2<-round(LM_2$fitted.values)
error_train_2<-mean(pred_train_2!=train_2$V1)
pred_test_2<-round(predict.lm(LM_2,test_2))
error_test_2<-mean(pred_test_2!=test_2$V1)
##KNN (with k=1,3,5,7,15)
#install.packages("class")
library(class)
k<-c(1,3,5,7,15)
error_train_3<-rep(NA,length(k))
for(x in 1:length(k)){
pred_train_3<-knn(train_2,train_2,cl=train_2$V1,k[x])
error_train_3[x]<-mean(pred_train_3!=train_2$V1)
}
k<-c(1,3,5,7,15)
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
error_test_3<-rep(NA,length(k))
for(x in 1:length(k)){
pred_test_3<-knn(train_2,test_2,cl=train_2$V1,k[x])
error_test_3[x]<-mean(pred_test_3!=test_2$V1)
}
Errors_train_test<-matrix(c(error_train_1, error_train_2, error_train_3, error_test_1, error_test_2, error_test_3),
ncol=2, nrow=7, byrow=FALSE)
colnames(Errors_train_test)<-c("Train-Error", "Test-Error")
row.names(Errors_train_test)<-c("LM_1 (using all G values)", "LM_2 (only using G=2,3)", paste("K-NN having K=",k,
"(only using G=2,3)"))
write.csv(Errors_train_test, "Q1_Errors.csv")
############(Q5)############
#------Importing the file
library(readxl)
auto<-read_excel("Assignment1_Auto_data.xls")
dim(auto)
#-----Removing missing value rows
auto<-na.omit(auto)
dim(auto)
#-----Inspecting data at higher level
View(auto)
names(auto)
str(auto)
#-----a-----
str(auto)
sapply(auto, class)
auto$origin<-factor(auto$origin)
#-----b----
quant<-sapply(auto, is.numeric)
quant
sapply(auto[,quant] , range)
#-----c-----
sapply(auto[,quant], mean)
sapply(auto[,quant], sd)
#-----d-----
auto1<-auto[-c(10:85),]
sapply(auto1[,quant] , range)
sapply(auto1[,quant], mean)
sapply(auto1[,quant], sd)
#-------e----
#install.packages('plyr')
#install.packages('psych')
#install.packages('PerformanceAnalytics')
library(plyr)
library(psych)
library(PerformanceAnalytics)
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
dim(auto)
multi.hist(auto[,-9])
chart.Correlation(auto[,-9], pch=21)
############(Q6)############
#------Importing the file
library(readxl)
auto<-read_excel("Assignment1_Auto_data.xls")
dim(auto)
#-----Inspecting data at higher level
View(auto)
names(auto)
str(auto)
#install.packages("pastecs")
library(pastecs)
stat.desc(auto)
#-------a--------
pairs(auto[, -9], pch=21)
#--------b--------
cor(auto[,-9])
write.csv(cor(auto[,-9]),"Auto_cor.csv")
#-------c-------
#--Removing missing value rows
auto<-na.omit(auto)
dim(auto)
lm_auto<-lm(mpg~.-name,auto)
summary(lm_auto)
#--------d------
par(mfrow=c(2,2))
plot(lm_auto)
#-------e-----
#All predictors and predictor pairs at a time
lm_auto_1<-lm(mpg~.^2 , auto[,-9])
summary(lm_auto_1)
#Taking all predictors along with effect of previously significant and insignificant predictor )
lm_auto_2<-lm(mpg~cylinders:displacement + cylinders:weight +
horsepower*displacement + horsepower:weight +
displacement:weight, auto[,-9])
summary(lm_auto_2)
#Taking all predictors along with interactions among previosuly insignificant predictors)
lm_auto_3<-lm(mpg~.+cylinders:horsepower + cylinders:acceleration + horsepower:acceleration , auto[,-9])
summary(lm_auto_3)
#Taking only highly correlated predictor pairs
lm_auto_4<-lm(mpg ~ cylinders*displacement+displacement*weight, data = auto[, -9])
summary(lm_auto_4)
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
#---------JUST FOR PRACTICE-----Two all and some predictor pairs at a time
#lm_auto_5_a<-lm(mpg~.+cylinders:displacement,auto[,-9])
#lm_auto_5_b<-lm(mpg~.+horsepower*weight,auto[,-9])
#lm_auto_5_c<-lm(mpg~.+acceleration:year,auto[,-9])
#summary(lm_auto_5_a)
#summary(lm_auto_5_b)
#summary(lm_auto_5_c)
#--------f-------
#Transformation-Using some predictors that do not have a linear relationship with response (horsepower and weight)
mpg_pred<-data.frame(mpg=auto$mpg, hp=auto$horsepower, wt=auto$weight)
chart.Correlation(mpg_pred)
mpg_transf_pred<-data.frame(mpg=auto$mpg,
hp_log=log(auto$horsepower), hp_sqrt=sqrt(auto$horsepower), hp_sq=(auto$horsepower)^2,
wt_log=log(auto$weight), wt_sqrt=sqrt(auto$weight),wt_sq=(auto$weight)^2
)
chart.Correlation(mpg_transf_pred)
############(Extra Q)############
#--------a-------
library(MASS)
Boston
?Boston
dim(Boston)
#---------b-------
pairs(Boston)
par(mfrow=c(2,3))
plot(Boston$rad, Boston$tax)
plot(Boston$indus, Boston$nox)
plot(Boston$age, Boston$nox)
plot(Boston$nox, Boston$dis)
plot(Boston$age, Boston$dis)
plot(Boston$medv, Boston$lstat)
#---------c-------
write.csv(cor(Boston$crim, Boston), "crim_corr.csv")
#---------d--------
par(mfrow=c(1,3))
hist(Boston$crim, breaks=30)
hist(Boston$tax, breaks=30)
hist(Boston$ptratio, breaks=30)
#---------e------
sum(Boston$chas==1)
#---------f-----
median(Boston$ptratio)
#---------g-----
which.min(Boston$medv)
Boston[which.min(Boston$medv),14]
Boston[which.min(Boston$medv),]
sapply(Boston, range)
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
#--------h------
summary(Boston$rm>7)
summary(Boston$rm>8)
x<-which(Boston$rm>8)
Boston[which(Boston$rm>8),]
par(mfrow=c(5,3))
for (i in 1:ncol(Boston)){
hist(Boston[, i], main=colnames(Boston)[i], breaks="FD")
abline(v=Boston[x, i], col="red", lw=1)
}