You are on page 1of 16

Swapnil

Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)



Ans-1) Comparing the performance of classifiers on zip-code data

Classifier-type Train-Error Test-Error
Linear_Model_2 (only using G=2,3) 0.00576 0.04121
K-NN having K= 1 (only using G=2,3) 0.00000 0.02473
K-NN having K= 3 (only using G=2,3) 0.00432 0.03022
K-NN having K= 5 (only using G=2,3) 0.00576 0.03022
K-NN having K= 7 (only using G=2,3) 0.00576 0.03022
K-NN having K= 15 (only using G=2,3) 0.00936 0.03846

Linear_Model_1 (using all G values – JUST CHECKING) 0.75175 0.77479




Ans-2) Self-study Question (Practiced in R)



Ans-3) Model flexibility is roughly proportional to the degrees of freedom in the model. It increases as we incorporate non-
linearity or/and no. of predictors into the model.

Advantages of very flexible(complex) versus less flexible(simple) approach for model (with circumstances):
• In the real-world scenario, the true relationship (f) between inputs & outputs is usually non-linear. So, fitting a
more flexible model could trace many data points in the training data, resulting in low bias in prediction values
• Circumstances (It is better to use more flexible or complex in lieu of simple ones) –
o If the relationship between inputs & outputs is non-linear in nature, & variance in error->0, we could use
flexible model where it’s low bias along with low variance in error (we wouldn’t have to worry even if
flexible model fits into these low varying errors) may give us more accurate prediction results
o Large sample with less predictors – we should reduce the existing model’s bias by increasing model’s
complexity (non-linearity here). Further, its high variance could be reduced by exploiting the big sample
through cross-validation (if we have info about the distribution of the inputs & outputs, we could more
confidently chose between flexible & inflexible)

Disadvantages of very flexible(complex) versus less flexible(simple) approach for model (with circumstances):
• If the data that is being studied has a lot of noise (because of outliers, etc.), then a highly flexible model would
follow the training data too closely, thus the varying errors in it, leading to higher variance in predictions
• It will lead to over-fitting to the training data and result in high overall prediction error on test/unseen data
• Fitting a flexible model requires estimating a large no. of parameters or non-linear functions (time and cost)
• They are not very interpretable as the relationship between each predictor and response is curvilinear
• Circumstances (It is better to use less flexible or simple models in lieu of more complex ones) –
o Relationship between predictors and response is non-linear but Var(error)->large; Inflexible is preferred
else the flexible model will fit errors too (and this data has noise)
o Many predictors but small sample – we should reduce existing model’s variance by reducing model’s
complexity. Since, we don’t have enough data to train and validate model through cross-validation which
could have reduced variance, we prefer simple models to infer patterns in data.
o When the goal is to infer relationship between predictors and response, rather than prediction


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

Ans-4) K-NN method

a) 3-D Euclidean distance (between any point X and Y) = sqrt{[(X1-Y1)^2]+[(X2-Y2)^2]+[(X3-Y3)^2]}
Where point X is (X1, X2, X3) and point Y is (Y1, Y2, Y3) in 3-dimensional space

For our problem, we calculate Euclidean distance (in grey column) between test data point (X1=0, X2=0, X3=0) with
all other data points in the training data defined by each obs/row in below table:

Obs X1 X2 X3 Y Distance Obs X1 X2 X3 Y Distance
1 0 3 0 Red 3 5 -1 0 1 Green 1.414213562
2 2 0 0 Red 2 6 1 1 1 Red 1.732050808
3 0 1 3 Red 3.16227766 2 2 0 0 Red 2
4 0 1 2 Green 2.236067977 Sorting on distance 4 0 1 2 Green 2.236067977
5 -1 0 1 Green 1.414213562 1 0 3 0 Red 3
6 1 1 1 Red 1.732050808 3 0 1 3 Red 3.16227766

b) GREEN because of the following reason:


K=1, means find only 1 nearest neighbor. Then as per the above table(s), [Obs = 5] qualifies to be the nearest one
(at minimum distance=1.41) to the test data point (X1=X2=X3=0). It means that this is the only data point from
observation/ training dataset which needs to be compared to our test data point, hence the color of this nearest
neighbor (which is GREEN) will determine the color of our test data point (X1=X2=X3=0)


c) RED because of the following reason:
K=3, means find 3 closest nearest neighbors. Then as per above table(s), [Obs = 5, 6, 2] qualify to be the 3 nearest
neighbors to the test data point (in terms of distance). It means that these are the 3 data points from
observation/training data which need to be compared to our test data point. Now, these three observations
namely: 5, 6, 2 are represented by Green, Red, Red respectively, and since the majority of the points are Red (2 out
of 3), we say that our test data will belong to Red, and hence the prediction is RED


d) If the Bayes decision boundary is highly non-linear, then we would expect the best value of K to be small because as
the value of K increases the boundary becomes inflexible (simple or linear). In this case, since our data is relatively
irregular/non-linear, a lower K can fit better (although not as low as K=1, as it will lead to over-fitting)

Ans-5) Please find below the answers for below parts:



• Quantitative predictors:
• Mpg
• Cylinders
• Displacement
• Horsepower
• Weight
• Acceleration
• Year

Qualitative predictors
• Name



Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

Note:
Depending on the purpose of model:
• Cylinders – could either be treated as quantitative or factor variable
• Origin – factor variable with 3 levels, & after more analysis based on variables Origin & Name

• Range of quantitative variables


Variable Mpg Cylinders Displacement Horsepower Weight Acceleration Year

Min 9 3 68 46 1613 8 70

Max 46.6 8 455 230 5140 24.8 82

Range 37.6 5 387 184 3527 16.8 12

• Mean and standard deviation of each variable on original auto data


Variable Mpg Cylinders Displacement Horsepower Weight Acceleration Year

Mean 23.44 5.47 194.41 104.47 2977.58 15.54 75.98

Std. 7.8 1.7 104.64 38.49 849.4 2.76 3.68



th th
• Mean and standard deviation of each variable on subsetted auto data (10 to 85 observation removed)
Variable Mpg Cylinders Displacement Horsepower Weight Acceleration Year
Min 11 3 68 46 1649 8.5 70
Max 46.6 8 455 230 4997 24.8 82
Range 35.6 5 387 184 3348 16.3 12
Mean 24.4 5.37 187.24 100.72 2935.97 15.73 77.15
Std 7.87 1.65 99.68 35.71 811.3 2.69 3.11



• Graphical analysis of variables (even though origin is a factor variable, we have used it here as a numeric one)

• Histogram, Density, & Normal Fit for all numeric variables (for univariate visual inspection)


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)


• Mixed plot using the library: PerformanceAnalytics


Note:
i. Diagonal has histograms and density plots for individual variables
ii. On left side of diagonal, we have scatterplots and trend lines between two variables at a time
iii. On right side of diagonal, we have Pearson correlation between two variables at a time (size of txt~value)

• Comments on observations and findings:
o Mpg seems to vary inversely with Displacement, Horsepower, and Weight in a strong manner
o High values of mpg seem to be attained with 4 cylinders. Beyond 4 no. of cylinders, mpg seems to be
negatively correlated with no. of cylinders
o Displacement, Weight, Horsepower, and cylinders seems to be highly linearly correlated with each other
o Acceleration seems to have an inverse relationship with horsepower, & displacement


• Yes, as per the above plots (especially the Mixed plot using the PerformanceAnalytics library), we could see:
• Mpg has a very high linear inverse relationship with cylinders, displacement, horsepower, & weight. Since
displacement, horsepower & weight are highly correlated with each other, only horsepower could be used, &
others could be dropped as they are highly correlated with horsepower & with each other.
• Mpg has a decent to good positive relationship with year and origin















Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

Ans-6) Auto Dataset (Chap-3)

a) Scatterplot for all numeric variables (Bi-variate visual inspection)




b) Correlation matrix using cor()

Matrix variable mpg cylinders displacement horsepower weight acceleration year origin
mpg 1 -0.78 -0.80 NA -0.83 0.42 0.58 0.56
cylinders -0.78 1 0.95 NA 0.90 -0.50 -0.35 -0.56
displacement -0.80 0.95 1 NA 0.93 -0.54 -0.37 -0.61
horsepower NA NA NA 1 NA NA NA NA
weight -0.83 0.90 0.93 NA 1 -0.42 -0.31 -0.58
acceleration 0.42 -0.50 -0.54 NA -0.42 1 0.28 0.21
year 0.58 -0.35 -0.37 NA -0.31 0.28 1 0.18
origin 0.56 -0.56 -0.61 NA -0.58 0.21 0.18 1

Key for above matrix:
• Correlation values are based on Pearson coefficients
• NA – ‘horsepower’ contains missing values hence, it is reflecting correlation values as NA w.r.t other variables
• Darker is the shade of green, higher is the positive linear relationship
• Darker is the shade of red, higher is the negative linear relationship
• Colors vary from red to yellow to green: negative to neutral/no to positive linear relationship, respectively
Note: We could remove rows containing missing values (if they are very less), and re-evaluate correlation matrix


c) Multiple linear regression using mpg as response variable and all others as predictors except name.
Note: We have removed the rows having missing values (which were merely 5 rows) for further analysis

i. The null hypothesis (that there is no significant relationship between mpg & other predictors) has been denied
by very low p-value (<<<0.05) for the F-tests, so there is a statistically significant relation

Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)


ii. It seems that there is a statistically significant relationship between certain predictors & response like mpg
(response) with displacement, weight, year, & origin based on the p-value with significance (<=0.05) for t-tests

iii. Coefficient of year variable (=0.75) means that if the year increases by unit (say from 1994 to 1995) then the
corresponding increase in mpg is 0.75, meaning over the years, cars tend to become more efficient

d) Checking the Linear Model fit with our data: Diagnostic plots



Observations/Comments along with Findings/Insight:
• Residuals Vs. Fitted: (Roughly tells us about the shape of our underlying true function & spread of data)
Residuals are spread in a mildly non-linear fashion w.r.t fitted values (like a convex down). It means that there
is some underlying non-linearity in the true function probably quadratic in nature (convex down) describing
our training data because:
o For low range of fitted values have slightly positive residuals
o Then in mid-range they give a residual value around (0 to slightly negative)
o And then again for higher range of fitted values, we get positive residuals
This in turn means that our model is not actually capturing the non-linearity in the data. Also, the residual
seems to increase as fitted values increase meaning the data spreads for higher values.
• Normal Q-Q plot: (Roughly conveys the skewness & outlier presence in data)
Most of the residuals lie on the normal Q-Q plot and hence seem to be normally distributed, except at the right
tail, where there seems to be a presence of some outliers and skewness in our data
• Scale-Location: (Roughly tells us about nature of data’s spread – eg. Low to high spread has increasing slope)
Residuals seem to have an increasing variance w.r.t fitted values, meaning pretty clouded across most values,
meaning but slightly spreading for higher values, denying the constant variance of errors assumption
• Residuals vs Leverage: (Roughly tells us about the tightness of the data along with leverage and outliers)
There seems to be some outliers like 325, 321, 324, etc. which fall in this range 2 <standardized residual
value<-2, along with one leverage point ‘14’ (which will be the only point very far in the domain of predictors).
Both these sets of outliers and leverage points could possibly influence our model in an adverse manner.

Note: We ought to treat the outliers and make transformations on data in order to satisfy the assumptions of
linear regression (it being a parametric supervised learning method, it has certain assumptions to be satisfied)


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

e) Interaction effects
Depending on the combination of variables taken for interaction together in the linear regression, the statistically
significant terms pop up. For e.g.:
• Allow all variables to interact, below are the significant pairs:
o Acceleration-origin
o Displacement-year
o Acceleration-year
• Allow originally insignificant variables to interact with significant ones, below are the significant pairs
o Displacement-horsepower
• Allow originally insignificant variables to interact with each other, below are the significant pairs
o Cylinders-horsepower
o Horsepower-acceleration
o Cylinders-acceleration

f) Transforming some variables which are non-linearly related with mpg using log(X), sqrt(X), and X^2
• Mixed plot (after and before)




Observations and Insights


• Log(X) - Makes right skewed data less skew
o For horsepower – from right skewed to less skewed
• Sqrt(X) – Makes right skewed data less skew and could
be applied for values=0 (but weak method)
o For horsepower - Hasn’t made much impact

• X^2 – Makes left skewed data less skew
o For hosepower - Hasn’t made much impact


Other comments:
• It’s helpful to transform predictors that follow a non-linear relationship with response using above methods
• Based on the variables own distribution & requirement of the model or analysis, transformation method could
be applied. Like for linear regression, its required to have a linear relationship between predictors & response
• After transformation, horsepower and weight vary much more linearly than before



Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

Ans-6) Boston Data

a) Boston data is about Housing Values in Suburbs of Boston


Item Mpg Description

Rows 506 Observations, where each one corresponds to suburbs of Boston
Column 14 Variables relate to the character or attributes corresponding to the home/around home


b) Pairwise scatterplot (using ‘PredictiveAnalytics’ library – Scatterplots & Trends + Histograms + Pearson correlations)



Top 3 positively (first row) and negatively (second row) linearly correlated variables





Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

• Comments on observations and findings:
o Accessibility to highways (rad) is strongly & positively correlated to full value property tax rate per 10k (tax),
which makes sense as closer is the property to highways, higher should be its value and hence tax
o Nitrogen oxide concentration (nox) is highly and positively correlated with proportion of non-retail business
(indus) & proportion of owner-occupied units prior to 1940 (age), respectively, which makes sense as old
industrial area are usually polluted with these toxic fumes
o Weighted mean of distances to five Boston employment centres (dis) is strongly & negatively correlated
with Nitrogen oxide concentration (nox) and proportion of owner-occupied units prior to 1940 (age), which
makes sense as the industrial area that are usually old are situated away from city centers
o Median value of owner-occupied homes in \$1000s (medv) is negatively correlated with lower status of the
population (lstat), meaning the lower status of population don’t have high value homes that makes sense

c) Predictors associated with per capita crime (crim) as per Pearson coefficients (red: (-) corr.; green: (+) corr.)

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
crim 1 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29 -0.39 0.46 -0.39

• Comments on observations and findings:
o Per capita crime rate (crim) seems to be strongly positively correlated with the accessibility to highways
(rad) & tax rate per 10k (tax), which makes sense as the areas closer to highways are usually very isolated
remote area, & thus have more potential where crime could happen. Since, (tax) is highly positively
correlated with (rad), its popping up as highly positively correlated variable
o Per capita crime rate (crim) seems to be positively correlated with Nitrogen oxide concentration (nox) &
proportion of non-retail business (indus), which makes sense as again these nitrogen-oxide flooded
industrial areas areas are away from city centers and hence not so safe
o Per capita crime rate (crim) seems to be highly positively correlated with lower status of population, which
makes sense as areas with lower status are usually prone to more crimes germane to crime based on needs
o Per capita crime rate (crim) seems to be negatively correlated with weighted mean of distances to five
Boston employment centres (dis) & Median value of owner-occupied homes in \$1000s (medv), which
makes sense as these areas are city residential areas which are usually under police patrol, & hence safe.


d) Suburbs with high crime rate, tax rate, pupil-teacher rate


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

• Comments on observations and findings:
o Per capita crime rate (crim) seems to be right skewed with very few suburbs with crim>40, & majorly crim is
between 0 & 20, with a mode of frequency 400 at 0 (it might either mean that the crimes were not
recorded or crime is actually less in Boston because of high number of safe suburbs)
o Tax rate per 10k (tax) is majorly distributed between the values 0 to 500 (probably due to residential area),
but there is a mode at around 680, which is possibly due to industrial area (which seem to be many)
o Pupil-teacher ratio seems to be a bit left skewed but pretty normally distributed with a mode at 21


e) 35


f) 19.05


g) Suburb of Boston with minimum medv
• Row nbr = 399 (with medv=5)

• Values of other predictors for this suburb (#399)

ID crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
399 38.35 0.00 18.10 0.00 0.69 5.45 100.00 1.49 24.00 666.00 20.20 396.90 30.59 5.00


• Compare the predictor values for this suburb (#399) with range of all variables/predictors

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
min 0.01 0.00 0.46 0.00 0.39 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
max 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00

Observations:
o zn of #399 = min(Zn)
o chas of #399 = min(Chas)
o age of #399 = max(age)
o rad of #399 = max(rad)
o tax of #399 = towards the higher end
o ptratio of #399 = towards the higher end
o black of #399 = max(black)
o lstat of #399 = near the higher end

Findings and Insights:


This means that this suburb #399 is bound to Charles river, where proportion of land zoned for over 25,000
sqft is zero (minimum in Boston), which has the oldest owner-occupied units, and from here the accessibility of
radial highways is also lowest. All these factors point towards the minimum value of owner occupied homes in
/1000$. Further, this area seems to have been dominated by blacks and low status population.


h) Suburbs and rooms per dwelling
• Suburbs with more than 7 rooms per dwelling = 64
• Suburbs with more than 8 rooms per dwelling = 13
• Suburbs with more than 8 rooms per dwelling have certain patterns for the other predictors as shown below:





Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)



Observations, Findings and Insights for all these suburbs:
o All have very low crime rate (near the lowest crim=0)
o All of them have been located either away from the radial highways or low accessible locations (at a rad>8)
o Most of them have a higher black population (black>350)
o Most of them have been resided by low status population (lstat<8)
o Most of them have high value households (medv>35)
o Most of them have buildings that are old aged (age>60) except a few
o Most of them have been located near city business centers (dis><4) except a few
o Most of them are located in industrial area (indus<7 indicates that)


















R-code on next page









Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

R-code) rm(list=ls())

#-------Setting up library
getwd()
setwd("/Users/swapnilparkhe/Desktop/MSBA-UIC/IDS 575 - Biz Stats/Assignments")
dir()

############(Q1)############

#------Importing data---
train<-read.table("zip.train")
test<-read.table("zip.test")

#-----High level inspection of data (similar to what desired)
dim(train)
names(train)
str(train)
aggregate(train$V1, by=list(c(train$V1)), length)

dim(test)
names(test)
str(test)
aggregate(test$V1, by=list(c(test$V1)), length)

#------Modeling with Linear regression and KNN methods
###----JUST FOR PRACTISE---LM_1 (all rows G=0,1,2,3,...9) has high Err(train)=75%, hence segregating training and
testing data for 2 & 3 separately for training and test, respectively
#train_1<-train
#test_1<-test
#LM_1<-lm(V1~., data=train_1)
#pred_train_1<-round(LM_1$fitted.values)
#error_train_1<-mean(pred_train_1!=train_1$V1)
#pred_test_1<-round(predict.lm(LM_1, test_1))
#error_test_1<-mean(pred_test_1!=test_1$V1)

###LM_2 (rows with G=2,3) has Err(train)=0.5%, Err(test)=4.1%
train_2<-subset(train, train$V1==2|train$V1==3)
test_2<-subset(test,test$V1==2|test$V1==3)

LM_2<-lm(V1~., data=train_2)

pred_train_2<-round(LM_2$fitted.values)
error_train_2<-mean(pred_train_2!=train_2$V1)

pred_test_2<-round(predict.lm(LM_2,test_2))
error_test_2<-mean(pred_test_2!=test_2$V1)

##KNN (with k=1,3,5,7,15)
#install.packages("class")
library(class)

k<-c(1,3,5,7,15)
error_train_3<-rep(NA,length(k))
for(x in 1:length(k)){
pred_train_3<-knn(train_2,train_2,cl=train_2$V1,k[x])
error_train_3[x]<-mean(pred_train_3!=train_2$V1)
}

k<-c(1,3,5,7,15)


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)
error_test_3<-rep(NA,length(k))
for(x in 1:length(k)){
pred_test_3<-knn(train_2,test_2,cl=train_2$V1,k[x])
error_test_3[x]<-mean(pred_test_3!=test_2$V1)
}

Errors_train_test<-matrix(c(error_train_1, error_train_2, error_train_3, error_test_1, error_test_2, error_test_3),
ncol=2, nrow=7, byrow=FALSE)
colnames(Errors_train_test)<-c("Train-Error", "Test-Error")
row.names(Errors_train_test)<-c("LM_1 (using all G values)", "LM_2 (only using G=2,3)", paste("K-NN having K=",k,
"(only using G=2,3)"))
write.csv(Errors_train_test, "Q1_Errors.csv")


############(Q5)############

#------Importing the file
library(readxl)
auto<-read_excel("Assignment1_Auto_data.xls")
dim(auto)

#-----Removing missing value rows
auto<-na.omit(auto)
dim(auto)

#-----Inspecting data at higher level
View(auto)
names(auto)
str(auto)

#-----a-----
str(auto)
sapply(auto, class)
auto$origin<-factor(auto$origin)

#-----b----
quant<-sapply(auto, is.numeric)
quant

sapply(auto[,quant] , range)

#-----c-----
sapply(auto[,quant], mean)
sapply(auto[,quant], sd)


#-----d-----
auto1<-auto[-c(10:85),]

sapply(auto1[,quant] , range)
sapply(auto1[,quant], mean)
sapply(auto1[,quant], sd)

#-------e----
#install.packages('plyr')
#install.packages('psych')
#install.packages('PerformanceAnalytics')
library(plyr)
library(psych)
library(PerformanceAnalytics)


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

dim(auto)
multi.hist(auto[,-9])
chart.Correlation(auto[,-9], pch=21)

############(Q6)############

#------Importing the file
library(readxl)
auto<-read_excel("Assignment1_Auto_data.xls")
dim(auto)

#-----Inspecting data at higher level
View(auto)
names(auto)
str(auto)

#install.packages("pastecs")
library(pastecs)
stat.desc(auto)

#-------a--------
pairs(auto[, -9], pch=21)

#--------b--------
cor(auto[,-9])
write.csv(cor(auto[,-9]),"Auto_cor.csv")

#-------c-------
#--Removing missing value rows
auto<-na.omit(auto)
dim(auto)

lm_auto<-lm(mpg~.-name,auto)
summary(lm_auto)


#--------d------
par(mfrow=c(2,2))
plot(lm_auto)


#-------e-----
#All predictors and predictor pairs at a time
lm_auto_1<-lm(mpg~.^2 , auto[,-9])
summary(lm_auto_1)

#Taking all predictors along with effect of previously significant and insignificant predictor )
lm_auto_2<-lm(mpg~cylinders:displacement + cylinders:weight +
horsepower*displacement + horsepower:weight +
displacement:weight, auto[,-9])
summary(lm_auto_2)

#Taking all predictors along with interactions among previosuly insignificant predictors)
lm_auto_3<-lm(mpg~.+cylinders:horsepower + cylinders:acceleration + horsepower:acceleration , auto[,-9])
summary(lm_auto_3)

#Taking only highly correlated predictor pairs
lm_auto_4<-lm(mpg ~ cylinders*displacement+displacement*weight, data = auto[, -9])
summary(lm_auto_4)


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

#---------JUST FOR PRACTICE-----Two all and some predictor pairs at a time
#lm_auto_5_a<-lm(mpg~.+cylinders:displacement,auto[,-9])
#lm_auto_5_b<-lm(mpg~.+horsepower*weight,auto[,-9])
#lm_auto_5_c<-lm(mpg~.+acceleration:year,auto[,-9])
#summary(lm_auto_5_a)
#summary(lm_auto_5_b)
#summary(lm_auto_5_c)

#--------f-------
#Transformation-Using some predictors that do not have a linear relationship with response (horsepower and weight)
mpg_pred<-data.frame(mpg=auto$mpg, hp=auto$horsepower, wt=auto$weight)
chart.Correlation(mpg_pred)

mpg_transf_pred<-data.frame(mpg=auto$mpg,
hp_log=log(auto$horsepower), hp_sqrt=sqrt(auto$horsepower), hp_sq=(auto$horsepower)^2,
wt_log=log(auto$weight), wt_sqrt=sqrt(auto$weight),wt_sq=(auto$weight)^2
)
chart.Correlation(mpg_transf_pred)


############(Extra Q)############
#--------a-------
library(MASS)
Boston
?Boston

dim(Boston)

#---------b-------
pairs(Boston)

par(mfrow=c(2,3))
plot(Boston$rad, Boston$tax)
plot(Boston$indus, Boston$nox)
plot(Boston$age, Boston$nox)
plot(Boston$nox, Boston$dis)
plot(Boston$age, Boston$dis)
plot(Boston$medv, Boston$lstat)

#---------c-------
write.csv(cor(Boston$crim, Boston), "crim_corr.csv")

#---------d--------
par(mfrow=c(1,3))
hist(Boston$crim, breaks=30)
hist(Boston$tax, breaks=30)
hist(Boston$ptratio, breaks=30)

#---------e------
sum(Boston$chas==1)

#---------f-----
median(Boston$ptratio)

#---------g-----
which.min(Boston$medv)
Boston[which.min(Boston$medv),14]
Boston[which.min(Boston$medv),]
sapply(Boston, range)


Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All R-codes are pasted at end)

#--------h------
summary(Boston$rm>7)
summary(Boston$rm>8)

x<-which(Boston$rm>8)
Boston[which(Boston$rm>8),]
par(mfrow=c(5,3))
for (i in 1:ncol(Boston)){
hist(Boston[, i], main=colnames(Boston)[i], breaks="FD")
abline(v=Boston[x, i], col="red", lw=1)
}

You might also like