You are on page 1of 33

Regression Analysis , Tools and

Techniques

Muhammad Shahid Aziz 17-MS-Pt-AMD-10


Mujahid Iqbal 17-MS-Pt-AMD-15
Contents
What is Regression?
Why do we use regression?
Regression Techniques
Regression Tools
Regression
Regression is the attempt to explain the variation in a dependent
variable using the variation in independent variables.
If the independent variable(s) sufficiently explain the variation in the
dependent variable, the model can be used for prediction
Dependent variable

Independent variable (x)


Why do we use Regression?
It indicates the significant relationships between dependent variable
and independent variable.

It indicates the strength of impact of multiple independent variables


on a dependent variable.

Regression is used for forecasting, time series modelling and finding


the causal effect relationship between the variables
Regression Techniques

No of Independent variables Shape of regression line

Type of dependent variable


Regression Techniques
Linear Regression
Polynomial Regression
Logistic Regression
Stepwise Regression
Ridge Regression
Lasso Regression
ElasticNet Regression
Simple linear regression
The output of a regression is a function that predicts the dependent
variable based upon values of the independent variables.
Simple regression fits a straight line to the data
Dependent variable(y)

y’ = b0 + b1X ± є
є

B1 = slope
b0 (y intercept)
= ∆y/ ∆x

Independent variable (x)


Simple linear regression
The function will make a prediction for each observed data point. The
observation is denoted by y and the prediction is denoted by y^.
Dependent variable

^
Prediction: y

Zero
Independent variable (x)
Simple linear regression
For each observation, the variation can be described as:
y = y^ + ε
Actual = Explained + Error

Prediction error: ε

Observation: y
Prediction: ^y

Zero
Least square regression
A least squares regression selects the line with the lowest total sum of
squared prediction errors.
This value is called the Sum of Squares of Error, or SSE.

Dependent variable

Independent variable (x)


Calculating SSR
The Sum of Squares Regression (SSR) is the sum of the squared
differences between the prediction for each observation and the
population mean
Dependent variable

Population mean: y

Independent variable (x)


Regression Formulas
The Total Sum of Squares (SST) is equal to SSR + SSE.
Mathematically,
𝑆𝑆𝑅 = σ(𝑦 ^ − 𝑦)
ത 2 (measure of explained variation)
𝑆𝑆𝐸 = σ(𝑦 − 𝑦^)2 (measure of unexplained variation)
ത 2 (measure of total variation in y)
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 = σ(𝑦 − 𝑦)
Coefficient of Determination
The proportion of total variation (SST) that is explained by the
regression (SSR) is known as the Coefficient of Determination, and is
often referred to as R .

𝑆𝑆𝑅 𝑆𝑆𝑅
𝑅2 = =
𝑆𝑆𝑇 𝑆𝑆𝑅 + 𝑆𝑆𝐸

The value of R can range between 0 and 1, and the higher its value
the more accurate the regression model is. It is often referred to as a
percentage.
Standard Error
The Standard Error of a regression is a measure of its variability. It
can be used in a similar manner to standard deviation, allowing for
prediction intervals.
y ± 2 standard errors will provide approximately 95% accuracy, and 3
standard errors will provide a 99% confidence interval.
Standard Error is calculated by taking the square root of the average
prediction error.
𝑆𝑆𝐸
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 =
𝑛−𝑘
Where n is the number of observations in the sample and k is the total
number of variables in the model
The output of a simple regression is the coefficient β and the constant
A.
The equation is then:
y=A+β *x+ε
where ε is the residual error.
β is the per unit change in the dependent variable for each unit change
in the independent variable.
Mathematically:
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝛽=
σ(𝑥 − 𝑥)ҧ 2
𝐴 = 𝑦ത − 𝛽(𝑥)ҧ
Multiple Linear Regression
More than one independent variable can be used to explain variance in
the dependent variable, as long as they are not linearly related.
A multiple regression takes the form:
𝑦 = 𝐴 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ … . +𝛽𝑘 𝑋𝑘 + 𝜀
where k is the number of variables, or parameters
Polynomial Regression
It is a technique to fit a nonlinear equation by taking polynomial
functions of independent variable.
a polynomial of degree k in one variable is written as:

𝑦 = 𝛽𝑜 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + ⋯ … . +𝛽𝑘 𝑋 𝑘 + 𝜀
Hence in the situations where the relation between the dependent and
independent variable seems to be non-linear we can
deploy Polynomial Regression Models.
Polynomial Regression
Logistic Regression
Logistic regression is used to find the probability of event=Success
and event=Failure.
 We should use logistic regression when the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in nature.
Logistic regression equation is given by
𝑝
𝑦 = log
1−𝑝
Where p is probability of event occurrence
Logistic Regression
It is widely used for classification problems
Logistic regression doesn’t require linear relationship between
dependent and independent variables.
 It can handle various types of relationships because it applies a non-
linear log transformation to the predicted odds ratio
Stepwise Regression
This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables
is done with the help of an automatic process, which involves no human
intervention.
Standard stepwise regression does two things. It adds and removes
predictors as needed for each step.
Ridge Regression
Ridge Regression is a technique used when the data suffers from
multicollinearity ( independent variables are highly correlated).
In multicollinearity, even though the least squares estimates are
unbiased, their variances are large which deviates the observed value
far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.
Lasso Regression
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and
Selection Operator) also penalizes the absolute size of the regression
coefficients.
In addition, it is capable of reducing the variability and improving the
accuracy of linear regression models.
Elastic Net Regression
ElasticNet is hybrid of Lasso and Ridge Regression techniques.
Elastic-net is useful when there are multiple features which are
correlated. Lasso is likely to pick one of these at random, while
elastic-net is likely to pick both.
Regression Analysis Tools
Excel, Mini tab, SPSS
NCSS
MA Canova
Regressit
E views
MATLAB
JMP
Stata
Example: Use the data in Table to obtain the regression line relating
income and alcohol consumption.
Province Income Alcohol
New Found Land 26.8 8.7
Prince Edward Island 27.1 8.4
Nova Scotia 29.5 8.8
New Brunswick 28.4 7.6
Quebec 30.8 8.9
Ontario 36.4 10.0
Manitoba 30.4 9.7
Saskatchewan 29.8 8.9
Alberta 35.1 11.1
British Columbia 32.5 10.9
X Y 𝐗𝟐 𝐘𝟐 XY

26.8 8.7 718.23 75.690 233.16


27.1 8.4 734.41 70.560 227.64
29.5 8.8 870.25 77.440 259.60
28.4 7.6 806.56 57.760 215.84
30.8 8.9 948.64 79.210 274.12
36.4 10.0 1324.96 100.00 364.00
30.4 9.7 924.16 94.090 294.88
29.8 8.9 888.04 79.210 265.22
35.1 11.1 1232.01 123.210 389.61
32.5 10.9 1056.25 118.810 354.25
𝚺𝐗 =306.8 𝚺𝐘 =93.0 𝚺𝐗 𝟐 =9503.52 𝚺𝒀𝟐 =875.98 𝚺𝐗𝐘 =2878.32
𝛴𝑋 =306.8
𝛴𝑌 =93.0
𝛴𝑋 2 =9503.52
𝛴𝑌 2 =875.98
𝛴𝑋𝑌 =2878.32
These can be used to determine the following:
𝑆𝑋𝑌 = 𝛴𝑋𝑌- 𝛴𝑋 ∗ 𝛴𝑌
𝑛
= 2878.32-2853.24
= 25.08
ΣX 2
𝑆𝑋𝑋 = Σ𝑋 2 -
n
306.82
SXX = 9503.52-
10
=90.896
2
2 ΣY
SYY = ΣY -
n
93.02
SYY = 875.98-
10
=11.08
The value for the slope of regression line is
SXY 25.08
b= = = 0.276
SXX 90.896
The intercept is:
a =𝑌ഥ − 𝑏𝑋ത

= 9.30 – (.276*30.68)
= 9.30 – 8.68 = 0.832
The least Square regression line is

𝑌ത = 0.832 +0.276 X
While the intercept a = 0.832 has little real meaning, the slope of the
line can be interpreted meaningfully. The slope b = 0.276 is positive,
indicating that as income increases, alcohol consumption also increases.
As illustrated in the below graph.
Thanks

You might also like