You are on page 1of 73

Statistics the Science of collecting, organising, presenting,

analyzing and interpreting data to assist in making more


effective decisions.
Types of Statistics:
The study of statistics is usually divided into 02 categories
(i) Descriptive statistics: Method of organising,
summarizing, and presenting data in a informative way
Ex:- Frequency distribution and charts (histogram, pie
etc. it suggests the specific measures of central location
such as mean, describes central values of a group of
numerical data.
(ii) Inferential Statistics (or statistical inference/inductive
statistics): the method used to determine something
about a population on the basis of a sample.
Population: the entire set of individuals or objects of
interest or the measurements obtained from all
individual or objects of interest.
Sample: A portion, or part of the population of interest.
Types of Variables:
There are 2 basic types of variables
(i) Qualitative Variable:- When characteristics being
studied is non numeric, it is called qualitative variable or
an attribute
Ex: gender, religion, color

Types of Variables

Non numeric Quantitative Quantitative Numeric data


data

Discrete Continuous

Can take only Can take any


certain values and value within a
there are usually range
‘gaps’ between
values

Nominal- Level Data:


The lowest or most primitive, measurement is the nominal
level. The highest or the level that gives us the most
information about the observation, is the ratio level of
measurement.
There are actually 04 levels of measurement:
1. Nominal Level Data:- For the nominal level of
measurement observations of a qualitative variable can
only be classified and counted. There is no particular order
of the levels.
Ex. The classification of items by colors say chocolates the
nominal level data have the following properties.
i) Data categories are mutually exclusive and exhaustive.
ii) Data categories have no logical order.
2. Ordinal Level Data: Example
a) Poor and rich
b) Superior vs Good vs Average
Here we can say superior is better than good is better
than Average But we can not conclude how much better
the rating is in summary, the properties of ordinal-level
data are:
i) The data classification are mutually exclusive and
exhaustive
ii) Data classification are ranked or order4ed
according to the particular trait they posses.
3. Interval-Level Data: Ex. Measurement of length by meter;
heat by
The properties of internal-level data are:
i) Data classification are mutually exclusive and exhaustive
ii) Data classifications are ordered according to the
amount the characteristic they possess.
iii) Equal differences in the characteristics are represented
by equal differences in the measurements.
4. Ratio-Level Data: The ratio-level is the “highest” level of
measurement. In summary, the properties of the ratio-
level data are:
i) Data classifications are mutually exclusive and
exhaustive.
ii) Data classifications are ordered according to the
characteristics they posses.
iii) Equal differences in the characteristics are represented
by equal differences in the numbers assigned to the
classifications.
iv) The zero point is the absence of the characteristic.

Class Intervals and Class Midpoints:


 Class midpoint: The midpoint is halfway between the
lower limits of two consecutive classes.
 Class Interval: Is the difference lower limit of consecutive
classes. One can also determine the class interval by
finding the difference between consecutive midpoints.

 Histogram: A graph in which classes are on the horizontal


(x-axis) and frequencies on the vertical axis (y-axis). The
class frequencies are represented by the height of the
bars, and the bars are drawn adjacent to each other.
 Frequency Polygon similar to a histogram. It is constructed
by joining the class mid-points with a line segment.

Describing Data
To transform a mass of raw data into a meaningful form is
important.
Descriptive Statistics: Frequency distributive and graphical
representations like histogram or frequency polygon.
Numerical Statistics: 2 important numerical ways to represent
data.
 Measure of location- often references to as averages. The
purpose of measure of a location is to pinpoint the center
of a set of values Most common 5 measures of location are
arithmetic mean the weighted mean the geometric mean
are median the mode.
 Measure of dispersion- often called the variation or spread

Any measurable characteristics of a population is called


parameter. The mean of a population is a parameter.

Weighted Mean=
Or

Median: The midpoint of the values after they have been


ordered from the smallest to the largest or the largest to the
smallest.
Mode: The value of the observation that appears most
frequently.
Mode has the advantage of not being affected by the extremely
high or low values. It can be used for all levels of data- nominal,
ordinal, internal or ratio.
The mode does have disadvantages; however that causes it to
be used less frequently than the mean or median. There may be
more than one mode values or no mode values in a set of data.
For a symmetric distribution, the mode, median, and mean are
looked at the center and are always equal.
If a distribution is non symmetrical or skewed, the relationship
among the three measures changes. In a positively skewed
(more area on right side) distribution, the AM is the largest of
the three affected by extreme values. The median is the next
largest measure. The mode is the smallest of the three
measures.
The opposite happens in a negatively skewed distribution.
Geometric Mean, GM=

Dispersion- studies the spread of data


Measures of Dispersion
(i) Range- Largest Value-Smallest Value
(ii) Mean Deviation, MD: The arithmetic mean of the
absolute values of the deviations from the arithmetic
mean

MD=

(iii) Variance & SD


(iv) Variance: The arithmetic mean of the squared deviation
from the mean
Standard Deviation: The square roof of the variance

Population Variance, =

Population SD,

Sample Variance, =
Sample SD,

Chebyshev’s Theorem: For any set of observations (sample or


population) the portion of the values that lie within K standard
deviations of the mean is at least 1-1{ , where K is any
constant greater than 1
The Empirical Rule (sometimes called as Normal Rule): For a
symmetrical, bell shaped frequency distribution approximately.
68% of the observations will be within SDs of the mean; and

practically all (99.7%) will be within SDs of the mean.

The Mean & SD of Grouped Data:

AM of grouped data,

SD of grouped data, =

Of variation The Ratio of the SD to the AM, expressed as a %


Probability Distribution: A listing of all the outcomes of an
experiment and the probability associated with each outcome.
In any experiment of chance, the outcomes occur randomly. So
it is often call random variable.
Random Variable: A quantity resulting from an experiment that
by chance, can assume different values.
Discrete Random Variable: A random variable that can assume
only certain clearly separated values.

Mean & SD of a Probability Distribution


Mean of a probability distribution is also referred to as its
expected value

Mean of a PD,

Where P(x) is the probability of a particular value x.

Variance of a PD,
Binomial PD is a special case of discrete probability distribution
with only 02 possible outcomes

Binomial PD,

Where: C- Combination, - Probability of success of each trial,


n- no. of trials, X- the RV defined number of success.

[Remember:- is not same as 3.1416]

Mean of a BPD,

Variance of a BPD,

Poisson PD describes the number of times some event occurs


during a specified internal. The internal may be time, distance,
area or volume.
PPD is based on 03 basic assumptions:
i) The RV is the no. Of times some event occurs during a
defined interval
ii) The probability of the event is proportional to the
interval.
iii) The intervals which do not overlap are independent.
Poisson Distribution,

Where:
- is the mean no. Of occurrences (successes) in a particular
interval; e= 2.71828; X- no. of occurrences (successes); P(x) is
the probability for a specified value of x;

Mean of PD,

Where is the no. of total trials and is the probability of


successes.

Normal PD

Where: is mean; is SD; = ; e is base of natural of

system.
Continuous PD
The number of normal distributions is unlimited, each having a
different mean , standard deviation , or both. While is
possible to provide probability tables for discrete distribution
such as binomial and the Poisson, providing tables for the
infinite number of distributions is impossible fortunately, one
member of the family can be used to determine the
probabilities for all normal distributions. It is called the standard
normal distribution and it is unique because it has a mean of 0
and a standard deviation of 1.
Any normal distribution can be converted into a Standard
Normal Distribution by subtracting the mean from each
observation and dividing the difference by the SD. The results
are called Z values. They are also referred to as z scores, the z
statistics, the standard normal deviates, the standard normal
values, or just the normal deviate.

It is observed that the binomial distribution (a discrete


distribution) can be approximated using normal distribution (a
continuous distribution) for large values of n.
The normal PD is a good approximation to the binomial
probability distribution when and are both at least
5.
Continuity correction factor the value.5 subtracted or added,
depending on the question, to a selected value when a discrete
probability distribution approximated by a continuous PD
How to apply the correction Factor
Only 04 cases may arise. These cases are:
For probability
i) at least x occur, use the area above (x-0.5)
ii) That more than x occur, the area above (x+0.5)
iii) That x or fewer occur, the area below (x+0.5)
iv) That fewer than x occur, the area below (x-0.5)

The CLT states that for large random samples, the shape of the
sampling distribution of the sample mean is close to a normal
probability distribution. This approximation is more accurate for
large samples than for small samples. This is one of the most
useful conclusions in statistics. We can reason about the
distribution of the sample mean with absolutely no information
about the shape of the population distribution from which the
sample is taken. In other words, the CLT is true for all
distributions.
Sampling Distribution of the Sample Mean: information about
the shape of the population distribution from which the sample
is taken means of samples of a specified size vary from sample
to sample.
Sampling Distribution of the Sample Mean: A probability
distribution of all possible sample means of a given sample size.
Central Limit Theorem
If all samples of a particular size are selected from any
population; the sampling distribution of the sample mean is
approximately a normal distribution. This approximation
improves with larger samples.
If the population follows a normal distribution, then for any
sample size the sampling distribution of the sample mean will
also be normal. If the population distribution is symmetrical
(but not normal), you will see the normal shape of the
distribution of the sample mean emerges with samples as small
as 10. On the other hand if you start with a distribution that is
skewed or has thick tails, it may require samples of 30 more to
observe the normality feature. A sample size of 30 or more to
be large enough for CLT be employed.
The CLT indicates that, regardless of the shape of the
population distribution, the sampling distribution of the sample
mean will move towards the normal probability distribution.
The larger the number of observations in each sample, the
stronger the convergence.
For larger sample sizes, it is observed that the mean of the
sampling distribution is the population mean, i.e., , and if
the standard deviation in the population is , the standard

deviation of the sample mean is , where n is the number

of observations in each sample. We refer as the standard

error of the mean. Its longer name is actually the standard


deviation of the sampling distribution of the sample mean.

Important Conclusions from this section:


i) The mean of the distribution of sample means will be
exactly equal to the population mean if we are able to
select all possible samples of the same size from a given
population. That is:

Even if we do not select all samples, we can expect the


mean of the distribution of sample means to be close to.
ii) There will be less dispersion in the sampling distribution
of the sample mean than in the population. If the SD of
the population is , the standard deviation of sample
means is , Note that when we increase the size of

the sample the standard error of the mean decreases.


If we have a population about which we have some
information. We take a sample from that population and wish
to conclude whether the sampling error, that is, the difference
between the population parameter and the sample statistic is
due to chance.
As discussed in this section, we can calculate the probability
that a sample mean will fall within a certain range. We know
that the sampling distribution of the sample mean will follow
normal probability distribution under two conditions:
i) When the samples are taken from populations known to
follow the normal distribution. In this case the size of
the sample is not a factor.
ii) When the shape of the population distribution is not
known or the shape is known to be non normal but our
sample contains at least 30 observations.

We can now use, , to convert any normal

distribution to standard normal distribution using this value z,


we can find the probability (from Table) selecting an
observation that would fall within a specific range.
Where: x is the value of the RV; is the population mean, and
is the population SD.

However, most business decisions refer to a sample not just one


observation. So we are interested in the distribution of , the
sample mean, instead of X, the value of one observation. That is
the first charge we make in the above formula. The second is
that we use the standard error of the mean of n observations
instead of the population standard deviation. That is, we use

in the denominator rather than . Therefore to find the

likelihood of a sample mean with a specified range, we first use


the following formula to find the corresponding z value. Then
use the Table for area under the normal curve to locate the
probability.

Finding the z value of when the population SD is known:


Often we do not know the value of the standard deviation, .
Again since the sample is at least 30, we estimate the
population SD with the sample SD. The actual distribution of
the statistics is student’s distribution.

Finding the z value of when the population SD is unknown:

POINT ESTIMATES & CONFIDENCE INTERVAL


In most business situations, information about the population is
not available. In fact, the purpose of sampling may be to
estimate same of the population parameter values like
population mean and population standard deviation.
We start by find a point estimate. However, a point estimate is a
single value. A more informative approach is to present a range
of values in which we expect the population parameter to
occur.
Point Estimate: The statistic, computed from sample
information, which is used to estimate the population
parameter.
The sample mean, , is a project estimate of the population

mean, ; , a sample proportion, is a point estimate of , , the


population proportion; and s, the sample SD, is a point estimate
of , the population SD.

A point estimate, however, tells only a part of the story while


we expect the point estimate to be close to the population
parameter, we would like to measure how close it really is. A
confidence interval serves this purpose.
Confidence Interval: A range of values constructed from sample
data so that the population parameter is likely to occur within
that range at a specified probability. The specified probability is
called the level of confidence.
The SD of the sampling distribution of the sample mean is
usually called the “standard error”

We know standard error of the mean

In most applied situations population SD is not available, so we


estimate it as follows:
The size of the Standard Error(SE) is affected by two values. The
first is the SD if the SD is large, then SE will also be large.
However, the SE is also affected by the sample size, n. As the
sample size is increased, the SE decreases, indicating that there
is less variability in the sampling distribution of the sample
mean. This conclusion is logical, because on estimate made
with a large sample should be more precise than one made
from a small sample.

Confidence Interval for the population Mean

Where: z depends on the level of confidence

Unknown Population SD and a Small Sample:


In the previous section we used the standard normal
distribution to express the level of confidence we assumed
ether:
i) The population followed the normal distribution and the
population SD was known, or
ii) The shape of the population was not known, but the
number of observation in the sample is at least 30.
What do we do if sample size is less than 30 and we do not
known the population SD? This situation is not covered by CLT
but exist in many cases. Under these conditions, the correct
statistical procedure is to replace the standard normal
distribution with the t distribution.

S is an estimate of

The t distribution has a higher spread than the normal


distribution.

Confidence interval for the population mean, unknown

To develop a confidence interval for the population mean with


an unknown population SD we:
i) Assume the sample is from a normal distribution.
ii) Estimate the population SD, σ, with the sample SD, s.
iii) Use t distribution rather than the z distribution.
We should be clear at this point. We usually employ the
standard normal distribution when the sample size is at least
30. We should strictly speaking, base the decision whether to
use z or t on whether is known or not. When is known, we
use z; when it is not we use t. The role of using z when the
sample size is 30 or more is based on the fact that the t
distribution approaches the normal distribution as the sample
size increases. When the sample size reaches 30, there is little
difference between the z and t values, so we ignore the
difference and use z.
Is the Population normal?

No Yes

Is n 30 or more? Is the population SD known?

Yes No Yes
No

Use a non- Use the z Use the t Use the z


parametric distribution distribution distribution
test

Figure: Determining when to use z Distribution or the t


Distribution
Choosing an Appropriate sample Size:
Sample size depends in 3 factors:
(i) The confidence interval defined
(ii) The margin of error the researcher will tolerate.
(iii) The variability in the population being studied
Or
(i) Level of confidence
(ii) Allowable error
(iii) Population SD (If the population is widely dispersed, a
large sample is required. If the population is
homogenous, has low SD, we need a small sample.).
However, it may be necessary to use an estimate for the
population SD. Here are three suggestions to find that
estimate:
a) Use a comparable study

b) Range based approach (Virtually 3 ....... is within


the range)
c) Conduct a pilot study
We can express the interaction among these three
factors and the sample size in the following formula:

Where is the max. Allowable error

Hypothesis:
A hypothesis is a statement about a population. Data are then
used to check the resemblances of the statement.

General is rejected it the confidence interval doesn’t include


the hypothesized value. If the confidence interval includes the
hypothesized value, then is not rejected.
What is Correlation Analysis?
 Is a technique to determine whether there is any
relationship between two variables
i. dependent variable, Y, (variable that is
predicted / estimated), and,
ii. Independent variable, X, ( a variable that
provides the basis for estimation)
 Ex. Scatter Analysis

The Coefficient of Correlation


- Interval-level (eq. diff. in characteristics are represented by
equal diff. in the measurement) or ratio level of data
(normalized interval-level of data)
o (mostly not on i. Nominal – (data categories are
mutually exclusive and exhaustive but have no logical
order hence provides no manipulation (say addn etc.)
on the data)or ii. Ordinal-Level data (data categories
are mutually exclusive and exhaustive and are
classified according to the rank of the data or data
categories has a ranked order Ex. Rich, Middle-Class,
Poor etc.)
Coefficient of Correlation, r: A measure of strength of the
relationship
 The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the
association as illustrated by the following
diagram
A case for Linear Regression
The Mr Bush, the marketing manager, would like to offer specific information about the
relationship between the number of sales calls and the number of severs sold. Use the least
square method to determine a linear equation to express the relationship between the two
variables.

Sales Sales Servers


Representative Calls (X) Sold (Y)
Hari 20 30
Rama 40 60
Shivani 20 40
Ravi 30 60
Gautam 10 30
Manish 10 40
Pandu 20 40
Harish 20 50
Venktesh 20 30
Binny 30 70
Total 220 450

a. What is the expected number of servers sold by a


representative who made 25 calls.
b. Determine a 95 percent confidence interval for all sales
representatives who makes 25 calls and for Mr Venu, a
Nothen Region sales representative who made 25 calls.

Is there any relationship between X and Y?


Plot the data – scatter plot

Exhibits a near linear relationship

Determine the strength of this relationship: Compute


Coefficient of Correlation, r
Use Nomalized Scatter Plot (through the mean of X and Y):
X-axis through and Y-axis Ȳ as vertices.
The vertices pass through the center of the data.

= ∑ X / n = 220 /10 = 22
Ȳ = ∑ Y / n = 450 /10 = 45

As most of the data (except 8th data of Harish) one are in the 1st
or 3rd quadrant, we may assume a +ve relationship because in
both these quadrants (X- ) (Y - Ȳ) is +ve (both (X- ) and (Y - Ȳ)
have same signs either +ve (in the 1st quadrant) or both –ve in
(the 3rd quadrant) and as observed in table below:
Calculate the deviations from the mean data

Sales Calls Sold X- Y - Ȳ (X- ) (Y - Ȳ)


Representative (X) (Y)
Hari 20 30 -2 -15 30
Rama 40 60 18 15 270
Shivani 20 40 -2 -5 10
Ravi 30 60 8 15 120
Gautam 10 30 -12 -15 180
Manish 10 40 -12 -5 60
Pandu 20 40 -2 -5 10
Harish 20 50 -2 5 -10
Venktesh 20 30 -2 -15 30
Binny 30 70 8 25 200
Total 220 450 0 0 900

Correlation Coefficient,
= 900 /[(10-1) * (9.189) * 14.337)]
= 0.759

Be cautious of spurious correlations.


Ex. No. of donkeys / horses and PhDs awarded
No of Trees in the institute campus and PhDs awarded
Testing of Coefficient of Correlation
In the previous example data there exhibited a strong
relationship between 10 sales people and the number of
servers sold. Could it be that the correlation of the population
is 0 (zero)? This would mean that the correlation of 0.759 is
due to chance.

Resolving this dilemma requires a test to answer the obvious


question selected. To put it in a different way did the computed
r came from a population of paired observations with zero
correlation?

We do a hypothesis testing
Hypothesis Testing: A statement about a population parameter
developed for the purpose of testing.
Null Hypothesis: A statement about the value of the
population parameter
Alternate Hypothesis: A statement that is accepted if the
sample data provide sufficient evidence that the Null
Hypothesis is false.
Steps in Hypothesis Testing
1. Establish the null hypothesis (H0) and the alternate
hypothesis (H1),
2. Select the level of significance, that is α,
(Rejecting the null hypothesis when it is in fact true is called a Type I error)
Errors in Making Decisions
 Type I Error (H0 rejected when true)
 When a true null hypothesis is rejected
 The probability of a Type I Error is a
 Called level of significance of the test
 Set by researcher in advance
 Type II Error [(Failure to reject H0 when it is false) or H1
accepted when false]
 Failure to reject a false null hypothesis
 The probability of a Type II Error is β

3. Select the test statistic, Examples: z, t, F, c2


4. Formulate the decision rule (based on 1,2, and 3 above)
5. Make a decision regarding the null hypothesis based on
the sample information. Interpret the results of the test.
For our example:
H0: ρ = 0 (the correlation of the population is 0)
H1: ρ ≠ 0 (the correlation of the population is different from 0)
From the way H1 is stated, we know that the test is two-tailed
We use a t-test (as we do not know the distribution of the
population, but may assume a more flatter symmetrical
distribution--- assuming a low n (here it is 10) with n-2 df

Using the 0.05 level of significance, the decision rule states that
if the computed t falls in the area ±2.306, the null hypothesis is
not rejected
Compute the test Statistic (here t)

The computed t is in the rejection region.


Thus, H0 is rejected at the 0.05% significance level.
This means the correlation of the population is not 0 i.e., there
is relationship between sales and calls made for the sales.

For extreme testing we use the p-value. A p-value is the


likelihood of finding a value of the test statistics more extreme
than the one computed earlier. We search for t-value with a
two-tail for the df from the table. We observe 3.297 is between
2.896 and 3.355 corresponding to 0.01 and 0.02 significance
level.
Regression Analysis
The Correlation analysis is used to measure the strength and
the direction of between two variables.
Is it meaningful to know whether co-relations exist(s)?
If so why?
Or
What is the application (use) of this relationship among
variables (between independent /dependent variables)?
The technique used to develop an equation and provide the
estimate is called regression analysis.
Regression Equation: An equation that expresses the linear
relationship between two variables.
Curve fitting; behavior of distribution of data around a line /
curve
Least Squares Principe: Determining a regression equation by
minimizing the sum of the squares of the vertical (shortest /
minm) distances between the actual Y values and the
estimated / predicted values of Y, Y’

General form of the linear regression Equation


Y’ = a + bX
where:
Y’is the predicted value of Y for a selected X value;
a is Y-intercept. It is the estimated value of Y when X=0;
b is the slope of the line, or the average change in Y’ for each change of one unit (either > or < )
in the value of X.;
X is any value of the independent variable that is selected.

Slope of the regression line:

where:
r is the correlation coefficient;
Sy is the SD of Y (the dependent variable)
Sx is the SD of X (the independent variable)

Y - intercept: a = Ȳ - b , value of Y even when X is absent

where:
Ȳ is the man of Y (the dependent variable)
is the mean of X (the independent variable)

Referring to the example above of number of sales calls (X) and


number of servers sold (Y), what is the expected number of
servers sold by the organization (any representative(s)) with 25
calls?
The calculations necessary to determine the regression
equation are:

a = Ȳ - b = 45 – (1.1842) * 22 = 18.9476

Thus the regression equation is


Y’ = a + bX; Y’ = 18.9476 + 1.1842X
For 25 calls (X =25) made by the sales representative(s), the
expected number of servers s/he can sell is:
Y’ = 18.9476 + 1.1842* 25
= 18.9476 + 23.684
= 48.5526
What does it mean? It means, with 25 calls made, we are
95% confident, the organization expects to sell (on an average /
mean) of around 49 servers.
Drawing the Regression Line

Sales Sales Servers Estimated


Representative Calls Sold Sales (Y’)
(X) (Y)
Hari 20 30 42.6316
Rama 40 60 66.3156
Shivani 20 40 42.6316
Ravi 30 60 54.4736
Gautam 10 30 30.7896
Manish 10 40 30.7896
Pandu 20 40 42.6316
Harish 20 50 42.6316
Venktesh 20 30 42.6316
Binny 30 70 54.4736
Total 220 450 450
This line has some interesting features. As we have discussed,
there is no other line through the data for which sum of square
of deviation of the data is smaller. In addition, this line will pass
through the points represented by the mean of X values and
the mean of Y values, that is and Ȳ. In this example = 22.0
and Ȳ = 45.0.

The Standard Error of Estimate


As we observe the predicated value differ from the actual
values.

What is the error of the regression estimates?


Standard Error of Estimate, SEE, SY.X: A measure of the scatter,
or dispersion, of the observed values around the line of
regression.

Standard Error of Estimate,

Remember that the regression line represents all the values of


Y’. If SY.X is small, this means that the data are relatively close to
the regression line and the regression equation can be used to
predict Y with little error. Conversely, if SY.X is large, then this
means that the data are widely scattered around the regression
line and the regression equation will not provide a precise
estimate of Y.
Sales Sales Servers Estimated Deviation Deviation
Representative Calls (X) Sold (Y) Sales (Y’) (Y-Y’) squared (Y-Y’)2
Hari 20 30 42.6316 -12.6316 159.5573
Rama 40 60 66.3156 -6.3156 39.8868
Shivani 20 40 42.6316 -2.6316 6.925319
Ravi 30 60 54.4736 5.5264 30.5411
Gautam 10 30 30.7896 -0.7896 0.623468
Manish 10 40 30.7896 9.2104 84.83147
Pandu 20 40 42.6316 -2.6316 6.925319
Harish 20 50 42.6316 7.3684 54.29332
Venktesh 20 30 42.6316 -12.6316 159.5573
Binny 30 70 54.4736 15.5264 241.0691
Total 220 450 450 0.0000 784.2105
Standard Error of Estimate,

The deviations are the vertical deviations from the regression


line. The sum of the signed deviations, ∑(Y-Y’), is zero. This
indicates that the positive deviations (above the regression line)
are offset by the negative deviations (below the regression line).

Thus far we have presented linear regression only as a


descriptive tool. In other words a simple summary (Y’ = a + bX)
of the relationship between the dependent Y variable and the
independent X variable. When our data are a sample taken
from a population, we are doing inferential statistics. Thus we
need to recall the distinction between population parameters
and sample statistics. In this case, we “model” the linear
relationship in the population by the equation:
Y = α + βX
where:
Y is any value of the dependent variable
α is the Y-intercept (the value of Y when X = 0) in the population,
β is the slope (the amount by which Y changes when X increases by one unit) of the population
line,
X is any value of the independent variable

Now α and β are population parameters and a and b,


respectively are estimates of those parameters. They (a & b) are
calculated from a particular sample taken from the population.
Thus, the values of a and b in the regression equation are
usually referred to as the estimated regression coefficients or
simply the regression coefficients.

Assumptions underlying Linear Regression


To properly apply linear regression, several assumptions are
necessary:
i. For each value of X, there are a group of Y values.
These values follow the normal distribution

ii. The mean of these normal distributions lie on the


regression line (in other words the regression line is the
line connecting the means normal distribution of Yi’s for
respective Xi’s)
iii. The standard deviations of these normal deviations are
the same. The best estimate we have of this SD is the
Standard Error of the Estimate, SEE (SY.X)

iv. The Y values are statistically independent. This means


that in selecting a sample a particular X does not
depend on any other value of X. This is particularly
important when data are collected over a period of
time. In such situations, the errors for a particular time
period are often correlated with those of other time
periods.
If the respective Y values follow a normal distribution with a
constant standard deviation equal to the Standard Error of
Estimate, SEE, (SY.X), around the mean respective Y value (the
respective mean values being on the regression line), then the
same relationship exists between the predicted values, Y’, and
the Standard Error of Estimate, SEE, SY.X;
i. Y’ ± SY.X will include the middle 68 % of the observations
ii. Y’ ±2SY.X will include the middle 95 % of the observations
iii. Y’ ±3SY.X will include all the observations
We can now relate these assumptions to sales of server by ABC
Ltd. Assume that we took a sample of at least 30 (number of
observations in the sample, n ≥30), but the standard error of
estimate, SEE (SY.X), was still 9.900824 (say 9.901). If we draw
parallel lines 9.901 units above and below the regression line
(see figure below) then about 68 percent of the points would lie
within these lines as limits. Similarly, a line 19.802 [2 SY.X =
2(9.901)] above the regression line and another 19.802 units
below the regression line should include about 95 percent of
the data values.
Confidence and Prediction Interval
The standard error of estimate is also used to establish
confidence intervals when the sample size is large (n ≥30) and
the scatter around the regression line approximates the normal
distribution.
 If the population standard deviation σ is unknown, we
can substitute the sample standard deviation, s, as an
estimate
 This introduces extra uncertainty, since s is different from
sample to sample
 In these circumstances the t distribution is used instead of
the normal distribution [for a small sample size (n <30)
and with unknown population distribution (σ)
distribution, we use t-statistics instead of z-statistics of
normal population distribution]

In our example as n=10 < 30), sample size is small we need a


correction factor to account for the size of the sample. In
addition, when we move away from the mean of the
independent variables, our estimates are subject to more of
variation, and we also need to adjust for this.

We are interested in providing interval estimates of two types:


i. Confidence interval, reports a mean value of Y’ for a
given X
ii. Prediction interval reports the range of values of Y’ for a
particular value of X
The confidence interval for a mean of Y, given X

where:
t is the value of t from the t-table with n-2 df
William Gosset (1990) noticed that ± z(s) was not precisely
correct for small samples. He noticed that for small samples (n <
30), the vatiations around are more than ± z(s). We need to
compensate this, for small samples. It is observed that it follows
a t-distribution (a more flatter distribution that z-distribution).
But with sample sizes n≥30, the t-vales and z-vales are almost
equal.

Referring to our server sales example:


let us find for the confidence interval for X = 25

Sales Calls Sold X- Y - Ȳ (X- ) (Y - Ȳ)


Representative (X) (Y)
Hari 20 30 -2 -15 30
Rama 40 60 18 15 270
Shivani 20 40 -2 -5 10
Ravi 30 60 8 15 120
Gautam 10 30 -12 -15 180
Manish 10 40 -12 -5 60
Pandu 20 40 -2 -5 10
Harish 20 50 -2 5 -10
Venktesh 20 30 -2 -15 30
Binny 30 70 8 25 200
Total 220 450 900

where:
Y’= (18.9476 + 1.1842* 25) = 48.5526
t for n-2 = 08 df, 95 percent confidence interval is 2.306

= 48.5526 ± 2.306*(9.901)*(0.334428027)
= 48.5526 ± 7.6356

Thus, the 95 percent of all sales representatives who makes 25


calls is from 40.9170 to 56.1882. To interpret, let’s round the
values; if a sales representative makes 25 calls, s/he expects to
sell 48.6 severs. It is likely those sales will range from 41 to 56.
To determine the prediction interval for a particular value of X
(particular sales representative in our example), the confidence
interval formula is modified slightly: A 1 is added in the radical.
The formula becomes:

Suppose we want to estimate the number of servers sold by any


individual, say Harish, who made 25 calls

= 48.5526 ±2.306*(9.901)*(1.054439)
= 48.5526 ±24.0746
Thus the interval is from 24.478 to 72.627. (≈ 24 to 73)
We conclude the number of servers sold will be between 24 to
73, for a particular sales representative. This interval is quite
large. It is much larger than the confidence interval for all sales
representatives who made 25 calls. It is logical, however, that
there should be more variation in the sales estimates for an
individual than for a group.
More on Coefficient of Determination
Sales Calls Sold Y - Ȳ (Y - Ȳ)2
Representative (X) (Y)
Hari 20 30 -15 225
Rama 40 60 15 225
Shivani 20 40 -5 25
Ravi 30 60 15 225
Gautam 10 30 -15 225
Manish 10 40 -5 25
Pandu 20 40 -5 25
Harish 20 50 5 25
Venktesh 20 30 -15 225
Binny 30 70 25 625
Total 220 450 0 1850

The sum of the square of the deviations from the arithmetic


mean for a set of numbers is smaller than the squared
deviations from any other value, such as median. The sum of
squared deviations is 1850. This is shown dogmatically below:
The line in red shows the difference
Here,
for Rama: X=40 , Y=60, Ȳ =45, and hence Y-Ȳ=15
for Ravi: X=30 , Y=60, Ȳ =45, and hence Y-Ȳ=15
for Binny: X=30 , Y=70, Ȳ =45, and hence Y-Ȳ=25

But for the regression: standard error of estimate, SEE (SY.X),


was still 9.900824. Thus logically, the total variation in Y can be
subdivided into explained variation and unexplained variation
Measures of Variation:
i. Explained variation (sum of squares due to regression)
ii. Unexplained variation (error sum of squares)
iii. Total Variation (Total sum of squares)
Explained Variation = Total Variation – Unexplained variation
in our example,

Coefficient of Determination, r 2
 The coefficient of determination is the portion of the
total variation in the dependent variable that is explained
by variation in the independent variable
 The coefficient of determination is also called r-squared
and is denoted as r2
SSR regression sum of squares
r2  
SST total sum of squares

Note:
0  r2  1
The Relationship among the Coefficient of Correlation, the
Coefficient of Determination, and the Standard Error of
Estimate
The standard error of estimate, measures how close the actual
value are to the regression line. When SEE is small, it indicates
that the two variables are closely related. In the calculation of
SEE the key term is ∑(Y-Y’)2. If the value of this term is small,
then SEE will also be small.

The correlation coefficient measures the strength of the linear


association between the two variables. When he points (data)
are close to the line, we note that the correlation coefficient
tend to be large. Thus, the SEE and r relate the same
information but use a different scale to report the strength of
the association. However, both measure involves the term ∑(Y-
Y’)2.

The coefficient of determination measures the percent of the


variation in Y that is explained by the variation in X.
A convenient way of showing the relationship among these
three measures is an ANOVA table. In ANOVA, the total
variation is divided into two components: that is due to the
treatments and that is due to random error. This concept is
similar in regression analysis. The total variation ∑(Y-Ȳ)2 is
divided into two components: 1. that explained by the
regression (explained by the independent variable), and 2. the
error, or unexplained variation.
The total number of df is n-1. The number of df for the
regression is 1, since there is only one independent variable.
The number of df associated with the error term is n-2. The
“SS” is located in the middle of the ANOVA table refers to the
sum of squares – the variation. The terms are computed as
follows:
Regression = SSR = ∑(Y’-Ȳ)2
Error Variation = SSE = ∑(Y-Y’)2
Total variation = SS total = ∑(Y-Ȳ)2
The format of the ANOVA table is
Source df SS MS
(sum of squares –
explains the variation)
Regression 1 SSR SSR /1
Error n-2 SSE SSE / (n-2)
Total n-1 SSTotal*
*SS Total = SSR + SSE

The Coefficient of determination can be obtained directly from


the ANOVA table:

The term “SSR/SS Total” is the portion of the variation in Y


explained by the independent variable, X. Note the effect of the
SSE term on r2. As SSE decreases, r2 will increase. Conversely, as
the standard error decreases, the term r2 increases.

The standard error of estimate, SEE, can also be obtained from


the ANOVA table using the following equation:
Multiple Linear Regression
Four Assumptions similar to these apply to the SLR model must
also apply to the MLR model
1. There exist many values of Y for a given value of X i. Hence,
there can be many possible εi for a given Xi. The
distribution of model errors for any level of X is normally
distributed.
2. The errors, , are independent are another.
3. The distribution of possible ε-values have equal variances
at each level of X.
4. The means of the dependent variables, y, for all specified
values of X can be connected with a line called the
population regression model.

(Y, X1, X2) points of three-dimensional space forms a slice (hyper


plane) through the data such that is minimized.
This is the same least square equation that is used as in Simple
Linear Regression.
Basic Model-Building Concepts:
A. Model specification or model identification is the process
of identifying
 Dependent variable.

 Independent variables, and,


 obtaining sample data for all variables. The Larger
sample size is better.
B. Model Building:
Development of the mathematical model in which some or
all of the independent variable to explain the variation in
the dependent variable.
- Include independent variables for which you have
complete data
- There is no way to determine whether an independent
variable will be good predicator variable by analyzing
the individual variables descriptive statistics such as the
mean and SD. Instead, we need to look at the
correlation between the independent variable(s) and
the dependent variable which is measured by the
correlation coefficient.
Is the relationship (measured through the correlation
coefficient) spurious?

We will conduct the test with a significant level with


df1 = n – (k+1) and dfMLR = n-k.
Individual tSTAT are calculated (for each independent variable and
the dependent variable)
For a 2 tail test

Compute the Regression equation:

Helps in getting point estimate for Y (estimated sales price/


sales units). Provides an estimated average (mean) value of y
for the respective values of X1, X2 etc.

Multiple coefficient of Determination (R2)


R2explains the % variation of Y, explained through linear
relationship of the selected independent variable(s). However,
as we may see later, not all the independent variables are
equally important to the model’s ability to explain this
verification.

Model Diagnosis:
Before use of the model (for predictive etc.) to estimate the
sales units (y), there are several questions that should be
answered.
1. Is the model significant?
2. Are the individual variable(s) significant?
3. Is the SD of the model error too large to provide
meaningful results?
4. Is Multicollinearity a problem?
5. Have the regression analysis assumptions been satisfied?

Is the Model Significant?


If the null hypothesis is true and all the slope coefficient are
simultaneously equal to zero, the overall regression model is
not useful for predictive or descriptive purposes.
So it is a question of overall fit of the MLR Model.
The F- test is a method for testing whether the regression
model explains a significant proportion of the variation in the
independent variable (and whether the overall model is
significant). The F-statistics for a MLR model is:
F-Test statistic

Where:
SSR= sum of squares regression =

SSE= sum of squares error =


n= sample size
k= number of independent variable
df of regression = number of independent variable, k
df of errors = n-k-1
total df of MLR = n-1
If

or if

If we reject H0, we conclude that regression model does explain


a significant portion of the variation in sales price. Thus, the
overall model is statistically significant. This means at least one
of the regression slope coefficient is significant (is not equal to
zero).
Addition of new independent variables always increases R2,
even if these variables have no relationship to the independent
variable. Therefore, as the number of independent variables is
increased (regardless of the quality of the variable), R2 will
always increase. However, each addition of variable results in
the loss of one df. This is viewed as part of the cost of adding
the specified variable. The addition to R2 may not justify the loss
of the df. The RA2 value takes into account this cost and adjusts
the RA2 value accordingly. RA2 will always less than R2.
When a variable added that does not contribute its fair share to
the explanation of the variation in the dependent variable, the
RA2 value may actually decline, even though R2 will always
increase. The adjusted R2 is a particularly important measure
when the number of independent variables is large relative to
the sample size.
Are the independent variables significant?
The overall model is significant means at least one independent
variables explains a significant proportion of the variation in
sales volume / price. This does not mean that all the variables
are significant, however to determine which variables are
significant, we need to test the following hypothesis.

for all j

We test the significance of each independent variable using


significant level α = 0.05 and the calculated t- values should be
compared to the critical t-value with n-k-1 df, which is
approximately t0.025 = 1.97

bj- sample slope coefficient

- estimate of the standard error for the j th sample slope

coefficient.

When a regression is to be used for prediction the model


should contain no insignificant variables. It insignificant
variables are present, they should be dropped and a new
regression equation obtained before the model is used for
prediction purposes.
Is the SD of the Regression Model too large?
The SD of the regression model (also called the standard error
of the estimate), measures the dispersion of observed sale
values, Y, around values predicted by the regression model.
Standard Error of the Estimate, Se

SSE- sum of squares error (residual)

Sometimes, even though a model has high R2, the standard


error of the estimate will be too large to provide adequate
precision for the confidence and predictive intervals. A rule of
thumb is to examine the range ±2Se across the mean predicted
value. It this range is acceptable from a practical view point, the
standard error of the estimate might be considered acceptable.

Is multicollinearity a problem?
Multicollinearity - a high correlation between the independent
variables such that the two variables contribute redundant
information to the model.
Some of the obvious problems and indications of severe
Multicollinearity:
i) Unexpected/ incorrect signs on the coefficients.
ii) A sizeable change in the value of the previously
estimated coefficients when a new variable is added to
the model
iii) The estimate of the SD of the model error increases
when a variable is added to the model.
iv) Low t-values for significant variables.
One method of Measure of multicollinearity is known as the
Variance Inflation Factor (VIF)

where: s the coefficient of determination when the j

independent variable is regressed against the remaining k-1


independent variables.
imply that the correction between the independent
variables is too extreme and should be dealt with by dropping
the variable from the model.
Confidence Interval Estimation for the Regression Coefficients.
The regression coefficients being project estimates are subject
to sampling error.
Confidence interval estimate for the regression slope

Important Checks of Health of Regression


A. So it is a question of overall fit of the MLR Model.
The F- test is a method for testing whether the regression
model explains a significant proportion of the variation in the
independent variable (and whether the overall model is
significant). The F-statistics for a MLR model is:
F-Test statistic

Where:
SSR= sum of squares regression =
SSE= sum of squares error =
n= sample size
k= number of independent variable
df of regression = number of independent variable, k
df of errors = n-k-1
total df of MLR = n-1

If

or if

 Multiple Coefficient of Determination (R2) Reports the


proportion of total variation in y explained by all x
variables taken together

Coefficient of Determination, R 2
 The coefficient of determination is the portion of the
total variation in the dependent variable that is explained
by variation in the independent variable
 The coefficient of determination is also called r-squared
and is denoted as r2
SSR regression sum of squares
R2  
SST total sum of squares

Note: 0  R2 1

B. We test the significance of each independent variable using


significant level α = 0.05 and the calculated t- values should
be compared to the critical t-value with n-k-1 df, which is
approximately t0.025 = 1.97

C. One method of Measure of multicollinearity is known as


the Variance Inflation Factor (VIF)

where: s the coefficient of determination when the j

independent variable is regressed against the remaining k-1


independent variables.

imply that the correction between the independent


variables is too extreme and should be dealt with by dropping
the variable from the model.
D. The errors should be Random / pure white noise
There should be no correlation in the errors.
Autocorrelation is correlation of the errors (residuals) over time
The Durbin-Watson statistic is used to test for autocorrelation
The Durbin-Watson Statistic
H0: ρ = 0 (no autocorrelation)
H1: autocorrelation is present
n

 (e t  e t 1 ) 2
The Durbin-Watson test statistic (d): d t 2
n

e 2
t
 The possible range is 0 ≤ d ≤ 4 t 1

 d should be close to 2 if H0 is true


 d less than 2 may signal positive autocorrelation,
 d greater than 2 may signal negative autocorrelation

1. The Durbin-Watson Statistic: d should be close to 2


2. Measure of multicollinearity is known as the Variance
Inflation Factor (VIF) should be <5
For 95% CL or 5% SL
3. For Significance of the independent variable t ≥ │±2│
4. For overall fit of the model, n≥10, F be greater than 5

You might also like