You are on page 1of 3

ERRORS AND LIMITATIONS ASSOCIATED WITH REGRESSION AND

CORRELATION ANALYSIS
Aivaz Kamer-Ainur
Mirea Marioara
Ovidius University of Constanta, Faculty of Economics Sciences, Dumbrava Rosie St. 5, code 900613, E-
mail: elenacondrea2003@yahoo.com
Abstract
This paper describes the main errors and limitation associated with the methods of regression and correlation
analysis. Those methods have been developed specifically to study statistical relationships in data series.
Key Words: Assumption, linear regression, linear correlation, multiple regressions, multiple correlations.

Introduction
Regression analysis is concerned with developing the linear regression equation by which the value of a
dependent variable Y can be estimated given a value of an independent variable X.
If simple regression analysis is used, the assumptions for this technique should be satisfied. The assumption
required to develop the linear regression equation and to estimate the value of dependent variable by point
estimation is:
1. The relationship between the two variables is linear.
2. The value of the independent variable is a set at various values, while the dependent variable is a
random variable.
3. The conditional distributions of the dependent variable have equal variances.
If any interval estimation or hypothesis testing is done, additional required assumptions are:
1. Successive observations of the dependent variable are uncorrelated.
2. The conditional distributions of the dependent variable are normal distributions.
The scatter diagram is a graph that portrays the relationship between the two variables and can be used to
observe whether there is general compliance with the assumptions underlying regression analysis. An alternative
graph to determine such compliance is the residual plot, which is a plot of the residuals e = ( ) with
respect to the fitted values . The mathematical criterion generally used to determine the linear regression
equation is the least squares criterion by which the sum of the squared deviations between the actual and
estimated values of the dependent variable is minimized. The standard error of estimate s y , x is the measure of
variability, or scatter, with respect to the regression line. It is used to establish prediction intervals for the
dependent variable.
Interval estimation of the conditional mean of the dependent variable is based on use of the standard error of the
conditional mean sy,x, The standard error of forecast sy(next) is used to construct a complete prediction interval for
an individual value of the dependent variable. By complete, we mean that the uncertainty regarding the value
of the conditional mean is considered in addition the uncertainty represented by the scatter with respect
regression line. When the sample is relatively large, approximate prediction intervals based on use of only the
standard error of estimate are considered acceptable. A final area of inference that we considered was interval
estimation and hypothesis testing concerning the slope 1 of the linear regression model.
The most frequently used measure of relationship for sample data is the sample correlation coefficient r. The
sign of the correlation coefficient indicates the nature (direct or inverse) of the relationship between the two
variables, while the absolute value of the correlation coefficient indicates the extent of the relationship. The
coefficient of determination r2 indicates the proportion of variance in the dependent variable that is explained
statistically by knowledge of the independent variable (and vice versa). The null hypothesis most frequently
tested in correlation analysis is that the population correlation is zero, represented by = 0 . Rejection of this
hypothesis leads to the conclusion that 0 and that the two variables are related.
The value of the dependent variable cannot be legitimately estimated if the value of the independent variable is
outside the range of values in the sample data that served as the basis for determining the linear regression
equation. There is no statistical basis to assume that the linear regression model applies outside of the range of
the sample data.

710
If the estimate of the dependent variable in fact concerns prediction, the historical data used to determine the
regression equation might not be appropriate to represent future relationships. Unfortunately, one can only
sample past data, not future data.
The standard error estimate is by itself not a complete basis for constructing prediction intervals, because the
uncertainly concerning the accuracy of the regression equation, and specifically of the conditional mean is not
considered. The standard error forecast is the complete measure of variability. However, when the sample is
large, use of the standard error of estimate is generally considered acceptable.
If correlation analysis is used, all the assumptions for this technique should be satisfied. These assumptions are:
1. the relationship is linear
2. Both variables are random variables.
3. For each variable, the conditional distributions have equal variances.
4. Successive observations are uncorrelated for both variables.
5. The joint distribution is a bivariate normal distribution.
A significant correlation does not necessarily indicate causation, but rather may indicate a common linkage in a
sequence of events. One type of significant correlation situation is when both variables are influenced by a
common cause and therefore are correlated with one another. For example, individuals with a higher level of
income have both a higher level of savings and a higher level of spending. We might therefore find that there is
a positive relationship between level of savings and level of spending, but this does not mean that one of this
variable cause the other. Another type of situation is one in which two related variables are separated by several
steps in a cause-effect chain of events. An interesting example in the medical field is the following sequence of
events: warm winter, appearance of viruses, and release of the flu.
The existence of warm, the climatic conditions is not itself the cause of the flu, but these conditions is several
steps removed in the cause-effect sequence. For many years, the climatic conditions themselves were thought to
be the cause, and so this disease was called flu.
A significant correlation is not necessarily an important correlation. There is much confusion regarding the
meaning of significant in the popular press. It is usually implied that a relationship that is significant is also
thereby important. However, from the statistical point of view, a significant correlation simply indicates that a
true relationship exists and that the correlation coefficient for the population is different from 0. Significance
is necessary to conclude that a relationship exists, but the coefficient of determination r2 is more useful in
judging the importance of the relationship.
Given a very large sample, a correlation of, say, r = 0,10 can be significantly different from 0 at = 0,05 . Yet
the coefficient of determination of r2 = 0,01 from this example indicated that only 1 percent of the variance of
the dependent variable is statistically explained by knowledge of the independent variable.
In a comparison of simple linear regression analysis and correlation analysis, the principal difference in the
assumption is that in regression analysis there is one random variable, while in correlation analysis, both
variables have to be random. The sapling design for a study therefore should consider the analysis to be
performed. Regression analysis is used when the main objective is to estimate values of the dependent variable,
whereas correlation analysis is used when the main objective is to measure and express the degree of
relationship between the two variables. When both variables are random variables, either regression analysis or
correlation analysis, or both, can usually be applied to the data.
In multiple regression analysis the value of the dependent variable is estimated on the basis of know values of
two or more independent variables, while the extent of the relationship between the independent variables, while
the extent of the relationship between the independent variables taken as a group and the dependent variable is
measured in multiple correlation analysis. For multiple regression analysis the principal assumption is:
1. The relationship can be represented by a linear model
2. The dependent variable is a continuous random variable
3. The variances of the conditional distributions of the dependent variable are all equal (homoscedasticity)
4. Successive observed values of the dependent variable are uncorrelated
5. The conditional distributions of the dependent variable are all normal distributions.
On the general level, the assumptions associated with multiple regressions and multiple correlation analysis has
to be satisfied if the results are to be meaningful. As in simple analysis involving one independent variable, the
assumption of linearity and equality of conditional variances can be investigated by obtaining a residual plot
based on the multiple regression models.
A specific area of concern when there are several independent variables is the possible existence of
multicollinearity. This term describes the situation in which two or more independent variables are highly
correlated with one another. Under such conditions, the meaning of the partial regression coefficient in the
multiple regression equation is unclear. Similarly, the meaning of the coefficients of partial correlation with a
given independent variable is highly negative even through the simple correlation is highly positive. The
statistical procedures that represent attempts to handle the problem of multicollinearity it is sometimes to

711
eliminate one of two highly correlated independent variables from the analysis, recognizing that the two
variables essentially are measuring the same factors. When correlated variables must be included in the analysis,
care must be taken in ascribing practical meaning to the partial regression coefficients and to the coefficients of
partial correlation. However, multicollinearity causes no special problem for inferences associated with the
overall regression model, such as F test for the significance of the regression effect, confidence intervals for the
mean of the dependent variable, and prediction intervals for individual values of the dependent variable. Of
course, for any interval estimates the values of the independent variables should be within the ranges of values
included in the sample data.
Another area of specific concern in multiple regressions and multiple correlation analysis is the possibility that
successive observed values of the dependent variables are correlated rather than uncorrelated. The existence of
such a correlation is called autocorrelation. The assumption that the successive values of the dependent variable
are uncorrelated has already been identified as a principal assumption in simple regression and simple
correlation analysis. However, in simple analysis the existence of such a correlation is easier to observe than a
multiple analysis. Typically, autocorrelation occurs when values of the dependent variable are collected as time
series values, that is, when they are collected in a series of time periods. When successive values of the
dependent variable are correlated values, the point estimate of the dependent variable based on the multiple
regression equation is not affected. However, the standard error associated witch each partial regression
coefficient bk is understated, and the value of the standard error of estimate is understated. The results is that the
prediction and confidence intervals are narrower (more precise) than they should be, and null hypotheses
concerning the absence of relationship are rejected too frequently. In terms of correlation analysis, the
coefficients of multiple determinations and multiple correlations are both overstated in value.

Bibliography
1. Andrei T.,Stancu S.,Pele T.D.- Statistic. Teorie i aplicaii, Editura Economic, 2002
2. Baron, T.Biji, E.Tovissi, L.Isaic-Maniu, A.- Statistic teoretic i economic, Editura Didactic i
Pedagogic, Bucureti, 1996
3. Biji E.,Baron T.- Statistic teoretic i economic, Editura Didactic i Pedagogic, Bucureti, 1966
4. Biji M., Biji E., Lilea E., Anghelache C.- Tratat de statistic, Editura Economic, Bucureti, 2002
5. Bdi, M., Cristache, S.- Statistic- Aplicaii practice, Editura Mondan, 1998
6. Isaic-Maniu, Al, Mitru, C.,Voineagu V.- Statistica pentru managementul afacerilor,Editura
economic, Bucureti, 1997
7. Jaba, E.- Statistic,Editura Economic, Bucuresti, 1998
8. Mihoc, Gh., Craiu, V.- Tratat de statistica matematica, Editura Academiei, Bucuresti, 1976
9. Trebici, V.- Mic enciclopedie de statistic,Editura tiinific, Bucureti, 1985

712