You are on page 1of 6

GEOG 303: Notes on Correlation and Regression

Levels of measurement

In statistics and quantitative research methodology, various attempts have been made to classify
variables (or types of data) and thereby develop a taxonomy of levels of measurement or scales
of measure. The various levels of measurement are as shown below:
Nominal Scale/ data
This scale categorizes and differentiates items based only on their names and other qualitative
classifications they belong to. The items cannot be ranked. E.g. Names of towns in Ghana, Days
of the week, religion. These are called categorical variables.
Ordinal Scale/data
The data on this scale can be ranked but intervals are not the same. Eg. Positions in a beauty
contest; level of education.
Interval Scale/ Data
The items can be ranked and intervals are the same, but the position of zero is arbitrary.
Examples include temperature with the Celsius scale, which has an arbitrarily-defined zero point
(the freezing point of a particular substance under particular conditions), date when measured
from an arbitrary epoch (such as AD) and direction measured in degrees from true or magnetic
north.
Ratio Scale or Ratio Data
We can rank items on this scale and their intervals are also the same. Again and there is a true
zero. Eg. Income, length, number of years of schooling, ages etc.

TYPES OF STATISTICAL ANALYSIS


DESCRIPTIVE STATISTICS: This involves the use of rates, percentages and graphs to
represent data collected on a sample. E.g. the use of frequency tables; computation of mean,
mode and median.

INFERENTIAL STATISTICS: This involves the application of statistical techniques (e.g. chisquare tests etc) and sample data to draw conclusions about the population parameter.

DEPENDENT AND INDEPENDENT VARIABLES: Dependent Variables are those that are
only measured or registered whereas the independent variables are those that are manipulated
to influence the outcome of the dependent variables. For instance, if we want to measure effects
of years of schooling on salaries, then salaries will be the dependent variable while years of
schooling will be independent variable.

CORRELATION AND REGRESSIONS


Correlation is a measure of the strength and direction of the relationship between two
quantifiable variables. Positive correlation denotes the positive directional relationship between
2 variables. This means an increase in one variable is associated with an increase in another
variable. The coefficient of correlation is positive. Negative correlation denotes the
negative/inverse directional relationship between 2 variables. An increase in one variable is
associated with a decline in the other variable. The coefficient of correlation is negative.
Explanation of correlation coefficient
The coefficient of correlation (r) lies between -1 and 1. A positive sign denotes a positive
correlation whiles a negative sign denotes a negation relationship between the variables. The
strength is explained in terms of the size of the variables using the guideline below:
0.0-0.2: Negligible/zero correlation meaning there is no relationship between the two variables
0.21-0.4: Weak correlation-meaning a relationship exists but it is very weak
0.41-0.7: Moderate correlation
0.71-0.99: Very strong correlation/ high sense of correlation

Coefficient of determination
This explains the amount of variability in the independent variable that is explained by
variability in the dependent variable.
Coefficient of determination = r2 X 100%

Regression Analysis

Regression is a technique used to analyse the relationship between 2 or more variables and how
one variable affects the other. It is used to establish an equation linking the two variables. There
are various types of regression.
1. Simple Linear Regression: This examines the relationship between two variables (one
dependent and one independent variable) measured usually on the ration scale. E.g. Age
and weight.
2. Multiple Linear Regression: This measures the relationship between one dependent
variable and several independent variables. For instance, one can examine measure the
relationship between output of maize (dependent variable) and several independent
variables, including soil quality, amount of fertilizer applied, rainfall amount etc.
3. Logistic Regresion: This is used when we are interested in analyzing the relationship
between one dependent variable and several categorical independent variables. Eg. We
can analyse the relationship between modern contraceptive use (dependent or outcome
variable) and variables such as location, marital status, religion.

Trial Question
A. A social scientist is interested in establishing the degree of the relationship between
number of districts and number of hospitals in eight randomly selected administrative
regions in the Republic of Nsutapong. The table below summarizes the data he obtained
from the field.
Regions
Number of Districts Number of Hospitals
A
2
3
B
4
3
C
5
4
D
5
5
E
6
7
F
7
8
G
9
9
H
10
11

(a) Calculate the Pearsons Product-Moment Correlation Coefficient (r) between number of
districts and number of hospitals and interpret your answer.
(b) Compute the coefficient of determination and interpret your answer.
(c) Fit a linear regression model for estimating the number of hospitals (y) from a given
number of districts (x).
(d) Using your model or otherwise find the number of districts in a region with 15 hospitals.

Solution
(a) Let x represent the number of districts while y represents the number of hospitals. To
calculate the correlation we construct the table below:
x
2
4
5
5
6
7
9
10
x =48

x=48

Given that n=8

r=
r=
r=
r=

y
3
3
4
5
7
8
9
11
y=50

y=50

xy
6
12
20
25
42
56
81
110
xy=352

x2
4
16
25
25
36
49
81
100
x2= 336

xy=352

y2
9
9
16
25
49
64
81
121
y2=374

x2= 336

y2=374

8(352)(48x50)
[8(336)(48)2 ][8(374)(50)2 ]
28162400

(26882304)(29922500)
416
(384)(492)
416
434.7

r= 0.96

Since r is 0.96, there is a strong positive correlation between the number of districts and the
number of hospitals. This means that the higher the number of districts in a region, the higher the
number of hospitals in the region.

(b) Coefficient of determination for the data = r2 x 100%


Where r = 0.96
r2= (0.96)2 x 100%
r2= 0.9216 x 100%
r2= 92.16% or 92.2%
This means that 92.2% of the variations/variability in the number of hospitals is accounted for or
explained by the variability in the number of districts, leaving the remaining 7.8% of the
variations in the number of hospitals to be accounted for by factors other than the number of
districts.

(c ). Regression Analysis
Equation of a simple linear regression is given as y = a + bx. where

From the table in (a), we know that n=8


X mean =48/8 =6

Y mean=50/8=6.25

By substitution,
=

3528(6 6.25)
3368(6)2

3528(37.5)
3368(36)

352300

b= 336288
52

b= 48 = 1.08

Given b as 1.08,
a= [6.25 (1.08 6)]
a= 6.25 6.48

xy=352

x2= 336

a= - 0.23
Since equation of a simple linear regression line is given as y = a + bx.
Substituting derived values into the equation;
y = - 0.23 +1.08x

(d) Using the model find the number of districts in a region with 15 hospitals.
We substitute Y= 15 into the equation
15= -0.23+1.08X
15.23 =1.08X
15.23/1.08 = X
X=14.1. There were 14 districts.

You might also like