You are on page 1of 182

SPSS: Statistical

Methods in
Research
13 – 14 January 2011

Facilitator: Mr. Chang Yun Fah


Department of Mathematical and Actuarial
Sciences,
Universiti Tunku Abdul Rahman
Self Introduction
 Name
 Department and work nature
 Experience in using SPSS or any
statistical tool
 Expectation from this program

6/9/2016 2
Contact Information
Objective
Method
Machinery
Agenda
Norm
Development of SPSS

6/9/2016 3
Contact Information
O
M
M Contact
A
N
Information
D
• Mr. Chang Yun Fah
• E-mail: changyf@utar.edu.my
• Tel: 03-41079802
• Department of Mathematical and Actuarial
Sciences, Faculty of Engineering and
Science

6/9/2016 4
C
Objective Objective
M
M
A • Use the analytical functions of SPSS.
N • Process data and generate statistics for
D ANOVA and MANOVA tests.
• Process data and generate statistics for linear
regression analysis and logit analysis.
• Process data and generate statistics for
principal component analysis and factor
analysis.
• Process data and generate statistics for
testing for clustering analysis.
• Process data and generate statistics for
discriminant analysis.

6/9/2016 5
C
O
Method
M
Presentation
A
N
D

Method of
Case study Exercise
Learning

Discussion

6/9/2016 6
C
O
M
Machine Machine &
A Software
N • SPSS 14.0 for Windows
D • Student package
• Release 14.0.0 on
• needed a downloadable hotfix to be
installed in order to be compatible with
Windows Vista.

6/9/2016 7
C
O
M
Agenda
M Day 1 Day 2
Agenda
COMMAND Principal
N Component Analysis
D One-Way, Two-Way and Factor Analysis
and Three-Way
ANOVA Clustering Analysis

Linear Regression Discriminant Analysis


Analysis and Logit
Analysis

6/9/2016 8
C
O
M
Writing report
M
A
Navigate Interpretation
D
Statistical analysis

Tools selection
Data collection

Problem analysis

6/9/2016 9
C
O
M History
M
A Release history
N SPSS 15.0.1 - November 2006
Development SPSS 16.0.2 - April 2008
SPSS Statistics 17.0.1 - December
2008
PASW Statistics 17.0.3 - September
2009
PASW Statistics 18.0 - August 2009
PASW Statistics 18.0.1 - December
2009
PASW Statistics 18.0.2 - April 2010
6/9/2016 10
One-Way,
Two-Way
and Three-
Way ANOVA

6/9/2016 11
One-Way ANOVA
 Analysis of variance (ANOVA) is an extension of
the independent-sample t-test (one independent
variable/factor with 2 levels/groups)
 Dealt with an experiment involves one
dependent variable and one factor/independent
variable.
 Comparing means of 2 or more levels/
treatments of the factor.

6/9/2016 12
Completely Randomized Design
 In general, there will be a levels of the factor, or a
treatments, and n replicates of the experiment, run in
random order
 Objective is to test hypotheses about the equality of the
a treatment means
 N=axn total runs

Fixed Effect Model:


yij = μ + τ i + εij ; i = 1, 2K , a; j = 1, 2,K , n
µ = an overall mean, τi = ith treatment effect, εij =
experimental error, NID(0,σ2).

H0: τ1 = τ2 = …= τa = 0
H1: τi ≠ 0 for at least one i.
6/9/2016 13
ANOVA table:
Source of
variation Sum of Squares DF MS F0
a
SST = n∑ ( yi. − y.. )
2
MST
i =1
a −1 MST =
SST F0 =
Between 1 a 2 y2 a −1 MSE
SST = ∑ yi −
treatments n i =1 N
Error (within SSE
SSE = SSTO − SST N −a M SE =
treatments) N −a
a n
SSTO = ∑∑ ( yij − y.. )
2

i =1 j =1

y2
N −1
a n
SSTO = ∑∑ yij2 −
i =1 j =1 N
Total

6/9/2016 14
Exercise:

 An engineer is interested in investigating the


relationship between the RF power setting and
the etch rate for this tool. The objective of an
experiment like this is to model the relationship
between etch rate and RF power, and to specify
the power setting that will give a desired target
etch rate. She is interested in a particular gas
(C2F6) and gap (0.80cm), and wants to test four
levels of RF power: 160W, 180W, 200W, and
220W. She decided to test five wafers at each
level of RF power

6/9/2016 15
Observations
Power (W) 1 2 3 4 5 Total Average
160 575 542 530 539 570 2756 551.2
180 565 593 590 579 610 2937 587.4
200 600 651 610 637 629 3127 625.4
220 725 700 715 685 710 3535 707.0

1. Identify the dependent variable and the factor


(independent variable)
Y=etch rate, X=RF power

2. The factor has how many levels/treatments/groups?


4 levels: 160, 180, 200, 220

6/9/2016 16
Observations
Power (W) 1 2 3 4 5 Total Average
160 575 542 530 539 570 2756 551.2
180 565 593 590 579 610 2937 587.4
200 600 651 610 637 629 3127 625.4
220 725 700 715 685 710 3535 707.0

3. Find the number of replicates?


n=5

4. Create a SPSS data file from this table. (‘Data14’)


Power: 160 160 160 160 160 180 180 180 180 180 200 200 200 200 200 220 220 220 220 220
EtchR: 575 542 530 539 570 565 593 590 579 610 600 651 610 637 629 725 700 715 685 710

6/9/2016 17
Open SPSS file ‘Data 1’: Summaries for
personal savings and personal income based
on ethnic group. What is your ‘conclusion’?
Descriptives

Personal savings (thousand)


95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
Malay 11 22.55 15.076 4.545 12.42 32.67 11 51
Chinese 15 27.73 8.514 2.198 23.02 32.45 21 56
Indian 7 37.14 8.783 3.320 29.02 45.27 32 56
Foreigner 6 40.83 14.386 5.873 25.74 55.93 12 49
Total 39 29.97 13.114 2.100 25.72 34.23 11 56

Descriptives
Are they statistically different?
Personal income
95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
Malay 11 $1,011.8182 $438.01411 $132.066 $717.5563 $1,306.0801 $650.00 $2100.00
Chinese 15 $1,127.3333 $340.95384 $88.03390 $938.5194 $1,316.1473 $600.00 $1600.00
Indian 7 $1,138.5714 $367.48826 $138.898 $798.7015 $1,478.4414 $600.00 $1670.00
Foreigner 6 $1,238.3333 $406.91113 $166.121 $811.3063 $1,665.3604 $590.00 $1700.00
Total 39 $1,113.8462 $376.92394 $60.35614 $991.6615 $1,236.0308 $590.00 $2100.00

6/9/2016 18
Ethnic Psavings
3 32 Rearrange the personal savings and
4 45 ethnic group from Questionnaire 1
2 34
2 56 (Data1) into the completely
3 32
2 26
randomized design format:
2 23
2 27 Observations
3 38
2 21 Treatment 1 2 3 4 5 6 7 8
4 48 1 11 43 18 19 11
4 49
4 12 2 34 56 26 23 27 21 27 24
2 27 3 32 32 38
1 11
4 45 48 49 12
1 43
1 18
1 19
1 11
2 24
6/9/2016 19
One-Way
Compare Means Analyze menu
ANOVA
move
move dependent select the
independent Click
variable to
variable to ‘Options’ Statistics and Continue
‘Dependent List’ Means plot OK
‘Factor’

1
9
4
2

5
3
6
8
7

6/9/2016 20
Open SPSS file ‘Data1’: Conduct an One-Way ANOVA by using
Personal savings as dependent variable and Ethnic as factor.
Use LCD test for multiple comparison.

H0: there is no difference in


the amounts of savings
among ethnic groups
H1: at least one ethnic group
has different amounts of
savings than other ethnic
groups.
Treatment Sum of
squares (SST)

Error Sum of ANOVA


squares (SSE) Personal savings (thousand)
Sum of
Squares df Mean Square F Sig.
Total Sum of Between Groups 1749.623 3 583.208 4.266 .011 p-value <0.05,
squares (SSTO) Within Groups 4785.351 35 136.724 reject H0
Total 6534.974 38
6/9/2016 21
Writing it in report 1!

Preliminary analysis of the data revealed that an average worker had


savings totaling RM29970 but the average savings for each of the four
ethnic groups and countries of origin seemed to be different. The means
plot shows that foreign workers had the highest savings and the Malay
workers the lowest. Therefore, one of the appropriate hypothesis in this
study is that there is a significant difference in the amount of savings of
workers from different ethnic groups and countries of origin. To test this
hypothesis, the one-way ANOVA was used. The analysis yielded a
significant result with F-ratio of 4.266 which was significant at the 0.05 level
of significance (p=0.011). Therefore it can be concluded that the workers
from different ethnic groups and countries of origin had different amounts of
savings.

6/9/2016 22
Exercise
Get the One-Way ANOVA for Personal income as
dependent variable and Ethnic as factor

H0: there is no difference in the amounts of income


among ethnic groups
H1: at least one ethnic group has different amounts of
income than other ethnic groups.

Do you get the same result?

6/9/2016 23
Multiple Comparison Test (Post Hoc Test)

•Besides determining the differences among the


treatments’ mean, a researcher may want to
know which means differ.
•E.g. the ANOVA test showed that the amounts
of savings differed among ethnic groups but the
analysis does not tell us which ethnic groups
differed in their amounts of savings.
•To detect the difference in the means of each
pair of ethnic groups, the post hoc test (multiple
comparison tests) can be used.
6/9/2016 24
Multiple Comparison Test (Post Hoc Test)
Comparing treatment means simultaneously without control
group:
H0: Contrast =0 vs H1: Contrast ≠ 0
1.Bonferroni t-test 2. Scheffe’s method

Comparing pairs of treatment means without control group:


H0: μi-μi’ = 0 vs H1: μi-μi’ ≠ 0
1.Tukey’s test
2. LSD (Fisher Least Significant Difference) test

6/9/2016 25
Dunnett’s test: Comparing treatment means with a control
group:
H0: μi-μa= 0 vs H1: μi-μa ≠ 0
Example 1: the Malays is a benchmark for personal savings.
Example 2: the ASEAN countries’ GDP for the last 4
consecutive years were compared. The GDP obtained in the
2006 is used as the base year.

Example 3: A teacher wants to compare the effectiveness of


3 new teaching approaches. A class of 100 students were
randomly divided into 4 groups, A, B, C and D. Classes for
groups A, B and C were conducted using Method 1, Method
2 and Method 3 respectively and the control group D using
the existing method.
6/9/2016 26
One-Way
Compare Means Analyze menu
ANOVA
move Select the
move dependent
independent Click ‘Post multiple
variable to Continue
variable to Hoc’ comparison
‘Dependent List’ OK
‘Factor’ methods

9
4

6/9/2016
8 27
Multiple Comparisons

Dependent Variable: Personal savings (thousand)


LSD

Mean
Malays
Difference 95% Confidence Interval
(I) Ethnic group
(J) Ethnic group (I-J) Std. Error Sig. Lower BoundUpper Bound
Malay Chinese -5.188 4.642 .271 -14.61 4.24
Indian -14.597* 5.653 .014 -26.07 -3.12 Chinese
Foreigner -18.288* 5.934 .004 -30.34 -6.24
Chinese Malay 5.188 4.642 .271 -4.24 14.61
Indian -9.410 5.352 .087 -20.28 1.46
Foreigner
Indians
-13.100* 5.648 .026 -24.57 -1.63
Indian Malay 14.597* 5.653 .014 3.12 26.07
Chinese 9.410 5.352 .087 -1.46 20.28
Foreigner -3.690 6.505 .574 -16.90 9.52
Foreigner Malay 18.288* 5.934 .004 6.24 30.34 Foreigner
Chinese 13.100* 5.648 .026 1.63 24.57
Indian 3.690 6.505 .574 -9.52 16.90
*. The mean difference is significant at the .05 level.

6/9/2016 28
Writing it in report 2!
Analysis of the data showed that the amount of savings differed among
the ethnic groups with the foreign workers having the highest savings at
RM40830 followed by the Indians, Chinese and lastly Malays. Further
analysis using one-way ANOVA technique revealed that the difference
was significant at least at 0.05 level. LCD test was conducted to detect
which ethnic group differed from the other ethnic groups. The result
showed the mean difference of savings between Malays and the Chinese
was RM5188 and this was not significant at 0.05 (p=0.271). But the mean
differences in savings between the Malays and the Indians and the
foreigners were RM14597 and RM18288 respectively and they were both
significant at 0.05 with the probabilities of error being p=0.014 and
p=0.004 respectively. At the 0.05 level, the Chinese workers had
significant savings differences from the foreigners but no significant
differences with the Malays and the Indians. The Indians has a significant
savings difference from the Malays but not the other ethnic groups. Lastly,
the foreigners had significant savings differences from the Malays and the
Chinese but not the Indians.

6/9/2016 29
Perform the multiple comparison by assuming Malay is a
control group.
Multiple Comparisons

Dependent Variable: Personal savings (thousand)


a
Dunnett t (2-sided)

Mean
Difference 95% Confidence Interval
(I) Ethnic group (J) Ethnic group (I-J) Std. Error Sig. Lower Bound Upper Bound
Chinese Malay 5.188 4.642 .563 -6.27 16.64
Indian Malay 14.597* 5.653 .038 .65 28.55
Foreigner Malay 18.288* 5.934 .011 3.64 32.93
*. The mean difference is significant at the .05 level.
a. Dunnett t-tests treat one group as a control, and compare all other groups against it.

6/9/2016 30
Kruskal-Wallis H Test (nonparametric multiple comparison test)
K Independent
Nonparametric Tests Analyze menu
Samples

move dependent move independent click ‘KW H’ click ‘Define select


variable to ‘Test variable to in Test Type Range’ minimum=1,
Variable List’ ‘Grouping Variable’ maximum=3 (3
groups)
1 Continue
10
4 OK

5
7 open SPSS file ‘Data11’
6

9
2
8

3
6/9/2016 31
Test Statistics a,b

(RM)
Chi-Square 9.632
df 2
Asymp. Sig. .008
Writing it in report 3! a. Kruskal Wallis Test
b. Grouping Variable: Country

This study examines the amount of remittances sent by foreign workers in


Malaysia to their families in their countries of origin. Initial analysis revealed
that the mean remittances for Indonesian, Bangladeshi and Myanmar workers
were RM940, RM735 and RM497.5 respectively. Examination of the means
suggested that there was a possibility that the amounts of remittances were
different among the three groups of foreign workers. However,, owing to the
small number of samples taken in this study, test of normality found that the
data were not normally distributed. This suggested that the use of one-was
ANOVA was not appropriate to test the difference of mean remittances among
the three groups. Instead the Kruskal-Wallis H test, a non-parametric test was
used. The test resulted in a fairly large Chi-square value of 9.846 which was
significant at 0.05 (p=0.007). Therefore, this study concludes that the amounts
of remittances to the countries of origin were different among the Indonesian,
Bangladeshi and Myanmar foreign workers.
6/9/2016 32
Two-Way ANOVA
 Applied to Two-Factor Factorial Design
 Detecting interaction in data involves at
least 3 variables.
 One dependent variable in interval or
numerical scale.
 Two or more independent variables/factors
measure in nominal or categorical scale.
 i = 1, 2,K , a

yijk = μ + τi + β j + ( τβ )ij + εijk  j = 1, 2,K , b
k = 1, 2,K , n

6/9/2016 33
Interaction Effects
 In the study on job satisfaction by ownership of
company and workers’ country of origin, 3
possible hypotheses can be constructed:
1. Whether the perceptions of workers towards
their employers differed significantly for the 3
groups of companies (main effect), i.e.
foreign-owned, joint venture and local.
2. Whether the mean perceptions of workers
differed significantly for the 2 groups of
workers (main effect), i.e. locals and
foreigners.

6/9/2016 34
Interaction Effects
3. Whether the means of perception were
significantly influenced by both the types of
companies and citizenships of workers
(interaction effects)

Response Overall Error term


variable mean  i = 1, 2,K , a

yijk = μ + τi + β j + ( τβ )ij + εijk  j = 1, 2,K , b
k = 1, 2,K , n
Mean
Mean

effect of Interaction
effect of
Factor 1 effects
Factor 2

6/9/2016 35
Open SPSS file ‘Data15’:
the data suggested that foreign workers responded favorably to the
question whether they received good treatment from their employers, but
the situation was reversed for companies owned by local investors.

H0: perceptions of workers


towards their employers
for the 3 groups of
companies are not
different.
H0: there is no different in
perceptions of workers
between locals and
foreigners
H0: there is no interaction
between the types of
companies (owner) and
citizenships of workers.

6/9/2016 36
Interaction Plot
Univariate General Linear Model Analyze menu

move variable move one factor to


move factors to click click
to ‘Dependent ‘Horizontal Axis’, one
‘Fixed Factors’ ‘Plot’ ‘Add’
Variable’ to ‘Separate Lines’
Continue
1
4
6
5
2 3

9
7
7

8
6/9/2016 37
move factors (can be >2
Multiple click ‘Post factors) to ‘Post Hoc Tests Continue
Comparison Hoc’ for’ and select the tests

10
13 click
‘Options’
click
‘Descriptiv
e statistics’
16
Continue
12 OK
11

11
14

6/9/2016 38
Descriptive Statistics

Dependent Variable: SATIS


Between-Subjects Factors OWNER WORKER Mean Std. Deviation N
foreign local 5.30 .949 10
Value Label N Vietnamese 7.30 1.494 10
OWNER 1 foreign 20 Total 6.30 1.593 20
2 joint-ventur joint-venture local 6.50 1.269 10
20
e Vietnamese 7.10 1.370 10
3 local 20 Total 6.80 1.322 20
WORKER 1 local 30 local local 6.00 1.563 10
2 Vietnamese 4.00 1.491 10
Vietnamese 30
Total 5.00 1.806 20
Total local 5.93 1.337 30
Vietnamese 6.13 2.080 30
Total 6.03 1.737 60

Tests of Between-Subjects Effects 1. There is a significant


Dependent Variable: SATIS difference in the level of
Type III Sum satisfaction towards
Source of Squares df Mean Square F Sig. employers between workers
Corrected Model 76.333a 5 15.267 8.114 .000 in different types of
Intercept 2184.067 1 2184.067 1160.823 .000 companies.
Ownership 34.533 2 17.267 9.177 .000 2. There is no significant
Worker .600 1 .600 .319 .575 difference in the level of
Ownership * Worker 41.200 2 20.600 10.949 .000 satisfaction towards
Error 101.600 54 1.881 employers between workers
Total 2362.000 60 of different citizenships.
Corrected Total 177.933 59
3. There is an interaction effect
a. R Squared = .429 (Adjusted R Squared = .376)
between company ownership
and citizenship.
6/9/2016 39
Multiple Comparisons

Dependent Variable: SATIS


Tukey HSD

Mean
Difference 95% Confidence Interval
(I) OWNER (J) OWNER (I-J) Std. Error Sig. Lower Bound Upper Bound
foreign joint-venture -.50 .434 .486 -1.55 .55
local 1.30* .434 .011 .25 2.35
joint-venture foreign .50 .434 .486 -.55 1.55
local 1.80* .434 .000 .75 2.85
local foreign -1.30* .434 .011 -2.35 -.25
joint-venture -1.80* .434 .000 -2.85 -.75
Based on observed means.
*. The mean difference is significant at the .05 level.

The lines are not parallel implied


that the citizenship and
ownership are ‘interact’ to each
other.
• Vietnamese workers achieved
higher level of satisfaction in
foreign and joint-venture
companies.
• Local workers achieved higher
level of satisfaction in local
companies.
6/9/2016 40
Writing it in report 4!
This study interviewed 60 local and foreign (Vietnamese) workers in local,
joint-venture and foreign-owned companies. Using 1-9 Likert scale ratings,
they were asked to indicate their level of satisfaction with the treatment they
received from their respective employers. In this study, the dependent
variable is the level of satisfaction while the ownership of the companies and
the citizenship of the workers acted as the explanatory variables (factors).
An initial look at the data suggested that the satisfaction levels of the
workers towards their employers were influenced by types of ownership of
the companies and their citizenships. Further analysis using the two-way
ANOVA method yielded three results. First, there was a significance
difference in the means of satisfaction towards employers among the
workers in the three groups of companies with F=9.177 (p=0.000). This
means that ownership of company exerted influence on workers’ level of
satisfaction.

6/9/2016 41
Second, there was no significant difference in the levels of satisfaction
between local workers and the Vietnamese workers with F=0.319 (p=0.575).
This means that citizenship did not have significant influence on satisfaction.
Third, there was an interaction effect in which both ownership of company
and workers’ citizenship together exerted significant influence on workers’
satisfaction with F=10.949 and significant at the 0.05 level (p=0.000).
A closer look at the interaction effects (figure) revealed that there was a big
gap between the level of satisfaction of local and Vietnamese workers in
foreign-owned companies, with the latter showing higher satisfaction than
the former. In joint-venture companies, Vietnamese workers continued to
show higher satisfaction than local workers but the gap was narrowed,
whereas in local companies, the situation was reversed where local workers
indicated a higher level of satisfaction than Vietnamese workers.

6/9/2016 42
Case of No Interaction Effects
Tests of Between-Subjects Effects

Dependent Variable: SATIS


Type III Sum
Source of Squares df Mean Square F Sig.
Corrected Model 6.222a 5 1.244 .391 .853
Intercept 1725.853 1 1725.853 542.749 .000
Worker 2.391 1 2.391 .752 .390
Activity 5.307 2 2.654 .835 .440 p>0.05, Accept H0 and
Worker * Activity .120 2 .060 .019 .981
Error 171.711 54 3.180 conclude that there is
Total 2362.000 60
Corrected Total 177.933 59
no interaction between
a. R Squared = .035 (Adjusted R Squared = -.054)
worker and activity

Parallel lines

6/9/2016 43
Tests of Between-Subjects Effects

Dependent Variable: SATIS


Type III Sum
Source of Squares df Mean Square F Sig.
Corrected Model 49.931a 8 6.241 2.487 .023
Intercept 1931.094 1 1931.094 769.406 .000
Activity 4.162 2 2.081 .829 .442
Ownership
Activity * Ownership
33.183
11.332
2
4
16.592
2.833
6.611
1.129
.003
.353
p>0.05, Accept H0 and
Error 128.002 51 2.510 conclude that there is
Total 2362.000 60
Corrected Total 177.933 59 no interaction between
a. R Squared = .281 (Adjusted R Squared = .168)
ownership and activity

It is hard to
determine the
interaction effects
based on a plot with
3 or more lines

6/9/2016 44
Three-Way ANOVA
 It is a 3-factor factorial design.
 One dependent variable in numeric scale
and 3 explanatory variables (factors) either
all in numeric scale or a combination of
numeric and categorical scales.
i = 1,2,..., a
 j = 1,2,...b

yijkl = μ + τi + β j + γk + (τβ)ij + (τγ)ik + ( βγ) jk + (τβγ)ijk + εijkl 
k = 1,2,..., c
l = 1,2,..., n

6/9/2016 45
Univariate General Linear Model Analyze menu

move categorical factors to move one factors one to


move variable
‘Fixed Factors’ or variables click ‘Horizontal Axis’, one to
to ‘Dependent
in numerical, interval, ‘Plot’ ‘Separate Lines’ and one to
Variable’
Likert scale to ‘Covariate’ ‘Separate Plots’
click
OK Continue
Open SPSS file ‘Data16’ ‘Add’

9
4 7

6 7
5
7

10
6/9/2016 46
There is a significant difference in There is no significant difference in
perception towards efficiency of the perception towards the efficiency of
police among the respondents of the police among respondents of
different ethnicities at 0.05 level different educational backgrounds at
(p=0.005) 0.05 (p=0.096)
Tests of Between-Subjects Effects
There is a
Dependent Variable: PERCEPTION significant
Type III Sum
Source of Squares df Mean Square F Sig. difference in
Corrected Model 172.717a 11 15.702 8.416 .000 perception towards
Intercept 544.027 1 544.027 291.588 .000
Ethnicity
the efficiency of the
27.432 2 13.716 7.352 .005
Education 5.767 1 5.767 3.091 .096 police among
City 10.607 1 10.607 5.685 .028 respondents of
Ethnicity * Education 65.617 2 32.809 17.585 .000 different locations at
Ethnicity * City 3.832 2 1.916 1.027 .378
Education * City 1.179 1 1.179 .632 .437
0.05 (p=0.028)
Ethnicity * Education * City 23.096 2 11.548 6.189 .009
Error 33.583 18 1.866 No interaction effect
Total 869.000 30
Corrected Total 206.300 29
between ethnic and
a. R Squared = .837 (Adjusted R Squared = .738) location factors

There is a significant There is a significant


interaction effect between interaction effect No interaction
ethnic background and between ethnic, effect between
educational attainment of the education and location education and
respondents factors location factors
6/9/2016 47
6/9/2016 48
Writing it in report 5!
This study looks at the public perception from the different ethnic,
educational background and location towards the efficiency of the police,
with the hypothesis that their perceptions are different and the differences
can be explained by their personal backgrounds. The three-way ANOVA
was employed in this study.
The results of the analysis showed that there was a significant difference in
perception towards efficiency of the police among respondents of different
ethnicities at 0.05 (p=0.005) and different locations (p=0.028), but no
significant difference in perception among respondents of different
educational backgrounds (p=0.096). In a similar perception, there was a
significant interaction effect between ethnic background and educational
attainment of the respondents at 0.05 (p=0.000), and a significant
interaction effect between ethnic, education, and location backgrounds of
the respondents (p=0.009), but no significant interaction effect between
ethnic and location factors (p=0.378) and no significant interaction effect
between education and location factors (p=0.437).

6/9/2016 49
Controlling for the location factor, this study revealed that there was a
significant interaction effect in perception towards efficiency in the police in
terms of ethnicity among respondents living in small cities. In small cities,
lowly educated Malay respondents demonstrated a relatively high regard for
the police compared with the highly educated Malays, a similar pattern
observable among the Chinese respondents but with a wider gap between
the lowly and highly educated. However, this phenomenon was not
noticeable among the Indian respondents; in fact this ethnic group showed a
reverse situation in which the highly educated Indians tended to have a
relatively more favorable perception towards the police compared with the
lowly educated Indians. Moreover, the gap between the two groups of
educational attainment among the Indians was much wider compared with
the other two ethnic groups, namely the Malays and the Chinese.

6/9/2016 50
A similar interaction effect can be seen among respondents living in big
cities. Generally the Malays demonstrated the highest regard for the police,
followed by the Chinese and the Indians. The lowly educated Malays
demonstrated a relatively high regard for the police; a similar pattern which
was observable among the Chinese respondents but at relatively low levels
and a narrower gap between the two groups. However, the situation is
reversed among the Indians where perception towards efficiency of the
police among the highly educated of this ethnic group was relatively higher
than that of the lowly educated. In general, the Indians were found to have
very low regard for the police compared with the other two ethnic groups,
and there was not much difference between the lowly and highly educated
respondents among members of this ethnic group in terms of the level of
respect for the police.

6/9/2016 51
Exercise using ‘Data15’: construct the 2 way ANOVA
using three factors: Ownership, Worker, and Activity

Tests of Between-Subjects Effects

Dependent Variable: SATIS


Type III Sum
Source of Squares df Mean Square F Sig.
Corrected Model 100.690a 16 6.293 3.503 .001
Intercept 1287.435 1 1287.435 716.697 .000
Ownership 38.985 2 19.492 10.851 .000
Worker 1.310 1 1.310 .729 .398
Activity 2.170 2 1.085 .604 .551
Ownership * Worker 35.736 2 17.868 9.947 .000
Ownership * Activity 7.209 4 1.802 1.003 .416
Worker * Activity 1.284 2 .642 .357 .702
Ownership * Worker
14.089 3 4.696 2.614 .063
* Activity
Error 77.243 43 1.796
Total 2362.000 60
Corrected Total 177.933 59
a. R Squared = .566 (Adjusted R Squared = .404)

6/9/2016 52
Difficult to
interpret 3 lines

6/9/2016 53
6/9/2016 54
Linear Regression Analysis and
Logit Analysis

6/9/2016 55
 Simple linear regression
 Fitted line
 Multiple linear regression and model
checking
 Significance test
 Model selection
 Nonlinear regression and curve estimation
 Dummy variable in regression
 Logit analysis

6/9/2016 56
Simple Linear Regression
yi = β 0 + β1 xi + ε i , i = 1,2, K , n
where:
• The intercept β0 and the slope β1 are unknown
constants (parameters)
• The regressor (independent or predictor variable)
xi is a known constant (fix)
• εi is the random error.
• yi is the value of the response (dependent)
variable in the i-th trial/observation

6/9/2016 57
Assumptions:
1) The error terms εi are Normally (needed in inferences and
MLE) and Independently Distributed with mean E(εi)=0 and
constant variance Var(εi)=δ2 ;

Cov ( ε i , ε j ) = 0; ∀i ≠ j

2) The errors (thus, the yi also) are uncorrelated with each other in
successive observations
ε i ~ NID (0 , σ 2
)
3) The relationship between the variables y and x should be linear.

6/9/2016 58
move one variable to
‘Dependent’, one variable Linear Regression Analyze menu
to ‘Independent’
Select types of statistics:
Statistics Continue OK
Estimates, Model fit, R2

1
8
4

2 3

7
6

Open ‘Data12[phone bill]’

6/9/2016 59
Coefficientsa

Unstandardized Standardized Both p-value<0.05,


Coefficients Coefficients
Model B Std. Error Beta t Sig.
means that the
1 (Constant) 296.813 53.519 5.546 .000 intercept and slope
CGPA (max 4.0) -48.650 20.532 -.324 -2.369 .022 are not significantly
a. Dependent Variable: Handphone bills
equal zeros

n n

∑y ∑x i i Standardized intercept
∑ y ( x − x)
n n

∧ ∑yx − i i
i =1
n
i =1
i i
S xy
always zero,
β1 = i =1
= i =1
= = −48.650 standardized slope (=-
∑ ( x − x)
2 n
 
n 2 S xx 0.324) is now
n 

∑ xi 
 i =1
i
comparable to other

i =1
xi −
2 i =1

n
slopes if any

∧ ∧
β 0 = y − β1 x = 296.813
∧ ∧ ∧
y = β 0 + β1 x Estimated Mobile phone bill = 296.813 – 48.65*CGPA

6/9/2016 60
ANOVAb

Sum of
Model Squares df Mean Square F Sig. p-value<0.05 means the
1 Regression 41792.297 1 41792.297 5.615 .022a regression model is
Residual 357292.9 48 7443.603
Total 399085.2 49
significant
a. Predictors: (Constant), CGPA (max 4.0)
b. Dependent Variable: Handphone bills

Model Summary

Change Statistics
Adjusted Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
1 .324a .105 .086 86.276 .105 5.615 1 48 .022
a. Predictors: (Constant), CGPA (max 4.0)

Coefficient of
correlation (R):
Std error of the
measure the
Coefficient of estimate: standard
strength of When Adjusted R2 – R2
determination errors of the
relationship between is large means the
(R2): the variation predicted values.
the dependent and linear model is not
independent of dependent appropriate or some
variables. variable ‘important’ independent
explained by the variables are missing
model (%).
6/9/2016 61
Writing it in report 6!
One of the hypothesis in this study is that several personal characteristics of
the youths interviewed can predict the monthly amount they spent on the
mobile phone. Simple linear regression analysis using CGPA as the
explanatory variable produced a significant result with F=5.615 and
significant at the 0.05 level (p=0.022). There was a weak and inverse
relationship between CGPA and expenditure on mobile phone with a
correlation coefficient of 0.324. The derived R2 was rather small at 0.105,
indicating only 10.5% variations in expenditure on mobile phone were
explained by academic performance measured in CGPA. The resultant
model from the analysis is Y= 296.813 – 48.65*X where Y is expenditure on
mobile phone and X is CGPA. It showed that the higher the CGPA of the
respondent, the lower the amount spent on mobile phone bills.
Note: Although the ANOVA test shows that the model is significant, but the small R
and R2 values indicated that the CGPA is not a good predictor for expenditure on
mobile phone.

Exercise: Repeat the linear regression analysis using Age


as predictor.
6/9/2016 62
Fitted Line
Construct the scatter plot for Age and Handphone bills. Double click the
plot obtained yielded the follow graph.
Select ‘Linear’
model

click
this
‘Add fit
line at
total’

This is a
possible
outlier
point

6/9/2016 63
Multiple Linear Regression
 It involves one dependent variable and a set of
several independent variables.
 Assumptions in simple linear regression are
hold e.g. linearity, normality, uncorrelated and
equal variance.
 IMPORTANT: At the end of MLR, we try to
select a model that has a minimum number of
independent variables with acceptable model
performance (e.g. large coefficient of
determination)

6/9/2016 64
Matrix form of multiple linear regression model

It is more convenient to deal with multiple linear regression


model in matrix form:

yi = β 0 + β1 xi1 + β 2 xi 2 + L + β k xik + ε i , i = 1,2,K, n

y = Xβ + ε
where y = [ y1 y2 L yn ] '  1 x11 x12 L x1k 
 1 x21 x22 L x2 k 
β = [β0 β1 Lβk ] '
X=
 M M M M 
 
ε = [ε 1 ε 2 Lε k ]
'
 1 xn1 xn 2 L xnk 

y and ε are vectors of size nx1, and β and X are vectors of


size px1, where p = n+1
6/9/2016 65
∧ ∧
y = X β = X(X X ) X′y = Hy
−1


β = (X X ) X′y
−1

To estimate the parameters, the following


conditions must be satisfied: WHY?
1) The number of observations (n) must
be at least the number of regressors
(k).
2) The matrix X’X must be non-singular.
3) All regressors must be linearly
independent

6/9/2016 66
Open SPSS file ‘Data17’
move one variable to ‘Dependent’, Analyze
Linear Regression
regressors to ‘Independent’ menu
Select types of statistics:
Statistics Estimates, Model fit, Continue Plots
Descriptive

1
4 12

2 3

5 8

7
6

6/9/2016 67
click ‘Histogram’ and Move Dependent to ‘Y’ and
Continue OK
‘Normal Probability Plot’ ZRESID (or others) to ‘X’

Dependent variable
Standardized
predicted values

Standardized 11
residuals

Deleted residuals
10
Residual plot
Adjusted to check
predicted values assumptions
9
Studentized
residuals
Studentized
deleted residuals

6/9/2016 68
Do the data meet the conditions and assumptions?
1) Do the explanatory variables have a relatively strong linear
relationship with the dependent variable?
Correlations
YES, except variable AGE
Working
Job Experience
Satisfcation (Years) INCOME (RM) AGE (Yeas) SEX
Pearson Correlation Job Satisfcation 1.000 .660 .719 -.045 -.407
Working
.660 1.000 .482 .135 -.232
Experience (Years)
INCOME (RM) .719 .482 1.000 -.093 -.183
AGE (Yeas) -.045 .135 -.093 1.000 .166
SEX -.407 -.232 -.183 .166 1.000
Sig. (1-tailed) Job Satisfcation . .000 .000 .406 .013
Working
.000 . .003 .239 .109
Experience (Years)
INCOME (RM) .000 .003 . .313 .166
AGE (Yeas) .406 .239 .313 . .191
SEX .013 .109 .166 .191 .
N Job Satisfcation 30 30 30 30 30
Working
30 30 30 30 30
Experience (Years)
INCOME (RM) 30 30 30 30 30
AGE (Yeas) 30 30 30 30 30
SEX 30 30 30 30 30

Scatter Plot Matrix


6/9/2016 69
Residual Analysis The deviation
∧ between the data
Residual is defined as ei = yi − y i and the fit.

The realized or
observed values of
the errors
ei Zero mean and
Standardized Residuals: di =
MS E approximately
unit variance
Average standard deviation

ei
ri =
Studentized Residuals:  
(
xi − x )
2
 Useful in
MS E 1 −  + 
1
 n S xx   regression
Exact standard error     diagnosis

In small data sets, the studentized residuals are more appropriate


than the standardized residuals

6/9/2016 70
1) Does the normality assumption hold?
YES, since the histogram is
approximately bell shape or the points
lie approximately along a straight line
in Normal P-P plot

6/9/2016 71
Heavy-tailed/long tailed distribution:
•The points show a sharp upward and
downward curve at both extremes.

Light tailed/short tailed


distribution: Flattening at the
extremes

Negative/left
skewed

Positive/right
skewed
6/9/2016 72
1) Does the errors variance constant?

Yes because the residuals


contained in a horizontal band.

Outward opening Inward opening


Double bow Nonlinear/cur
funnel funnel
vilinear
6/9/2016 73
Model Summaryb • The response variable and regressors
Adjusted Std. Error of have strong relationship.
Model R R Square R Square the Estimate
a
1 .834 .695 .646 1.492 • 69.5% of the variation in response are
a. Predictors: (Constant), SEX, AGE (Yeas), INCOME
explained by the regressors.
(RM), Working Experience (Years)
b. Dependent Variable: Job Satisfcation • No important regressor missing.

Significant test for the model using ANOVA


ANOVAb

Model
Sum of
Squares df Mean Square F Sig.
The overall regression
1 Regression
Residual
126.678
55.622
4
25
31.670
2.225
14.234 .000a model was significant at
Total 182.300 29 0.05 level (p=0.000)
a. Predictors: (Constant), SEX, AGE (Yeas), INCOME (RM), Working Experience
(Years)
b. Dependent Variable: Job Satisfcation

6/9/2016 74
JobStat = 1.406 + 0.413Exp + 0.001Income – 0.003Age – 1.134Sex

• Intercept and Age are not


significant at 0.05 level, they
Coefficientsa should be removed from further
Unstandardized Standardized analysis.
Coefficients Coefficients
Model B Std. Error Beta t Sig. • Income is the most significant
1 (Constant) 1.406 1.225 1.148 .262
Working predictor in predicting job
.413 .148 .368 2.798 .010
Experience (Years)
satisfaction.
INCOME (RM) .001 .000 .499 3.884 .001
AGE (Yeas) -.003 .030 -.011 -.097 .923
SEX -1.134 .578 -.228 -1.963 .061
•Increase in working experience
a. Dependent Variable: Job Satisfcation (1 year) will lead to an increase
in 1 level of job satisfaction.
Increase in sex (from female to
Significant test for individual parameter male) will lead to a decrease in 1
using one sample t-test level of job satisfaction

6/9/2016 75
Writing it in report 7!
This study examines the possibility of several personal characteristics of
workers in explaining job satisfaction in the firm. Job satisfaction was in a
9-point Likert scale in which 1 denotes extreme dissatisfaction and 9
denotes extreme satisfaction, working experience in years, monthly
income in RM, age in years and sex a categorical variable in which 0 is
for female and 1 for male. Initial analysis using the Pearson correlation
method found that the dependent variable had relatively strong
correlations with the independent variables and with the normality
assumption hold but the error variance is not constant. (Assuming that a
transformation method was applied to deal with the error variance) In
general the analysis yielded a significant regression model with F value
of 14.234 and significant at the 0.05 level. The derived model is
JobStat = 1.406 + 0.413Exp + 0.001Income – 0.003Age – 1.134Sex

6/9/2016 76
Job satisfaction was found to be positively correlated with working
experience and income but had an inversed relationship with age and
sex. High job satisfaction was associated with high income, longer
working experience and young workers. This study also found that
female workers were more likely to be more satisfied than their male
counterparts. Taking the regression model as a whole it was found that
the four independent variables were able to explain 69.5% of the
variance in levels of job satisfaction among the studied workers. At the
individual level, income was the most significant variable in explaining
variations in levels of job satisfaction, followed by working experience
and sex, while age did not have significant impact.

6/9/2016 77
• Enter (Regression): all variables in a block are entered
in a single step.
Model Selection • Remove: all variables in a block are removed in a
single step.
• Stepwise: At each step, the independent variable not in
the model that has the smallest probability of F is
entered, if that probability is sufficiently small. Variables
already in the regression equation are removed if their
probability of F becomes sufficiently large.
• Backward Elimination: all variables are entered into the
equation and then sequentially removed. The variable
with the smallest partial correlation with the dependent
variable and meets the elimination criterion is removed
first. Repeat the procedure until there are no variables
Choose ‘Stepwise’ in the model that satisfy the removal criteria.
method • Forward Selection: variables are sequentially entered
into the model. The variable with the largest
Repeat by using positive/negative correlation with the dependent
‘Remove’, ‘Backward’, variable and satisfies the entry criterion enter first.
‘Forward’ methods Repeat the procedure until there are no variables that
meet the entry criterion.
6/9/2016 78
Model Summary
Variables Entered/Removeda Adjusted Std. Error of
Model R R Square R Square the Estimate
Variables Variables
1 .719a .517 .500 1.773
Model Entered Removed Method
1 Stepwise 2 .803b .645 .619 1.548
(Criteria: 3 .834c .695 .660 1.463
Probabilit a. Predictors: (Constant), INCOME (RM)
y-of-
b. Predictors: (Constant), INCOME (RM), Working
F-to-enter
INCOME Experience (Years)
. <= .050,
(RM)
Probabilit c. Predictors: (Constant), INCOME (RM), Working
y-of- Experience (Years), SEX
F-to-remo
ve >= .
100).
2 Stepwise Models 1, 2 and 3 are all significant, but Model 3 is
(Criteria:
Probabilit
preferred because there is a large increase in R2
Working
y-of-
F-to-enter
values from Model 2 to Model 3.
Experience . <= .050,
(Years) Probabilit ANOVAd
y-of-
F-to-remo Sum of
ve >= . Model Squares df Mean Square F Sig.
100). 1 Regression 94.261 1 94.261 29.979 .000a
3 Stepwise Residual 88.039 28 3.144
(Criteria: Total 182.300 29
Probabilit 2 Regression 117.581 2 58.790 24.527 .000b
y-of- Residual 64.719 27 2.397
F-to-enter
Total 182.300 29
SEX . <= .050,
Probabilit
3 Regression 126.657 3 42.219 19.727 .000c
y-of- Residual 55.643 26 2.140
F-to-remo Total 182.300 29
ve >= . a. Predictors: (Constant), INCOME (RM)
100).
b. Predictors: (Constant), INCOME (RM), Working Experience (Years)
a. Dependent Variable: Job Satisfcation
c. Predictors: (Constant), INCOME (RM), Working Experience (Years), SEX
d. Dependent Variable: Job Satisfcation

6/9/2016 79
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 1.875 .609 3.078 .005
INCOME (RM) .001 .000 .719 5.475 .000
2 (Constant) .344 .724 .475 .638
INCOME (RM) .001 .000 .522 3.990 .000
Working
.458 .147 .408 3.119 .004
Experience (Years)
3 (Constant) 1.320 .832 1.586 .125
INCOME (RM) .001 .000 .501 4.034 .000
Working
.410 .141 .365 2.913 .007
Experience (Years)
SEX -1.145 .556 -.230 -2.059 .050
a. Dependent Variable: Job Satisfcation

Best subset of independent variables after Stepwise


method (remove constant)

JobStat = 0.41Exp + 0.001Income– 1.145Sex

6/9/2016 80
Nonlinear Regression and Curve Estimation
Open SPSS file ‘Data18’ and create the scatter plot with
fitted line
The linear line is not fitted
well to the data. There
are two ways to solve
this problem:
1. Transform the data
using an appropriate
transformation (say
log)
2. Fit the data using curve
estimation

6/9/2016 81
Transform the y and x values using logarithm: y => ln y
and x => ln x

Other common transformations:


• Power transformation: Y= aXb
• Compound transformation: Y = abX
• Logarithmic transformation: Y = a + b*lnX
6/9/2016 • Exponential transformation: Y = a*exp(bX) 82
Curve Estimation
move one variable to ‘Dependent’, Analyze
Curve Estimation Regression
regressor to ‘Independent’ menu
Select types of models: Logarithmic, Inverse, OK
Quadratic, Cubic, Power, Growth etc..

1
6
4

2
3

6/9/2016 83
Model Summary and Parameter Estimates

Dependent Variable: Value (RM)


Model Summary Parameter Estimates
Equation R Square F df1 df2 Sig. Constant b1 b2 b3
Logarithmic .759 25.137 1 8 .001 799276.8 -431306
Inverse .894 67.131 1 8 .000 -206770 1165690
Quadratic .768 11.598 2 7 .006 1150714 -379446 29109.793
Cubic .851 11.460 3 6 .007 1811317 -1089058 199266.6 -11279.3
Power .895 68.545 1 8 .000 1990114 -4.253
Growth .918 89.780 1 8 .000 14.831 -1.227
Exponential .918 89.780 1 8 .000 2761657 -1.227
The independent variable is Distance.

Power, Growth and


Exponential methods are the
most appropriate due to their
large R2 values and model
simplicity:
Power method
Value = 1990114*Distance-4.253

6/9/2016 84
Writing it in report 8!
This study examines the relationship between property value and
distance with the hypothesis that property values are the highest in
areas near the city centre but decreases with the distance from the city
centre. Initial analysis using simple linear regression yielded a
significant but weak relationship with only about 60% variations in
property values explained by distance. Double-log or power
transformation (growth and exponential as well) using natural logarithm
for both variables resulted in a better R2 of 0.895 indicating
approximately 90% accuracy. This study concludes that there is a non-
linear relationship between property values and distance from the city
centre with the following equation:
VALUE = 1990114 / DISTANCE4.253
where VALUE is average property values in RM per 1000m3 and
DISTANCE is distance in kilometer from the city centre.

6/9/2016 85
Dummy Variable

 It is a variable in categorical or nominal


scale.
 It may consist of 2 groups (dichotomy),
e.g. sex or > 2 groups.
 A dummy variable acts as an independent
variable.

6/9/2016 86
Open SPSS file ‘Data10b’: the scatter plot indicates that
perception of the public is influenced by age has strong
negative relationship. Careful investigation reveals that the
dummy variable, family background, could possibly influence
public opinion.

6/9/2016 87
Method 1: split the variables into two groups manually,
based on non-drug addict family and drug-addict family

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 9.043 .597 15.158 .000
Age -.104 .014 -.713 -7.686 .000
Family -2.647 .453 -.542 -5.842 .000
a. Dependent Variable: Perception

Perception=9.043 – 0.104AGE – 2.647FAMILY


6/9/2016 88
Model Summary Model Summary

Adjusted Std. Error of Adjusted Std. Error of


Model R R Square R Square the Estimate Model R R Square R Square the Estimate
1 .854a .730 .713 1.290 1 .760a .578 .552 1.356
a. Predictors: (Constant), AGE0 a. Predictors: (Constant), AGE1

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 71.889 1 71.889 43.224 .000a
Residual 26.611 16 1.663
Total 98.500 17
a. Predictors: (Constant), AGE0 ANOVAb
b. Dependent Variable: PERCEP0 Sum of
Model Squares df Mean Square F Sig.
1 Regression 40.355 1 40.355 21.945 .000a
Residual 29.423 16 1.839
Total 69.778 17
a. Predictors: (Constant), AGE1
b. Dependent Variable: PERCEP1
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 9.903 .782 12.665 .000
AGE0 -.127 .019 -.854 -6.575 .000
a. Dependent Variable: PERCEP0
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 5.769 .693 8.326 .000
AGE1 -.085 .018 -.760 -4.685 .000
a. Dependent Variable: PERCEP1

6/9/2016 89
Method 2: Use split file command of the SPSS (see module 1)
Model Summary

Change Statistics
Adjusted Std. Error of R Square
Family Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
non-drug addict family 1 .854a .730 .713 1.290 .730 43.224 1 16 .000
drug-addict family 1 .760a .578 .552 1.356 .578 21.945 1 16 .000
a. Predictors: (Constant), Age
ANOVAb

Sum of
Family Model Squares df Mean Square F Sig.
non-drug addict family 1 Regression 71.889 1 71.889 43.224 .000a
Residual 26.611 16 1.663
Total 98.500 17
drug-addict family 1 Regression 40.355 1 40.355 21.945 .000a
Residual 29.423 16 1.839
Total 69.778 17
a. Predictors: (Constant), Age
b. Dependent Variable: Perception

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Family Model B Std. Error Beta t Sig.
non-drug addict family 1 (Constant) 9.903 .782 12.665 .000
Age -.127 .019 -.854 -6.575 .000
drug-addict family 1 (Constant) 5.769 .693 8.326 .000
Age -.085 .018 -.760 -4.685 .000
a. Dependent Variable: Perception

Non-drug addict: Perception = 9.903 – 0.127AGE


Drug addict: Perception = 5.769 – 0.085AGE
6/9/2016 90
Dummy Variable with 3 or more categories
 Need special treatment in the SPSS.
 The # of dummy variables created = #
categories in the variable – 1.
1) Sex (# categories=2) => 1 dummy variable (0=female, 1=male)
2) Agreement (3 categories) => 2 dummy variables
3) Ethnic (4 categories) => 3 dummy variables

Var 1 Var 2 Var 3


Variable 1 Variable 2 Malay 1 0 0
Agree 1 0 Chinese 0 1 0
Disagree 0 1 Indian 0 0 1
Neutral 0 0 Others 0 0 0

6/9/2016 91
Model Summary

Open SPSS file ‘Data10d’ Model R R Square


Adjusted
R Square
Std. Error of
the Estimate
1 .895a .800 .774 1.177
a. Predictors: (Constant), Ethnic2, Age, Ethnic1, Family

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 172.013 4 43.003 31.031 .000a
Residual 42.960 31 1.386
Total 214.972 35
a. Predictors: (Constant), Ethnic2, Age, Ethnic1, Family
b. Dependent Variable: Perception

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 8.139 .631 12.907 .000
Age -.084 .013 -.579 -6.445 .000
Family -1.705 .483 -.349 -3.528 .001
Ethnic1 .663 .445 .136 1.492 .146
Ethnic2 -1.378 .543 -.278 -2.539 .016
a. Dependent Variable: Perception

Perception = 8.139 – 0.084AGE – 1.705FAMILY + 0.663ETH1 – 1.378ETH2


Malay: Perception = 8.139 – 0.084AGE – 1.705FAMILY + 0.663(1) – 1.378(0)
= 8.802 – 0.084AGE – 1.705FAMILY

6/9/2016 92
Writing it in report 9!
One of the objectives of this study is to examine whether several personal
background characteristics of the surveyed public influence the shaping of
their opinion towards the proposal that drug addicts be sent to an
uninhabited island as a measure to rehabilitate them. There are three
independent variables, namely age, family background and ethnicity,
where the last two are dummy variables. Analysis using multiple linear
regression taking perception as the dependent variable yielded a
significant model (F=31.03, p=0.000) with R2=0.80 implying 80% of
variations in perception is explained by the independent variables. This
implies that the independent variables exerted strong and significant
influences in shaping public opinion. The derived model is as follows:
PERCEPTION=8.139 – 0.084*AGE – 1.705*FAMILY + 0.663*ETHNIC1 –
1.377*ETHNIC2

6/9/2016 93
Regardless of family background and ethnicity, age had an inverse but
strong (WHY?) and significant influence on perception in that older
respondents tended to be less favorable towards the proposal. In terms
of ethnic influence, the regression model indicated that the Malays
generally showed more favorable response to the proposal followed by
the Indians and the Chinese in that order. In terms of family experience
in drug addiction, respondents from non-drug addict families were more
favorable to the proposal of sending drug addicts to the island compared
with those from drug addict families.

6/9/2016 94
Logit Analysis/ Logistic Regression
 It is used when the dependent variable is in
categorical scale.
 It is a linear probability method of predicting the
category of outcome for individual cases or
observations.
 Advantage: not requiring assumptions of
multivariate normality and equal variance-
covariance

6/9/2016 95
Open ‘Data19[dengue fine]’

Fine=whether the house was fined by the local authority (1=fined, 0=not fined)
Size=size of the house indicated by the floor area
Age=average age of family members
Pot=number of outdoor flower pots in the house compound
Family=total number of persons in the family
Helper=whether the house employs a housemaid (1=yes, 0=no)
House1=house location (1=squatter, 0=other types of houses)
House2=house location (1=flat, 0=other types of houses)
6/9/2016 96
move one variable to ‘Dependent’, Analyze
Binary Logistic Regression
regressors to ‘Covariates’ menu

Options select ‘casewise listing of residuals’, ‘all Continue OK


cases’, ‘include constant in model’

1 8
4

2
3
6

6/9/2016 97
The prediction equation for the probability of a household being fined by the
local authority is:
exp( 6.141− 0.001Size − 0.269Age − 0.049Pot + 0.324Family − 0.741Help + 4.773H1+ 3.879H2)
Prob( y =1) =
1+ exp( 6.141− 0.001Size − 0.269Age − 0.049Pot + 0.324Family − 0.741Help + 4.773H1+ 3.879H2)

Odds Ratio (OR) quantifies how strongly the presence or absence of


outcome A is associated with the presence or absence of factor B in a
given population.
None of the factors is found to be significant (all p > 0.05) or exp(0)=1 lies
in all the 95% CI

Hosmer & Lemeshow tests the null hypothesis


Hosmer and Lemeshow Test that predictions made by the model fit
Step Chi-square df Sig. perfectly with observed group memberships
1 23.908 8 .002 Since p=0.02 < 0.05, we reject the null
hypothesis and conclude that predictions
The model is significant
made by the model does not fit with
observed group memberships.
Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 14.583 a .498 .665
a. Estimation terminated at iteration number 7 because
parameter estimates changed by less than .001.

50% of the variation in y is explained by the model

2 Log likelihood measures how poorly the model predicts the decisions. The
smaller the statistic the better the model

Cox & Snell R square and Nagelkerke R square can be interpreted like R
square in multiple linear regression. Cox & Snell R2 cannot reach the
maximum of 1 and Nagelkerke R2 can reach 1.

6/9/2016 100
Classification Tablea

Predicted
Out of 11 households that
Actual FINE Percentage were actually not fined, 10
Observed not fined fined Correct were correctly predicted
Step 1 FINE not fined 10 1 90.9
fined 1 9 90.0 not fined (90.0%
Overall Percentage 90.5 accuracy)
a. The cut value is .500

Casewise List

Selected Observed Predicted Temporary Variable


a
Case Status FINE Predicted Group Resid ZResid
1 S f .804 f .196 .494
2 S f .909 f .091 .317
3 S f .666 f .334 .707
4 S f .890 f .110 .351 Household no. 8 was
5
6
S
S
f
f
.954
.999
f
f
.046
.001
.220
.028
misclassified. It was actually
7 S f .999 f .001 .028 fined, but predicted not fined
8 S f** .033 n .967 5.444
9 S f .717 f .283 .628
with the probability of 0.033.
10 S f .935 f .065 .264
11 S n .387 n -.387 -.794
12 S n .196 n -.196 -.493
13 S n .185 n -.185 -.477
14 S n .001 n -.001 -.030 Household no. 16 was
15
16
S
S
n
n**
.007
.550
n
f
-.007
-.550
-.084
-1.105
misclassified. It was actually
17 S n .016 n -.016 -.126 not fined, but predicted fined
18 S n .302 n -.302 -.658
19 S n .116 n -.116 -.362
with the probability of 0.55.
20 S n .196 n -.196 -.494
21 S n .139 n -.139 -.401
a. S = Selected, U = Unselected cases, and ** = Misclassified cases.
6/9/2016 101
Writing it in report 10!
This study is concerned with the dengue fever epidemics and how the
households in a local community responded to measures adopted by the
local health authority in imposing fines on households found to allow their
house compounds to become the breeding ground of the aedes mosquito
which spreads the viruses. The dependent variable took the form of a
dummy with value 1 denoting the households that were fined by the local
health authority and value 0 for those which were not fined. The objective
was to examine whether a set of predictor variables comprising
household characteristics such as house and family sizes, average age of
family members, whether the households employed housemaids, and
locations of the houses could be used to predict the probability of being
fined. The study found that of the six predictor variables entered in the
analysis, only house type was found to be significant at least at the 0.05
level. Other variables were not significant owing to the small sample size
used in this study.

6/9/2016 102
The parameter estimates showed that houses of bigger sizes, with more
members, occupied by family members dominated by the young, which
did not employ housemaids, and were located in squatter and flat areas
had greater probability of being negligent and thus fined by the local
authority. The goodness-of-fit statistic was found to be good with the
observed chi-square estimate of 23.908 and significance at the 0.05 level
(p=0.02). It is indicated that out of the total 11 households that were
actually not fined by the local authority, 10 households were predicted not
fined by the logit model resulting in a 90.9% success, and out of the 10
households that were actually fined, 9 households were predicted fined
resulting in a 90% success, giving the overall model predictability of
90.5%. The study concludes that the household characteristics selected
were able to predict the probability of the households being fined by the
local health authority for failing to prevent the spread of dengue fever in
the community under study.

6/9/2016 103
Principal Component Analysis and
6/9/2016
Factor Analysis 104
Comparing PCA and Factor Analysis
PCA FA
to identify a relatively small number of same
factors that can be used to represent
relationships among sets of many
interrelated variables. (reduce
dimension)
identify the underlying, not directly same
observable, constructs (hidden
variables)
all the variations in a given population only part of the variation in a given
are contained within the variables population is contained within the
used to define that population variables used to define that
population

6/9/2016 105
PCA FA
used in deterministic approach used in studies that use a more
studies flexible experimental approach
Objective: to select a number of Objective: the factors (components)
components that explain as much of are selected mainly to explain the
the total variance as possible interrelationship among the original
variables.
Its value for a given individual is Emphasis on obtaining easily
relatively simple to compute and understandable factors that convey
interpret the essential information contained
in the original set of variables.

6/9/2016 106
move all selected Data Analyze
Descriptive Factor
variables to ‘Variables’ Reduction menu
select ‘initial solution’, ‘coefficients’
and ‘univariate descriptive’ Continue
Open SPSS file ‘Data20’
1
24
4

8
5 16
12 20
2 3
7

6/9/2016 107
Continue select ‘Correlation matrix’, ‘Unrotated factor choose ‘Principal Extractio
solution’, ‘Sree plot’ and ‘Eigenvalues over: 1’ Components’ method n
select ‘Rotated solution’, choose ‘Varimax’
Continue ‘Loading plots’ Rotation
method

OK select ‘Display factor Click ‘Save as


Continue score coefficient matrix’ variables: Regression’ Scores

select ‘Exclude cases


Continue click ‘Sorted by size’ Options
listwise’

9 11 19
For Factor 17
Analysis,
10
choose
18
‘principal axis
factoring’,
‘Alpha
factoring’ or
‘Image 15 23
13 22
factoring’

21
14
6/9/2016 108
Economic Sector Variable Definition
Employment Sector Variable
Agriculture, hunting and forestry AGRIC
Fishing FISH
Mining and quarrying MINE
Manufacturing MANU
Electricity, gas and water supply ELECT
Construction CONST
Wholesale, retail, repair of motor vehicle, personnel WSALE
Hotels and restaurants HOTEL
Transport, storage and communication TRANS
Financial intermediation FINANCE
Real estate, renting and business activities ESTATE
Public admin, defence, social security ADMIN
Education EDU
Health and social work HEALTH
Other community, social and personal services OTHERS
Housemaid MAID
6/9/2016 109
Step #17: save the factor scores as variables yielded the table below.
There are only 4 optimum factors (FAC1_1 to FAC4_1). These factor
scores are listed after the last variable in Data View window.

6/9/2016 110
Determine the number of components using Scree Plot

An ‘elbow’ occurs at about 4th


component. It appears that 3 (or
perhaps 4) sample components
effectively summarize the total
sample variance.

6/9/2016 111
Step #10 Extract:
Determine the number of • If you put the ‘eigenvalues over’
components using % of variance = 1, the optimum # of
components is obtained
The first 4 components contributed
• If the eigenvalues over = 0, the
82.685% cumulative % of variance
where contribution of the 5th component
# of components = the number of
is small 5.231% of variance, which is variables
not significant. • You can also set the number of
components desired
Total Variance Explained

Initial Eigenvalues Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings
Component Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %
1 6.738 42.112 42.112 6.738 42.112 42.112 5.790 36.185 36.185
2 3.023 18.892 61.004 3.023 18.892 61.004 3.147 19.672 55.857
3 2.432 15.199 76.203 2.432 15.199 76.203 2.994 18.712 74.568
4 1.037 6.482 82.685 1.037 6.482 82.685 1.299 8.117 82.685
5 .837 5.231 87.916
6 .767 4.796 92.711
7 .414 2.585 95.297 Contributions of the first 4
8 .297 1.854 97.151
9 .208 1.301 98.452 components after rotation.
10 .131 .821 99.273
11 .064 .399 99.672
12 .042 .260 99.933
13 .011 .067 100.000
14 1.10E-016 6.85E-016 100.000
15 -6.6E-017 -4.11E-016 100.000
16 -1.9E-016 -1.17E-015 100.000
6/9/2016 112
Extraction Method: Principal Component Analysis.
Component Matrixa

Component
Component matrix before rotation
1 2 3 4
FINANCE
ESTATE
.932
.929
.286
.242
.011
.056
-.148
.084
Component matrix after rotation
MAID .910 .180 -.327 -.027
TRANS .839 .053 -.241 .198
AGRIC -.797 .164 -.435 -.277
OTHERs
Rotated Component Matrixa
.761 .136 -.385 -.121
WSALE .687 .425 .269 -.329
Component
EDU -.636 .362 .429 -.144
MANU .330 -.815 .203 .364 1 2 3 4
CONST -.268 .778 .349 .100 FINANCE .939 -.038 .271 -.121
ELECT .599 .677 -.032 .188 MAID .939 -.274 -.020 -.105
ADMIN -.528 .565 .283 -.072 ESTATE .901 -.139 .304 .091
HOTEL .167 -.293 .770 .063
ELECT .826 .255 -.015 .325
HEALTH .410 -.188 .656 -.382
FISH -.563 .204 -.583 .040 OTHERs .805 -.245 -.105 -.205
MINE -.233 .529 .251 .643 TRANS .795 -.401 .049 .089
Extraction Method: Principal Component Analysis. WSALE .743 .310 .392 -.179
a. 4 components extracted.
MANU -.135 -.823 .488 .112
CONST .044 .797 .002 .413
Component 1 consists of variables ADMIN -.259 .756 -.078 .197
‘Finance’, ‘Maid’, ‘Estate’, ‘Elect’, ‘Others’, EDU -.472 .707 .075 .106
‘Trans’ and ‘Wsale’ because they have HOTEL -.149 -.049 .821 .111
HEALTH .146 .070 .810 -.313
largest coefficient values as compared to
FISH -.277 .143 -.776 .025
components 2, 3 and 4. AGRIC -.519 .355 -.691 -.235
MINE -.034 .360 -.047 .823
Component 2 consists of variables Manu,
Extraction Method: Principal Component Analysis.
Const, Admin and Edu. Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 7 iterations.
Component 3 consists of variables Hotel,
Health, Fish and Agric. Component 4 consists of variable Mine
only.
6/9/2016 113
Rotated Component Matrixa

Hidden Variables: 1 2
Component
3 4
FINANCE .939 -.038 .271 -.121
• Component 1: Dominated by Tertiary MAID .939 -.274 -.020 -.105
Employment (Y1) ESTATE .901 -.139 .304 .091
ELECT .826 .255 -.015 .325
• Component 2: Lack in Manufacturing OTHERs .805 -.245 -.105 -.205
TRANS .795 -.401 .049 .089
(Y2) WSALE .743 .310 .392 -.179
MANU -.135 -.823 .488 .112
• Component 3: Dominant in Tourist- CONST .044 .797 .002 .413
Related Activities (Y3) ADMIN -.259 .756 -.078 .197
EDU -.472 .707 .075 .106
• Component 4: Dependent on Mining (Y4) HOTEL -.149 -.049 .821 .111
HEALTH .146 .070 .810 -.313
FISH -.277 .143 -.776 .025
Instead of using 16 variables, the AGRIC -.519 .355 -.691 -.235
researcher may use these 4 components to MINE -.034 .360 -.047 .823
Extraction Method: Principal Component Analysis.
study the problem concerned. Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 7 iterations.

Y1=0.939FINANCE + 0.939MAID + 0.901ESTATE + 0.826ELECT


+ 0.805OTHERS + 0.795TRANS + 0.743WSALE
Y2 = –0.823MANU + 0.797CONST + 0.756ADMIN + 0.707EDU
Y3 = 0.821HOTEL + 0.810HEALTH – 0.776FISH – 0.691AGRIC
Y4 = 0.823MINE
6/9/2016 114
The component plot for the first 3 components from the
rotated component matrix, e.g. Financial coordinate is
(0.939, -0.038, 0.271)

6/9/2016 115
Scatter plot: the relative positions of states in terms of dominance of
employment in the tertiary sector (component 1) and problem of lack in
manufacturing (component 2). You may plot the scatter diagrams for any
combination of the components.

Plot using the factor scores obtained from Step #17 116
6/9/2016
Writing it in report 11!
In this study, principal component analysis was used to portray the
basic structure of economic development of states in Malaysia
based on the percentage of employment in sixteen economic sub-
sectors, namely agriculture (including hunting & forestry), fishing,
mining (including quarrying), manufacturing, utility (including
electricity, gas & water supply), construction, wholesale (including
retail, repair of motor vehicle & personal trade), hotels and
restaurants, transport (including storage & communication),
financial intermediation, real estate (including renting and business
activities), administration, housemaid services, education, health
and other community, social and personal services. Initial analysis
showed variations in the values of the variables and correlations
among several variables, both indicating the possibility of using this
technique. Initial solution resulted in four components (factors) that
had eigenvalues of more tan one with a total of 82.7% variance
explained.

6/9/2016 117
The first component accounted for 42.1% of the variance explained,
the second component contributed 18.9%, the third component 15.2%
and the fourth component 6.5%. Rotation using the varimax method
did not result in an increase in total variance explained, but it
managed to raise slightly the percentage of variance explained for the
first, second and third components so that the contributions of these
components to the underlying economic structure of the country
studied were made clear and more significant.
Seven employment sub-sector variables, namely in finance,
housemaid services, real estate, utility, transport and communication,
wholesale and others loaded high on Component 1. Employment in
manufacturing, construction, administration and educational services
loaded high on Component 2. Employment in hotel, health services,
fishing and agriculture loaded high on Component 3. The only
variable contributing to the formation of Component 4 was
employment in mining. In this study, the first component was labeled
‘Dominant by tertiary employment’, the second component ‘Lack in
manufacturing’, the third ‘Dominant in tourist-related activities’ and the
fourth ‘Dependent on mining’.
6/9/2016 118
Figure 1 maps out the position of each state in the employment
structure based on factor scores. Kuala Lumpur was positioned far
ahead of other states in the first component of employment
dimension, followed by Selangor in the middle, while other states
flocked together with generally low scores. On the other extreme,
Perlis and Kelantan, and to a lesser extent, Terengganu, showed not
only low scores for tertiary activities but also lack of employment in
manufacturing. Seen from the second figure which was defined as
the dominance of employment in tertiary activities and tourism,
Kuala Lumpur did quite well, positioned positively far from other
states. States such as Melaka, Penang and Kelantan did quite well
in tourism but greatly lacked in tertiary activities.

6/9/2016 119
From the third figure where employment in tertiary activities was
seen together with mining, the position of Terengganu clearly
eclipsed that of other states. But because several other sub-sectors
such as utility, construction and agriculture too loaded (either
positively or negatively) quite heavily on this component, Melaka
was also positioned near Terengganu. When scores for component 2
depicting lack in manufacturing and component 3 for significant
contribution to tourism in employment were put together, the position
of Kelantan, Perlis, Terengganu and Kuala Lumpur as a group
separated from other states was very clear. Relative to other states,
Penang did well both in manufacturing and tourism as shown by the
negative and low scores on components 2 and 4 were plotted
together, Terengganu was found to be positioned far from other
states indicating a problem of lack in manufacturing and dependence
on mining, while Kelantan and, to a lesser extent, Perlis were behind
other wates in both sub-sectors. Finally, when components 3 and 4
were plotted in a two-dimension map, the two states were positioned
in two extremes where Melaka lacked in mining, fishing and
agriculture but did well in tourism, and Sabah did well in agriculture
but not quite well in mining.
6/9/2016 120
Better tertiary employment

Lack of employment More tourist related activities

More depends on mining


6/9/2016 121
6/9/2016 122
Clustering
Techniques

6/9/2016 123
 This procedure attempts to identify relatively
homogeneous groups of cases (or variables)
based on selected characteristics, using an
algorithm that starts with each case (or variable)
in a separate cluster and combines clusters until
only one is left
 It can be used to reduce dimension or reduce
cases. Since the variables or cases in the same
group are homogeneous, we can choose only
one of them to represent the whole group

6/9/2016 124
Measure
Allows you to specify the distance or similarity measure to be used
in clustering. Select the type of data and the appropriate
distance or similarity measure:
1. Interval: Available alternatives are Euclidean distance,
squared Euclidean distance, cosine, Pearson correlation,
Chebychev, block, Minkowski, and customized
2. Counts: Available alternatives are chi-square measure and phi-
square measure
3. Binary: Available alternatives are Euclidean distance, squared
Euclidean distance, size difference, pattern difference,
variance, dispersion, shape, simple matching, phi 4-point
correlation, lambda, Anderberg's D, dice, Hamann, Jaccard,
Kulczynski 1, Kulczynski 2, Lance and Williams, Ochiai, Rogers
and Tanimoto, Russel and Rao, Sokal and Sneath 1, Sokal and
Sneath 2, Sokal and Sneath 3, Sokal and Sneath 4, Sokal and
Sneath 5, Yule's Y, and Yule's Q
6/9/2016 125
Euclidean Distance
Let the data of m variables and n observations were
recorded as follows: Var 1 Var 2 Var m

Obs 1  x11 x12 L x1m 


x x22 L x2m 
Obs 2
 21
M M xij M 
 
Obs n  xn1 xn 2 L xnm 

The Euclidean distance between 2 observations:

( xh1 − xk1 ) + ( xh 2 − xk 2 ) + L + ( xhm − xkm )


2 2 2
d hk = h, k = 1, 2,K , n

The Euclidean distance between 2 variables:

d hk = ( x1h − x1k ) + ( x2 h − x2 k ) + L + ( xnh − xnk )


2 2 2
h, k = 1, 2,K, m

6/9/2016 126
Distance Matrix
 d11 = 0 d12 L d1m 
 d 0 L d 2 m 
 The diagonal entries = 0, means
D= 21
 M M O M  there is no distance between a
  variable/case and itself.
 d n1 dn2 L d nm = 0 

Pearson Similarity Matrix


s jk
1 r12 L r1q  rjk =
  s j sk
 r21 1 r2 q 
R= 1 n
= ∑ ( x hj − x j )( x hk − x k )
 M M  s jk
n h =1
 
 rq1 rq 2 L 1 s j = s jj
Note: The distance matrix and similarity matrix can be obtained by
using the command Analyze => Correlate => Distances
6/9/2016 127
Single (Nearest) Linkage
Single link method defines the similarity between the (j,k)-
group and the remaining variables l as fallows
{ }
r( jk )l = max rjl , rkl , l ≠ j , k ; 1 ≤ l ≤ n ( or m )

Complete (Furthest) Linkage


Complete link method defines the similarity between the
(j,k)-group and the remaining variables l as fallows
{ }
r( jk )l = min rjl , rkl , l ≠ j , k ; 1 ≤ l ≤ n ( or m )

6/9/2016 128
Between (Average) Groups Linkage

Within (Average) Groups Linkage

6/9/2016 129
move categorical variable move all selected Hierarchical Classif Analyze
to ‘Label Cases by’ variables to ‘Variables’ Cluster y menu
select ‘cases’ Agglomeration
Statistics Continue
under Cluster schedule Open SPSS file ‘Data20’

1 17
4

5 Cluster
6 cases

10
2 7
13
3
8 9

6/9/2016 130
Plot Dendogram Continue Method Select cluster method
‘Furthest neighbor’
select measure
OK Continue ‘Pearson Correlation’

11 12 14 16

15

Types of clustering mehtod:


1. Between (average) groups linkage 5. Centroid clustering
2. Within (average) groups linkage 6. Median clustering
3. Nearest neighbor/Simple linkage 7. Ward’s method

6/9/2016 4. Furthest neighbor/Complete linkage 131


Proximity Matrix

Correlation between Vectors of Values


13: 14:Kuala
Case 1:Johor 2:Kedah 3:Kelantan 4:Melaka5:N.Sembilan6:Pahang 7:Perak 8:Perlis 9:Penang 10:Sabah11:Sarawak12:Selangor Terengganu Lumpur
1:Johor 1.000 .949 .637 .973 .951 .680 .927 .698 .967 .526 .528 .930 .728 .618
2:Kedah .949 1.000 .817 .894 .985 .869 .975 .867 .842 .749 .759 .818 .831 .523
3:Kelantan .637 .817 1.000 .601 .805 .965 .856 .979 .448 .914 .927 .531 .933 .486
4:Melaka .973 .894 .601 1.000 .929 .611 .905 .663 .973 .416 .424 .968 .735 .736
5:N.Sembilan .951 .985 .805 .929 1.000 .840 .981 .859 .863 .692 .710 .876 .843 .626
6:Pahang .680 .869 .965 .611 .840 1.000 .874 .973 .485 .953 .966 .520 .866 .393
7:Perak .927 .975 .856 .905 .981 .874 1.000 .902 .825 .752 .755 .847 .891 .647
8:Perlis .698 .867 .979 .663 .859 .973 .902 1.000 .515 .903 .924 .583 .944 .477
9:Penang .967 .842 .448 .973 .863 .485 .825 .515 1.000 .303 .301 .950 .585 .655
10:Sabah .526 .749 .914 .416 .692 .953 .752 .903 .303 1.000 .989 .339 .764 .235
11:Sarawak .528 .759 .927 .424 .710 .966 .755 .924 .301 .989 1.000 .344 .783 .225
12:Selangor .930 .818 .531 .968 .876 .520 .847 .583 .950 .339 .344 1.000 .691 .825
13:Terengganu .728 .831 .933 .735 .843 .866 .891 .944 .585 .764 .783 .691 1.000 .631
14:Kuala Lumpur .618 .523 .486 .736 .626 .393 .647 .477 .655 .235 .225 .825 .631 1.000
This is a similarity matrix
Agglomeration Schedule
Start with 14 clusters
Stage Cluster First
Cluster Combined Appears (states). First
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 10 11 .989 0 0 9
combine cluster 10
Complete 2 2 5 .985 0 0 4 and cluster 11
3 3 8 .979 0 0 7
linkage 4 2 7 .975 2 0 10 because they have
5
6
1
1
4
9
.973
.967
0
5
0
0
6
8
the largest similarity
7 3 6 .965 3 0 9 value (0.989). This is
8 1 12 .930 6 0 12
9 3 10 .903 7 1 11
followed by
10 2 13 .831 4 0 11 combining clusters 2
11 2 3 .692 10 9 13
12 1 14 .618 8 0 13 and 5, and so on.
13 1 2 .225 12 11 0
6/9/2016 132
Horizontal Icicle

Number of clusters
Case 1 2 3 4 5 6 7 8 9 10 11 12 13
11:Sarawak X X X X X X X X X X X X X
X X X X X X X X X X X X X
10:Sabah X X X X X X X X X X X X X
X X X X X
6:Pahang X X X X X X X X X X X X X
X X X X X X X
8:Perlis X X X X X X X X X X X X X
X X X X X X X X X X X
3:Kelantan X X X X X X X X X X X X X
X X X
13:Terengganu X X X X X X X X X X X X X
X X X X
7:Perak X X X X X X X X X X X X X
X X X X X X X X X X
5:N.Sembilan X X X X X X X X X X X X X
X X X X X X X X X X X X
2:Kedah X X X X X X X X X X X X X
X
14:Kuala Lumpur X X X X X X X X X X X X X
X X
12:Selangor X X X X X X X X X X X X X
X X X X X X
9:Penang X X X X X X X X X X X X X
X X X X X X X X
4:Melaka X X X X X X X X X X X X X
X X X X X X X X X
1:Johor X X X X X X X X X X X X X

1 cluster 3 clusters
2 clusters 4 clusters

6/9/2016 133
Dendogram

Cluster 1: Sabah,
Sarawak, Kelantan,
Perlis, Pahang
Cluster 2: Kedah, N9,
Perak, Terengganu
Cluster 3: Johor, Melaka,
Penang, Selangor
Cluster 4: KL

6/9/2016 134
The number of clusters can be adjusted according to the
desired degrees of similarity

6/9/2016 135
6/9/2016 136
6/9/2016 137
Clustering variable

6/9/2016 138
K-Means Clustering
 This procedure attempts to identify relatively
homogeneous groups of cases based on selected
characteristics, using an algorithm that can handle large
numbers of cases. However, the algorithm requires you
to specify the number of clusters
 You can specify initial cluster centers if you know this
information
 You can select one of two methods for classifying cases,
either updating cluster centers (centers will change)
iteratively or classifying only
 The k-means cluster analysis command is efficient
primarily because it does not compute the distances
between all pairs of cases, as do many clustering
algorithms
6/9/2016 139
 It is a tool designed to assign cases to a fixed
number of groups (clusters) whose
characteristics are not yet known but are based
on a set of specified variables. It is most useful
when you want to classify a large number
(thousands) of cases
 A good cluster analysis is:
 Efficient. Uses as few clusters as possible
 Effective. Captures all statistically and commercially
important clusters. For example, a cluster with five
customers may be statistically different but not very
profitable

Open SPSS file ‘Data20’

6/9/2016 140
Select the number of move all selected K-Means Analyze
Classify
clusters needed variables to ‘Variables’ Cluster menu
Create a new
file to store Iterate and Define the maximum
Continue Save
results classify number of iterations
click the boxes to
OK Continue select ‘Statistics’ Options
save and continue

1 15
4

5
7

2
3
6

14 8 10 12
13
9 11
8
6/9/2016 141
Initial Cluster Centers

Cluster Iteration Historya


1 2 3
AGRIC 1.4 .3 29.5 Change in Cluster Centers
FISH .8 .0 4.7 Iteration 1 2 3
MINE .2 .3 .3 1 10.000 7.484 10.664
MANU 40.1 13.6 12.2 2 2.283 .000 1.753
ELECT .5 1.0 .6 3 1.743 .000 1.944
CONST 6.2 10.6 7.6 4 .000 .000 .000
WSALE 15.9 22.3 15.3 a. Convergence achieved due to no or small
HOTEL 7.5 7.4 4.7 change in cluster centers. The maximum
TRANS 4.8 6.4 4.7 absolute coordinate change for any center is
FINANCE 2.4 7.6 1.3 .000. The current iteration is 4. The minimum
ESTATE 4.3 8.9 1.6 distance between initial centers is 28.739.
ADMIN 4.3 6.9 6.5
EDU 4.2 4.5 4.7
HEALTH 2.9 2.4 1.3 In early iterations, the cluster centres
OTHERs 2.0 3.2 2.3 shift quite a lot.
MAID 2.3 3.7 2.5

Cluster Membership By the 4th iteration, they have


Case Number Cluster Distance settled down to the general area of
1 1 5.680 their final location, and the last
2 1 7.869
3 3 4.522
iteration is minor adjustments.
4 1 6.084
5 1 7.462
6 3 3.860
If the algorithm stops because the
7 1 8.716 maximum number of iterations (10 in
8 3 4.437 this example) is reached, you may
9 1 13.639
10 3 8.516
want to increase the maximum
11 3 7.837 because the solution may otherwise be
12 2 7.484
13 3 10.846
unstable.
6/9/2016 14 2 7.484 142
Final Cluster Centers

Cluster
1 2 3 The final cluster centres
AGRIC 10.0 1.3 23.2 Number of Cases in each Cluster
FISH .9 .2 2.2
are computed as the Cluster 1 6.000
MINE .3 .3 .5 mean for each variable 2 2.000
MANU 30.2 20.1 13.8
ELECT .6 .9 .6 within each final cluster. 3 6.000
Valid 14.000
CONST
WSALE
7.8 10.0 11.0 The final cluster centres Missing .000
15.4 19.6 15.3
HOTEL 7.2 6.3 6.1 reflect the
TRANS
FINANCE
4.7
1.9
7.1
6.0
3.9
1.2
characteristics of the
ESTATE 3.0 8.0 2.1 typical case for each
ADMIN 7.1 7.0 8.5
EDU 5.5 4.7 6.7
cluster.
HEALTH 2.0 2.2 1.7
OTHERs 1.7 2.9 1.7
MAID 1.9 3.4 1.6

This table shows the Euclidean


distances between the final cluster
Distances between Final Cluster Centers
centres. Greater distances between
Cluster 1 2 3
1 15.922 21.514 clusters correspond to greater
2 15.922 24.916 dissimilarities.
3 21.514 24.916

• Clusters 2 and 3 are most different.


• Cluster 3 is approximately equally
similar to clusters 1 and 2.

6/9/2016 143
ANOVA

Cluster
Mean Square df
Error
Mean Square df F Sig.
The ANOVA table
AGRIC 462.014 2 33.735 11 13.696 .001 indicates which
FISH 4.296 2 1.139 11 3.771 .057
MINE .064 2 .070 11 .916 .429
variables contribute
MANU 407.604 2 27.601 11 14.768 .001 the most to your
ELECT .073 2 .023 11 3.184 .081
CONST 14.986 2 5.299 11 2.828 .102
cluster solution.
WSALE 15.184 2 3.097 11 4.903 .030
HOTEL 1.764 2 1.229 11 1.435 .279
TRANS 7.688 2 .426 11 18.037 .000 Variables with
FINANCE
ESTATE
18.092 2 .579 11 31.242 .000 large F values (or
26.271 2 .764 11 34.375 .000
ADMIN 3.908 2 2.818 11 1.387 .290 small p-values)
EDU
HEALTH
3.922 2 1.390 11 2.821 .103 provide the
.167 2 .282 11 .590 .571
OTHERs 1.202 2 .214 11 5.618 .021 greatest
MAID 2.535 2 .198 11 12.774 .001 separation
The F tests should be used only for descriptive purposes because the clusters have been
chosen to maximize the differences among cases in different clusters. The observed between clusters.
significance levels are not corrected for this and thus cannot be interpreted as tests of the
hypothesis that the cluster means are equal.

6/9/2016 144
A new SPSS file was created to store the results

Transpose the row data to


column data
Data
Transpose
select all
variables needed
OK

6/9/2016 145
Plot of Distances from Cluster Center by Cluster Membership
This is a diagnostic plot that helps you to find outliers within clusters.
► Click the Graphs
menu and select Chart
Builder
► Click the Gallery
tab, select Boxplot
from the list of chart
types, and drag and
drop the Simple
Boxplot icon onto the
canvas.
► Drag and drop
Distance of Case from
its Classification
Cluster Center onto the
y axis.
► Drag and drop
Cluster Number of
Case onto the x axis.
► Click OK to create
the boxplot.

6/9/2016 146
6/9/2016 147
Discriminant Analysis

6/9/2016 148
 Discriminant analysis is used to model the value
of a dependent categorical variable based on its
relationship to one or more predictors
 Given a set of independent variables,
discriminant analysis attempts to find linear
combinations of those variables that best
separate the groups of cases. These
combinations are called discriminant functions
and have the form displayed in the equation

6/9/2016 149
d ik = b0 k + b1k xi1 + K + b pk xip

The number of functions equals min(#groups-1, #predictors).

The procedure automatically chooses a first function that


will separate the groups as much as possible. It then
chooses a second function that is both uncorrelated with
the first function and provides as much further separation
as possible. The procedure continues adding functions in
this way until reaching the maximum number of functions
as determined by the number of predictors and categories
in the dependent variable.

6/9/2016 150
The discriminant model has the following assumptions:
1. The predictors are not highly correlated with each other
2. The mean and variance of a given predictor are not
correlated
3. The correlation between two predictors is constant across
groups
4. The values of each predictor have a normal distribution
Exercise:
If you are a political analyst, you want to be able to identify characteristics
that are indicative of voters who are likely to vote presidential candidate of
USA, and you want to use those characteristics to identify supporters and
opponents.

Suppose information on 1569 past voters is contained in SPSS file


‘Data21’ . Use a random sample of these 1102 (about 70%) voters to
create a discriminant analysis model, setting the remaining voters aside to
validate the analysis. Then use the model to classify the 466 voters as
supporter
6/9/2016
or opponent. 151
select ‘Fixed Value’ select ‘Set Random Number Transform
OK
and type ‘9191972’ Starting Point’ Generators menu

Type rv.bernoulli(0.7) in the Type validate in the Compute Transform


OK
Numeric Expression text box. Target Variable text box. Variable menu
1,6 This sets the
7 values of
Generate 5 validate to be
Random randomly
selection generated
3 Bernoulli
of cases
4 variates with
probability
parameter 0.7.
2
Approximately 70
% of the voters
8 9 previously voted
will have a
validate value of
1. These voters
will be used to
create the model.
The remaining
voters who were
previously voted
will be used to
validate the
model results.

6/9/2016 152
10
Define range: select ‘candidate’ as
Discriminant Classify
Analyze
Continue minimum=1, maximum=2 the grouping variable menu

Statistics Continue Value =1 select ‘validate’ as the Select other variables


selection variable the independent

1
4 22
5

2
8 9
11 16 19
3
6 10
5 2 9

6/9/2016 153
Select ‘Within-groups select ‘Fisher’s’ and select Means, Univariate
Continue correlation’ ‘Unstandardized’ ANOVA, Box’s M

Continue select ‘Summary table’ and Classify


‘Leave-one-out classification’

OK Continue Select ‘Predicted group membership’, Save


‘Probabilities of group membership’

18
14
12

13
15 17

21
20

6/9/2016 154
Classification Statistics
The classification functions are used to assign cases to
groups.
Prior Probabilities for Groups

VOTE FOR Cases Used in Analysis


CLINTON, BUSH Prior Unweighted Weighted
Bush .500 466 466.000
Clinton .500 636 636.000
Total 1.000 1102 1102.000

Classification Function Coefficients


The coefficients for Age and Highest
VOTE FOR CLINTON, year of school completed are smaller
BUSH
Bush Clinton
for the Clinton classification function,
AGE OF RESPONDENT
age categories
.589
-4.613
.563
-4.213
which means that voters who have
HIGHEST YEAR OF
5.761 5.740 voted in previous presidential election
SCHOOL COMPLETED
RS HIGHEST DEGREE -9.436 -9.411 are less likely to Clinton’s supporter.
RESPONDENTS SEX 6.604 7.126
(Constant) -46.455
Fisher's linear discriminant functions
-46.770
Similarly, voters with larger age
categories and higher degree are more
likely to support Clinton.

6/9/2016 155
There is a separate function for each group. For each case, a
classification score is computed for each function. The discriminant model
assigns the case to the group whose classification function obtained the
highest score.
Classification function for Bush:
Y1 = −46.455 + 0.589 Age − 4.613 AgeCategory + 5.761School − 9.436 Degree + 6.604 Sex

Classification function for Clinton:


Y2 = −46.770 + 0.563 Age − 4.213 AgeCategory + 5.740 School − 9.411Degree + 7.126 Sex

Example: a voter with age=53, AgeCat=3, School=12, Degree=1, Sex=1 yielded


Y1=37.223 and Y2=37.025. Thus, this voter will be classified as Bush supporter.

The discriminant model predicts that there is only about an 44.86%


chance that she will vote Clinton, so she is a more likely to vote Bush
(but not a strong supporter).
6/9/2016 156
The eigenvalues table provides information about the relative efficacy
of each discriminant function. When there are two groups, the
canonical correlation is the most useful measure in the table, and it is
equivalent to Pearson's correlation between the discriminant scores
and the groups. Eigenvalues

Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
1 .020a 100.0 100.0 .141
a. First 1 canonical discriminant functions were used in the
analysis.

Wilks' lambda is a measure of how well each function separates cases into
groups. It is equal to the proportion of the total variance in the discriminant
scores not explained by differences among the groups. Smaller values of Wilks'
lambda indicate greater discriminatory ability of the function.
Wilks' Lambda

Wilks'
Test of Function(s) Lambda Chi-square df Sig.
1 .980 22.038 5 .001

The associated chi-square statistic tests the hypothesis that the means
of the functions listed are equal across groups. The small significance
value indicates that the discriminant function does better than chance at
separating the groups.
6/9/2016 157
Checking Collinearity of Predictors: The within-groups correlation matrix
shows the correlations between the predictors.
Pooled Within-Groups Matrices

HIGHEST
AGE OF YEAR OF
RESPON age SCHOOL RS HIGHEST RESPOND
DENT categories COMPLETED DEGREE ENTS SEX
Correlation AGE OF RESPONDENT 1.000 .943 -.306 -.246 .037
age categories .943 1.000 -.247 -.189 .021
HIGHEST YEAR OF
-.306 -.247 1.000 .870 -.068
SCHOOL COMPLETED
RS HIGHEST DEGREE -.246 -.189 .870 1.000 -.066
RESPONDENTS SEX .037 .021 -.068 -.066 1.000

Tests of Equality of Group Means: The tests of equality of group means


measure each independent variable's potential before the model is created.

Wilks' lambda is Tests of Equality of Group Means Each test displays the
another measure of Wilks' results of a one-way
a variable's
Lambda F df1 df2 Sig. ANOVA for the independent
AGE OF RESPONDENT 1.000 .069 1 1100 .794
potential. Smaller age categories
variable using the grouping
1.000 .194 1 1100 .660
values indicate the HIGHEST YEAR OF variable as the factor. If the
1.000 .147 1 1100 .701
variable is better at
SCHOOL COMPLETED significance value is greater
RS HIGHEST DEGREE 1.000 .042 1 1100 .837
discriminating than 0.10, the variable
RESPONDENTS SEX .985 16.822 1 1100 .000
probably does not
between groups.
contribute to the model.

ONLY variable Sex is significant in the discriminant model


6/9/2016 158
Standardized Canonical Discriminant Function Coefficients

Function
The standardized coefficients allow you to
1 compare variables measured on different
AGE OF RESPONDENT -1.497 scales. Coefficients with large absolute
age categories 1.453
HIGHEST YEAR OF
values correspond to variables with greater
SCHOOL COMPLETED
-.210 discriminating ability. It downgrades the
RS HIGHEST DEGREE .104 importance of Sex.
RESPONDENTS SEX .886

Structure Matrix The structure matrix shows the correlation


Function
of each predictor variable with the
1 discriminant function. The ordering in the
RESPONDENTS SEX .868
structure matrix is the same as that
age categories .093
HIGHEST YEAR OF suggested by the tests of equality of group
-.081
SCHOOL COMPLETED means and is different from that in the
AGE OF RESPONDENT -.055
RS HIGHEST DEGREE -.043
standardized coefficients table. This
Pooled within-groups correlations between discriminating disagreement is likely due to the collinearity
variables and standardized canonical discriminant functions noted in the correlation matrix.
Variables ordered by absolute size of correlation within function.

Since the structure matrix is unaffected by collinearity, it's safe to say that this
collinearity has inflated the importance of Age, Age Categories, Highest year of
school completed and Highest degree in the standardized coefficients table.
Thus, voter’s sex best discriminates between supporters and opponents.

6/9/2016 159
Checking for Correlation of Group Means and Variances: The group
statistics table reveals a potentially more serious problem. For all five
predictors, larger group means tend to associate with larger group
standard deviations. Group Statistics

VOTE FOR Valid N (listwise)


CLINTON, BUSH Mean Std. Deviation Unweighted Weighted
Bush AGE OF RESPONDENT 48.9893 16.53325 466 466.000
age categories 2.5300 1.05757 466 466.000
HIGHEST YEAR OF
13.9614 2.65156 466 466.000
SCHOOL COMPLETED
RS HIGHEST DEGREE 1.7103 1.18402 466 466.000
RESPONDENTS SEX 1.5193 .50016 466 466.000
Clinton AGE OF RESPONDENT 48.7264 16.41338 636 636.000
age categories 2.5582 1.04002 636 636.000
HIGHEST YEAR OF
13.8931 3.10213 636 636.000
SCHOOL COMPLETED
RS HIGHEST DEGREE 1.6950 1.25292 636 636.000
RESPONDENTS SEX 1.6415 .47993 636 636.000
Total AGE OF RESPONDENT 48.8376 16.45719 1102 1102.000
age categories 2.5463 1.04709 1102 1102.000
HIGHEST YEAR OF
13.9220 2.91902 1102 1102.000
SCHOOL COMPLETED
RS HIGHEST DEGREE 1.7015 1.22374 1102 1102.000
RESPONDENTS SEX 1.5898 .49209 1102 1102.000

In particular, look at Highest year of school completed, for which the


mean is lower but standard deviation for the Clinton group are
considerably higher. In further analysis, you may want to consider
using transformed values of this predictor.
6/9/2016 160
Checking Homogeneity of Covariance Matrices: Box’s Test
H0: There is equality of covariance matrices
H1: The covariance matrices are not equal
Log determinants are a
Log Determinants measure of the variability of the
VOTE FOR CLINTON, Log groups. Larger log determinants
BUSH Rank Determinant correspond to more variable
Bush 5 2.865
Clinton 5 3.166
groups. Large differences in log
Pooled within-groups 5 3.075 determinants indicate groups
The ranks and natural logarithms of determinants that have different covariance
printed are those of the group covariance matrices.
matrices.
Test Results
Since Box's M is significant,
Box's M 39.887
F Approx. 2.646
you should request separate
df1 15 matrices to see if it gives
df2 4016676 radically different classification
Sig. .001
results. See the section on
Tests null hypothesis of equal population covariance matrices.
specifying separate-groups
covariance matrices for more
information.

6/9/2016 161
Model Validation
The classification table shows the practical results of using the discriminant model.
b,c,d
Classification Results

Predicted Group
Membership
240 of the 466 voted Bush are
VOTE FOR
CLINTON, BUSH Bush Clinton Total classified correctly.
Cases Selected Original Count Bush 240 226 466
Clinton 250 386 636 Of the cases used to create
% Bush 51.5 48.5 100.0
Clinton 39.3 60.7 100.0 the model, 386 of the 636
a Count Bush
Cross-validated 238 228 466 voters who previously voted
Clinton 255 381 636
% Bush 51.1 48.9 100.0
Clinton are classified correctly.
Clinton 40.1 59.9 100.0
Cases Not Selected Original Count Bush 105 90 195 The cross-validated section of the
Clinton 120 151 271
table attempts to correct this by
% Bush 53.8 46.2 100.0
Clinton 44.3 55.7 100.0 classifying each case while
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is leaving it out from the model
classified by the functions derived from all cases other than that case.
b. 56.8% of selected original grouped cases correctly classified.
calculations
c. 54.9% of unselected original grouped cases correctly classified.
d. 56.2% of selected cross-validated grouped cases correctly classified. Overall, 56.8% of the cases are
classified correctly.

Subset validation is obtained by classifying past customers who were not


used to create the model. These results are shown in the Cases Not
Selected section of the table.
54.9% of these cases were correctly classified by the model. This suggests
6/9/2016 that, overall, your model is in fact correct about half of the times. 162
Separate-groups: Creates
separate-group scatterplots of
the first two discriminant
function values. If there is only
one function, histograms are
displayed instead.

6/9/2016 163
 Combined-groups. Creates an all-groups
scatterplot of the first two discriminant function
values. If there is only one function, a histogram
is displayed instead.
 Territorial map. A plot of the boundaries used to
classify cases into groups based on function
values. The numbers correspond to groups into
which cases are classified. The mean for each
group is indicated by an asterisk within its
boundaries. The map is not displayed if there is
only one discriminant function.

6/9/2016 164
SPSS also produces an ASCII territorial map plot which shows the relative
location of the boundaries of the different categories.

Territory for
Group 2

Territory for Territory for


Group 1 Group 3

6/9/2016 165
 The Discriminant Analysis procedure is useful
for modeling the relationship between a
categorical dependent variable and one or more
scale independent variables.
 If your dependent variable is scale, use the
Linear Regression procedure.
 Alternatively, if your dependent variable is scale,
try the GLM Univariate procedure.
 If your predictors are multicollinear and you want
to reduce their number, use the Factor Analysis
procedure.

6/9/2016 166
Differences between DA and CA
 In clustering, the category of the object is unknown.
However, we know the rule to classify (usually based on
distance) and we also know the features (independent
variables) that can describe the classification of the
object. There is no training example to examine whether
the classification is correct or not. Thus, the objects are
assigned into groups merely based on the given rule.
 In discriminant analysis, object groups and several
training examples of objects that have been grouped are
known. The model of classification is also given (e.g.
linear or quadratic) and we want to know the best fit
parameters of the model that can best separate the
objects based on the training samples.

6/9/2016 167
Neural
Network
Method

Day 3 – Data Science 168


ANNs – The basics
 ANNs incorporate the two fundamental
components of biological neural nets:

1. Neurones (nodes)
2. Synapses (weights)

Day 3 – Data Science 169


The Key Elements of Neural Networks
 Neural computing requires a number of neurons, to be
connected together into a neural network. Neurons are
arranged in layers. Inputs Weights
p1 w1

w2
p2 a
w3 f Output
p3

1
Bias
a = f (p1w1 + p 2 w2 + p3 w3 + b ) = f (∑ pi wi + b )

 Each neuron within the network is usually a simple


processing unit which takes one or more inputs and
produces an output. At each neuron, every input has an
associated weight which modifies the strength of each
input. The neuron simply adds together all the inputs and
calculates an output to be passed on.
Day 3 – Data Science 170
Activation functions
 The activation function is generally non-linear. Linear
functions are limited because the output is simply
proportional to the input.

Day 3 – Data Science 171


Perceptrons
Neuron Model

The perceptron neuron produces a 1 if the net


input into the transfer function is equal to or
greater than 0, otherwise it produces a 0.
Architecture Decision boundaries

Day 3 – Data Science 172


Feed-forward nets
Information flow is unidirectional
Data is presented to Input layer
Passed on to Hidden Layer
Passed on to Output layer

Information is distributed

Information processing is parallel

Internal representation (interpretation) of data

Day 3 – Data Science 173


 Feeding data through the net:

(1 × 0.25) + (0.5 × (-1.5)) = 0.25 + (-0.75)


= - 0.5
1
Squashing: = 0.3775
1+ e 0. 5

Day 3 – Data Science 174


Backpropagation
1. A set of examples for training the
network is assembled. Each
case consists of a problem
statement (which represents the
input into the network) and the
corresponding solution (which
represents the desired output
from the network).
2. The input data is entered into the
network via the input layer.

Day 3 – Data Science 175


Backpropagation
3. Each neuron in the network processes 5. Fine tuning the
the input data with the resultant values weights in this way
steadily "percolating" through the
network, layer by layer, until a result is has the effect of
generated by the output layer. teaching the
4. The actual output of the network is network how to
compared to expected output for that produce the correct
particular input. This results in an error output for a
value. The connection weights in the particular input, i.e.
network are gradually adjusted,
working backwards from the output
the network learns.
layer, through the hidden layer, and to
the input layer, until the correct output
is produced.

Day 3 – Data Science 176


The Learning Rule
 The delta rule is often utilized by the most common class
of ANNs called backpropagational neural networks.

Input

Desired
Output

 When a neural network is initially presented with a


pattern it makes a random guess as to what it might be.
It then sees how far its answer was from the actual one
and makes an appropriate adjustment to its connection
weights.
Day 3 – Data Science 177
Recurrent Networks
 Feed forward networks:
 Information only flows one way
 One input pattern produces one output
 No sense of time (or memory of previous state)

 Recurrency
 Nodes connect back to other nodes or themselves
 Information flow is multidirectional
 Sense of time and memory of previous state(s)

Day 3 – Data Science 178


Choose MLP or RBF Neural Networks Analyze

Select rescaling Specify the Specify the dependent variable Variables


method covariates (scale, categorical or both)

1 Open ‘Data19[dengue fine]’

2 3

6
-The MLP procedure can find more
complex relationships, while the
RBF procedure is faster.
7

6/9/2016 179
Use 70% training:
30% testing Partitions

Specify the output Specify # hidden layers, # units, Architecture


activation function activation function

8 10

9
11

12

6/9/2016 180
Specify network performance methods Specify network structure Output

OK Check Independent variable importance analysis


13

14

15

16

17

6/9/2016 181
Main Reference

Abd Rahim Md Nor (2009),


Statistical Methods in
Research, Petaling Jaya:
Prentice Hall

6/9/2016 182