Professional Documents
Culture Documents
MULTIPLE DISCRIMINANT
ANALYSIS
In partial fulfilment of covering the course of Business Research
Methods
Submitted To:
Submitted By:
This report is intended to be submitted as an end term project for the course titled Multiple
Discriminant Analysis. I am thankful to our course professor Alok Kumar, Professor, Fore
School of Management, New Delhi for his guidance & assistance in the completion of this
course and project.
Danish Sharma
Deeksha Dixit
Parth Dhingra
Safal Tyagi
Samik Mahotra
Tanmay Mathur
FMG-25A
Fore School of Management
New Delhi
Introduction
Multiple Discriminant Analysis (MDA) is a method for compressing a multivariate signal to
yield a lower-dimensional signal amenable to classification.
MDA is useful because most classifiers are strongly affected by the curse of dimensionality.
In other words, when signals are represented in very-high-dimensional spaces, the classifier's
performance is catastrophically impaired by the overfitting problem. This problem is reduced
by compressing the signal down to a lower-dimensional space as MDA does.
Multiple discriminant analysis (MDA), also known as canonical variates analysis (CVA) or
canonical discriminant analysis (CDA), constructs functions to maximally discriminate
between n groups of objects. This is an extension of linear discriminant analysis
(LDA) which - in its original form - is used to construct discriminant functions for objects
assigned to two groups.
Key assumptions
The distribution of the original variables is assumed to be (close to) multivariate normal
in each group.
Explanatory variables are continuous. Categorical explanatory variables should be
evaluated by, e.g., discriminant correspondence analysis.
The covariance matrices of each group should be (near) equal.
It is assumed that multivariate linear functions can be used to discriminate between
groups.
The number of samples (objects) must be greater than the number of variables in the
analysis.
There should be at least two objects per group.
Variables should be homoscedastic. If the mean of a variable is correlated with its
variance, significance tests may be invalid.
There should be no linear dependency between explanatory variables.
Warnings
MDA is sensitive to outliers. These should be identified and treated accordingly.
MDA is only suitable when evaluating the variables' ability to linearly discriminate
between any grouping.
Highly correlated variables will contribute very similarly to an MDA solution and may
be redundant. Thus, variables that are uncorrelated are preferable.
While unequal group sizes can be tolerated, very large differences in group sizes can
distort results, particularly if there are very few (< 20) objects per group.
If MANOVA tests on a given set of explanatory variables are insignificant, MDA is
unlikely to be useful.
When interpreting the coefficients of a discriminant function, carefully distinguish
between standardized and unstandardized coefficients.
Heteroscedasticity is likely to lead to invalid significance tests.
Across implementations, the absolute values of discriminant weights may vary due to
different scaling and standardization approaches, but their relative proportions should be
the same.
Chapter 1
In this Study, a large international air carrier has collected data on employees in three
different job classifications:
1) customer service personnel,
2) mechanics
3) dispatchers
The director of Human Resources wants to know if these three job classifications appeal
to different personality types. Each employee is administered a battery of psychological
test which include measures of interest in outdoor activity, sociability and
conservativeness.
The dataset has 244 observations on four variables.
The psychological variables are Outdoor Interests, Social and Conservative Nature.
The following table summarizes sample of data:
Here Outdoor, Social and Conservative are independent variables on continuous scale.
On the basis of these independent variables, Job categories were evaluated.
As we are conducting multiple discriminant analysis, we are having 3 categorizations of
jobs therefore 2 discriminant functions will be created.
Above Table shows Descriptive Statistics of the data set, this includes mean and standard
deviation of data set in different categories. For eg. Mean outdoor score of customer
service job is 12.52 and its standard deviation is 4.649.
The above table shows correlation between independent variables, from the table we can
interpret that all the variables are very less correlated with each other.
Above table shows Eigen Values of both discriminant functions of our model, Eigen
value is the ratio of variation between group and variation within group.
D1 function is having 77% of discriminating ability of the model and other function D2
is having 22.9% discriminating ability of the model.
Wilks Lambda signifies unexplained variation, Hence, lower the better. As p value is
less than 0.05 therefore null hypothesis is Rejected.
75% of the original grouped cases are correctly classified by the Model which SPSS
created.
Chapter 2
The type of iris for a flower depends upon the length and width of petal and sepal.
Data Analysis
The input variables and data is as follows:
The output is as follows:
Multiple discriminant analysis of 3 categories will give 2 discriminant functions i.e. D1 &
D2. By tests of equality of group means we check the significant ability of the discriminant
model.
Ho: Discriminant function is insignificant.
H1: Discriminant function is significant.
As the Significance value of all variable is less than 0.05, Hence the NULL Hypothesis is
Rejected and Discriminant Function is Significant. Pooled Within Group Matrices show
Correlation between Independent variables.
Eigen Value= Variation b/w grps / Variation within grps.
In this analysis, the first function accounts for 99.7% of the discriminating ability of the
discriminating variables and the second function accounts for 0.3%.
We can verify this by noting that the sum of the eigenvalues is 107.901+.320 = 108.221.
Then (107.901/108.221) = 0.997
and (0.320/108.221) = 0.3.
Wilks Lambda signifies unexplained variation, Hence, lower the better.
As p value is less than 0.05 therefore null hypothesis is Rejected.
Standardized Function Coefficients signifies correlation between variables and discriminant
function.
D1 = 2.102 * length of sepal 0.057 * width of sepal + 5.059 * length of petal 1.703 *
width of petal -30.017
D2 = 0.645 * length of sepal + 1.743 * width of sepal - 1.228 * length of petal + 2.7 * width
of petal -7.978
The magnitudes of these coefficients indicate how strongly the discriminating variables affect
the score.
From the above output, it can be inferred that the first function is being followed.
Chapter 3
Data Analysis
BP values are categorized into: Low(<120) Normal ( 120 to140) High(>140)
Discriminant
Notes
Valid 42 8.9
Excluded Missing or out-of-range
0 .0
group codes
At least one missing
0 .0
discriminating variable
Both missing or out-of-range
group codes and at least one
432 91.1
missing discriminating
variable
Total 432 91.1
Total 474 100.0
Group Statistics
Valid N (listwise)
Eigenvalues
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
Wilks' Lambda
Standardized Canonical
Discriminant Function
Coefficients
Function
1 2
Function
1 2
Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
*. Largest absolute correlation between each variable and any discriminant function
Function
1 2
Unstandardized coefficients
Function
bldprssr 1 2
Unstandardized canonical
discriminant functions evaluated at
group means
Classification Statistic
Classification Processing Summary
Processed 474
Excluded Missing or out-of-range group
0
codes
At least one missing
432
discriminating variable
Used in Output 42
Classification Resultsa,c
Normal 0 18 0 18
High 0 0 11 11
Normal 0 18 0 18
High 0 0 11 11
A Nutritionist wants to establish if there is a relationship between the varieties of the milk
with the composition of the milk
To determine this, they have conducted a suitable research. Initially the variables that have an
impact on consumers credit worthiness were identified, these variables were:
A. Water Content
B. Fat Content
C. Carbs Content
Historical data was collected from the banks own record and consumers were classified in to
two groups as follows:
A. Skimmed (code = 0)
B. Toned (code = 1)
C. Double Toned (code = 2)
This was done based on the banks experience with the customers during the last two years.
Objective of the Problem
1. To understand the working of multiple discriminant using SPSS.
2. To understand if the composition of elements in the milk is related to the variety it
has.
Data Analysis
Sample taken for multiple discriminant is as follows. Here 0 refers to skimmed milk, 1
refer to Toned Milk and 2 refer to Double toned milk.
Here Water, Fats & Carbs are independent variables on continuous scale. On the basis of
these independent variables, risk is evaluated.
As we are conducting multiple discriminant analysis, we are having 3 categorizations of
risks therefore 2 discriminant functions will be created.
By tests of equality of group means we check the significant discriminating ability of
different variables in the discriminant model.
Ho: Discriminant function is insignificant.
H1: Discriminant function is significant.
As the Significance value of all variable is less than 0.05, Hence the NULL Hypothesis is
Rejected and Discriminant Function is Significant.
As Sig. value of all variables is less than 0.05, hence we reject Null Hypothesis and all
the variables in model is significant.
The above table shows correlation between independent variables, from the table we can
interpret that all the variables are very less correlated with each other.
Above table shows Eigen Values of both discriminant functions of our model, Eigen
value is the ratio of variation between group and variation within group.
D1 function is having 90.4% of discriminating ability of the model and other function D2
is having 9.6% discriminating ability of the model.
Wilks Lambda signifies unexplained variation, Hence, lower the better. As p value is
less than 0.05 therefore null hypothesis is Rejected. 2nd function is insignificant as p
value is more than 0.05.
The distribution of the scores from each function is standardized to have a mean of zero
and standard deviation of one.
The magnitudes of these coefficients indicate how strongly the discriminating variables
effect the score.
Recommendation and Suggestions
Model classifies 90% of the data correctly, hence it is model with decent accuracy for
categorization
Chapter 5
A credit card bank has been in the business for the last 14 years, during the last 2 years their
repayment default has shot up considerably. Even though the bank charges a penalty interest
on all late payments, this high default rate is putting a lot of pressure on the banks recovery
mechanism and has now begun to impact its profitability in this activity. The problem appears
to be the credit appraisal mechanism used by the bank to evaluate credit card applicants at the
time of credit card allotment. Hence the bank desires to revamp its appraisal system using its
past experience.
To determine this, they have conducted a suitable research. Initially the variables that have an
impact on consumers credit worthiness were identified, these variables were:
A. Consumers age.
B. Monthly household income.
C. No of years married.
Historical data was collected from the banks own record and consumers were classified in to
two groups as follows:
A. High risk (code = 1)
B. Medium risk (code =2)
C. Low Risk (code = 3)
This was done based on the banks experience with the customers during the last two years.
Objective of the Problem
3. To understand the working of multiple discriminant using SPSS.
4. To understand if the risk allocated to each customer is related to age, income and
number of years married.
Data Analysis
Sample taken for multiple discriminant is as follows. Here 1 refers to high risk, 2 refer to
medium risk and 3 refer to low risk.
Here Age, Income and Years of marriage are independent variables on continuous scale.
On the basis of these independent variables, risk is evaluated.
As we are conducting multiple discriminant analysis, we are having 3 categorizations of
risks therefore 2 discriminant functions will be created.
Below Table shows Descriptive Statistics of the data set, this includes mean and standard
deviation of data set in different categories. For eg. Mean age of high risk is 26.70
and its standard deviation is 4.785.
By tests of equality of group means we check the significant discriminating ability of
different variables in the discriminant model.
Ho: Discriminant function is insignificant.
H1: Discriminant function is significant.
As the Significance value of all variable is less than 0.05, Hence the NULL Hypothesis is
Rejected and Discriminant Function is Significant.
As Sig. value of all variables is less than 0.05, hence we reject Null Hypothesis and all
the variables in model is significant.
The above table shows correlation between independent variables, from the table we can
interpret that all the variables are very less correlated with each other.
Above table shows Eigen Values of both discriminant functions of our model, Eigen
value is the ratio of variation between group and variation within group.
D1 function is having 98% of discriminating ability of the model and other function D2
is having 2% discriminating ability of the model.
Wilks Lambda signifies unexplained variation, Hence, lower the better. As p value is
less than 0.05 therefore null hypothesis is Rejected. 2nd function is insignificant as p
value is more than 0.05.
Data Analysis
Multiple Discriminant Analysis was done on the basis of this sample.
Here Sugar, Fat and Milk are independent variables on continuous scale. On the basis of
these independent variables, Job categories were evaluated.
As we are conducting multiple discriminant analysis, we are having 3 categorizations of
jobs therefore 2 discriminant functions will be created.
Above Table shows Descriptive Statistics of the data set, this includes mean and standard
deviation of data set in different categories. For eg. Mean fat of Nestle is 5.93 and its
standard deviation is .829.
The above table shows correlation between independent variables, from the table we can
interpret that all the variables are very less correlated with each other.
Above table shows Eigen Values of both discriminant functions of our model, Eigen
value is the ratio of variation between group and variation within group.
D1 function is having 99.3% of discriminating ability of the model and other function D2
is having .7% discriminating ability of the model.
Wilks Lambda signifies unexplained variation, Hence, lower the better. As p value is
less than 0.05 therefore null hypothesis is Rejected for D1 where as it is greater for D2
hence null hypothesis is accepted for it.
Standardized Function Coefficients signifies correlation between variables and
discriminant function.
https://en.wikipedia.org/wiki/Multiple_discriminant_analysis
www.investopedia.com/terms/m/multiple-discriminant-analysis.asp
https://sites.google.com/site/mb3gustame/discrimination/multiple-discriminant-analysis
www.investorwords.com/6586/multiple_discriminant_analysis.html
www.cengage.com/resource_uploads/downloads/0324594690_163056.pdf
www.bauer.uh.edu/nbsyam/documents/MktRes-MARK7362-Lecture7_001.ppt
www.emeraldinsight.com/doi/abs/10.1108/17468801111119498