You are on page 1of 30

Multivariate Data Analysis

Overview of Methods
8 6

-2 -2 0 2 4 6 8

Sophisticated multivariate statistical methods are becoming standard practice in the physical, natural and social sciences, as well as in business
Variations of existing methods are being developed, existing techniques are being applied to new applications, and new methods continue to be designed

1.0
Talking to co-workers Reading a book

.5
Talking to friends

30-34 35-39

Radio Cable television 40-44

Dimension 2

0.0
25-29

Newspaper 45-49 Broadcast television

-.5

Internet 50-54

-1.0
Magazines

Most important information source Age group

-1.5 -1.0 -.8 -.6 -.4 -.2 .0 .2 .4 .6 .8

Dimension 1

The accelerated use of advanced multivariate techniques is being driven by


Growing complexity in the topics being addressed Ever-larger data sets Ability to apply computationally intensive methods through powerful computer tools Academic training

-2 -2 0 2 4 6 8

Scale Nominal

Definition Non-ordered categories

Examples Race, gender marital status

Descriptive Statistics Percentages, mode

Ordinal

Ordered relation between categories


Ordered relation, equality of differences Ordered relation, equality of differences, absolute zero

Attitudes, social class


Economic indices Sales,costs, frequency

Percentiles, median
Range, mean, standard deviation All of above, coefficient of variation

Interval Metric

Ratio

Response vs explanatory
Response or dependent variable
Variable to be modeled or predicted

Explanatory or independent variable


Variables used to predict or model dependent variable

Importance of identifying data and variable types


Critical in determining analysis objectives and appropriate analysis method Avoid inappropriate variable operations

Dependence techniques
One or a set of variables are regarded as dependent variables Objective is to predict or explain the value of the dependent variable(s) based on the values of a set of independent variables Examples
What is the probability that a loan applicant will default? What factors best differentiate people whose primary news source is the Internet?

Dependence techniques
Multiple regression Logistic regression Discriminant analysis Canonical correlation Structural equation modeling Analysis of variance Decision trees

Interdependence techniques
No single group of variables defined as dependent or independent Objective is to identify and characterize underlying structure between the variables Examples
What are the underlying factors that define a customers perception of a brand? Which signal returns arise from the same object and how many objects are present?

Interdependence techniques
Factor analysis Multidimensional scaling Correspondence analysis Cluster analysis

Interdependence techniques are valuable data reduction methods


Data reduction attempts to manage and interpret the large amounts of data gathered One goal is combine groups of cases measured over multiple variables into a relatively small number of understandable segments Or to group variables together into categories of latent traits and then characterize cases with respect to this smaller number of traits

The reduced data variables are then often used as variables in dependence techniques

Multiple regression is a dependence technique used to model the relationship between the value of a single metric dependent variable and a set of metric independent variables
Categorical variables can be included as dummy variables

Model can be applied to predict changes in the dependent variables response to changes in the independent variables Regression also indicates the relative importance of independent variables on the response of the dependent variable

For example, a client may be interested in understanding the effect of price and promotional activity on a products market share among both loyal and not loyal customers Technical result is a linear model of the form

Y = a0 + a1X1 + a2X2 + +anXn

Best visualizations of the results control all but one (or two) of the independent variables and examine how the value of dependent variable changes with respect to the free independent variables

Market share for loyal customers

Market share for not-loyal customers

60

60

50

50

40

40

30

30

20

20

Market Share

Market Share

10

10

0 20

30

40

50

60

70

80

0 20

30

40

50

60

70

80

Promotion Index

Promotion Index

Properties
Single interval scale dependent variable Multiple independent variables, preferably on interval scale Familiar and useful technique

Issues
Assumes linear relationship between dependent and independent variables Overused and often assumptions not fully checked Often misapplied to classification problems

Logistic Regression is a dependence techniques used to model the relationship between a single categorical dependent variable and a set of metric independent variables
Typically dependent variable takes one of two values success/failure, buy/do not buy Multinomial formulations

A logistic model gives the probability that the dependent variable takes a target value given the values of the independent variable

For example, which credit and demographic factors best predict whether a customer will keep a loan current
Dependent variable taken as 60 days past due or worse Independent variables are credit and employment history, and demographic descriptors

Properties
Powerful technique for predicting group membership and identifying important independent variables Becoming more widely used Procedures and results similar to linear regression

Issues
Adequate data Model validation Communicating probabilistic concepts

Decision trees are a dependence technique used to develop a model to classify the value of a single dependent variable based on a set of independent variables
Dependent and independent variables can be any data type

The typical product of CART is a straightforward, easily interpretable set of segmentation rules
For example, classify existing customers as high or low likelihood buyers of a new product based on demographics and historical purchasing behavior. Classification could be used to focus advertising campaign

Decision trees can be also used to examine profiles of different market segments with respect to underlying demographic and psychographic variables
For example, what are the most significant demographic variables determining whether the Internet is a persons most important information source?

Properties
Single dependent variable of any scale Multiple independent variables of any scale Free of model assumptions typical in other dependence techniques Powerful statistical learning algorithm able to identify complex variable interactions Not as familiar Standard inferential statistics not applicable Often leads to asymmetric relationships

Issues

Factor analysis is an interdependence technique used to identify a set of underlying latent traits (factors) that explain the correlations between a large number of variables
Data summarizing
Derive a set of underlying concepts that summarize a larger set of variables

Data reduction
Develop a set of factor variables that serves as a more parsimonious description of the data

Interested in defining underlying dimensions influencing the perception of online destinations


Survey respondents are asked to rate a set of destinations (including clients) with respect to a number of traits Factor analysis can be applied to develop a succinct set of perception dimensions This manageable set of dimensions can be used to characterize a clients site and to develop a focused plan to reposition it


4.5

Factors can then be used to provide visual summary of data


A C B E

On a scale of 1 to 5 where "1" means "not at all descriptive" and "5" means "extremely descriptive," how well do each of the following words or phrases describe the +website?

4.0

3.5 3.0 G Client 2.5 H 2.0 1.5 D 1.0 .5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 F

Competence

Sophistication
Trustworthy Exciting

Trustworthy

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Down-to-earth Daring Intelligent Confusing Friendly Up-to-date Clumsy Slick Genuine Imaginative Pretentious Upper class Honest Spirited Dependable Reliable Informative Silly Efficient Sassy

Properties
Very useful in identifying structure and relationships in data Provides tractable set of concepts for both managerial and analytical uses Provides opportunities for visualizations

Issues
Questionnaire design Variable selection Factor interpretation and validity

Cluster analysis is an interdependence technique used to segment cases into homogeneous groups based on a specified set of variables
Data reduction
Develop a more parsimonious description of cases which can then be used in analytical classification methods

Identify similarities between cases with respect to clustering variables Characterize clusters with respect to other sets of variables

Want to identify and then characterize similar groups of TV pilot shows based on survey responses rating shows on various traits
For one or two traits it may be possible to do this subjectively. Cluster analysis provides an objective method for multiple traits Clusters can be characterized with respect to variables not used in the analysis, such as show success, and cluster membership can be used as a dependent variable in classification method

60

50

1
The Grub National

2
The Pitt

3
Oliver B

Cedric Wanda at Live Gir Ground2 Normal O More Pat Becoming Bernie M

40

Beat Cop Andy Ric College Nathan's

Greg Ruling the C 30 Msgr. Ma Normal P Tick2

HUMOR

20 20 30 40 50 60

CLEVER

Cluster 1: Low likelihood of success Cluster 2: Moderate likelihood of success Cluster 3: High likelihood of success

Properties
Many cluster techniques are available for data of all scales Can identify structure in large data sets that may be difficult to discover in any other way Provides objective segmentation method

Issues
Selecting appropriate clustering method Determining appropriate number of clusters Validating clusters

You might also like