Professional Documents
Culture Documents
Overview of Methods
8 6
-2 -2 0 2 4 6 8
Sophisticated multivariate statistical methods are becoming standard practice in the physical, natural and social sciences, as well as in business
Variations of existing methods are being developed, existing techniques are being applied to new applications, and new methods continue to be designed
1.0
Talking to co-workers Reading a book
.5
Talking to friends
30-34 35-39
Dimension 2
0.0
25-29
-.5
Internet 50-54
-1.0
Magazines
Dimension 1
-2 -2 0 2 4 6 8
Scale Nominal
Ordinal
Percentiles, median
Range, mean, standard deviation All of above, coefficient of variation
Interval Metric
Ratio
Response vs explanatory
Response or dependent variable
Variable to be modeled or predicted
Dependence techniques
One or a set of variables are regarded as dependent variables Objective is to predict or explain the value of the dependent variable(s) based on the values of a set of independent variables Examples
What is the probability that a loan applicant will default? What factors best differentiate people whose primary news source is the Internet?
Dependence techniques
Multiple regression Logistic regression Discriminant analysis Canonical correlation Structural equation modeling Analysis of variance Decision trees
Interdependence techniques
No single group of variables defined as dependent or independent Objective is to identify and characterize underlying structure between the variables Examples
What are the underlying factors that define a customers perception of a brand? Which signal returns arise from the same object and how many objects are present?
Interdependence techniques
Factor analysis Multidimensional scaling Correspondence analysis Cluster analysis
The reduced data variables are then often used as variables in dependence techniques
Multiple regression is a dependence technique used to model the relationship between the value of a single metric dependent variable and a set of metric independent variables
Categorical variables can be included as dummy variables
Model can be applied to predict changes in the dependent variables response to changes in the independent variables Regression also indicates the relative importance of independent variables on the response of the dependent variable
For example, a client may be interested in understanding the effect of price and promotional activity on a products market share among both loyal and not loyal customers Technical result is a linear model of the form
Best visualizations of the results control all but one (or two) of the independent variables and examine how the value of dependent variable changes with respect to the free independent variables
60
60
50
50
40
40
30
30
20
20
Market Share
Market Share
10
10
0 20
30
40
50
60
70
80
0 20
30
40
50
60
70
80
Promotion Index
Promotion Index
Properties
Single interval scale dependent variable Multiple independent variables, preferably on interval scale Familiar and useful technique
Issues
Assumes linear relationship between dependent and independent variables Overused and often assumptions not fully checked Often misapplied to classification problems
Logistic Regression is a dependence techniques used to model the relationship between a single categorical dependent variable and a set of metric independent variables
Typically dependent variable takes one of two values success/failure, buy/do not buy Multinomial formulations
A logistic model gives the probability that the dependent variable takes a target value given the values of the independent variable
For example, which credit and demographic factors best predict whether a customer will keep a loan current
Dependent variable taken as 60 days past due or worse Independent variables are credit and employment history, and demographic descriptors
Properties
Powerful technique for predicting group membership and identifying important independent variables Becoming more widely used Procedures and results similar to linear regression
Issues
Adequate data Model validation Communicating probabilistic concepts
Decision trees are a dependence technique used to develop a model to classify the value of a single dependent variable based on a set of independent variables
Dependent and independent variables can be any data type
The typical product of CART is a straightforward, easily interpretable set of segmentation rules
For example, classify existing customers as high or low likelihood buyers of a new product based on demographics and historical purchasing behavior. Classification could be used to focus advertising campaign
Decision trees can be also used to examine profiles of different market segments with respect to underlying demographic and psychographic variables
For example, what are the most significant demographic variables determining whether the Internet is a persons most important information source?
Properties
Single dependent variable of any scale Multiple independent variables of any scale Free of model assumptions typical in other dependence techniques Powerful statistical learning algorithm able to identify complex variable interactions Not as familiar Standard inferential statistics not applicable Often leads to asymmetric relationships
Issues
Factor analysis is an interdependence technique used to identify a set of underlying latent traits (factors) that explain the correlations between a large number of variables
Data summarizing
Derive a set of underlying concepts that summarize a larger set of variables
Data reduction
Develop a set of factor variables that serves as a more parsimonious description of the data
4.5
On a scale of 1 to 5 where "1" means "not at all descriptive" and "5" means "extremely descriptive," how well do each of the following words or phrases describe the +website?
4.0
3.5 3.0 G Client 2.5 H 2.0 1.5 D 1.0 .5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 F
Competence
Sophistication
Trustworthy Exciting
Trustworthy
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Down-to-earth Daring Intelligent Confusing Friendly Up-to-date Clumsy Slick Genuine Imaginative Pretentious Upper class Honest Spirited Dependable Reliable Informative Silly Efficient Sassy
Properties
Very useful in identifying structure and relationships in data Provides tractable set of concepts for both managerial and analytical uses Provides opportunities for visualizations
Issues
Questionnaire design Variable selection Factor interpretation and validity
Cluster analysis is an interdependence technique used to segment cases into homogeneous groups based on a specified set of variables
Data reduction
Develop a more parsimonious description of cases which can then be used in analytical classification methods
Identify similarities between cases with respect to clustering variables Characterize clusters with respect to other sets of variables
Want to identify and then characterize similar groups of TV pilot shows based on survey responses rating shows on various traits
For one or two traits it may be possible to do this subjectively. Cluster analysis provides an objective method for multiple traits Clusters can be characterized with respect to variables not used in the analysis, such as show success, and cluster membership can be used as a dependent variable in classification method
60
50
1
The Grub National
2
The Pitt
3
Oliver B
Cedric Wanda at Live Gir Ground2 Normal O More Pat Becoming Bernie M
40
HUMOR
20 20 30 40 50 60
CLEVER
Cluster 1: Low likelihood of success Cluster 2: Moderate likelihood of success Cluster 3: High likelihood of success
Properties
Many cluster techniques are available for data of all scales Can identify structure in large data sets that may be difficult to discover in any other way Provides objective segmentation method
Issues
Selecting appropriate clustering method Determining appropriate number of clusters Validating clusters