You are on page 1of 34

MULTIVARIATE DATA ANALYSIS WITH EXCEL E. OMEY HUB Stormstraat 2, 1000 Brussels Belgium edward.omey@hubrussel.be www.edwardomey.

com

1. An example .................................................................................................................... 2 1.1. The data set............................................................................................................. 2 1.2. Qualitative - qualitative ......................................................................................... 3 1.3. Qualitative quantitative ...................................................................................... 5 1.4. Quantitative quantitative ................................................................................. 10 1.4.1. RAW DATA .................................................................................................. 10 1.4.2. GROUPED DATA ........................................................................................ 12 2. Covariance and Correlation....................................................................................... 15 2.1. Definitions ............................................................................................................. 15 2.2. Calculations .......................................................................................................... 15 2.2.1. Raw data ........................................................................................................ 15 2.2.2. Grouped data ................................................................................................. 19 3. The regression line ...................................................................................................... 23 3.1. Formulas ............................................................................................................... 23 3.2. Grouped data ........................................................................................................ 24 3.2. Raw data ............................................................................................................... 26 3.2.1. Using the XY-scatter (BBA 1) ...................................................................... 26 3.3.2. Using DATA DATA-ANALYSIS (BBA3) ............................................. 31

Multivariate Data Analysis E. Omey 2011

1. An example
1.1. The data set
We have 50 data-lines with A = age (quantitative) B = Male or Female (qualitative) C = number of children (quantitative) E = Unemployed or employed (qualitative) D = IQ (quantitative)

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

A 20 37 41 31 49 22 48 49 28 45 38 20 24 23 47 37 48 48 48 42 34 22 20 23 31 36 45 47 44 35 30 30 29 23

B M F F M F M M F M F M F F F M F F F M F F M F F F M M F M F F F F F

C 0 3 1 1 2 0 1 2 0 1 2 2 3 0 1 1 0 2 4 1 2 1 0 1 1 1 3 2 2 2 0 0 2 0

D E E E E E U U U U E E E E E E E E E E E U E E U E E E U U E E E E E

E 35 91 93 156 69 145 112 160 109 134 85 154 87 75 140 149 80 123 141 74 147 130 90 109 160 79 134 81 171 116 91 141 85 185

Multivariate Data Analysis E. Omey 2011

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

22 40 33 42 37 47 43 43 34 28 40 46 31 23 45 27

M M F F F F M F F M M M F M M M

0 3 0 2 3 3 2 3 2 1 2 0 2 1 3 1

U E E E U E E E U U E U U U U E

48 90 117 112 72 106 127 113 200 82 153 145 115 110 145 143

As before we can study each of the variables separately. Now we can also study two variables at the same time. There are several possibilities

1.2. Qualitative - qualitative


We study B and E and make a frequency table. In EXCEL this can be doen by using the PIVOT-table, but we are not going into detail here. We find the following table:
B F F M M TABLE 1 D E U E U AF 22 7 12 9 50 RF 0,44 0,14 0,24 0,18 1

Note that it does not make sense to calculate cumulative relative frequencies. (Why?) Another way to present the table is as follows:
AF B F M D E 22 12 34 RF U 7 9 16 29 21 50 B F M D E 0,44 0,24 0,68 TABLE 2 U 0,14 0,18 0,32 0,58 0,42 1

There are several graphs we can make. Starting from table 2 we select the relative frequencies and the choose

Multivariate Data Analysis E. Omey 2011

INSERT

COLUMN:

Some make up (change the horizontal axis and remove what is unnecessary) we get:

If we choose the second option in INSERT

COLUMN of the Chart Wizard:

Multivariate Data Analysis E. Omey 2011

The first bar represents employed people and the bar is divided into 2 parts: F and M.

1.3. Qualitative quantitative


Now we study the combination of C (children) and B (F or M). We find the following frequency tables:
AF C 0 1 2 3 4 F 7 6 11 5 0 29 B M 5 8 4 3 1 21 12 14 15 8 1 50 RF TABLE 3 C 0 1 2 3 4 B F 0,14 0,12 0,22 0,1 0 0,58 M 0,1 0,16 0,08 0,06 0,02 0,42 Total 0,24 0,28 0,3 0,16 0,02 1

Choosing the relative frequencies in Table 3, we take again INSERT get

COLUMN. We

Multivariate Data Analysis E. Omey 2011

The second option in COLUMN gives:

Each bar represents the RF of the number of children. The bar is divided into 2 pieces: one piece for F and one piece for F. Now we can also study the M and F separately. We construct 3 frequency tables: one for F only, one for M only and one for the Total population. We get:

Multivariate Data Analysis E. Omey 2011

C 0 1 2 3 4

AF 7 6 11 5 0 29

only F RF 0,24 0,21 0,38 0,17 0,00 1 only M RF 0,24 0,38 0,19 0,14 0,05 1 Total RF 0,24 0,28 0,30 0,16 0,02 1

CRF 0,24 0,45 0,83 1,00 1,00

C 0 1 2 3 4

AF 5 8 4 3 1 21

CRF 0,24 0,62 0,81 0,95 1,00

C 0 1 2 3 4

AF 12 14 15 8 1 50

CRF 0,24 0,52 0,82 0,98 1,00

To compare the different pieces, we can make a new table and a new graph. By copypaste we get the following table:
Table 4 C 0 1 2 3 4 only F RF 0,24 0,21 0,38 0,17 0,00 1 only M RF 0,24 0,38 0,19 0,14 0,05 1 Total RF 0,24 0,28 0,30 0,16 0,02 1

We see that for F the mode is 2, for M the mode is 1. The total mean number of children is: 1,44 When we restrict to F, we get a mean equal to: 1,48 When we restrict to M, we get a mean equal to: 1,38.

Multivariate Data Analysis E. Omey 2011

Selecting all relative frequencies in Table 4, and then choosing the first option in COLUMNS, we get

Using the cumulative relative frequencies, we get


only F CRF 0,24 0,45 0,83 1,00 1,00 only M CRF 0,24 0,62 0,81 0,95 1,00 Total CRF 0,24 0,52 0,82 0,98 1,00

C 0 1 2 3 4

And then we get 3 EFD in one graph:

Multivariate Data Analysis E. Omey 2011

or
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 0 1 2 3 4 5 6

Series1 Series2 Series3

Multivariate Data Analysis E. Omey 2011

1.4. Quantitative quantitative


We study A (age) and C (children).

1.4.1. RAW DATA


We can use the raw data to make a SCATTER plot. To this end we copy-paste the data, first A and then C: (for convenience I only copy part of the table)
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1

Now we select INSERT

SCATTER:

Multivariate Data Analysis E. Omey 2011

10

In the next step we can give a title to the graph and to the axes:

Multivariate Data Analysis E. Omey 2011

11

1.4.2. GROUPED DATA


When we get the data under the form of a frequency table, we have to start from the following table.
A C 0 1 2 3 4 20 ]15, 25] 6 3 1 1 0 11 30 ]25,35] 4 4 5 0 0 13 40 ]35,45] 0 5 5 6 0 16 50 ]45, 55] 2 2 4 1 1 10

12 14 15 8 1 50

For A, we used 4 classes of length 10, with class centers points 20, 30, 40 and 50. The relative frequencies are in the following table:
A (class centres) 30 0,08 0,08 0,1 0 0 0,26

C 0 1 2 3 4

20 0,12 0,06 0,02 0,02 0 0,22

40 0 0,1 0,1 0,12 0 0,32

50 0,04 0,04 0,08 0,02 0,02 0,2

0,24 0,28 0,3 0,16 0,02 1

To make further calculations and make nice graphs, we present the table in another form:

Multivariate Data Analysis E. Omey 2011

12

C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50

RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02

The first and second column contains all values of A and C; the 3rd column contains the relative frequencies. To make a BUBBLE graph, we select the data, and then INSERT BUBBLE. We get: OTHER CHARTS

Multivariate Data Analysis E. Omey 2011

13

After some make-up, we get:

Multivariate Data Analysis E. Omey 2011

14

2. Covariance and Correlation


2.1. Definitions
Recall that the covariance and the correlation coefficient are given by:
Cov( x, y ) = x y x y Cov( x, y ) r ( x, y ) = s( x) s ( y ) Note that these definitions only make sense for quantitative data! The covariance is an indication for the presence or the absence of a linear relationship. The correlation coefficient is a measure of the strength of such a relationship.

2.2. Calculations
2.2.1. Raw data
When we start from RAW data, we select DATA DATA ANALYSIS.

We study again A and C and we copy-paste the data: (for convenience I only copy part of the table)
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1

Now we activate the DATA ANALYSIS tools:

Multivariate Data Analysis E. Omey 2011

15

On the top we see Correlation and Covariance. For the covariance, we get:

For the Input range, we select the data together with the titles (and then click on LABELS IN FIRST ROW). For the Output range we select an empty cell. We get (for our example):

Multivariate Data Analysis E. Omey 2011

16

After OK, we get the following table:

A C

A 90,09 4,74

C 1,1664

The table gives Cov(A, A) = s(A) = 90?09 Cov(C, C) = s( C) = 1,1664 Cov(A, C) = Cov(C, A) = 4,74 > 0 For the correlation, we proceed in a similar way. We get the following screenshots:

Multivariate Data Analysis E. Omey 2011

17

And then:
A A C 1 0,462398 C 1

As expected, the correlations r(A, A) = r(C,C) = 1. For A and C, we get r(A, C) = 0,46. Remark Note that this approach can be used for 2 or more variables at the same time!

Multivariate Data Analysis E. Omey 2011

18

2.2.2. Grouped data


For grouped data we have a little more work. We have to use the EXCEL function SUMPRODUCT. When we have grouped data, we present the table (or make a table) of the following from:
C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 AF 6 3 1 1 0 4 4 5 0 0 0 5 5 6 0 2 2 4 1 1 50 RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02 1

To calculate the mean of C, for example, we have to calculate: Mean( C) = sum of (outcomes * frequencies)/n = [(0 * 6 ) + (1*3) + + (4*1)]/50 Or Mean( C) = sum of (outcomes * relative frequencies) = (0 * 0,12 ) + (1*0,06) + + (4*0,02)

With the excel-function SUMPRODUCT we can do this in one step. When we activate the function wizard, we select the SUMPRODUCT function and get:

Multivariate Data Analysis E. Omey 2011

19

For Array 1 we choose the column of C-values, for Array 2, we choose the relative frequencies. We get:

After OK, we find mean(C) = 1,44 In a similar way, we find mean (A) = 35 To find the second moment mean (C), we choose in ARRAY 1 the C-values, in ARRAY 2 we choose again the C-values and in ARRAY 3, we choose the relative frequencies.

Multivariate Data Analysis E. Omey 2011

20

After OK, we get the result: mean(C) = 3,24. In a similar way, we find: mean (A) = 1334 If we want the mean (AC), in ARRAY 1, we choose the values of C, in ARRAY 2 we choose the values of A and in ARRAY 3 we choose the relative frequencies. We get:

As a result we obtain that: mean(AC) = 55.

Multivariate Data Analysis E. Omey 2011

21

Using these numbers, now it is easy to find variances, standard deviations, covariance and correlation coefficient: Recall that s(x) = mean(x) (mean(x)) s(x) = sqrt(s(x)) cov(x,y) = mean(xy) mean(x)*mean(y) r(x, y) = cov(x,y)/s(x)s(y) For A and C we find:
mean(C) mean(A) mean(C) mean(A) mean(AC) s(C) s(C) s(A) s(A) cov(A,C) r(A,C) 1,44 35 3,24 1334 55 1,1664 1,08 109 10,44031 4,6 0,407963 R87-R82*R81 R95/(R90*R93) SUMPRODUCT(L80:L99;O80:O99)

SUMPRODUCT(L80:L99;L80:L99;O80:O99)

SUMPRODUCT(L80:L99;M80:M99;O80:O99) R84-R81*R81 SQRT(R89)

Multivariate Data Analysis E. Omey 2011

22

3. The regression line


3.1. Formulas
To find the regression line, we start from data (xi ,yi), i = 1, 2, , n. We approximate y ) by y = a + bx . For the i-th datapoint we have yi = a + bxi ei = y i y i = y i a bxi We determine a and b by requiring that e = 0 and that s(e) is as small as possible, where s (e) = SSE = ei2 . We find the following solution to this mathematical problem: cov( x, y ) b= s ( x) a = y b x y = a + bx

Multivariate Data Analysis E. Omey 2011

23

3.2. Grouped data


When we have grouped data, we proceed as before to make the calculations. Reconsider the following example:
C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 AF 6 3 1 1 0 4 4 5 0 0 0 5 5 6 0 2 2 4 1 1 50 RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02 1

Earlier we have found the following information:


mean( C) mean(A) mean(C) mean(A) mean(AC) s(C) s(C) s(A) s(A) cov(A,C) r(A,C) 1,44 35 3,24 1334 55 1,1664 1,08 109 10,44031 4,6 0,407963

We determine the regression line C = a + bA , and find:

Multivariate Data Analysis E. Omey 2011

24

cov( A, C ) = 4,6 = 0,0422 b= s ( A) 109 4,6 a = C b A = 1,44 35 = 0,037 109 C = a + bA = 0,037 + 0,0422 A For A = 20, 30, we find that: A = 20 : C = 0,80 A = 30 C = 1,23

A = 40 C = 1,65 A = 50 C = 2,07
Remark. For the age class]15, 25], we find that
0 6 + 11 + 2 1 + 3 1 + 4 0 = 0,73 11 For the other age classes we find C : A = 20 = C : A = 30 = 1,07; C : A = 40 = 2,06; C : A = 50 = 1,7

Multivariate Data Analysis E. Omey 2011

25

3.2. Raw data


For raw data, we can proceed as in Section 3.1 or we can proceed as follows.

3.2.1. Using the XY-scatter (BBA 1)


We take again the following example:
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1

First we make a SCATTER plot:

If we activate the chart (by clicking on it), we see in the chart-tools layout the option TRENDLINE and we can select the trend line options:

Multivariate Data Analysis E. Omey 2011

26

We choose a LINEAR trendline and also ask the EQUATION and DISPLAY RSQUARED:

Multivariate Data Analysis E. Omey 2011

27

After CLOSE, we get the following graph:

Multivariate Data Analysis E. Omey 2011

28

After some make-up, we find:

EXCEL has found the regression line. It is given by the equation y = 0,052x 0,427 In our notations, this is: C = 0, 427 + 0, 052 A .

We also get a number R. This number is the same as R = r(A,C) and it is calles the Rsquare. Multivariate Data Analysis E. Omey 2011 29

In the perfect situation, we have r(A, C) = 1 and then R = 1. In the worst scenario, we have r(A, C) = 0, and then R = 0.

One can prove that R = r(A, C) = r( C, C )


Remark Using this option in EXCEL, we can also consider other relationships, such as a logarithmic model, a polynomial model etc.

Multivariate Data Analysis E. Omey 2011

30

3.3.2. Using DATA

DATA-ANALYSIS (BBA3)

From the statistical point of view it is more interesting to use the Data-Analysis tools of excel. In these tools, we activate regression and find:

and

We fill in the form as follows: Input Y-range: we select the data about C, together with the title or label; Input X- range: we select the data about A, again with the title or label; Labels: we tell excel that we choose the labels or titles; We select the output range ball and then choose an empty cell. Multivariate Data Analysis E. Omey 2011 31

After doing this, we get:

After OK, we get the following output:


SUMMARY OUTPUT Regression Statistics Multiple R 0,462398 R Square 0,213812 Adjusted R Square 0,197433 Standard Error 0,977353 Observations 50 ANOVA df Regression Residual Total 1 48 49 SS 12,46953 45,85047 58,32 Standard Error 0,535118 0,014562 MS 12,46953 0,955218 F 13,05412 Significance F 0,000723

Intercept A

Coefficients -0,4278 0,052614

t Stat -0,79945 3,613048

P-value 0,427969 0,000723

Lower 95% -1,50373 0,023335

Upper 95% 0,6481 0,0819

Lower 95,0% -1,504 0,0233

Upper 95,0% 0,648128 0,081893

Multivariate Data Analysis E. Omey 2011

32

The output consists of 3 parts. We start with the last part:


Coefficients -0,4278 0,0526 Standard Error 0,5351 0,0146 t Stat -0,7994 3,6130 P-value 0,4280 0,0007 Lower 95% -1,5037 0,0233 Upper 95% 0,6481 0,0819

Intercept A

In the first column, we get the estimates for the parameters in the model C = a + bA . We find b = 0,0526 a = int ercept = 0,4278 C = 0,4278 + 0,0526
In the second column, we get the estimated standard deviation of these estimates:

sb = 0,0146 s a = 0,5351 The t-values of the estimates are given in the 3rd column: a 0,4278 = 0,7994 ta = = sa 0,5351
0,0526 = 3,6130 0,0146 In the next we get the prob-values of the t-values. Note that the P-value is given by P value = 2 P (t n k > t a ) The last 2 columns give 95% c.i. for the parameters. ta = In the first part of the output, we see:
Regression Statistics Multiple R 0,462398368 R Square 0,213812251 Adjusted R Square 0,197433339 Standard Error 0,977352605 Observations 50

* as before, R = r( C, C ) * Adjusted R = this is a slightly adjusted R-value * Standard error s(e): this is the square root of s(e), where se) is given by 1 n 2 s (e) = ei n k i =1

Multivariate Data Analysis E. Omey 2011

33

where ei = yi y i are the errors. In the middle part, we have the ANOVA-table (AN Of the VAriances):

ANOVA df Regression Residual Total 1 48 49 SS 12,4695 45,8505 58,32 MS 12,4695 0,9552 F 13,0541 Significance F 0,000722823

In the notations of statistics, we have s (C ) = s (Y ) = 1 SST ( yi y) 2 = n 1 n 1

where SST = TOTAL sum of squares, and n 1 is called the degrees of freedom. 58,32 In our example, we have s (C ) = 49 The other entries are 1 1 SSE s (e) = (ei e) 2 = n k ei2 = n k nk where SSE = the residual sum of squares or the error sum of squares; and 1 SSR s ( y ) = s (C ) = ( yi y) 2 = k 1 k 1 Note that SST = SSR + SSE, and this will hold for all linear models with a constant term. The standard error of the first part of the output is just equal to s(e). The F-value is given by SSR /(k 1) SSR /(k 1) R /(k 1) F= = = SSE /(n k ) ( SST SSR ) /(n k ) (1 R ) /(n k ) This F-value is related to R: we find that R is large if and only if the corresponding F-value is large. The advantage of F is that we can compare it with an F-distribution F(k-1, n-k). The prob-value of the F-value is given in the significance F column. A small significance F shows that the F-value is large, and hence also that the corresponding R-value is large.

Multivariate Data Analysis E. Omey 2011

34

You might also like