RR

MULTIVARIATE DATA ANALYSIS WITH EXCEL E. OMEY HUB Stormstraat 2, 1000 Brussels Belgium edward.omey@hubrussel.be www.edwardomey.
com
1. An example .................................................................................................................... 2 1.1. The data set............................................................................................................. 2 1.2. Qualitative - qualitative ......................................................................................... 3 1.3. Qualitative quantitative ...................................................................................... 5 1.4. Quantitative quantitative ................................................................................. 10 1.4.1. RAW DATA .................................................................................................. 10 1.4.2. GROUPED DATA ........................................................................................ 12 2. Covariance and Correlation....................................................................................... 15 2.1. Definitions ............................................................................................................. 15 2.2. Calculations .......................................................................................................... 15 2.2.1. Raw data ........................................................................................................ 15 2.2.2. Grouped data ................................................................................................. 19 3. The regression line ...................................................................................................... 23 3.1. Formulas ............................................................................................................... 23 3.2. Grouped data ........................................................................................................ 24 3.2. Raw data ............................................................................................................... 26 3.2.1. Using the XY-scatter (BBA 1) ...................................................................... 26 3.3.2. Using DATA DATA-ANALYSIS (BBA3) ............................................. 31
Multivariate Data Analysis E. Omey 2011
1. An example
1.1. The data set
We have 50 data-lines with A = age (quantitative) B = Male or Female (qualitative) C = number of children (quantitative) E = Unemployed or employed (qualitative) D = IQ (quantitative)
N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
A 20 37 41 31 49 22 48 49 28 45 38 20 24 23 47 37 48 48 48 42 34 22 20 23 31 36 45 47 44 35 30 30 29 23
B M F F M F M M F M F M F F F M F F F M F F M F F F M M F M F F F F F
C 0 3 1 1 2 0 1 2 0 1 2 2 3 0 1 1 0 2 4 1 2 1 0 1 1 1 3 2 2 2 0 0 2 0
D E E E E E U U U U E E E E E E E E E E E U E E U E E E U U E E E E E
E 35 91 93 156 69 145 112 160 109 134 85 154 87 75 140 149 80 123 141 74 147 130 90 109 160 79 134 81 171 116 91 141 85 185
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
22 40 33 42 37 47 43 43 34 28 40 46 31 23 45 27
M M F F F F M F F M M M F M M M
0 3 0 2 3 3 2 3 2 1 2 0 2 1 3 1
U E E E U E E E U U E U U U U E
48 90 117 112 72 106 127 113 200 82 153 145 115 110 145 143
As before we can study each of the variables separately. Now we can also study two variables at the same time. There are several possibilities
1.2. Qualitative - qualitative

We study B and E and make a frequency table. In EXCEL this can be doen by using the PIVOT-table, but we are not going into detail here. We find the following table:
B F F M M TABLE 1 D E U E U AF 22 7 12 9 50 RF 0,44 0,14 0,24 0,18 1
Note that it does not make sense to calculate cumulative relative frequencies. (Why?) Another way to present the table is as follows:
AF B F M D E 22 12 34 RF U 7 9 16 29 21 50 B F M D E 0,44 0,24 0,68 TABLE 2 U 0,14 0,18 0,32 0,58 0,42 1
There are several graphs we can make. Starting from table 2 we select the relative frequencies and the choose
INSERT
COLUMN:
Some make up (change the horizontal axis and remove what is unnecessary) we get:
If we choose the second option in INSERT
COLUMN of the Chart Wizard:
The first bar represents employed people and the bar is divided into 2 parts: F and M.
1.3. Qualitative quantitative

Now we study the combination of C (children) and B (F or M). We find the following frequency tables:
AF C 0 1 2 3 4 F 7 6 11 5 0 29 B M 5 8 4 3 1 21 12 14 15 8 1 50 RF TABLE 3 C 0 1 2 3 4 B F 0,14 0,12 0,22 0,1 0 0,58 M 0,1 0,16 0,08 0,06 0,02 0,42 Total 0,24 0,28 0,3 0,16 0,02 1
Choosing the relative frequencies in Table 3, we take again INSERT get
COLUMN. We
The second option in COLUMN gives:
Each bar represents the RF of the number of children. The bar is divided into 2 pieces: one piece for F and one piece for F. Now we can also study the M and F separately. We construct 3 frequency tables: one for F only, one for M only and one for the Total population. We get:
C 0 1 2 3 4
AF 7 6 11 5 0 29
only F RF 0,24 0,21 0,38 0,17 0,00 1 only M RF 0,24 0,38 0,19 0,14 0,05 1 Total RF 0,24 0,28 0,30 0,16 0,02 1
CRF 0,24 0,45 0,83 1,00 1,00
C 0 1 2 3 4
AF 5 8 4 3 1 21
CRF 0,24 0,62 0,81 0,95 1,00
C 0 1 2 3 4
AF 12 14 15 8 1 50
CRF 0,24 0,52 0,82 0,98 1,00
To compare the different pieces, we can make a new table and a new graph. By copypaste we get the following table:
Table 4 C 0 1 2 3 4 only F RF 0,24 0,21 0,38 0,17 0,00 1 only M RF 0,24 0,38 0,19 0,14 0,05 1 Total RF 0,24 0,28 0,30 0,16 0,02 1
We see that for F the mode is 2, for M the mode is 1. The total mean number of children is: 1,44 When we restrict to F, we get a mean equal to: 1,48 When we restrict to M, we get a mean equal to: 1,38.
Selecting all relative frequencies in Table 4, and then choosing the first option in COLUMNS, we get
Using the cumulative relative frequencies, we get

only F CRF 0,24 0,45 0,83 1,00 1,00 only M CRF 0,24 0,62 0,81 0,95 1,00 Total CRF 0,24 0,52 0,82 0,98 1,00
C 0 1 2 3 4
And then we get 3 EFD in one graph:
or
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 0 1 2 3 4 5 6
Series1 Series2 Series3
1.4. Quantitative quantitative

We study A (age) and C (children).
1.4.1. RAW DATA

We can use the raw data to make a SCATTER plot. To this end we copy-paste the data, first A and then C: (for convenience I only copy part of the table)
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1
Now we select INSERT
SCATTER:
10
In the next step we can give a title to the graph and to the axes:
11
1.4.2. GROUPED DATA

When we get the data under the form of a frequency table, we have to start from the following table.
A C 0 1 2 3 4 20 ]15, 25] 6 3 1 1 0 11 30 ]25,35] 4 4 5 0 0 13 40 ]35,45] 0 5 5 6 0 16 50 ]45, 55] 2 2 4 1 1 10
12 14 15 8 1 50
For A, we used 4 classes of length 10, with class centers points 20, 30, 40 and 50. The relative frequencies are in the following table:
A (class centres) 30 0,08 0,08 0,1 0 0 0,26
C 0 1 2 3 4
20 0,12 0,06 0,02 0,02 0 0,22
40 0 0,1 0,1 0,12 0 0,32
50 0,04 0,04 0,08 0,02 0,02 0,2
0,24 0,28 0,3 0,16 0,02 1
To make further calculations and make nice graphs, we present the table in another form:
12
C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50
RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02
The first and second column contains all values of A and C; the 3rd column contains the relative frequencies. To make a BUBBLE graph, we select the data, and then INSERT BUBBLE. We get: OTHER CHARTS
13
After some make-up, we get:
14
2. Covariance and Correlation

2.1. Definitions
Recall that the covariance and the correlation coefficient are given by:
Cov( x, y ) = x y x y Cov( x, y ) r ( x, y ) = s( x) s ( y ) Note that these definitions only make sense for quantitative data! The covariance is an indication for the presence or the absence of a linear relationship. The correlation coefficient is a measure of the strength of such a relationship.
2.2. Calculations
2.2.1. Raw data
When we start from RAW data, we select DATA DATA ANALYSIS.
We study again A and C and we copy-paste the data: (for convenience I only copy part of the table)
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1
Now we activate the DATA ANALYSIS tools:
15
On the top we see Correlation and Covariance. For the covariance, we get:
For the Input range, we select the data together with the titles (and then click on LABELS IN FIRST ROW). For the Output range we select an empty cell. We get (for our example):
16
After OK, we get the following table:
A C
A 90,09 4,74
C 1,1664
The table gives Cov(A, A) = s(A) = 90?09 Cov(C, C) = s( C) = 1,1664 Cov(A, C) = Cov(C, A) = 4,74 > 0 For the correlation, we proceed in a similar way. We get the following screenshots:
17
And then:
A A C 1 0,462398 C 1
As expected, the correlations r(A, A) = r(C,C) = 1. For A and C, we get r(A, C) = 0,46. Remark Note that this approach can be used for 2 or more variables at the same time!
18
2.2.2. Grouped data

For grouped data we have a little more work. We have to use the EXCEL function SUMPRODUCT. When we have grouped data, we present the table (or make a table) of the following from:
C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 AF 6 3 1 1 0 4 4 5 0 0 0 5 5 6 0 2 2 4 1 1 50 RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02 1
To calculate the mean of C, for example, we have to calculate: Mean( C) = sum of (outcomes * frequencies)/n = [(0 * 6 ) + (1*3) + + (4*1)]/50 Or Mean( C) = sum of (outcomes * relative frequencies) = (0 * 0,12 ) + (1*0,06) + + (4*0,02)
With the excel-function SUMPRODUCT we can do this in one step. When we activate the function wizard, we select the SUMPRODUCT function and get:
19
For Array 1 we choose the column of C-values, for Array 2, we choose the relative frequencies. We get:
After OK, we find mean(C) = 1,44 In a similar way, we find mean (A) = 35 To find the second moment mean (C), we choose in ARRAY 1 the C-values, in ARRAY 2 we choose again the C-values and in ARRAY 3, we choose the relative frequencies.
20
After OK, we get the result: mean(C) = 3,24. In a similar way, we find: mean (A) = 1334 If we want the mean (AC), in ARRAY 1, we choose the values of C, in ARRAY 2 we choose the values of A and in ARRAY 3 we choose the relative frequencies. We get:
As a result we obtain that: mean(AC) = 55.
21
Using these numbers, now it is easy to find variances, standard deviations, covariance and correlation coefficient: Recall that s(x) = mean(x) (mean(x)) s(x) = sqrt(s(x)) cov(x,y) = mean(xy) mean(x)*mean(y) r(x, y) = cov(x,y)/s(x)s(y) For A and C we find:
mean(C) mean(A) mean(C) mean(A) mean(AC) s(C) s(C) s(A) s(A) cov(A,C) r(A,C) 1,44 35 3,24 1334 55 1,1664 1,08 109 10,44031 4,6 0,407963 R87-R82*R81 R95/(R90*R93) SUMPRODUCT(L80:L99;O80:O99)
SUMPRODUCT(L80:L99;L80:L99;O80:O99)
SUMPRODUCT(L80:L99;M80:M99;O80:O99) R84-R81*R81 SQRT(R89)
22
3. The regression line

3.1. Formulas
To find the regression line, we start from data (xi ,yi), i = 1, 2, , n. We approximate y ) by y = a + bx . For the i-th datapoint we have yi = a + bxi ei = y i y i = y i a bxi We determine a and b by requiring that e = 0 and that s(e) is as small as possible, where s (e) = SSE = ei2 . We find the following solution to this mathematical problem: cov( x, y ) b= s ( x) a = y b x y = a + bx
23
3.2. Grouped data

When we have grouped data, we proceed as before to make the calculations. Reconsider the following example:
C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 AF 6 3 1 1 0 4 4 5 0 0 0 5 5 6 0 2 2 4 1 1 50 RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02 1
Earlier we have found the following information:

mean( C) mean(A) mean(C) mean(A) mean(AC) s(C) s(C) s(A) s(A) cov(A,C) r(A,C) 1,44 35 3,24 1334 55 1,1664 1,08 109 10,44031 4,6 0,407963
We determine the regression line C = a + bA , and find:
24
cov( A, C ) = 4,6 = 0,0422 b= s ( A) 109 4,6 a = C b A = 1,44 35 = 0,037 109 C = a + bA = 0,037 + 0,0422 A For A = 20, 30, we find that: A = 20 : C = 0,80 A = 30 C = 1,23
A = 40 C = 1,65 A = 50 C = 2,07
Remark. For the age class]15, 25], we find that
0 6 + 11 + 2 1 + 3 1 + 4 0 = 0,73 11 For the other age classes we find C : A = 20 = C : A = 30 = 1,07; C : A = 40 = 2,06; C : A = 50 = 1,7
25
3.2. Raw data

For raw data, we can proceed as in Section 3.1 or we can proceed as follows.
3.2.1. Using the XY-scatter (BBA 1)

We take again the following example:
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1
First we make a SCATTER plot:
If we activate the chart (by clicking on it), we see in the chart-tools layout the option TRENDLINE and we can select the trend line options:
26
We choose a LINEAR trendline and also ask the EQUATION and DISPLAY RSQUARED:
27
After CLOSE, we get the following graph:
28
After some make-up, we find:
EXCEL has found the regression line. It is given by the equation y = 0,052x 0,427 In our notations, this is: C = 0, 427 + 0, 052 A .
We also get a number R. This number is the same as R = r(A,C) and it is calles the Rsquare. Multivariate Data Analysis E. Omey 2011 29
In the perfect situation, we have r(A, C) = 1 and then R = 1. In the worst scenario, we have r(A, C) = 0, and then R = 0.
One can prove that R = r(A, C) = r( C, C )

Remark Using this option in EXCEL, we can also consider other relationships, such as a logarithmic model, a polynomial model etc.
30
3.3.2. Using DATA
DATA-ANALYSIS (BBA3)
From the statistical point of view it is more interesting to use the Data-Analysis tools of excel. In these tools, we activate regression and find:
and
We fill in the form as follows: Input Y-range: we select the data about C, together with the title or label; Input X- range: we select the data about A, again with the title or label; Labels: we tell excel that we choose the labels or titles; We select the output range ball and then choose an empty cell. Multivariate Data Analysis E. Omey 2011 31
After doing this, we get:
After OK, we get the following output:

SUMMARY OUTPUT Regression Statistics Multiple R 0,462398 R Square 0,213812 Adjusted R Square 0,197433 Standard Error 0,977353 Observations 50 ANOVA df Regression Residual Total 1 48 49 SS 12,46953 45,85047 58,32 Standard Error 0,535118 0,014562 MS 12,46953 0,955218 F 13,05412 Significance F 0,000723
Intercept A
Coefficients -0,4278 0,052614
t Stat -0,79945 3,613048
P-value 0,427969 0,000723
Lower 95% -1,50373 0,023335
Upper 95% 0,6481 0,0819
Lower 95,0% -1,504 0,0233
Upper 95,0% 0,648128 0,081893
32
The output consists of 3 parts. We start with the last part:

Coefficients -0,4278 0,0526 Standard Error 0,5351 0,0146 t Stat -0,7994 3,6130 P-value 0,4280 0,0007 Lower 95% -1,5037 0,0233 Upper 95% 0,6481 0,0819
Intercept A
In the first column, we get the estimates for the parameters in the model C = a + bA . We find b = 0,0526 a = int ercept = 0,4278 C = 0,4278 + 0,0526
In the second column, we get the estimated standard deviation of these estimates:
sb = 0,0146 s a = 0,5351 The t-values of the estimates are given in the 3rd column: a 0,4278 = 0,7994 ta = = sa 0,5351
0,0526 = 3,6130 0,0146 In the next we get the prob-values of the t-values. Note that the P-value is given by P value = 2 P (t n k > t a ) The last 2 columns give 95% c.i. for the parameters. ta = In the first part of the output, we see:
Regression Statistics Multiple R 0,462398368 R Square 0,213812251 Adjusted R Square 0,197433339 Standard Error 0,977352605 Observations 50
* as before, R = r( C, C ) * Adjusted R = this is a slightly adjusted R-value * Standard error s(e): this is the square root of s(e), where se) is given by 1 n 2 s (e) = ei n k i =1
33
where ei = yi y i are the errors. In the middle part, we have the ANOVA-table (AN Of the VAriances):
ANOVA df Regression Residual Total 1 48 49 SS 12,4695 45,8505 58,32 MS 12,4695 0,9552 F 13,0541 Significance F 0,000722823
In the notations of statistics, we have s (C ) = s (Y ) = 1 SST ( yi y) 2 = n 1 n 1
where SST = TOTAL sum of squares, and n 1 is called the degrees of freedom. 58,32 In our example, we have s (C ) = 49 The other entries are 1 1 SSE s (e) = (ei e) 2 = n k ei2 = n k nk where SSE = the residual sum of squares or the error sum of squares; and 1 SSR s ( y ) = s (C ) = ( yi y) 2 = k 1 k 1 Note that SST = SSR + SSE, and this will hold for all linear models with a constant term. The standard error of the first part of the output is just equal to s(e). The F-value is given by SSR /(k 1) SSR /(k 1) R /(k 1) F= = = SSE /(n k ) ( SST SSR ) /(n k ) (1 R ) /(n k ) This F-value is related to R: we find that R is large if and only if the corresponding F-value is large. The advantage of F is that we can compare it with an F-distribution F(k-1, n-k). The prob-value of the F-value is given in the significance F column. A small significance F shows that the F-value is large, and hence also that the corresponding R-value is large.
34

RR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RR

Uploaded by

Copyright:

Available Formats

MULTIVARIATE DATA ANALYSIS WITH EXCEL E. OMEY HUB Stormstraat 2, 1000 Brussels Belgium edward.omey@hubrussel.be www.edwardomey.

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

1.2. Qualitative - qualitative

Multivariate Data Analysis E. Omey 2011

If we choose the second option in INSERT

COLUMN of the Chart Wizard:

Multivariate Data Analysis E. Omey 2011

1.3. Qualitative quantitative

Choosing the relative frequencies in Table 3, we take again INSERT get

Multivariate Data Analysis E. Omey 2011

The second option in COLUMN gives:

Multivariate Data Analysis E. Omey 2011

CRF 0,24 0,45 0,83 1,00 1,00

CRF 0,24 0,62 0,81 0,95 1,00

CRF 0,24 0,52 0,82 0,98 1,00

Multivariate Data Analysis E. Omey 2011

Using the cumulative relative frequencies, we get

And then we get 3 EFD in one graph:

Multivariate Data Analysis E. Omey 2011

Series1 Series2 Series3

Multivariate Data Analysis E. Omey 2011

1.4. Quantitative quantitative

1.4.1. RAW DATA

Now we select INSERT

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

1.4.2. GROUPED DATA

20 0,12 0,06 0,02 0,02 0 0,22

40 0 0,1 0,1 0,12 0 0,32

50 0,04 0,04 0,08 0,02 0,02 0,2

0,24 0,28 0,3 0,16 0,02 1

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

After some make-up, we get:

Multivariate Data Analysis E. Omey 2011

2. Covariance and Correlation

Now we activate the DATA ANALYSIS tools:

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

After OK, we get the following table:

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

2.2.2. Grouped data

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

As a result we obtain that: mean(AC) = 55.

Multivariate Data Analysis E. Omey 2011

SUMPRODUCT(L80:L99;M80:M99;O80:O99) R84-R81*R81 SQRT(R89)

Multivariate Data Analysis E. Omey 2011

3. The regression line

Multivariate Data Analysis E. Omey 2011

3.2. Grouped data

Earlier we have found the following information:

We determine the regression line C = a + bA , and find:

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

3.2. Raw data

3.2.1. Using the XY-scatter (BBA 1)

First we make a SCATTER plot:

Multivariate Data Analysis E. Omey 2011

Multivariate Data Analysis E. Omey 2011

After CLOSE, we get the following graph: