Professional Documents
Culture Documents
com
1. An example .................................................................................................................... 2 1.1. The data set............................................................................................................. 2 1.2. Qualitative - qualitative ......................................................................................... 3 1.3. Qualitative quantitative ...................................................................................... 5 1.4. Quantitative quantitative ................................................................................. 10 1.4.1. RAW DATA .................................................................................................. 10 1.4.2. GROUPED DATA ........................................................................................ 12 2. Covariance and Correlation....................................................................................... 15 2.1. Definitions ............................................................................................................. 15 2.2. Calculations .......................................................................................................... 15 2.2.1. Raw data ........................................................................................................ 15 2.2.2. Grouped data ................................................................................................. 19 3. The regression line ...................................................................................................... 23 3.1. Formulas ............................................................................................................... 23 3.2. Grouped data ........................................................................................................ 24 3.2. Raw data ............................................................................................................... 26 3.2.1. Using the XY-scatter (BBA 1) ...................................................................... 26 3.3.2. Using DATA DATA-ANALYSIS (BBA3) ............................................. 31
1. An example
1.1. The data set
We have 50 data-lines with A = age (quantitative) B = Male or Female (qualitative) C = number of children (quantitative) E = Unemployed or employed (qualitative) D = IQ (quantitative)
N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
A 20 37 41 31 49 22 48 49 28 45 38 20 24 23 47 37 48 48 48 42 34 22 20 23 31 36 45 47 44 35 30 30 29 23
B M F F M F M M F M F M F F F M F F F M F F M F F F M M F M F F F F F
C 0 3 1 1 2 0 1 2 0 1 2 2 3 0 1 1 0 2 4 1 2 1 0 1 1 1 3 2 2 2 0 0 2 0
D E E E E E U U U U E E E E E E E E E E E U E E U E E E U U E E E E E
E 35 91 93 156 69 145 112 160 109 134 85 154 87 75 140 149 80 123 141 74 147 130 90 109 160 79 134 81 171 116 91 141 85 185
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
22 40 33 42 37 47 43 43 34 28 40 46 31 23 45 27
M M F F F F M F F M M M F M M M
0 3 0 2 3 3 2 3 2 1 2 0 2 1 3 1
U E E E U E E E U U E U U U U E
48 90 117 112 72 106 127 113 200 82 153 145 115 110 145 143
As before we can study each of the variables separately. Now we can also study two variables at the same time. There are several possibilities
Note that it does not make sense to calculate cumulative relative frequencies. (Why?) Another way to present the table is as follows:
AF B F M D E 22 12 34 RF U 7 9 16 29 21 50 B F M D E 0,44 0,24 0,68 TABLE 2 U 0,14 0,18 0,32 0,58 0,42 1
There are several graphs we can make. Starting from table 2 we select the relative frequencies and the choose
INSERT
COLUMN:
Some make up (change the horizontal axis and remove what is unnecessary) we get:
The first bar represents employed people and the bar is divided into 2 parts: F and M.
COLUMN. We
Each bar represents the RF of the number of children. The bar is divided into 2 pieces: one piece for F and one piece for F. Now we can also study the M and F separately. We construct 3 frequency tables: one for F only, one for M only and one for the Total population. We get:
C 0 1 2 3 4
AF 7 6 11 5 0 29
only F RF 0,24 0,21 0,38 0,17 0,00 1 only M RF 0,24 0,38 0,19 0,14 0,05 1 Total RF 0,24 0,28 0,30 0,16 0,02 1
C 0 1 2 3 4
AF 5 8 4 3 1 21
C 0 1 2 3 4
AF 12 14 15 8 1 50
To compare the different pieces, we can make a new table and a new graph. By copypaste we get the following table:
Table 4 C 0 1 2 3 4 only F RF 0,24 0,21 0,38 0,17 0,00 1 only M RF 0,24 0,38 0,19 0,14 0,05 1 Total RF 0,24 0,28 0,30 0,16 0,02 1
We see that for F the mode is 2, for M the mode is 1. The total mean number of children is: 1,44 When we restrict to F, we get a mean equal to: 1,48 When we restrict to M, we get a mean equal to: 1,38.
Selecting all relative frequencies in Table 4, and then choosing the first option in COLUMNS, we get
C 0 1 2 3 4
or
1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 0 1 2 3 4 5 6
SCATTER:
10
In the next step we can give a title to the graph and to the axes:
11
12 14 15 8 1 50
For A, we used 4 classes of length 10, with class centers points 20, 30, 40 and 50. The relative frequencies are in the following table:
A (class centres) 30 0,08 0,08 0,1 0 0 0,26
C 0 1 2 3 4
To make further calculations and make nice graphs, we present the table in another form:
12
C 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
A 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50
RF 0,12 0,06 0,02 0,02 0 0,08 0,08 0,1 0 0 0 0,1 0,1 0,12 0 0,04 0,04 0,08 0,02 0,02
The first and second column contains all values of A and C; the 3rd column contains the relative frequencies. To make a BUBBLE graph, we select the data, and then INSERT BUBBLE. We get: OTHER CHARTS
13
14
2.2. Calculations
2.2.1. Raw data
When we start from RAW data, we select DATA DATA ANALYSIS.
We study again A and C and we copy-paste the data: (for convenience I only copy part of the table)
A 20 37 41 31 49 22 40 46 31 23 45 27 C 0 3 1 1 2 0 2 0 2 1 3 1
15
On the top we see Correlation and Covariance. For the covariance, we get:
For the Input range, we select the data together with the titles (and then click on LABELS IN FIRST ROW). For the Output range we select an empty cell. We get (for our example):
16
A C
A 90,09 4,74
C 1,1664
The table gives Cov(A, A) = s(A) = 90?09 Cov(C, C) = s( C) = 1,1664 Cov(A, C) = Cov(C, A) = 4,74 > 0 For the correlation, we proceed in a similar way. We get the following screenshots:
17
And then:
A A C 1 0,462398 C 1
As expected, the correlations r(A, A) = r(C,C) = 1. For A and C, we get r(A, C) = 0,46. Remark Note that this approach can be used for 2 or more variables at the same time!
18
To calculate the mean of C, for example, we have to calculate: Mean( C) = sum of (outcomes * frequencies)/n = [(0 * 6 ) + (1*3) + + (4*1)]/50 Or Mean( C) = sum of (outcomes * relative frequencies) = (0 * 0,12 ) + (1*0,06) + + (4*0,02)
With the excel-function SUMPRODUCT we can do this in one step. When we activate the function wizard, we select the SUMPRODUCT function and get:
19
For Array 1 we choose the column of C-values, for Array 2, we choose the relative frequencies. We get:
After OK, we find mean(C) = 1,44 In a similar way, we find mean (A) = 35 To find the second moment mean (C), we choose in ARRAY 1 the C-values, in ARRAY 2 we choose again the C-values and in ARRAY 3, we choose the relative frequencies.
20
After OK, we get the result: mean(C) = 3,24. In a similar way, we find: mean (A) = 1334 If we want the mean (AC), in ARRAY 1, we choose the values of C, in ARRAY 2 we choose the values of A and in ARRAY 3 we choose the relative frequencies. We get:
21
Using these numbers, now it is easy to find variances, standard deviations, covariance and correlation coefficient: Recall that s(x) = mean(x) (mean(x)) s(x) = sqrt(s(x)) cov(x,y) = mean(xy) mean(x)*mean(y) r(x, y) = cov(x,y)/s(x)s(y) For A and C we find:
mean(C) mean(A) mean(C) mean(A) mean(AC) s(C) s(C) s(A) s(A) cov(A,C) r(A,C) 1,44 35 3,24 1334 55 1,1664 1,08 109 10,44031 4,6 0,407963 R87-R82*R81 R95/(R90*R93) SUMPRODUCT(L80:L99;O80:O99)
SUMPRODUCT(L80:L99;L80:L99;O80:O99)
22
23
24
cov( A, C ) = 4,6 = 0,0422 b= s ( A) 109 4,6 a = C b A = 1,44 35 = 0,037 109 C = a + bA = 0,037 + 0,0422 A For A = 20, 30, we find that: A = 20 : C = 0,80 A = 30 C = 1,23
A = 40 C = 1,65 A = 50 C = 2,07
Remark. For the age class]15, 25], we find that
0 6 + 11 + 2 1 + 3 1 + 4 0 = 0,73 11 For the other age classes we find C : A = 20 = C : A = 30 = 1,07; C : A = 40 = 2,06; C : A = 50 = 1,7
25
If we activate the chart (by clicking on it), we see in the chart-tools layout the option TRENDLINE and we can select the trend line options:
26
We choose a LINEAR trendline and also ask the EQUATION and DISPLAY RSQUARED:
27
28
EXCEL has found the regression line. It is given by the equation y = 0,052x 0,427 In our notations, this is: C = 0, 427 + 0, 052 A .
We also get a number R. This number is the same as R = r(A,C) and it is calles the Rsquare. Multivariate Data Analysis E. Omey 2011 29
In the perfect situation, we have r(A, C) = 1 and then R = 1. In the worst scenario, we have r(A, C) = 0, and then R = 0.
30
DATA-ANALYSIS (BBA3)
From the statistical point of view it is more interesting to use the Data-Analysis tools of excel. In these tools, we activate regression and find:
and
We fill in the form as follows: Input Y-range: we select the data about C, together with the title or label; Input X- range: we select the data about A, again with the title or label; Labels: we tell excel that we choose the labels or titles; We select the output range ball and then choose an empty cell. Multivariate Data Analysis E. Omey 2011 31
Intercept A
32
Intercept A
In the first column, we get the estimates for the parameters in the model C = a + bA . We find b = 0,0526 a = int ercept = 0,4278 C = 0,4278 + 0,0526
In the second column, we get the estimated standard deviation of these estimates:
sb = 0,0146 s a = 0,5351 The t-values of the estimates are given in the 3rd column: a 0,4278 = 0,7994 ta = = sa 0,5351
0,0526 = 3,6130 0,0146 In the next we get the prob-values of the t-values. Note that the P-value is given by P value = 2 P (t n k > t a ) The last 2 columns give 95% c.i. for the parameters. ta = In the first part of the output, we see:
Regression Statistics Multiple R 0,462398368 R Square 0,213812251 Adjusted R Square 0,197433339 Standard Error 0,977352605 Observations 50
* as before, R = r( C, C ) * Adjusted R = this is a slightly adjusted R-value * Standard error s(e): this is the square root of s(e), where se) is given by 1 n 2 s (e) = ei n k i =1
33
where ei = yi y i are the errors. In the middle part, we have the ANOVA-table (AN Of the VAriances):
ANOVA df Regression Residual Total 1 48 49 SS 12,4695 45,8505 58,32 MS 12,4695 0,9552 F 13,0541 Significance F 0,000722823
where SST = TOTAL sum of squares, and n 1 is called the degrees of freedom. 58,32 In our example, we have s (C ) = 49 The other entries are 1 1 SSE s (e) = (ei e) 2 = n k ei2 = n k nk where SSE = the residual sum of squares or the error sum of squares; and 1 SSR s ( y ) = s (C ) = ( yi y) 2 = k 1 k 1 Note that SST = SSR + SSE, and this will hold for all linear models with a constant term. The standard error of the first part of the output is just equal to s(e). The F-value is given by SSR /(k 1) SSR /(k 1) R /(k 1) F= = = SSE /(n k ) ( SST SSR ) /(n k ) (1 R ) /(n k ) This F-value is related to R: we find that R is large if and only if the corresponding F-value is large. The advantage of F is that we can compare it with an F-distribution F(k-1, n-k). The prob-value of the F-value is given in the significance F column. A small significance F shows that the F-value is large, and hence also that the corresponding R-value is large.
34