You are on page 1of 3

Manobin Sharma

073/MSMS/859
Assignment 01(Data Science)
Anscombe's Quartet:
Anscombe’s quartet is a classic example of the drawback to just reporting correlation. Francis
Anscombe illustrated in his 1973 American Statistician paper, how a set of four different pairs
of variables can deliver the same correlation coefficient, while the relationships between each
pair are completely different. It is constructed to demonstrate both the importance of graphing
data before analyzing it and the effect of outliers on statistical properties.

As it contains four datasets that have nearly identical simple descriptive statistics, yet appear
very different when graphed.

He described the article as being intended to counter the impression among statisticians that
"numerical calculations are exact, but graphs are rough.

Data Sets:

X1 Y1 X2 Y2 X3 Y3 X4 Y4

10 8.04 10 9.14 10 7.46 8 6.58


8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Average:9 7.50090909 9 7.50090909 9 7.5 9 7.50090909
Mean Of X : 9
Mean Of Y : 7.50
Linear regression line
y = 3.00 + 0.500x
X1 vs Y1
12 y = 0.5001x + 3.0001
R² = 0.6665
10

6 X1 vs Y1
Linear (X1 vs Y1)
4

0
0 5 10 15

 The first scatter plot appears to be a simple linear relationship, corresponding to


two variables correlated and following the assumption of normality.

X2 vs Y2
12
y = 0.5x + 3.0009
10 R² = 0.6662

8
X2 vs Y2
6
Linear (X2 vs Y2)
4

0
0 5 10 15

 he second graph is not distributed normally; while a relationship between the two variables is
obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more
general regression and the corresponding coefficient of determination would be more
appropriate.
X3 vs Y3
14
y = 0.4997x + 3.0025
12 R² = 0.6663
10

8
X3 vs Y3
6
Linear (X3 vs Y3)
4

0
0 5 10 15

 In the third graph, the distribution is linear, but should have a different regression line (a robust
regression would have been called for). The calculated regression is offset by the one outlier
which exerts enough influence to lower the correlation coefficient from 1 to 0.816.

X4 vs Y4
14
y = 0.4999x + 3.0017
12 R² = 0.6667
10

8
X4 vs Y4
6
Linear (X4 vs Y4)
4

0
0 5 10 15 20

 Finally, the fourth graph shows an example when one outlier is enough to produce a high
correlation coefficient, even though the other data points do not indicate any relationship
between the variables

It is unknown, how Anscombe created his datasets. Since its publication, several methods to
generate similar data sets with identical statistics and dissimilar graphics have been developed.

You might also like