You are on page 1of 7

A

CONCISE OVERVIEW OF BASIC STATISTICS



The goal of descriptive statistics is the description of a population (a set of individuals
or entities) or some characteristics of this population via the collection and study of data
concerning all of the sets elements, or the elements of a certain subset of the population
(to which we will refer as a sample).

The characteristic of the population that we want to study is called the variable. For
example: the number of items sold at each one of the 12 fine art auctions in Santa
Monica, one for every month in 2014. Here the population is the set of all these Santa
Monica fine art auctions in 2014; the variable is the number of items that were sold.


1. Measures of location (or: of place) central tendencies of a set of data

a. Mean, Median, Mode (the three Ms)
For a list of data [ x1, x2 , x3, ... , x N ] (listing the values measured for each of the
individuals or entities in the population or the sample of size N) the (arithmetic)
N

mean or average of is given by the formula

x
i=1

. In case of a population the

N
mean usually is denoted by the Greek letter . In case of a sample one usually
writes X for the mean.


Example: These are the number of items sold at the Santa Monica fine art
auctions in 2014: [14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 26]. This is a
population: we are given the data for all the auctions in 2014.
The mean of these data is :

X=

14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 26
= 17, 42 .
12


The median is the midpoint of the values, after these have been put in increasing
(or decreasing) order.

Example (continued): Order the list of values. We get
9, 11, 14, 14, 16, 16, 18, 19, 21, 22, 23, 26 .
In case of an uneven number of data, the median will be the central number in
the ordered sequence. Because here the number of date is even, there is no single
central number. We then take the average of the two most central ones. In this
case, these are 16 and 18. So the median of our data set is 17.




The mode or modal value is the value that appears most frequently in a set of
data.



Example (continued): Unlike the mean and the median, the mode does not have
to be unique. In our example both 14 and 16 occur twice; the other values
appear only once. In such cases one sometimes speaks of a bimodal data set.




b. Quartiles, deciles, percentiles
The median is the measure of location that identifies the center of a collection of
observations. It divides the ordered sequence of our data into two equal parts. In
a similar manner we can divide the ordered (from small to large) sequence of
data into four, ten or a hundred equal parts.

Quartiles divide the sequence of data into four equal parts. The first or lower
quartile, Q1, is the value below which 25% of our data occur. The second quartile,
Q2, is the value below which 50% of our data occur; it is equal to the median. The
third or upper quartile, Q3, is the value below which 75% of our data occur.

Deciles divide the sequence of data into ten equal parts. The first decile, D1, is
the value below which 10% of our data occur. The fifth decile, D5, is equal to the
median. The ninth decile, D9, is the value below which 90% of our data occur.

Percentiles divide the sequence of data into a hundred equal parts. The fiftieth
percentile, P50, is equal to the median. To find the location of a given percentile,
we use the formula L p = (N +1)
the desired percentile.

p
, where N is the size of our sequence, and p
100


Example (continued): The location of the median in a sequence of 12 values is
equal to that of the 50th percentile, which is L50 = 13

50
= 6, 5 . This means
100

that the median is halfway the 6th and the 7th value in the sequence.

The location of the first quartile is that of the 25th percentile:

25
= 3, 25 . We find the value of Q1 between the 3th and the 4th value.
100
In our sequence of data that is between 14 and 14. Therefore Q1 = 14 .
L25 = 13


The location of the third quartile is that of the 75th percentile:

L75 = 13

75
= 9, 75 . We find the value of Q3 between the 9th and the 10th value.
100

In our sequence of data that is between 21 and 22. We use linear interpolation to
find the precise value: Q3 = 21+ 0, 75 1 = 21, 75 .





2. Measures of dispersion
a. Range



The range of a collection of data is the difference between the greatest and the
least value.

Example (continued): The range of the data in our example is 26 9 = 17.



b. Variance and standard deviation
Measures of dispersion indicate the degree to which numerical data are spread
out. Variance and standard deviation are based on the squares of the
difference between each of the values and the datas arithmetic mean.
There is a subtle but nevertheless important difference between the formulas
used to calculate these measures value in the case of a population and in the
case of a sample.

N

( x )

Population variance: 2 =

i=1

( x X )

; standard deviation: = 2 .

Sample variance: S 2 =

i=1

N 1

; standard deviation: S = S 2 .


It is sometimes useful to expresses the standard deviation as the ratio of the
standard deviation to the mean. This is called the coefficient of variation (cv)
(also named variation coefficient or relative standard deviation, rsd).
In case of a population: cv =

; in case of a sample: cv = .
X


Example (continued): In our example the data are those of a population. The
12

( x 17, 42)

population variance 2 =

i=1

12

= 23, 41 . The sample standard

deviation = 23, 41 = 4,84 . The variation coefficient is

23, 41
= 1, 34 .
17, 42


c. Interquartile & interdecile range
The interquartile range (also called midspread or middle fifty) is the
difference between the upper and the lower quartile: IQR = Q3 Q1 . It contains
the most centrally placed 50% of our data. (The smaller the IQR, the smaller our
datas dispersion.)

Similarly, the interdecile range contains the most centrally placed 80% of our
data: IDR = D9 D1 . It is the width of the smallest interval containing 80% of
the most central of our datas.





A graphical representation of the list of data, their distribution and their
tendencies in a boxplot or box and whisker diagram:




3. Frequency distributions

A statistical analysis of large sets of data will in general start by organizing the observed
data in a certain number of classes or intervals, in many cases (but not always) of a
constant, fixed, width. This is called a frequency distribution. It counts how many of our
data fall within a certain class.

Example: The following table counts the number of documented art works depicting
Venus or Aphrodite, created in France within time-periods of half century, from 1500
2000. (source: K. Bender)

[Midpoints:]

[1525]
1500-
1549

[1575]
1550-
1599

[1625]
1600-
1649

[1675]
1650-
1699

[1725]
1700-
1749

[1775]
1750-
1799

[1825]
1800-
1849

[1875]
1850-
1899

[1925]
1900-
1949

[1975]
1950-
1999

Frequency

frequency
percentage
cumulative
frequency
percentage

60

137

210

483

785

383

327

286

162

0,32%

2,11%

4,82%

7,39%

17%

27,62%

13,48%

11,51%

10,06%

5,7%

0,32%

2,43%

7,25%

14,64%

31,63%

59,25%

72,73%

84,24%

94,3%

100%

(You can find the data and the calculations on the excel worksheet ExcelWorksheet_1)

A frequency distribution of a data set is usually visualized by means of a so-called
histogram, which can easily be generated e.g. in Excel from the listing of the classes (the
bins) and the corresponding frequencies of occurrences of values within each of these
bins. Here is the histogram visualization of the temporal distribution of French Venus
art works, as generated in Excel:



Given a frequency distribution, the only possible estimation of the range of the data
obtained for our population or sample is the difference between the lowest and the
highest possible value.
We can estimate the mean of the data by using the center (midpoint) of a class as its
value, and then using the frequencies of each of the classes as weights in a weighted
c

f m
i

average of these midpoints: X =

i=1
c

. Here c is the number of classes; fi is the

i=1

frequency of class i; mi is the center (midpoint) of class i.



Example (continued): In the number of French Venus art works distribution there are
10 classes, with midpoints 1525, 1575, 1625, 1650, 1675, 1700, 1725, 1775, 1825, 1875,
1925, and 1975. If we are specifically interested in this estimation of the mean as an
estimation of the mean age of the documented French Venus art works, we can use age
midpoints, counting back from 2000. The 1525 will correspond to an age of 475 years,
1575 to an age of 425 years, et cetera. Then the estimated mean age will be:

X france =

9 475 + 60 425 ++162 25 592250


=
= 208, 39 .
2842
2842


We use linear interpolation to determine estimations of other measures of locations, like
the median and the quartiles.

Example (continued): In the frequency distribution we learn from the row of cumulative
frequency percentages that 31,36% of the French Venus art works dates from before
1750, and that 59,25% dates from before 1800. The median Me (the date such that 50%
of the documented works was created before) therefore must be somewhere between
1750 and 1800. Applying linear interpolation we find as an estimation of this median:

Me 1750
50 31, 36
18, 64
=
Me = 1750 + 50
1783, 4 . I.e. the median age
50
59, 25 31, 36
27,89
will be 2000 1783,4 = 216,6 years.



Similarly, we can determine the first and the third quartiles, Q1 and Q3:

Q1 1700
25 14, 64
10, 36
=
Q1 = 1700 + 50
1731 . I.e., we estimate that one
50
31, 36 14, 64
16, 72
quarter of the French Venus art works is more than 269 years old.

Q3 1800
75 72, 73
2, 27
=
Q3 = 1800 + 50
1812,1 . I.e., we also estimate that
50
82,14 72, 73
9, 41
about a quarter of the French Venus art works is less than 188 years old.
Finally, we estimate that half of all the documented French Venus art works came into
being between 1731 and 1812 (the interquartile range, or IQR, is 81 years).
As before, we may visualize this descriptive analysis of our data in a box-plot:

( There are many tools that help you quickly generate such boxplot images, for example, online at
http://www.imathas.com/stattools/boxplot.html )



Like for the mean, we also use the midpoints of the classes to estimate the variance and
the standard deviation:
c

f (m X )
i

2
Population variance: =

; standard deviation: = 2 .

i=1
c

i=1
c

f (m X )
i

Sample variance: S 2 =

i=1

#
&
% fi ( 1
$ i=1 '
c

; standard deviation: S = S 2 .



Example (continued): The frequency distribution was obtained from population data.
5

f (m 208, 39)
i

Therefore the estimated variance is

i=1

2842

= 9046,83 ; the estimated

standard deviation is 9046,83 = 95,11 years.


The estimated coefficient of variation =

95,11
0, 46 , i.e. 46%: on the average, the age
208, 39

of a French Venus art work will differ by 46% from the estimated mean age of 208,4
years.



Exercise: On the Excel worksheet ExcelWorksheet_1 you will find the data that were used
above as well as all the calculations, detailed in Excel. The worksheet also contains similar data
sets (source: K. Bender) for the Venus iconography between 1500 and 2000 in Italy (it), in the
Low Countries (lc) and in Germany, Switzerland and Central European countries (gsce). In order
to train yourself in the use of Excel to quickly perform a basic descriptive statistical analysis of a
set of data:

a. Perform a descriptive analysis similar to the one given in this note, for each of these
three supplementary data sets. Summarize your findings in a short text report; include a
histogram and a boxplot.
b. Determine the total distribution of Venus art works in Europe between 1500 and 2000
(see the final right columns on the work sheet). Again perform the descriptive analysis
for this full distribution, and summarize your findings in a short text report with
histogram and boxplot.
c. Compare and interpret the results.

You might also like