Professional Documents
Culture Documents
x
i=1
N
mean
usually
is
denoted
by
the
Greek
letter .
In
case
of
a
sample
one
usually
writes
X for
the
mean.
Example:
These
are
the
number
of
items
sold
at
the
Santa
Monica
fine
art
auctions
in
2014:
[14,
22,
16,
11,
23,
16,
18,
21,
14,
19,
9,
26].
This
is
a
population:
we
are
given
the
data
for
all
the
auctions
in
2014.
The
mean
of
these
data
is
:
X=
14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 26
= 17, 42 .
12
The
median
is
the
midpoint
of
the
values,
after
these
have
been
put
in
increasing
(or
decreasing)
order.
Example
(continued):
Order
the
list
of
values.
We
get
9, 11, 14, 14, 16, 16, 18, 19, 21, 22, 23, 26 .
In
case
of
an
uneven
number
of
data,
the
median
will
be
the
central
number
in
the
ordered
sequence.
Because
here
the
number
of
date
is
even,
there
is
no
single
central
number.
We
then
take
the
average
of
the
two
most
central
ones.
In
this
case,
these
are
16
and
18.
So
the
median
of
our
data
set
is
17.
The
mode
or
modal
value
is
the
value
that
appears
most
frequently
in
a
set
of
data.
Example
(continued):
Unlike
the
mean
and
the
median,
the
mode
does
not
have
to
be
unique.
In
our
example
both
14
and
16
occur
twice;
the
other
values
appear
only
once.
In
such
cases
one
sometimes
speaks
of
a
bimodal
data
set.
b. Quartiles,
deciles,
percentiles
The
median
is
the
measure
of
location
that
identifies
the
center
of
a
collection
of
observations.
It
divides
the
ordered
sequence
of
our
data
into
two
equal
parts.
In
a
similar
manner
we
can
divide
the
ordered
(from
small
to
large)
sequence
of
data
into
four,
ten
or
a
hundred
equal
parts.
Quartiles
divide
the
sequence
of
data
into
four
equal
parts.
The
first
or
lower
quartile,
Q1,
is
the
value
below
which
25%
of
our
data
occur.
The
second
quartile,
Q2,
is
the
value
below
which
50%
of
our
data
occur;
it
is
equal
to
the
median.
The
third
or
upper
quartile,
Q3,
is
the
value
below
which
75%
of
our
data
occur.
Deciles
divide
the
sequence
of
data
into
ten
equal
parts.
The
first
decile,
D1,
is
the
value
below
which
10%
of
our
data
occur.
The
fifth
decile,
D5,
is
equal
to
the
median.
The
ninth
decile,
D9,
is
the
value
below
which
90%
of
our
data
occur.
Percentiles
divide
the
sequence
of
data
into
a
hundred
equal
parts.
The
fiftieth
percentile,
P50,
is
equal
to
the
median.
To
find
the
location
of
a
given
percentile,
we
use
the
formula
L p = (N +1)
the
desired
percentile.
p
,
where
N
is
the
size
of
our
sequence,
and
p
100
Example
(continued):
The
location
of
the
median
in
a
sequence
of
12
values
is
equal
to
that
of
the
50th
percentile,
which
is
L50 = 13
50
= 6, 5 .
This
means
100
that
the
median
is
halfway
the
6th
and
the
7th
value
in
the
sequence.
The
location
of
the
first
quartile
is
that
of
the
25th
percentile:
25
= 3, 25 .
We
find
the
value
of
Q1
between
the
3th
and
the
4th
value.
100
In
our
sequence
of
data
that
is
between
14
and
14.
Therefore
Q1 = 14 .
L25 = 13
The
location
of
the
third
quartile
is
that
of
the
75th
percentile:
L75 = 13
75
= 9, 75 .
We
find
the
value
of
Q3
between
the
9th
and
the
10th
value.
100
In
our
sequence
of
data
that
is
between
21
and
22.
We
use
linear
interpolation
to
find
the
precise
value:
Q3 = 21+ 0, 75 1 = 21, 75 .
2.
Measures
of
dispersion
a. Range
The
range
of
a
collection
of
data
is
the
difference
between
the
greatest
and
the
least
value.
Example
(continued):
The
range
of
the
data
in
our
example
is
26
9
=
17.
b. Variance
and
standard
deviation
Measures
of
dispersion
indicate
the
degree
to
which
numerical
data
are
spread
out.
Variance
and
standard
deviation
are
based
on
the
squares
of
the
difference
between
each
of
the
values
and
the
datas
arithmetic
mean.
There
is
a
subtle
but
nevertheless
important
difference
between
the
formulas
used
to
calculate
these
measures
value
in
the
case
of
a
population
and
in
the
case
of
a
sample.
N
( x )
Population variance: 2 =
i=1
( x X )
; standard deviation: = 2 .
Sample variance: S 2 =
i=1
N 1
; standard deviation: S = S 2 .
It
is
sometimes
useful
to
expresses
the
standard
deviation
as
the
ratio
of
the
standard
deviation
to
the
mean.
This
is
called
the
coefficient
of
variation
(cv)
(also
named
variation
coefficient
or
relative
standard
deviation,
rsd).
In
case
of
a
population:
cv =
;
in
case
of
a
sample:
cv = .
X
Example
(continued):
In
our
example
the
data
are
those
of
a
population.
The
12
( x 17, 42)
population variance 2 =
i=1
12
23, 41
= 1, 34 .
17, 42
c. Interquartile
&
interdecile
range
The
interquartile
range
(also
called
midspread
or
middle
fifty)
is
the
difference
between
the
upper
and
the
lower
quartile:
IQR = Q3 Q1 .
It
contains
the
most
centrally
placed
50%
of
our
data.
(The
smaller
the
IQR,
the
smaller
our
datas
dispersion.)
Similarly,
the
interdecile
range
contains
the
most
centrally
placed
80%
of
our
data:
IDR = D9 D1 .
It
is
the
width
of
the
smallest
interval
containing
80%
of
the
most
central
of
our
datas.
A
graphical
representation
of
the
list
of
data,
their
distribution
and
their
tendencies
in
a
boxplot
or
box
and
whisker
diagram:
3.
Frequency
distributions
A
statistical
analysis
of
large
sets
of
data
will
in
general
start
by
organizing
the
observed
data
in
a
certain
number
of
classes
or
intervals,
in
many
cases
(but
not
always)
of
a
constant,
fixed,
width.
This
is
called
a
frequency
distribution.
It
counts
how
many
of
our
data
fall
within
a
certain
class.
Example:
The
following
table
counts
the
number
of
documented
art
works
depicting
Venus
or
Aphrodite,
created
in
France
within
time-periods
of
half
century,
from
1500
2000.
(source:
K.
Bender)
[Midpoints:]
[1525]
1500-
1549
[1575]
1550-
1599
[1625]
1600-
1649
[1675]
1650-
1699
[1725]
1700-
1749
[1775]
1750-
1799
[1825]
1800-
1849
[1875]
1850-
1899
[1925]
1900-
1949
[1975]
1950-
1999
Frequency
frequency
percentage
cumulative
frequency
percentage
60
137
210
483
785
383
327
286
162
0,32%
2,11%
4,82%
7,39%
17%
27,62%
13,48%
11,51%
10,06%
5,7%
0,32%
2,43%
7,25%
14,64%
31,63%
59,25%
72,73%
84,24%
94,3%
100%
(You
can
find
the
data
and
the
calculations
on
the
excel
worksheet
ExcelWorksheet_1)
A
frequency
distribution
of
a
data
set
is
usually
visualized
by
means
of
a
so-called
histogram,
which
can
easily
be
generated
e.g.
in
Excel
from
the
listing
of
the
classes
(the
bins)
and
the
corresponding
frequencies
of
occurrences
of
values
within
each
of
these
bins.
Here
is
the
histogram
visualization
of
the
temporal
distribution
of
French
Venus
art
works,
as
generated
in
Excel:
Given
a
frequency
distribution,
the
only
possible
estimation
of
the
range
of
the
data
obtained
for
our
population
or
sample
is
the
difference
between
the
lowest
and
the
highest
possible
value.
We
can
estimate
the
mean
of
the
data
by
using
the
center
(midpoint)
of
a
class
as
its
value,
and
then
using
the
frequencies
of
each
of
the
classes
as
weights
in
a
weighted
c
f m
i
i=1
c
i=1
X france =
We
use
linear
interpolation
to
determine
estimations
of
other
measures
of
locations,
like
the
median
and
the
quartiles.
Example
(continued):
In
the
frequency
distribution
we
learn
from
the
row
of
cumulative
frequency
percentages
that
31,36%
of
the
French
Venus
art
works
dates
from
before
1750,
and
that
59,25%
dates
from
before
1800.
The
median
Me
(the
date
such
that
50%
of
the
documented
works
was
created
before)
therefore
must
be
somewhere
between
1750
and
1800.
Applying
linear
interpolation
we
find
as
an
estimation
of
this
median:
Me 1750
50 31, 36
18, 64
=
Me = 1750 + 50
1783, 4 .
I.e.
the
median
age
50
59, 25 31, 36
27,89
will
be
2000
1783,4
=
216,6
years.
Similarly,
we
can
determine
the
first
and
the
third
quartiles,
Q1
and
Q3:
Q1 1700
25 14, 64
10, 36
=
Q1 = 1700 + 50
1731 .
I.e.,
we
estimate
that
one
50
31, 36 14, 64
16, 72
quarter
of
the
French
Venus
art
works
is
more
than
269
years
old.
Q3 1800
75 72, 73
2, 27
=
Q3 = 1800 + 50
1812,1 .
I.e.,
we
also
estimate
that
50
82,14 72, 73
9, 41
about
a
quarter
of
the
French
Venus
art
works
is
less
than
188
years
old.
Finally,
we
estimate
that
half
of
all
the
documented
French
Venus
art
works
came
into
being
between
1731
and
1812
(the
interquartile
range,
or
IQR,
is
81
years).
As
before,
we
may
visualize
this
descriptive
analysis
of
our
data
in
a
box-plot:
(
There
are
many
tools
that
help
you
quickly
generate
such
boxplot
images,
for
example,
online
at
http://www.imathas.com/stattools/boxplot.html
)
Like
for
the
mean,
we
also
use
the
midpoints
of
the
classes
to
estimate
the
variance
and
the
standard
deviation:
c
f (m X )
i
2
Population
variance:
=
; standard deviation: = 2 .
i=1
c
i=1
c
f (m X )
i
Sample variance: S 2 =
i=1
#
&
% fi ( 1
$ i=1 '
c
; standard deviation: S = S 2 .
Example
(continued):
The
frequency
distribution
was
obtained
from
population
data.
5
f (m 208, 39)
i
i=1
2842
95,11
0, 46 ,
i.e.
46%:
on
the
average,
the
age
208, 39
of
a
French
Venus
art
work
will
differ
by
46%
from
the
estimated
mean
age
of
208,4
years.
Exercise:
On
the
Excel
worksheet
ExcelWorksheet_1
you
will
find
the
data
that
were
used
above
as
well
as
all
the
calculations,
detailed
in
Excel.
The
worksheet
also
contains
similar
data
sets
(source:
K.
Bender)
for
the
Venus
iconography
between
1500
and
2000
in
Italy
(it),
in
the
Low
Countries
(lc)
and
in
Germany,
Switzerland
and
Central
European
countries
(gsce).
In
order
to
train
yourself
in
the
use
of
Excel
to
quickly
perform
a
basic
descriptive
statistical
analysis
of
a
set
of
data:
a. Perform
a
descriptive
analysis
similar
to
the
one
given
in
this
note,
for
each
of
these
three
supplementary
data
sets.
Summarize
your
findings
in
a
short
text
report;
include
a
histogram
and
a
boxplot.
b. Determine
the
total
distribution
of
Venus
art
works
in
Europe
between
1500
and
2000
(see
the
final
right
columns
on
the
work
sheet).
Again
perform
the
descriptive
analysis
for
this
full
distribution,
and
summarize
your
findings
in
a
short
text
report
with
histogram
and
boxplot.
c. Compare
and
interpret
the
results.