Very Basic Statistics

Very Basic Statistics
Course Content
Data Types
Descriptive Statistics
Data Displays
Data Types
Variables
Quantitative Variable
A variable that is counted or measured on a
numerical scale
Can be continuous or discrete (always a whole
number).
Qualitative Variable
A non-numerical variable that can be classified into
categories, but cant be measured on a numerical
scale.
Can be nominal or ordinal
Continuous Data
Continuous data is measured on a scale.

The data can have almost any numeric value
and can be recorded at many different points.
For example
Temperature (39.25oC)
Time (2.468 seconds)
Height (1.25m)
Weight (66.34kg)
Discrete Data
Discrete data is based on counts, for example;

The number of cars parked in a car park
The number of patients seen by a dentist each day.
Only a finite number of values are possible e.g.

a dentist could see 10, 11, 12 people but not
12.3 people
Nominal Data
A Nominal scale is the most basic level of measurement.

The variable is divided into categories and objects are
measured by assigning them to a category.
For example,
Colours of objects (red, yellow, blue, green)
Types of transport (plane, car, boat)
There is no order of magnitude to the categories i.e.

blue is no more or less of a colour than red.
Ordinal Data
Ordinal data is categorical data, where the categories

can be placed in a logical order of ascendance e.g.;
1 5 scoring scale, where 1 = poor and 5 = excellent
Strength of a curry (mild, medium, hot)
There is some measure of magnitude, a score of 5

excellent is better than a score of 4 good.
But this says nothing about the degree of difference

between the categories i.e. we cannot assume a
customer who thinks a service is excellent is twice as
happy as one who thinks the same service is good.
Task 1
Look at the following variables and decide if they are

qualitative or quantitative, ordinal, nominal, discrete
or continuous
Age
Year of birth
Sex
Height
Number of staff in a department
Time taken to get to work
Preferred strength of coffee
Company size
Session Content
Measures of Location
Measures of Dispersion
Common Measures
Measures of location summarise the data with

a single number
There are three common measures of location
Mean
Mode
Median
Quartiles are another measure

Mean
The mean (more precisely, the arithmetic mean) is

commonly called the average
In formulas the mean is usually represented by x
read as x-bar.
The formula for calculating the mean from n individual
data-points is;
x x
n
X bar equals the sum of the data divided by the
number of data-points
Pros & Cons
Advantages Disadvantages
It may not be an actual meaningful
value, e.g. an average of 2.4 children
basic calculation is easily understood per family.
Can be greatly affected by extreme
values in a dataset. e.g. seven
all data values are used in the calculation students take a test and receive the
following scores.
40 42 45 50 53 54 99
used in many statistical procedures. The average score is 54.7 but is this
really representative of the group?
If the extreme value of 99 is dropped,

the average falls to 47.3
Mode
The mode represents the most commonly occurring

value within a dataset.
We usually find the mode by creating a frequency

distribution in which we tally how often each value
occurs.
If we find that every value occurs only once, the distribution

has no mode.
If we find that two or more values are tied as the most

common, the distribution has more than one mode.
Pros & Cons
easy to understand
not all sets of data have a modal
not affected by outliers value
(extreme values)
can also be obtained for some sets of data have more

qualitative data than one modal value
e.g. when looking at the

frequency of colours of cars
we may find that silver occurs multiple modal values are often
most often difficult to interpret
Task 2
The following values are the ages of students in their

first year of a course
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Find the mean age of the students

Find the modal value
In your opinion which is the better measure of location
for this data set?
Median
Median means middle, and the median is the middle of

a set of data that has been put into rank order.
Specifically, it is the value that divides a set of data into

two halves, with one half of the observations being
larger than the median value, and one half smaller.
Half the data < 29 Half the data > 29
18 24 29 30 32
Finding the Median from Individual
Data
Step 1:- Arrange the observations in increasing order i.e.
rank order. The median will be the number that corresponds
to the middle rank.
Step 2:- Find the middle rank with the following formula:
Middle rank = *(n+1)
Step 3 Identify the value of the median

If n is an odd number the middle rank will fall on an
observation. The median is then the value of that
observation.
Finding the Median from Individual
Data
If n is an even number, the middle rank will fall between
two observations. In this case the median is equal to the
arithmetic mean of the values of the two observations
40 42 45 50 53 54 70 99
Position of Median = *(n+1) = 4.5
data - point 4 data - point 5

Median = 2
50 53
51.5
Median = 2
Pros & Cons
the concept is easy to data must be arranged in rank

understand order (ascending or
descending)
the median can be
determined for any type of cannot combine medians in
data (with the exception of statistical calculations as with
nominal) mean values
the median is not unduly

influenced by extreme values
in the dataset
Task 3
Using the student age data below, find the

median age
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Quartiles
Also known as percentiles

Lower quartile - 25% of the data is below this
Position of Q1 = *(n+1)
Upper quartile 75% of the data is below this

Position of Q3 = *(n+1)
If a quartile falls on an observation, the value of the

quartile is the value of that observation.
For example, if the position of a quartile is 20, its value is
the value of the 20th observation.
Quartiles
If a quartile lies between observations, the value of the quartile

is the value of the lower observation plus the specified fraction
of the difference between the two observations.
40 42 45 50 53 54 70 99
Position of Upper Quartile = *(n+1) = 6.75
Upper quartile = data-point 6 + 0.75*(data-point 7 data-point 6)
Upper quartile = 54 + 0.75*(70 54) = 66
Task 4
Using the student age data below find the

upper and lower quartiles
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Common Measures
The dispersion in a set of data is the variation among

the set of data values.
It measures whether they are all close together, or

more scattered.
2 4 6 8 10 12 14 16 2 4 6 8 10 12
Report turnaround time (days) Report turnaround time (days)
Common Measures
The four common measures of spread are

the range
the inter-quartile range
the variance
the standard deviation
Range
The range is the difference between the largest and the

smallest values in the dataset i.e. the maximum
difference between data-points in the list.
It is sensitive to only the most extreme values in the list.
The range of a list is 0 if and only if all the data-points
in the list are equal.
4 16 Days
Range
Pros & Cons
best for symmetric data doesnt use all of the

with no outliers data, only the extremes
easy to compute and very much affected if the

understand extremes are outliers
good option for ordinal only shows maximum

spread, does not show
data
shape
Task 5
Using the student age data find the range of

the data.
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Inter-quartile Range
(upper quartile lower quartile)

Essentially describes how much the middle 50% of
your dataset varies
example: if all patients in a dentist surgery took more-

or-less the same time to be treated with only one or
two exceptionally quick or long appointments you
would expect the inter-quartile range to be very small
but if all appointments were either very quick or very

long, with few in between then the inter-quartile range
would be larger.
Pros & Cons
Good for ordinal data Harder to calculate and

understand
Doesnt use all the information

Ignores extreme values (ignores half of the data-
points, not just the outliers)
Tails almost always matter in
More stable than the range data and these arent included
because it ignores outliers Outliers can also sometimes
matter and again these arent
included.
Task 6
Using the student age data find the inter-

quartile range.
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Variance and Standard Deviation
(, s2) =(population notation, sample notation)

The variance (s2)and standard deviation
(s)are measures of the deviation or dispersion
of observations (x) around the mean ( of a
distribution
Variance is an average squared deviation from
the mean
The standard deviation (SD) is the square root of the

variance.
small SD = values cluster closely around the mean
large SD = values are scattered
1 SD Mean 1 SD
Mean
1 SD 1 SD
4 6 8 10 12 14 16 Days 8 10 12
The following formulae define these measures

Population Sample
Variance 2
x 2
Variance s 2
x x 2
N n 1
Standard Deviation 2 Standard Deviation s s 2
Variance
Advantages:
uses all of the data values
Disadvantages:
the variance is measured in the original units squared
extreme values or outliers effect the variance
considerably
hard to calculate manually
Standard Deviation
Advantages:
same units of measurement as the values
useful in theoretical work and statistical methods
and inference
Disadvantages:
hard to calculate manually
Task 7
Using the student age data find the variance

and the standard deviation
18, 19, 18, 25, 22, 20, 21, 45, 33, 20, 18, 18
Session Summary
Mean
Mode
Median
Quartiles
Range
Interquartile Range
Variance
Standard Deviation
Data Displays
Session Content
Histograms
Run charts
Box plots
Bar charts
Pareto charts
Pie charts
Scatter plots
Contingency tables
Histograms
Histogram of dataset 1 (normal)
30
25
20
Frequency
15
10
0
45.0 52.5 60.0 67.5 75.0 82.5 90.0
dataset 1 (normal)
Run Charts
Time Series Plot of Time Taken
35.0
32.5
Time Taken
30.0
27.5
25.0
mon tue wed thu fri mon tue wed thu fri mon tue wed thu fri mon tue wed thu fri
Day
Boxplots
Boxplot of dataset 1 (norma, dataset 2 (expon, dataset 3 (unifo
400
300
Data
200
100
dataset 1 (normal) dataset 2 (exponential) dataset 3 (uniform)

Bar Charts
Chart of Frequency
20
15
Frequency
10
0
missed dose wrong patient wrong dose wrong time wrong medicine
Causes of Medication Errors
Pareto Charts
Pareto Chart of Causes of Medication Errors
40 100
80
30
Frequency
Percent
60
20
40
10
20
0 0
Causes of Medication Errors
Frequency 18 15 4 2 1
Percent 45.0 37.5 10.0 5.0 2.5
Cum % 45.0 82.5 92.5 97.5 100.0
Pie Charts
Pie Chart of Causes of Medication Errors
Category
missed dose
1, 2.5% wrong patient
4, 10.0% 2, 5.0%
wrong dose
wrong time
wrong medicine
15, 37.5%
18, 45.0%
Scatterplots
Scatterplot of Weight Loss vs Time on Diet

80
70
60
50
Weight Loss
40
30
20
10
0
0 5 10 15 20 25
Time on Diet
Contingency Tables
Colour of eyes
Colour of hair Brown Green/grey Blue Total
Black 50 54 41 145
Brown 38 46 48 132
Fair 22 30 31 83
Ginger 10 10 20 40
Total 120 140 140 400=N
Session Summary
Histograms
Run charts
Box plots
Bar charts
Pareto charts
Pie charts
Scatter plots
Contingency tables
Course Summary
Data Types
Data Displays

Very Basic Statistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Very Basic Statistics

Uploaded by

Copyright:

Available Formats

Very Basic Statistics

Continuous data is measured on a scale.

Discrete data is based on counts, for example;

The number of patients seen by a dentist each day.

Only a finite number of values are possible e.g.

A Nominal scale is the most basic level of measurement.

There is no order of magnitude to the categories i.e.

Ordinal data is categorical data, where the categories

There is some measure of magnitude, a score of 5

But this says nothing about the degree of difference

Look at the following variables and decide if they are

Measures of location summarise the data with

Quartiles are another measure

The mean (more precisely, the arithmetic mean) is

If the extreme value of 99 is dropped,

The mode represents the most commonly occurring

We usually find the mode by creating a frequency

If we find that every value occurs only once, the distribution

If we find that two or more values are tied as the most

can also be obtained for some sets of data have more

e.g. when looking at the

The following values are the ages of students in their

Find the mean age of the students

Median means middle, and the median is the middle of

Specifically, it is the value that divides a set of data into

Half the data < 29 Half the data > 29

Step 3 Identify the value of the median

Position of Median = *(n+1) = 4.5

data - point 4 data - point 5

the concept is easy to data must be arranged in rank

the median is not unduly

Using the student age data below, find the

Also known as percentiles

Upper quartile 75% of the data is below this

If a quartile falls on an observation, the value of the

If a quartile lies between observations, the value of the quartile

Using the student age data below find the

The dispersion in a set of data is the variation among

It measures whether they are all close together, or

The four common measures of spread are

The range is the difference between the largest and the

best for symmetric data doesnt use all of the

easy to compute and very much affected if the

good option for ordinal only shows maximum

Using the student age data find the range of

(upper quartile lower quartile)

example: if all patients in a dentist surgery took more-

but if all appointments were either very quick or very

Good for ordinal data Harder to calculate and

Doesnt use all the information

Using the student age data find the inter-

(, s2) =(population notation, sample notation)

The standard deviation (SD) is the square root of the

The following formulae define these measures

Using the student age data find the variance

dataset 1 (normal) dataset 2 (exponential) dataset 3 (uniform)

Scatterplot of Weight Loss vs Time on Diet

You might also like