You are on page 1of 17

Descriptive and Inferential Statistics

Statistics can be broken into two basic types.

I. Descriptive statistics. This is a set of methods to describe data that we have collected.

Example: Of 350 randomly selected people in the town of Luserna, Italy, 280 people had the last
name Nicolussi. An example of descriptive statistics is the following statement :

"80% of these people have the last name Nicolussi."

Example: On the last 3 Sundays, Henry D. Car salesman sold 2, 1, and 0 new cars respectively.
An example of descriptive statistics is the following statement:

"Henry averaged 1 new car sold for the last 3 Sundays."

These are both descriptive statements because they can actually be verified from the information
provided.

II. Inferential statistics. This is a set of methods used to make a generalization, estimate,
prediction or decision.

Example: Of 350 randomly selected people in the town of Luserna, Italy, 280 people had the last
name Nicolussi. An example of inferential statistics is the following statement:

"80% of all people living in Italy have the last name Nicolussi."

We have no information about all people living in Italy, just about the 350 living in Luserna. We
have taken that information and generalized it to talk about all people living in Italy. The easiest
way to tell that this statement is not descriptive is by trying to verify it based upon the
information provided.

Example: On the last 3 Sundays, Henry D. Car salesman sold 2, 1, and 0 new cars respectively.
An example of inferential statistics is the following statements:

"Henry never sells more than 2 cars on a Sunday."

Although this statement is true for the last 3 Sundays, we do not know that this is true for all
Sundays.

"Henry is selling fewer cars lately because people have caught on to his dirty tricks."

There is nothing in the information given that tells us that this statement is true.

"Henry sold 0 cars last Sunday because he fell asleep in one of the cars on the lot."
Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
The major use of inferential statistics is to use information from a sample to infer something
about a population.

Questions

1) The last four semesters an instructor taught Intermediate Algebra, the following
numbers of people passed the class.

17 19 4 20

Which of the following conclusions can be obtained from purely descriptive measures and
which can be obtained by inferential methods?

a) The last four semesters the instructor taught Intermediate Algebra, an average of 15 people
passed the class.

b) The next time the instructor teaches Intermediate Algebra, we can expect approximately 15
people to pass the class.

c) This instructor will never pass more than 20 people in an Intermediate Algebra class.

d) The last four semesters the instructor taught Intermediate Algebra; no more than 20 people
passed the class.

e) Only 5 people passed one semester because the instructor was in a bad mood the entire
semester.

f) The instructor passed 20 people the last time he taught the class to keep the administration off
of his back for poor results.

g) The instructor passes so few people in his Intermediate Algebra classes because he doesn't like
teaching that class.

2) During the last week, Tony Gwynn of the San Diego Padres recorded the following
number of hits.

Sun Mon Tues Wed Thurs Fri Sat


2 1 4 3 0 3 1

Which of the following conclusions can be obtained from purely descriptive methods and which
can be obtained by inferential methods?

a) Tony will never have more than 4 hits in a game.

b) Tony had 0 hits on Thursday because he used a bat that belonged to another player.
Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
c) During the last week, Tony averaged 2 hits per game.

d) Tony is a better hitter than any other baseball player.

e) Tony had the same total number of hits in the first 3 games as he did in the last 4 games.

Classify each set of data as discrete or continuous.

1) The number of suitcases lost by an airline.

2) The height of corn plants.

3) The number of ears of corn produced.

4) The number of green M&M's in a bag.

5) The time it takes for a car battery to die.

6) The production of tomatoes by weight.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Frequency Distributions
A frequency distribution is a tool for organizing data. We use it to group data into categories
and show the number of observations in each category. Here are some test scores from a math
class.

65 91 85 76 85 87 79 93
82 75 100 70 88 78 83 59
87 69 89 54 74 89 83 80
94 67 77 92 82 70 94 84
96 98 46 70 90 96 88 72

It's hard to get a feel for this data in this format because it is unorganized. To construct a
frequency distribution, you should first identify the lowest and highest values in the list. We do
this because we want to be sure that each value in the list fits into one of our categories. The low
value here is 46, and the high is 100. A set of categories that would work here is 41-50, 51-60,
61-70, 71-80, 81-90, and 91-100. Here's a finished product:

Class Frequency
41-50 1
51-60 2
61-70 6
71-80 8
81-90 14
91-100 9

We can now see that the biggest number of tests was between 81 and 90, and most of the tests
were between 71 and 100.

The low number in each category (or class) is called the lower class limit, and the high number is
called the upper class limit.

Now for some guidelines for constructing a frequency distribution.

• Each value should fit into a category. The classes should be mutually exhaustive.
• No value should fit into more than 1 category. The classes should be mutually exclusive;
there should be no overlapping of classes.
• Make the classes of equal size if possible. This makes it easier to compare the frequency
in one class to another.
• Avoid open-ended classes if possible such as "75 and over".

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
• Try to use between 5 and 20 classes if possible. If you have fewer than 5 classes, you're
not really breaking up the data, and if you use more than 20 classes, this will probably be
information overflow.
• It is usually convenient to use class sizes of 5 or 10, in other words, to have each class
containing 5 or 10 possible values.
• It is usually convenient to make the lower limit of the first category a multiple of the
class size.

After the first two rules above, the rest are merely suggestions. Each set of data may require you
to violate some of these suggestions. The best advice is to try and follow them whenever
possible.

One further extension to the frequency distribution is to look at the percentage of values that
show up in each category. This is called a relative frequency distribution or percent
frequency distribution. Here's how the above data would be presented in this way.

Relative
Class Frequency Percent
Frequency
41-50 1 1/40 2.5%
51-60 2 2/40 5%
61-70 6 6/40 15%
71-80 8 8/40 20%
81-90 14 14/40 35%
91-100 9 9/40 22.5%

The final frequency distribution that we will discuss is the cumulative frequency distribution.
Think about the word cumulative, it generally refers to some sort of total. A cumulative
frequency distribution is a way to list how many values fit into the first class, the first 2 classes,
the first 3 classes, etc., or the last class, the last 2 classes, etc. Here's a cumulative less than
frequency distribution for the above set of data.

Cumulative
Class Frequency
(Less Than)
41-50 1 1
51-60 2 3
61-70 6 9
71-80 8 17
81-90 14 31
91-100 9 40

The 1 means that there is 1 value that is 50 or less, the 3 means that there are 3 values that are 60
or less, the 9 means that there are 9 values that are 70 or less, and so on.
Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
Now for a cumulative greater than frequency distribution.

Cumulative
Class Frequency (Greater
Than)
41-50 1 40
51-60 2 39
61-70 6 37
71-80 8 31
81-90 14 23
91-100 9 9

The 40 means that there are 40 values that are 41 or more, the 39 means that there are 39 values
that are 51 or more, the 37 means that there are 37 values that are 61 or more, and so on.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Histograms
A histogram could be thought of as a graph of a frequency distribution. Recall the following
example from frequency distributions.

Here are some test scores from a math class.

65 91 85 76 85 87 79 93
82 75 100 70 88 78 83 59
87 69 89 54 74 89 83 80
94 67 77 92 82 70 94 84
96 98 46 70 90 96 88 72

Here's the frequency distribution:

Class Frequency
41-50 1
51-60 2
61-70 6
71-80 8
81-90 14
91-100 9

Here's the histogram that goes with the frequency distribution. This was done using Minitab.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
The horizontal axis (x - axis) corresponds to the test scores, while the vertical axis (y - axis)
represents the frequency of each class or category. The first bar corresponds to the class 41 - 50.
The next bar goes with the class 51 - 60, and so on. Notice that the first bar actually goes up to
the value 51, but the score of 51 actually goes with the second class.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Pie Charts
A pie chart is a useful diagram when we're dealing with percentages. The chart is a way to
visually display what percentage of the total a certain category makes up.

Example: 1200 students at the College of the Sequoias were polled and asked about the number
of parking spaces on campus. Here are the results:

Response Frequency
Too Many 300
About Right 360
Not Enough 540

To do this manually:

First determine what percent of the total each category represents.

Next draw a circle. This circle will represent 100% of the values. Place marks every 5%, keeping
in mind that one-quarter of a circle represents 25%.

Next starting at the top, move in the clockwise direction until you reach 25%. Then move
another 30% (until you reach 55%).

A little color wouldn't hurt either.

To do this with software: Use Excel

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Measures of Central Tendency
Often it is desirable to have a certain number to describe a set of data. In other words, this one
number would be representative of the data. Since a representative number should be close to the
"middle" of the data, we call these measures of central tendency. The first, and weakest, of these
measures is the mode.

Mean

The mean is the most powerful, and usually the most accurate and reliable, measure of central
tendency. When we usually hear the word "average", what we are really thinking about is the
mean. To find the mean for a set of data, we take the sum of all of the values, and divide the sum
by how many values there are. If we are looking for the mean of a sample, we denote that mean
by . This is read "x-bar":

The formula for the sample mean is , where


is the mean of the sample,
is the sum of all the values, and
n is the number of values in the set.

If we are looking for the mean of a population, we denote that mean by the Greek letter , mu.
The way to calculate this mean is the same. The difference in notation is to tell a sample statistic,
, from a population parameter, . We will always use our own alphabet when discussing a
sample statistic, and the Greek alphabet to discuss a population parameter.

The formula for the population mean is , where


is the mean of the population,
is the sum of all the values, and
N is the number of values in the set.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Example:

Joe D. Student got the following scores on his 5 statistics exams : 89, 83, 71, 95, 73. Find Joe's
mean test score.

So, why and not ? Since these represent all of Joe's 5 tests, we treat it as a population. But
again the difference here is only in name.

It will be to your advantage to be able to use your calculator to compute the mean. Your
calculator has a built in way to calculate the mean of a set of data. Most non-graphing calculators
use some or all of the following steps.

1. Put your calculator into statistics mode.

2. Make sure that your statistical registers are cleared. These are the memory locations where
your calculator stores the values.

3. Enter your numbers into the calculator by pressing the number and then hitting the key that
will "store" the number in the statistical registers. The key will either have , M+, or Data on
it.

4. Once all the numbers have been entered, push the key with over it. Usually, you will have to
push the 2nd key or the Shift key or the Inv key.

What makes the mean so much better than the mode?

It always exists, the mode occasionally cannot be found.

It is unique, there is sometimes more than one mode for the set of data.

It uses every value in its calculation.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
When is the mean not a representative value for a set of data? Look at the following
example.

Seven houses were sold last week in Visalia. Here are the selling prices : $94,900, $97,900,
$99,900, $100,900, $102,900, $107,900, and $1,250,000. Find the average selling price.

So, the average selling price is approximately $264,914.29. Is this a typical value for the set of
data? No. This is over $150,000 more than 6 of the values, and a little more than $1,000,000 less
than the seventh value. The problem here is that the $1,250,000 home is an extreme outlier for
this set of data, and has influenced the mean. If you were to calculate the mean without the
outlier, we would come up with a value of approximately $100,733.33. This number accurately
describes the other six values. Another thing to try would be to find the median.

Exercise- Mean

1) In trying to estimate Joe D. Bowler's mean bowling score, six of his games are selected at
random. The scores are 187, 169, 172, 209, 154, and 195. Find the mean for these six scores.

2) During the first 5 weeks of the 1995 NFL season, the San Francisco Forty Niners gained the
following number of yards rushing : 154, 158, 90, 78, and 109. George Seifert was interested in
his team's rushing performance through the first 5 weeks of the season. Find the mean rushing
yardage.

3) "I wonder how many points, on average, are scored by a typical NFL team in a game?"
wonders Joe D. Sports fan. Joe gets his local paper that shows all of this weekend’s results. Here
are the points scored that week :

24, 21, 14, 17, 10, 3, 20, 23, 24, 22, 21, 6, 17, 14, 20, 23, 14, 52, 7, 17, 34, 10, 7, 27, 14, 31, 7,
22, 35, 0

Find the mean score for this data.

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Median
The median for a set of data is the value that is exactly in the center. The first step is to place the
values in order from smallest to biggest. Then you want to divide the set into 2 equal halves. If
there is a single value left between the two halves, this is the median. This will always be the
case if you start with an odd number of values. If there is not a value between the two halves,
then the median is the mean of the two values closest to the center. This will always be the case
if you start with an even number of values. Maybe some examples will help.

Example: Find the median of the following numbers : 98, 86, 46, 63, 66, 94, 31, 56, 51, 75, 48.

First put them in order : 31, 46, 48, 51, 56, 63, 66, 75, 86, 94, 98.

There are 11 values, so we can get 2 groups of 5 with one left over.

31, 46, 48, 51, 56 63 66, 75, 86, 94, 98

The median is 63.

Example: Find the median of : 93, 90, 62, 44, 75, 89, 74, 100, 78, 61, 78, 81, 57, 67.

First put them in order : 44, 57, 61, 62, 67, 74, 75, 78, 78, 81, 89, 90, 93, 100.

There are 14 values, so we can get 2 groups of 7 with none left over.

44, 57, 61, 62, 67, 74, 75 78, 78, 81, 89, 90, 93, 100

The median is found by taking the mean of 75 and 78. The median is 76.5.

The median is not sensitive to outliers in the way that the mean is. Let's take a look at the same
example from the mean section regarding the selling price of the 7 homes sold in Visalia last
week.

Example: Seven houses were sold last week in Visalia. Here are the selling prices : $94,900,
$97,900, $99,900, $100,900, $102,900, $107,900, and $1,250,000. Find the median selling price.

First, put the values in order.

$94,900, $97,900, $99,900, $100,900, $102,900, $107,900, $1,250,000.

There are seven values, so we can get 2 groups of 3 with 1 left over.

$94,900, $97,900, $99,900 $100,900 $102,900, $107,900,


Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm
Accessed: 4.11.2008
$1,250,000

The median is $100,900. Is this value more representative of the set of values than the mean of
approximately $264,914.29? Since the median uses only the value or values in the center of the
list, the outliers are not used.

The median is used in cases where extreme outliers occur, like real estate prices (there are some
really expensive houses) and household income (some people make a lot of money). Can you
think of some other situations that would require the median?

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Mode
The mode for a set of data is the value that occurs the greatest number of times.

Example: Find the mode for the following set of data : 4, 6, 6, 7, 11, 11, 11, 12

Ans. The mode is 11, because it occurs more times (3) than any other number.

One weakness of the mode is that sometimes a set of data can have more than one mode.

Example: Find the mode for the following set of data : 4, 6, 6, 6, 7, 11, 11, 11, 12

Ans. The modes are 6 and 11, because each occurs 3 times.

A set of data with 2 modes is sometimes called bimodal.

Sometimes a set of data doesn't have a mode. This happens when no value is repeated in the set.

Find the mode for the following set of data : 4, 5, 6, 7, 10, 11, 12, 13

Ans. This set of data has no mode.

So, sometimes a set of data has more than one mode, and sometimes a set of data doesn't even
have a mode. Another weakness is that the mode occasionally is not a typical value for the set of
data. Consider the set of values : 5, 5, 73, 75, 77, 78, 79, 80, 82, 83, 84. The mode is 5, but is 5
representative of this set of values? Of course not! This set of values, with the exception of the
two outliers of 5, is made up of values in the 70's and 80's. If you were told that the mode for a
set of data was 5, and you did not see the actual values, would you guess that most of the
numbers were in the 70's and 80's? Probably not.*/-

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Standard Deviation: is the measure of dispersion that we will use most often. It is based on
the variance. The standard deviation, whether of a sample or population, is found by taking the
square root of the appropriate variance.

Example: A student took 5 exams in a class and had scores of 92, 75, 95, 90, and 98. Find the
variance for her test scores.

We will treat these 5 test scores as a population, since there is no suggestion that there are more
than 5 tests.

To find the standard deviation : .

Example: Five students took an experimental exam and had scores of 92, 75, 95, 90, and 98.
Find the variance for their test scores.

We will treat these 5 test scores as a sample.

To find the standard deviation : .

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008
Some useful link:

http://www.stat.tamu.edu/stat30x/notes/node3.html

http://www.sdecnet.com/psychology/stathelp.htm

http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/Data_Descr_Infer.htm

http://onlinestatbook.com/chapter1/inferential.html

http://faculty.vassar.edu/lowry/webtext.html

Adapted from: http://infinity.cos.edu/faculty/woodbury/Stats/Tutorial/TOC1.htm


Accessed: 4.11.2008

You might also like