You are on page 1of 70

|  

  
CHAPTER 1
BUM 2413 / BPF 3313
@ 

1.1 Overview
1.2 Statistical Problem-Solving Methodology
1.3 Review of Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
@|
± Oy the end of this chapter, you should be able to
± Define the meaning of statistics, population, sample, parameter,
statistic, descriptive statistics and inferential statistics.
± Understand and explain why a knowledge of statistics is needed

± Outline the 6 basic steps in the statistical problem solving


methodology.
± Identifies various method to obtain samples.

± Discuss the role of computers and data analysis software in


statistical work.
± Summarize data using measures of central tendency, such as
the mean, median, mode, and midrange.
± Describe data using measures of variation, such as the range,
variance, and standard deviation.
! ! |
 
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
‡ Ten of thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
‡ The average annual salary for a professional football player for the year 2001
was $1,100,500.
‡ The average cost of a wedding is nearly RM10,000.
‡ In USA, the median salary for men with a bachelor¶s degree is $49,982, while
the median salary for women with a bachelor¶s degree is $35,408.
‡ Globally, an estimated 500,000 children under the age of 15 live with Type 1
diabetes.
‡ Women who eat fish once a week are 29% less likely to develop heart disease.
  
± Deal with uncertainty in repeated scientific
measurements

± Draw conclusions from data

± Design valid experiments and draw reliable conclusions

± Oe a well-informed member of society


u   

± is the sciences of conducting


studies to collect, organize,
summarize, analyze,
present, interpret and draw
conclusions from m  .

Any values (observations or


measurements) that have been collected
         
            
  
Population Parameter
The complete collection of A number that describes a
measurements outcomes, object population characteristics
or individual under study

Tangible Conceptual
Always finite & after a population is sampled, Population that consists of all the
the population size decrease by 1 value that might possibly have been
The total number of members is fixed & observed & has an unlimited number
could be listed of members

Sample
Statistic
A subset of a population,
A number that describes a
containing the objects or outcomes
sample characteristics
that are actually observed


!
± Consider a machine that makes steel rods for use in optical storage
devices. The specification for the diameter of the rods is 0.45 2
0.02 cm. During the last hour, the machine has made 1000 rods.
The quality engineer wants to know approximately how many of
these rods meet the specification. He does not have time to
measure all 1000 rods.
So he draws a random sample of 50 rods, measures them, and
finds that 46 of them (92%) meet the diameter specification. Now, it
is unlikely that the sample of 50 rods represents the population of
1000 perfectly.


!
The engineer might need to answer several questions based on the
sample data. For example:
1. How large is a typical difference for this kind of sample?
2. What interval gives a good estimate of the percentage of
acceptable rods in the population with reasonable certainty?
3. How certain can the engineer be that at least 90% of the rods are
good?

Statistics can help us to address questions like these.


( 
    

± (     ± |     
± consists of the collection, ± consists of generalizing from
organization, samples to populations,
classification, performing estimations
summarization, and
presentation of data hypothesis testing,
obtain from the sample. determining relationships
± Used to describe the
among variables, and making
characteristics of the predictions.
sample ± Used to describe, infer,

± Used to determine estimate, approximate the


whether the sample characteristics of the target
represent the target population
population by comparing ± Used when we want to draw a
sample statistic and
population parameter conclusion for the data obtain
from the sample



± Ten of thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank. (Descriptive)

± The death rate from lung cancer was 10 times for smokers compared to
nonsmokers. (Inferential)

± The average cost of a wedding is nearly RM10,000. (Descriptive)

± In USA, the median salary for men with a bachelor¶s degree is $49,982, while
the median salary for women with a bachelor¶s degree is $35,408. (Descriptive)

± Globally, an estimated 500,000 children under the age of 15 live with Type 1
diabetes. (Inferential)

± A researcher claim that a new drug will reduce the number of heart attacks in
men over 70 years of age. (Inferential)
   
 
     

( 
u   

Yes

u   
No
|  
  
± It is a fact that, you need a knowledge
of statistics to help you
1. Describe and understand numerical
relationship between variables
± There are a lot of data in this world
so we need to identify the right
variables.
2. Make better decision
± Statistical methods allow people to
make better decisions in the face of
uncertainty.
(    
  
 
1. A management consultant wants to compare a client¶s
investment return for this year with related figures from last
year. He summarizes masses of revenue and cost data
from both periods and based on his findings, presents his
recommendations to his client.

2. A college admission director needs to find an effective way


of selecting student applicants. He design a statistical study
to see if there¶s a significance relationship between SPM
result and the gpa achieved by freshmen at his school. If
there is a strong relationship, high SPM result will become
an important criteria for acceptance.
  (   
1. Suppose that the manager of ³Oig-Wig Executive Hair Stylist´,
Alvin Tang, has advertised that 90% of the firm¶s customers
are satisfied with the company¶s services. If Pamela, a
consumer activist, feels that this is an exaggerated statement
that might require legal action, she can use statistical
inference techniques to decide whether or not to sue Alvin.

2. Students and professional people can also use the knowledge


gained from studying statistics to become better consumers
and citizens. For example, they can make intelligent decisions
about what products to purchase based on consumer studies
about government spending based on utilization studies, and
so on.
!
 | |@ 
  | 
 ( 
 | |@  
|  ( 
6 Oasic Steps
1. Identifying the problem or opportunity
2. Deciding on the method of data collection
3. Collecting the data
4. Classifying and summarizing the data
5. Presenting and analyzing the data
6. Making the decision
!
|    
 

 
± Must clearly understand & correctly define the objective/goal
of the study
± If not, time & effort are waste

± Is the goal to study some population?


± Is it to impose some treatment on the group & then test the
response?
± Can the study goal be achieved through simple counts or
measurements of the group?
± Must an experiment be performed on the group?
± If sample are needed, how large?, how should they be
taken? ± the larger the better (more than 30)
@ 

± The larger the sample, the smaller the magnitude of
sampling errors.
± Survey studies needed large sample because the returns
of the survey is voluntary based.
± Easy to divide into subgroups.
± In mail response the percentage of response may be as
low as 20%-30%, thus the bigger number of samples is
required.
± Subject availability and cost factors are legitimate
considerations in determining appropriate sample size.

(     (@ 
± Data must be gathered that are accurate, as
complete as possible & relevant to the
problem
± Data can be obtained in 3 ways
1. Data that are made available by others
(internal, external, primary or secondary data)
2. Data resulting from an experiment
(experimental study)
3. Data collected in an observational study
(observation, survey, questionnaire, interview)

@   
± ïonprobability data
± Is one in which the judgment of the experimenter, the
method in which the data are collected or other factors
could affect the results of the sample
± 3 basic methods:  m  
    
and    

± Probability data
± Is one in which the chance of selection of each item in the
population is known before the sample is picked
± 4 basic methods : m ,  ,  m, and
  .

  

1. Judgment samples
± Oase on opinion of one or more expert person
± Ex: A political campaign manager intuitively picks certain voting
districts as reliable places to measure the public opinion of his
candidate

2. Voluntary samples
± uestion are posed to the public by publishing them over radio or
tv (phone or sms)

3. Convenience samples
± Take an µeasy sample¶ (most conveniently available)
± Ex: A surveyor will stand in one location & ask passerby their
questions
  

1. Random samples
± Selected using chance method or random methods
± Example:
± A lecturer wants to study the physical fitness levels
of students at her university. There are 5,000
students enrolled at the university, and she wants to
draw a sample of size 100 to take a physical fitness
test. She obtains a list of all 5,000 students,
numbered it from 1 to 5,000 and then randomly
invites 100 students corresponding to those numbers
to participate in the study.
  

2. Systematic samples
± Numbering each subject of the populations and data is
selected every - number.
± Example:
± A lecturer wants to study the physical fitness levels of
students at her university. There are 5,000 students
enrolled at the university, and she wants to draw a sample
of size 100 to take a physical fitness test. She obtains a list
of all 5,000 students, numbered it from 1 to 5,000 and
randomly picks one of the first 50 voters (5000/100 = 50)
on the list. If the pick number is 30, then the 30 student in
the list should be invited first. Then she should invite the
selected every 50 name on the list after this first random
starts (the 80 student, the 130 student, etc) to produce
100 samples of students to participate in the study.
  

3. Stratified samples
± Dividing the population into groups according to some
characteristics that is important to the study, then sampling from
each group
± Example:
± A lecturer wants to study the physical fitness levels of students
at her university. There are 5,000 students enrolled at the
university, and she wants to draw a sample of size 100 to take a
physical fitness test. Assume that, because of different lifestyles,
the level of physical fitness is different between male and female
students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
Then she can either use random method or systematic methods
to select the participants. As example she can use random
sample to chose 50 male students and use systematic method
to chose another 50 female students or otherwise.
  

4. Cluster samples
± Dividing the population into sections/clusters, then randomly
select some of those cluster and then choose all members from
those selected cluster
± Using a cluster sampling can reduce cost and time.
± Example:
± A lecturer wants to study the physical fitness levels of students at
her university. There are 5,000 students enrolled at the university,
and she wants to draw a sample to take a physical fitness test.
Assume that, because of different lifestyles, the level of physical
fitness is different between freshmen, sophomores, juniors and
seniors students. To account for this variation in lifestyle, the
population of student can easily be clustered into freshmen,
sophomores, juniors and seniors students. Then she can choose
any one cluster such as freshmen and take all the freshmen
students as the participant.
Identified the type of sampled obtain

Example 1
A physical education professor wants to study the
physical fitness levels of students at her university. There are
20,000 students enrolled at the university, and she wants to draw
a sample of size 100 to take a physical fitness test. She obtains a
list of all 20,000 students, numbered it from 1 to 20,000 and then
invites the 100 students corresponding to those numbers to
participate in the study.

Example 2
A quality engineer wants to inspect rolls of wallpaper in order
to obtain information on the rate at which flows in the printing are
occurring. She decides to draw a sample of 50 rolls of wallpaper from
a day¶s production. Each hour for 5 hours, she takes the 10 most
recently produced rolls and counts the number of flaws on each. Is
this a simple random sample?
Example 3
Suppose we have a list of 1000 registered voters in a community
and we want to pick a probability sample of 50. We can use a random
number table to pick one of the first 20 voters (1000/50 = 20) on our list.
If the table gave us the number of 16, the 16th voter on the list would be
the first to be selected. We would then pick every 20th name after this
random start (the 36th voter, the 56th voter, etc) to produce a sample.

Example 4
Consumer surveys of large cities often employ cluster sampling.
The usual procedure is to divide a map of the city into small blocks each
blocks containing a cluster are surveyed. A number of clusters are
selected for the sample, and all the households in a cluster are
surveyed. Using a cluster sampling can reduce cost and time. Less
energy and money are expended if an interviewer stays within a specific
area rather than traveling across stretches of the cities.
Example 5
Suppose our population is a university student body. We want to
estimate the average annual expenditures of a college student for non
school items. Assume we know that, because of different lifestyles,
juniors and seniors spend more than freshmen and sophomores, but
there are fewer students in the upper classes than in the lower classes
because of some dropout factor. To account for this variation in lifestyle
and group size, the population of student can easily be stratified into
freshmen, sophomores, junior and seniors. A sample can be stratum
and each result weighted to provide an overall estimate of average non
school expenditures.

Example 6
A researcher wanted to survey students in 100 homerooms in
secondary school in a large school district. They could first randomly
select 10 schools from all the secondary schools in the district. Then
from a list of homerooms in the 10 schools they could randomly select
100.

@       
± Organize or group the facts/sample raw data for study
and investigation
± Classifying- identifying items with like characteristics &
arranging them into groups or classes.
± Ex: Production data (product make, location, production process
ext..)
± Data can be classified as       
m  and     m  .
± Summarization
± oraphical & Descriptive statistics ( tables, charts, measure of
central tendency, measure of variation, measure of position)
Data Classification
± Data are the values that variables can assume
± Variables is a characteristic or attribute that can assume different
values.
± Variables whose values are determined by chance are called random
variables

Variables can be
classified

Oy how they are categorized,


As uantitative
counted or measured
and ualitative
- Level of measurements of
data
ualitative ïominal Data (can¶t be rank)
Gender, race, citizenship. etc
(categorical/Attributes)
jse code
1i Data that refers only to numbers (1,
2,«)
name classification (done
using numbers) Ordinal Data (can be rank)
Feeling (dislike ± like),
2i Can be placed into color (dark ± bright) , etc
distinct categories
according to some
i es of characteristic or attribute.
Data
Discrete Variables
uantitative Assume values that can be
counted and finite
(Numerical) Ex : no of something
1i Data that represent
counts or measurements Continuous variables
(can be count or measure) 1. Can assume all values between any two
2i Are numerical in nature specific values & it obtained by measuring
2. Have boundaries and must be rounded
and can be ordered or because of the limits of measuring device
ranked. Ex: weight, age, salary, height,
temperature, etc
Example

The Lemon Marketing Corporation has asked you for information about the car
you drive. For each question, identify each of the types of data requested as
either attribute data or numeric data. When numeric data is requested,
identify the variable as discrete or continuous.

1. What is the weight of your car?


2. In what city was your car made?
3. How many people can be seated in your car?
4. What¶s the distance traveled from your home to your school?
5. What¶s the color of your car?
6. How many cars are in your household?
7. What¶s the length of your car?
8. What¶s the normal operating temperature (in degree Fahrenheit) of your car¶s
engine?
9. What gas mileage (miles per gallon) do you get in city driving?
10. Who made your car?
11. How many cylinders are there in your car¶s engine?
12. How many miles have you put on your car¶s current set of tyres?
     (
ï 
      

 
 
 
 
÷ ÷ ÷ ÷
÷÷ ÷÷  ÷÷ ÷ 
÷÷÷
 
÷   ÷  
÷ 
÷  ÷   ÷
  ÷
÷ ÷  ÷÷  ÷
÷ ÷

 ÷  ÷  ÷ ÷
÷    ÷
  ÷ 
÷  ÷
÷  ÷÷÷

÷ ÷  ÷ ÷
÷ ÷ ÷

 ÷
÷
÷  ÷ ÷÷
÷ ÷

÷  ÷  ÷  ÷÷    ÷
 ÷
÷ 
÷
÷  ÷
÷
÷ ÷ ÷
÷  ÷ ÷
÷ ÷ ÷ ÷
÷ ÷ ÷ ÷
÷
÷





        

± Summarized & analyzed information given


by the graphical & descriptive statistics
± Identify the relationship of the information
± Making any relevant statistical inferences
(hypothesis testing, confidence interval,
ANOVA, control charts, etc«)

 
!@
(  
 
± Bell Shaped ± Uniform
± Has a single ± Basically
peak & tapers flat/rectangular
off at either end
± Approximately
symmetry
± It is roughly the
same on the
both sides of a
line running
through the
center

± J-Shaped ± Reverse J-
± Has a few data
Shaped
values on the ± Opposite J-
left side & Shaped
increase as one ± Has a few data
move to the values on the
right right side &
increase as one
move to the left
(  
 
± Right Skewed ± Left Skewed
± The peak is to ± The peak is to
the left the right
± The data value ± The data value
taper off to the taper off to the
right left

± Bimodal ± U-Shaped
± Have 2 peak at ± The shape is U
the same height
"
   
± The researchers can make a list of all the
options and decisions which can achieve
the objective and goal of the research,
weighs the options and choose the best
options which represents the µbest¶ solution
to the problem.
± The correctness of this choice depends on
the analytical skill and the quality of the
information.
utatistical ïo

Problem Yes

uolving
Methodolog

Yes

ïo
  @
   
Two software tools commonly used for data
analysis
1. Spreadsheets
± Microsoft Excel & Lotus 1-2-3
2. Statistical Packages
± MINITAO, SAS, SPSS and SPlus
!
| 
( @||
 | |@
  #(( 
 $
± Statistical methods can be used to summarize data.

± Measures of average are also called     m and


include the  , m ,  m, and m .

± Measures that determine the spread of data values are called    
    or    m   and include the ,   , and
 m mm  .

± ˜      tell where a specific data value falls within the data set
or its relative position in comparison with other data values. The most
common measures of position are   , m , and   .

± The measures of central tendency, variation, and position are part of what is
called  m   . This type of data is typically used to confirm
conjectures about the data
± 1.3.1 Measures of Central Tendency

˜ 

the sum of the values divided b the total number of values.

R  
  




 
 1
 
, oulation size   1
,  samle size


Example: 9 2 1 4 3 3 7 5 8 6

  
± The mean is compute by using all the values of the data.
± The mean varies less than the median or mode when samples are
taken from the same population and all three measures are
computed for these samples.
± The mean is used in computing other statistics, such as variance.
± The mean for the data set is unique, and not necessarily one of the
data values.
± The mean cannot be computed for an open-ended frequency
distribution.
± The mean is affected by extremely high or low values and may not
be the appropriate average to use in these situations
± 1.3.1 Measures of Central Tendency

˜ 
the middle number of  ordered data (smallest to largest)

  



edian   ±1  ± 
±1
2 2
2 Median
2
Example: Example:

9 2 1 3 3 7 5 8 6 9 2 1 4 3 3 7 5 8 6

  
± The median is used when one must find the center or middle
value of a data set.

± The median is used when one must determine whether the


data values fall into the upper half or lower half of the
distribution.

± The median is used to find the average of an open-ended


distribution.

± The median is affected less than the mean by extremely high


or extremely low values.
± 1.3.1 Measures of Central Tendency

˜ 
the most commonl occurring value in a data series

± The mode is used when the most typical case is desired.

± The mode is the easiest average to compute.

± The mode can be used when the data are nominal, such as
religious preference, gender, or political affiliation.

± The mode is not always unique. A data set can have more than
one mode, or the mode may not exist for a data set.

Example: 9 2 1 4 3 3 7 5 8 6
± 1.3.1 Measures of Central Tendency

˜ 
is a rough estimate of the middle & also a ver rough
estimate of the average and can be affected b one
extremel high or low value.

lowest value ± highest value


R
2

Example: 9 2 1 4 3 3 7 5 8 6

(

u mmetric

Positivel skewed or right-skewed Negativel skewed or left-skewed


± 1.3.2 Measures of Variation / Dispersion

± Used when the central of tendency doesn't mean


anything or not needed (ex: mean are same for two
types of data)

± One that measure the variability that exists in a data set

± To form a judgment about how well the average value


illustrate/ depict the data

± To learn the extent of the scatter so that steps may be


taken to control the existing variation
± 1.3.2 Measures of Variation / Dispersion




is the different between the highest


value and the lowest value in a data set.
ihe s mbol
is used for the range.


 highest value - lowest value

Example: 9 2 1 4 3 3 7 5 8 6
± 1.3.2 Measures of Variation / Dispersion

V 

is the average of the squares of the distance each value is from the mean.

   V 
 V 


2 2

2
  1
  
2
    
 1
 ,  o  u latio n size  ,  s a m  le s i z e
 1

2 2
 1
    1
   
 ,  o  u latio n size  ,  s a m  le s i z e
 1

         r       

Example:
  (  
is the square root of the variance 9 2 1 4 3 3 7 5 8 6

   
!  ( 
± Variances and standard deviations can be used to determine the
spread of the data. If the variance or standard deviation is large, the
data are more dispersed. The information is useful in comparing two or
more data sets to determine which is more variable.
± The measures of variance and standard deviation are used to
determine the consistency of a variable.
± The variance and standard deviation are used to determine the number
of data values that fall within a specified interval in a distribution.
± The variance and standard deviation are used quite often in inferential
statistics.
± The standard deviation is used to estimate amount of spread in the
population from which the sample was drawn.
@     

|f 11.27,  15, and 4.12


illustrate the Cheb chev iheorem
for  1,  2, and  3
± 1.3.3 Measures of Position

Describing the position of the data value (increasing order)


  (
    
Split data into Split data into Split data into
100 equal parts 10 equal parts 4 equal parts

          
100 10 4

TIPS: If is not a whole number, round it up to the next whole number
If c is a whole number, then use  ±  ±1  2

w R15 2, 3 3, 2 4.5
Example: 9 2 1 4 3 3 7 5 8 6
@| 
1. oiven 9 2 1 4 3 7 5 4 6 .
a) What percentile is the value of 8?
b) Find the value correspond to 4th deciles.
c) Find the value correspond to 3rd quartiles.

2. oiven 9 22 11 14 13 3 7 15 18 16
a) Find the value correspond to 20th percentiles.
b) What percentile is the value of 20?
c) Find the value correspond to 7th deciles.

TIPS: The percentile correspond to a given value of x is computed by:


n u m b er o valu es b elo   ± 0 .5
R

 
 100%
to tal n u m b er o valu es
  
± An   is an extremely high or an extremely low data value when
compared with the rest of the data values.

± Outliers can be the result of measurement or observational error.

± When a distribution is normal or bell-shaped, data values that are


beyond three standard deviations of the mean can be considered
suspected outliers.

Example: 9 22 11 14 13 3 7 15 18 16 no outliers
| %@        
    @  
± Casio fx-570MS ± Casio fx-570W
± Insert data ± Insert data
± MODE SD data M+ ± MODE SD data M+
± Shift 1 ± Shift 1
± Shift 2 ± Shift 2
± Clear data ± Shift 3
± Shift CLR 1 ± Shift 4
± Clear data
± Shift AC/ON =
!
    
(     |
± 1.4.1 Stem and Leaf Plot

± A simple way to summarize a data set.


± Each item in the sample is divided into two
parts: a stem, consisting of the leftmost one or
two digits, and the leaf, which consists of the
next digit.
± It is a compact way to represent the data.
± It also gives us some indication of the shape of
our data.
— MPL— 1
± Example: Duration of dormant periods of the geyser Old Faithful in
Minutes
± Stem-and-leaf plot:

4 259
5 0111133556678
6 067789
7 01233455556666699
8 000012223344456668
9 013

± Let¶s look at the first line of the stem-and-leaf plot. This represents
measurements of 42, 45, and 49 minutes.
± A good feature of these plots is that they display all the sample
values. One can reconstruct the data in its entirety from a stem-
and-leaf plot.
± 1.4.2 Box Plots

± A boxplot is a graphic that presents the


median, the first and third quartiles, and any
outliers present in the sample.

± The interquartile range (IQR) is the difference


between the third quartile and the first quartile.
This is the distance needed to span the middle
half of the data.
@ 






± STEP1 : Arrange the data
± STEP2 : Find the Median
± STEP3 : Find 1 and 3
± STEP4 : Find Outliers
± Points that lying more than 1.5 times the interquartile
range above Q3 or below Q1
 ÷ 1  1.5 3  1  and   3 ± 1.5 3  1 

± STEP5 : Draw a scale for the data on the axis.


± STEP6 : Locate the lowest value, 1, the median, 3, the
highest value and outliers on the scale.
± STEP7 : Draw a box around 1 and 3, draw a vertical line
through the median, and connect the upper and lower values
 
1. Plot a boxplot for the following data. Then describe the data.

a) 9 22 11 14 13 3 7 15 18 16

b) 19 2 1 7 5 8 6

2. A dietician is interested in comparing the sodium content of real


cheese with the sodium content of a cheese substitute. Te data
for two random samples are shown. Compare the distributions
using boxplots

Real Chese 310, 420, 45, 40, 220, 240, 180, 90

Cheese Subtitute 270, 180, 250, 290, 130, 260, 340, 310
  



EXTRA INFO:
1. If the boxplots for two or more data sets are graphed on the same axis,
the distributions can be compared.
2. To compare the averages, use the location of the medians.
3. To compare the variability, use the location of the interquartile range.
natom of a Boxlot
@  
± The applications of statistics
are many and varied. People
encounter them in everyday
life, such as in reading
newspapers or magazines,
listening to the radio, or
watching television.
± Oy combining all of the
descriptive statistics
techniques discussed in this
chapter together, the student
is now able to collect,
organize, summarize and
present data.
  
± See You in
CHAPTER 2
Commonly used
Probability
Distribution
- DO YOUR
TUTORIAL!!!

You might also like