You are on page 1of 21

SECTION 2.4: Measures of Center There are three main statistical measures which attempt to locate a measure of center.

The main objectives of this section are to present the important measures of center and to show how to compute them. Definition: A measure of center is a value at the center or middle of a data set. The measures of center that we will work with are: The mean (arithmetic mean); the mode and the median. The arithmetic mean For a sample from a larger population, the mean is denoted by x. If all the values of the population are used, then the mean is denoted by . Notation: denotes the addition of a set of values x is the variable usually used to represent the individual data values n represents the number of values in a sample. N represents the number of values in a population. x = x is the mean of a set of sample values n = x denotes the mean of all values in a population N For (a) Raw Data Example: Find the mean of the set of numbers: 63, 65, 67, 68, 69, 70, 71, 72, 74, 75 Solution: n = 10 x = 694 x = 694/10 x = 69.4 (b) Ungrouped frequency distribution For a frequency distribution x = fx f Example: The 30 members of an orchestra were asked how many instruments each could play. The results are set out in the frequency distribution. Calculate the mean number of instruments played: Number of instruments, x Frequency, f 1 11 2 10 3 5 4 3 5 1

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

x 1 2 3 4 5

f 11 10 5 3 1 f = 30

fx 11 20 15 12 5 fx = 63

x = fx f = 63/30 = 2.1 The mean number of instruments played is 2.1. (c) Grouped frequency distribution When data has been grouped into intervals, the midpoint, x, of the interval is taken to represent the interval. Example: The lengths of 40 bean pods were measured to the nearest cm and grouped as shown. Find the mean length, giving the answer to 1 d.p. Length (cm) 48 9 13 14 18 19 23 24 28 29 - 33 Midpoint, x 6 11 16 21 26 31 f 2 4 7 14 8 5 f = 40 x = fx f = 825/40 x = 20.6 (1 d.p.) The mean length of the bean pods is 20.6 cm (1d.p.) Weighted mean In some situations, the values vary in their degree of importance, so we may want to compute a weighted mean, which is a mean computed with the different scores assigned different weights. In such cases, we can calculate the weighted mean by assigning different weights to different values, as shown in the formula below: Weighted mean, x = (wx) w Example: Suppose we need a mean of three test scores (85, 90, 75), but the first test counts for 20%, the second test counts for 30%, and the third test counts for 50% of the final grade. We can assign weights of 20, 30, and 50 to the test scores, as follows: 2 Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data fx 12 44 112 294 208 155 fx = 825

x = (wx) w = (20 x 85) + (30 x 90) + (50 x 75) 20 + 30 + 50 = 81.5 The weighted mean formula is used to calculate grade-point average. Trimmed mean An important advantage of the mean is that it takes every value into account, but an important disadvantage is that it is sometimes dramatically affected by a few extreme values (outliers). Because the mean is very sensitive to extreme values, we say that it is not a resistant measure of center. To overcome this disadvantage, a trimmed mean can be used. To find the 10% trimmed mean for a data set, first arrange the data in order, then delete the bottom 10% of the values and the top 10% of the values, and calculate the mean of the remaining values. Exercise: Determine the arithmetic mean for the given set of data. Then determine the trimmed mean for the same data set, and compare both results: Weights of anesthetized bears: 80 344 416 348 166 166 204 26 120 436 65 356 316 94 86 60 64 114 76 48 The median The median of a data set is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If there are n numbers the median is the (n + 1)th value. Procedure for finding the median: Sort the data. (Arrange in increasing order)

220 125 150 29

262 132 270 514

360 90 202 140

204 40 365

144 220 79

332 46 148

34 154 446

140 116 62

180 182 236

105 150 212

Is the number of values odd or even?

Odd: the median is the value in the exact middle.

Even: the median is the mean of the two middle numbers. (add the middle numbers, divide by 2.

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

Determining the median from (a) Raw Data: Example: Find the median of each of the sets. (i) 7, 7, 2, 3, 4, 2, 7, 9, 31 Solution: In order of magnitude: 2, 2, 3, 4, 7, 7, 7, 9, 31 n = 8. The median is the (9 + 1)th value, i.e. the 5th value. So median = 7 36, 41, 27, 32, 29, 38, 39, 43

(ii)

Solution: In order of magnitude: 27, 29, 32, 36, 38, 39, 41, 43 n = 8 and the median is the (8 + 1)th value, i.e. the 4 th value. This does not exist, so we consider the 4th and 5th values. Median = (36 + 38) = 37 (b) Ungrouped frequency distribution: The median can be found directly from the cumulative frequency distribution. Example: The table below shows the number of children in the family for 35 families in a certain area. Find the mean number of children per family. Number of Frequency Cumulative children frequency, cf 0 3 3 1 5 8 2 12 20 3 9 29 4 4 33 5 2 35 The median is the 18th value: (35 + 1) = 18 We could have written out all the values in order from the frequency table, thus 0, 0, 0, 1, 1, 1, 1, 2, 2,. However, we can see from the cumulative frequency table that the 18th value is 2, as the first 8 values are 0 or 1 and the first 20 values are 0 or 1 or 2. (c) Grouped frequency distribution: Once the information has been grouped and the raw data lost we can only estimate a value for the median. We will consider two methods to determine an approximate value for the median: (i) by calculation (ii) from a cumulative frequency curve

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

Example: The masses, measured to the nearest kg, of 49 boys are noted and the distribution formed. Estimate the median mass. Mass (kg) F Mass (kg) cf - 59 0 < 59.5 0 60 64 2 < 64.5 2 65 69 6 < 69.5 8 70 74 12 < 74.5 20 75 79 14 < 79.5 34 80 84 10 < 84. 5 44 85 - 89 5 < 89.5 49

The median is the (49 + 1) th value, i.e. the 25th value Method (a): by calculation: The 25th value lies in the class 74.5 79.5 There are 14 items in this class. The median is 5/14 of the interval of 5 kg from 74.5 to 79.5. Estimate of the median mass = 74.5 + (5/14) (5) = 76. 3 kg (1 d.p.) Method (b): from the cumulative frequency curve Draw the ogive (cf curve) and read off the value corresponding to a cumulative frequency of 25.
Ogive showing the masses of 49 boys
60
cumulative frequency

50 40 30 20 10 0 59.5

64.5

69.5

74.5 mass (kg)

79.5

84.5

89.5

From the graph, the value corresponding to the cumulative frequency of 25 is 76.3 kg. The mode: The mode is the value that occurs most often (has the highest frequency). For a given data set, more than one mode can exist. Two modes: bimodal More than two modes: multimodal Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data 5

Determining the mode from (a) Raw data: Example: Find the mode(s) of each of the following set: (i) 4, 5, 5, 1, 2, 9, 5, 6, 4, 5, 7, 5, 5 Solution: mode = 5 2, 2, 3, 5, 8, 2, 5, 6, 6, 5 Solution: modes = 2, 5 (The distribution is bimodal.)

(ii)

(b) Grouped data: When data has been grouped into classes, the class which has the largest standard frequency is called the modal class. An estimate of the mode can be obtained from the modal class. Example: Estimate the mode of the following frequency distribution which shows the marks of 330 candidates in an examination. marks 11- 20 21- 30 31 - 40 41 - 50 51 - 60 61 - 70 71 - 80 81 - 90 91 100 frequency 20 40 80 100 50 20 10 10 0 Solution: First a histogram or bar chart is constructed.
Histogram to show examination marks
120 100

Frequency

80 60 40 20 0 10 20 30 40 50 60 70 80 90 100 Marks

The modal class is 41 50. The modal class contains 20 more than the class below and 50 more than the class above. So the mode is likely to divide the modal class in the ration 20: 50 = 2: 5 An estimate of the mode can be found from the histogram by drawing lines as shown in the diagram. This gives a value of 43 marks. By calculation: An estimate of the mode is 20/(20 + 50) of the interval of 10 marks from 40 50. Estimate of mode = 40 + (2/7)(10) = 42.9 6 Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

Exercise: 1] 2] If the mean of the following numbers is 17, find the value of c: 12, 18, c, 13 The mean of 10 numbers is 8. If an eleventh number is now included in the results, the mean becomes 9. What is the value of the eleventh number. The mean of 4 numbers is 5, and the mean of 3 different numbers is 12. What is the mean of the 7 numbers together? A bag contained five balls each bearing one of the numbers 1, 2, 3, 4, 5. A ball was drawn from the bag, its number noted, and then replaced. This was repeated 50 times and the table below shows the resulting frequency distribution. 1 2 3 4 5 Number x 11 y 8 9 Frequency If the mean is 2.7, determine the value of x and y. state the mode and median of this distribution. On a certain day the number of books on 40 shelves in a library was noted and grouped as shown. Find the mean number of books on a shelf. Give your answer to 2 significant figures. Number of books Number of shelves Skewness A comparison of the mean , median, and mode can reveal information about the characteristic of skewness, defined and illustrated below: A distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half. A distribution of data is skewed if it is not symmetric and if it extends more to one side than the other. Lopsided to the right = skewed to left = negatively skewed Lopsided to the left = skewed to right = positively skewed Data not lopsided = symmetric = zero skewness 31 - 35 4 36 - 40 6 41 - 45 10 46 - 50 13 51 - 55 5 56 60 2

3]

4]

(i) (ii) 5]

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

COMPARISON OF MEAN, MEDIAN, AND MODE: How common? Most familiar average Commonly used Takes every value into account? yes Affected by extreme values yes Advantages and Disadvantages Works with many statistical methods Often a good choice if there are some extreme values Appropriate for data at the nominal level

Average

Definition x = x n

Existence

Mean

Always exist

Median

Middle score

Always exists Might not exist; may be more than one mode

no

no

Mode

Most frequent score

Sometimes used

no

no

General comments: For a data collection that is approximately symmetric with one mode, the mean, median, and mode tend to be about the same. For a data collection that is obviously asymmetric, it would be good to report both the mean and median. The mean is relatively reliable. That is, when samples are drawn from the same population, the sample means tend to be more consistent than the other averages (consistent in the sense that the means of samples drawn from the same population dont vary as much as the other averages). Using Technology Exercise: Microsoft Excel can calculate all measures of central tendency. Example 1: Cambridge Power and Light Company selected 20 residential customers. Following are the amounts, to the nearest dollar, the customers were charged for electrical services last month. 54 48 58 50 25 47 75 46 60 70 67 68 39 35 56 66 33 62 65 67 What are the mean median, and mode of these amounts? On a new worksheet, key your data in column A. Give A1 a title. Type the data in cells A2 to A21. To calculate: the mean, key =AVERAGE(A2:A21) in cell A23. You may type in a cell beside it Mean the median 8 =MEDIAN(A2:A21) in cell A24 Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

the mode: =MODE (A2:A21) in cell A25. If there is no mode, Excel will display #N/A in that cell. If there is more than one mode Excel will display the one that occurs first in the string of data. Example 2: The weighted mean Carter Construction Company pays its hourly employees either $6.50, $7.50, or $8.50 per hour. There are 26 hourly employees. 14 are paid at the $6.50 rate, 10 at the $7.50 rate, and 2 at the $8.50 rate. What is the weighted mean hourly rate paid to the 26 employees. Key Employee in F1, Rate in G1, Product in H1 Key 14, 10, 2 in F2 to F4 respectively. Key 6.5, 7.5, 8.5 in G2 to G4 respectively. In H2, key = F2*G2 Make H2 your active cell. Place your cursor on the bottom right handle. You will have a thick black plus sign. Click and drag to H3:H4. Highlight F2: F4. From the Tool bar choose the AutoSum button. Highlight H2:H3. From the Tool bar choose the AutoSum button. In J7, key = H5/F5 You may type in weighted mean in an adjacent cell to this result. Example 3: To find the 10% trimmed mean of the data Weights of Anethesized Bears (previous example), key in the data as usual in column form, labeling the data in the first cell. Key = TRIMMEAN( : , 10%) in a new cell. SECTION 2.5: Measures of Variation Because variation is so important in statistics, this is one of the most important sections. The following key concepts are discussed in detail: (1) Variation refers to the amount that values vary among themselves, and it can be measured with specific numbers; (2) Values that are relatively close together have lower measures of variation, and values that are spread farther apart have measures of variation that are larger; (3) The standard deviation, which is a particularly important important measure of variation, can be computed; (4) The values of standard deviation must be interpreted correctly. Example: Waiting times of customers ( in minutes) at the JV Bank (where all customers enter a single waiting line) and the Bank of P (where customers wait in individual lines at three different teller windows: JV: 6.5 6.6 6.7 6.8 7.1 7.3 7.4 7.7 7.7 7.7 P: 4.2 5.4 5.8 6.2 6.7 7.7 7.7 8.5 9.3 10.0 (a) (b) Determine the mean, median, and mode for each data set. Interpret the results by determining whether there is a difference between the two data sets that is not apparent from a comparison of the measures of center. If so, how are the data sets different?

We will now develop some specific ways to measure variation: Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data 9

Range The range of a data set is the difference between the highest value and the lowest value. Range = (highest value) (lowest value) For JV: For P: Range = 7.7 6.5 = 1.2 min Range = 10.0 4.2 = 5.8 min

Standard deviation of a sample The standard deviation of a set of sample values is a measure of variation of values from the mean. Formula s = ( x x)2 sample standard deviation n1 alternative formula, s = x2 - x2 n There is also a shortcut formula for the standard deviation. s= n(x2) (x)2 n(n 1)

Example: Use the first standard deviation formula to find the standard deviation of the JV Bank customer waiting times. Those times (in minutes) have been listed in the first column of the table below: x 6.5 6.6 6.7 6.8 7.1 7.3 7.4 7.7 7.7 7.7 71.5 = x x-x -0.65 -0.55 (x x)2 0.4225 0.3025

0.55

0.3025 2.0450= (x x)2

Example: Here is the same example, but the shortcut formula is used. Find n, x , and x2. n = 10 (sample size = 10) x = 71.5 (sum of the 10 sample values) x2 = 513.27 (=6.52 + 6.62 + 6.72 + 7.72) s= 10(513.27) (71.5)2 10(10 1) = 0.48 min Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

10

Standard deviation of a Population: Here is the formula for the standard deviation for a population, denoted by . = ( x - )2 N1 Where is the population mean, and N is the population size. Variance of a Sample and Population The variance of a set of values is a measure of variation equal to the square of the standard deviation. Sample variance: square of the standard deviation s Denoted by s2 Example: From one of our previous examples (JV bank customer waiting times), s = 0.48, more precisely s = 0.4767 min. So, S2 = (0.4767 min)2 = 0.23 min Population variance: square of the population standard deviation Denoted by 2 Round off rule: Carry one more decimal place than is present in the original set of values. Finding Standard Deviation from a Frequency Table Sometimes it is necessary to compute the standard deviation of a data set that is summarized in the form of a frequency table. If the original list of sample values is available, use those values with the previous standard deviation formulae to get more exact results. If the original data are not available, use the formula: s = n(fx2) (fx)2 n(n 1) Example: Use the following table to calculate the standard deviation. Word rating f x fx 0-2 20 1 20 3-5 14 4 56 6-8 15 7 105 9 - 11 2 10 20 12 - 14 1 13 13 52 = f 214 = (fx) S = 52(1348) (214)2 52(52 1) s = 3.0 fx2 20 224 735 200 169 1348 = (fx2)

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

11

Interpreting and understanding Standard Deviation The standard deviation measures the variation among values. Values close together will yield a small standard deviation, whereas values spread farther apart will yield a larger standard deviation. Three different ways of developing a sense for values of standard deviation: (1) Range rule of thumb Based on the principle that for many data sets, the vast majority (such as 95%) of sample values lie within 2 standard deviation of the mean. x 2s For estimation: To obtain a rough estimate of the standard deviation s, use the equation S range 4 where range = (highest value) (lowest value) Example: Previous results from the National Health Survey show that the heights of men have a mean of 69.0 inches and a standard deviation of 2.8 inches. Use the range rule of thumb to find the minimum and maximum usual heights. x 2s = 69.0 2(2.8) Based on these results, we expect that typical men will range in height between 63.4 inches and 74.6 inches. (2) Empirical Rule for data with a bell-shaped distribution For data sets having a distribution that is approximately bell-shaped, the following properties apply: about 68% of all values fall within 1 standard deviation of the mean xs about 95% of all values fall within 2 standard deviations of the mean x 2s about 99.7% of all values fall within 3 standard deviations of the mean Example: The heights of men have a bell-shaped distribution with a mean of 69.0 inches and a standard deviation of 2.8 inches (based on data from the National Health Survey). What percentage of men have heights between 60.6 inches and 77.4 inches? Solution: 60.6 inches and 77.4 inches are each exactly 3 standard deviations away from the mean of 69.0 inches. 69.0 3(2.8) According to the empirical rule, 99.7% of all mens heights are between 60.6 inches and 77.4 inches. Chebyshevs theorem

(3)

Chebyshevs theorem applies to any data set, unlike the empirical rule, but its results are very approximate. It says: The proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1 1/K2, where K is any positive number greater than 1. For K = 2 and K = 3, the following results are gotten: 12 Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

at least (or 75%) of all values lie within 2 standard deviations of the mean. at least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean.

Example: Heights of men have a mean of 69.0 inches and a standard deviation of 2.8 inches. What can we conclude from Chebyshevs theorem. Within 2 standard deviations At least (or 75%) of all men have heights within 2 standard deviations of the mean (63.4 in. 74.6 in.) Within 3 standard deviations At least 8/9 (or 89%) of all men have heights within 3 standard deviations of the mean (60.6in. 77.4 in.) Exercise: 1] Find the range, variance, and standard deviation for each of the two samples, then compare the two sets of results.

Maximum breadth of samples of male Egyptian skulls from 4000 BC and 150 AD 4000 BC: 131 119 138 125 129 126 131 132 126 128 150 AD: 136 130 126 126 139 141 137 138 133 131 (Based on data from Ancient Races of the Thebaid by Thomson and Randall-Maciver.) 2]

128 134

131 129

Find the standard deviation of the data summarized in the given frequency table. Samples of students cars and faculty/staff cars were obtained at a certain college, and their ages (in years) are summarized in the frequency table. Age 0 -2 3-5 6-8 9 - 11 12 - 14 15 - 17 18 - 20 21 - 23 Students 23 33 63 68 19 10 1 0 Facult/staff 30 47 36 30 8 0 0 1

3]

Two different sections of a statistics class take the same quiz and the scores are recorded below. Do a double stem-and leaf plot for both data sets. Discuss the variation of the data in each of the sets and compare. Find the range and standard deviation for each section. What do the range values lead you to conclude about the variation in the two sections. Why is the range misleading in this case? What do the standard deviation values lead you to conclude about the variation in the two sections? Section 1: Section 2: 1 2 20 3 20 4 20 5 20 6 20 14 20 15 20 16 20 17 20 18 20 19

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

13

4] (a) (b) (c)

(d)

(e)

Let a population consist of the values 1, 2, and 3. Assume that samples of two different values are randomly selected with replacement. Find the variance 2 of the population {1, 2, 3}. List the nine different possible samples and find the sample variance s2 for each of them. If you repeatedly select two different items, what is the mean value of the sample variances s2? For each of the nine samples, find the variance by treating each sample as if it is a population. (Be sure to use the formula for population variance.) If you repeatedly select two different items, what is the mean value of the population variances? Which approach results in values that are better estimates of 2: part (b) or part (c)? Why? When computing variances of samples, should you use division by n or n 1? The preceding parts show that s2 is an unbiased estimator of 2. Is s an unbiased estimator of ?

Using technology Here are the standard commands for calculating (i) Range = MAX( : ) MIN( : )
(ii)

Sample Variance, s2 = VAR( : ) Population Variance 2 = VARP( : ) Sample standard deviation, s =STDEV( : ) Population standard deviation, =STDEVP( : )

(iii)

Use example 1 above to calculate the measures of variation (dispersion) using the appropriate commands from above. NOTE: Microsoft Excel can create a summary of all calculations for descriptive statistics. Under Tools, locate Data Analysis. Locate Descriptive Statistics. Follow instructions on the dialog box.. Click on Input range and key in range of cells where the data has been placed (example: A2: A40). Leave the columns button clicked. Click on summary statistics. Then OK. SECTION 2.6: Measures of Position This section introduces measures that can be used to compare values from different data sets or to compare values within the same data set. The basic tools are z scores, quartiles, and percentiles. z Scores A standard score, or z score, is the number of standard deviations that a given value x is above or below the mean. It is found using the expressions Sample: z=xx S Population: 14 z = x - Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

Example: Former NBA superstar Michael Jordan is 78 inches tall, and WNBA basketball player Rebecca Lobo is 76 inches tall. Jordan is obviously taller by 2 inches, but which player is relatively taller? Does Jordans height among men exceed Lobos height among women? Men have heights with a mean of 69.0 inches and a standard deviation of 2.8 inches; women have heights with a mean of 63.6 inches and a standard deviation of 2.5 inches. Solution: For Jordan For Lobo x = 78 in. x = 76 in. = 2.8 in. = 2.5 in. = 69.0 in. = 63.6 in. using for both data sets, z = x - For Jordan For Lobo z = 78 69.0 z = 76 63.6 2.8 2.5 z = 3.21 z = 4.96 Interpretation: Jordans height is 3.21 standard deviations above the mean, but Lobos height is 4.96 standard deviations above the mean. Lobos height among women is greater than Jordans height among men. Consider another player, Mugsy Bogues, who is only 63 inches tall. When his height is converted to a z score, z = 63 69.0 2.8 = -2.14 his height is 2.14 standard deviations below the mean (because the z score is negative). He is a relatively short person amongst the population of men. NOTE: Ordinary values fall within 2 standard deviations from the mean: x 2 std dev. -2 z score 2 Unusual values fall below or above 2 standard deviatons from the mean. z score < -2 or z score > 2 Quartiles and Percentiles Quartiles (Q1, Q2, Q3) divide the sorted values into four equal parts. For Q1, At least 25% of the sorted values will be less than or equal to Q1. Q1 = (n + 1)th value For Q2, At least 50 % of the sorted values will be less than or equal to Q2. Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data 15

Q2 = median Q2 = (n + 1)th value For Q3, At least 75% of the sorted values will be less than or equl to Q3. Q3 = (n + 1)th value

Percentiles (denoted by Pn) are the 99 values which split a distribution into 100 equal portions. For example, the 10th percentile, P10 P10 = 10/100(n + 1)th value Or P90 = 90/100(n + 1)th percentile Note: Q1 = P25 Q2 = P50 Q3 = P75 Example: The table lists 36 weights (in pounds) of the contents of 36 cans of regular Coke. (a) Find the quartiles (b) Find the 10th and 90th percentiles (c) Find the percentile corresponding to the weight of 0.8143 lbs. 0.7901 0.8044 0.8062 0.8073 0.8079 0.8126 0.8128 0.8143 0.8150 0.8150 0.8152 0.8161 0.8161 0.8163 0.8165 0.8172 0.8176 0.8181 0.8189 0.8192 0.8194 0.8194 0.8207 0.8211 0.8229 0.8244 0.8247 0.8251 0.8264 0.8284

0.8110 0.8152 0.8170 0.8192 0.8244 0.8295

Solution: The data must be sorted in order of size, if it isnt as yet. This set has already been sorted. (a) Quartiles Q1 = (n + 1)th value = (36 + 1)th value = 9.25 th value 9th value is 0.8143. 10th value is 0.8150 The difference between 0.8150 0.8143 = 0.0007 So Q1 = 0.8143 +1/4(0.0007) = 0.814475 Q2 = (n + 1)th value = (36 + 1) th value = 18.5 th value Q2 is the average of the two middle entries. (Remember: Q2 = median) Q2 = 0.8170 + 0.8172) 2 = 0.8171 16 Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

Q3 = (n + 1) th value = (36 + 1)th value = 27 th value th 27 value is 0.8207 and the 28th value is 0.8211 The difference is 0.8211 0.8207 = 0.0004 So Q3 = 0.8207 + (0.0004) = 0.8210

(b)

P10 and P90 P10 = 10/100(n + 1)th value = 10/100(36 + 1)th value = 3.7th value P10 = P90 = 90/100(n + 1)th value = 90/100(36 + 1)th value = 33 1/3 rd value P90 = (b) To determine the percentile which correspond to a certain value from the data set, use percentile of value x = number of values less than x Total number of values Percentile of 0.8143 lbs = 8 100 36 = 22 (rounded) The weight of 0.8143 is the 22nd percentile. 100

Using Technology Excels Rank and Percentile analysis toll produces a table showing the rank order and percentile for each value in a data set. As an alternative to the analysis tool, these results could also be obtained using Excels Data Sort and Edit Fill commands. Exercise: Use Excel to determine the information required in the exercise on 36 weights (in pounds) of 36 cans of regular Coke (previous exercise). Compare with your manual calculations. Exercise: Use Microsoft Excel wherever possible. Express all z scores with two decimal places. Consider a value to be unusual if its z score is less than 2.00 or greater than 2.00. 1. Human body temperatures have a mean of 98.20 and a standard deviation of 0.62. An emergency room patient is found to have a temperature of 101. Convert 101 to a z score. Is that temperature unusually high? What does it suggest? Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data 17

2.

3. 4. 5.

Scores on a history test have a mean of 80 and a standard deviation of 12. Scores on a psychology test have a mean of 30 and a standard deviation of 8. Which is relatively better: a score of 75 on the history test or a score of 27 on the psychology test? Refer to the data set for the sample of 36 weights of regular Coke. Convert the weight of 0.7901 to a z score. Is 0.7901 an unusual weight for regular Coke? Refer to the data set for weights of anesthesized bears and find the indicated percentile or quartile. (i) P85 (ii) P35 (iii) Q1 (iv) Q3 (v) P50 The first several terms of the famous Fibonacci Sequence are 1, 1, 2, 3, 5, 8, 13. (a) Find the mean x and standard deviation s, then convert each value to a z score. Dont round the z scores; carry as many places as your calculator can handle. (b) Find the mean and standard deviation of the z scores found in part (a). (c) If you use any other data set, will you get the same results obtained in part (b)?

SECTION 2.7: Exploratory Data Analysis EDA Exploratory data analysis is the process of using statistical tools (such as graphs, measures of center, measures of variation) to investigate data sets in order to understand their important characteristics. When exploring a data set, we usually want to calculate the mean and the standard deviation and to generate a graph, usually a histogram/bar chart. It is also important to further examine the data set to identify any notable features, especially those that could have a strong effect on results and conclusions (for example, outliers). Outliers An outlier is a value that is located very far away from almost all of the other values. Relative to the other data, an outlier is an extreme value. Effects of an outlier (1) can have a dramatic effect on the mean, and, hence, in some instances, a trimmed mean is calculated. (2) Can have a dramatic effect on the standard deviation. (3) Can have a dramatic effect on the scale of the histogram/bar chart so that the true nature of the distribution is totally obscured. An easy way to find outliers is to examine a sorted list of the data. Look at the minimum and maximum sample values and determine whether they are very far away from the other typical values. When an outlier occurs because of a nonsampling error, it should either be corrected or deleted. However, some data sets include outliers that are correct values. To study the effects of outliers, we can construct graphs and calculate statistics with and without the outliers included. Boxplots Boxplots are useful for revealing the center of the data, the spread of the data, the distribution of the data, and the presence of outliers. It is a graph that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile Q1; the median; and the third quartile Q3.

18

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

Example: Comparing Ages of Oscar Winners. In Ages of Oscar-winning Best Actors and Actresses (Mathematics Teacher magazine) by Richard Brown and Gretchen Davis, stem-and-leaf plots are used to compare the ages of actors and actresses at the time they won Oscars. Here are the results for recent winners from each category: Actors: 32 37 36 32 51 53 33 61 35 45 55 39 76 37 42 40 32 60 38 56 48 48 40 43 62 43 42 44 41 56 39 46 31 47 45 60 Actresses: 50 44 35 74 30 33 35 26 61

80 41 60

26 31 34

28 35 24

41 41 30

21 42 37

61 37 31

38 26 27

49 34 39

33 34 34

Use boxplots to compare the two data sets. Exploring We now have the following tools to explore data sets: measures of center: mean, median, mode measures of variation: standard deviation and range measures of spread and relative location: minimum value, maximum value, and quartiles unusual values: outliers distribution: histograms, stem-and-leaf plots, and box plots Rather than simply producing statistics and graphs, try to identify those that are particularly interesting and important. As a first step, investigate outliers and consider their effects by finding measures and graphs with and without the outliers included. Exercise: PROJECT The traditional typewriter keyboard configuration is called a Qwerty keyboard because of the position of the letters QWERTYin the top row of letters. Developed in 1872, the Qwerty configuration was supposed to force typists to slow down so that their work machines would be less likely to jam. The Dvorak keyboard developed in 1936, positioned the keys most frequently used in the middle (or home) row, a move intended to improve efficiency. Both keyboard configurations are shown in the accompanying illustration.

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

19

An article in the magazine Discover suggests that you can measure the ease of typing by using this point rate system: count each letter on the home row as 0, each letter on the top row as 1, and each letter on the bottom row as 2. (see Typecasting by Scott, Kim, Discover). For example, the word statistics would have a rating of 7 on the Qwerty board and a rating of 1 on the Dvorak keyboard: S t a t i s t i c s Qwerty: 0 1 0 1 1 0 1 1 2 0 (sum = 7) Dvorak: 0 0 0 0 0 0 0 0 1 0(sum = 1) This rating system was used with each of the 52 words in a certain document and the rating values are shown below: Table 1: Qwerty keyboard word ratings 2 2 5 1 4 0 5 7 7 2 2 10 6 2 6 1 1 5 2 5 1 7 Table 2: Dvorak keyboard word ratings 2 0 3 1 4 0 3 4 4 2 0 5 2 0 4 1 0 1 0 3 1 4 Exploratory Data Analysis (a) (b) (c) (d) (e) Organize each data set in a frequency table, using 5 classes with the first being 0 2. Construct the appropriate histogram/bar chart for each data set from the frequency tables in part (a) Construct the frequency polygons on the same axes for Qwerty and Dvorak keyboards, using the frequency tables in part (a). From the graphs above, describe the type of distribution (skewed negativelyor positively, symmetric) displayed by each data set. Calculate all measures of center and summarize them in the table below: Qwerty Dvorak Mean Median Mode

2 7 5 7 2

6 5 8 2 14

3 6 2 7 2

3 6 5 2 2

4 8 4 3 6

2 10 2 8 3

0 0 1 5 0

0 3 4 0 1

0 3 0 4 2

0 1 3 0 0

2 3 5 1 0

0 5 0 3 0

(f)

Calculate all measures of variation and summarize them in the table below: Qwerty Dvorak Range Standard deviation Variance Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

20

(g)

(h)

On the same axes, construct the boxplot (using 5 number summary) for each of the given data sets. Identify any outliers. Which of these ratings (Qwerty or Dvorak) appear to have more spread (use information from the boxplots). Analysis: Using the graphs and measures you have determined above, comment on the level of typing difficulty using each of the keyboards. Do both key boards appear to have the same level of difficulty or is one easier to use than the other?

So far we have treated the two data sets as if they were separate and independent, but the word ratings came from the same 52 words in the same document. We should therefore explore the differences between the pairs of ratings corresponding to each of the 42 words. For example: let us say the first word in the document was the word we. W 1 2 e 1 (sum = 2) 0 9sum = 2)

Qwerty Dvorak

The difference is 2 2 = 0. (i) (j) (k) (l) Find the difference of each pair of data from both data sets: Determine the mean, standard deviation, 5-number summary boxplot, any outliers for the set of data consisting of the 52 differences. Construct an appropriate graph to discuss the type of distribution. If both keyboards were to have the same level of difficulty, we would expect the differences between word ratings to average around 0. What does the mean difference and graph tell you about the level of difficulty of one board compared to the other? Does this support your earlier analysis in part (h)? Would removing any outliers affect the conclusion(s) you have drawn? Show any necessary calculations to support your statement.

(m)

Probability & Statistics Chapter 2 Describing, Exploring & Comparing Data

21

You might also like