You are on page 1of 18

1

Topic 1

Descriptive Statistics

Contents
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Graphing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Other Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Measures of Spread or Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Statistical Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Introduction to Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Summary and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4 6 6 8 9 12 13 15 15

Learning Objectives

Give appropriate examples of where quantitative methods can be used successfully; Calculate measures of central tendency; in particular mean, median and mode; Construct histograms, bar charts and pictograms from sample data; Give reasons for the usefulness of a stem-and-leaf diagram and construct one from sample results; Describe why spread or dispersion is an important property of data and calculate standard deviation and inter-quartile range; Construct a boxplot for sample data; Explain the concepts behind Exploratory Data Analysis; Use the basic functions of a statistical software package such as Excel or Minitab; Describe the basic ideas of probability theory and carry out simple calculations of probability.

TOPIC 1. DESCRIPTIVE STATISTICS

1.1

Introduction

There are many occasions in everyday life when it is desirable to nd out and make use of quantitative information and in some cases it will simply be a case of looking up the details in an appropriate source. For example if someone is going on holiday to Sydney and wished to know the temperature there at the moment, the appropriate information could be obtained from a newspaper. If they then want to know how many Australian Dollars theyll get for their money, a visit to the bank will soon give the relevant exchange rate. However there will be a number of situations where the quantitative information desired is simply not available and research is therefore necessary. It must be stressed, though, that great care has to be taken even in the most basic of situations if the results of any research are to be believable. It is all too easy to decide to carry out a survey on some subject and expect instantly to be provided with all the answers. To obtain reliable results there needs to be a plan for collecting information (or data) about the subject and which takes into account any possible relationships between factors. This course can be thought of as being made up of two main sections. The rst involves looking at data and nding ways of presenting and summarising them (the word data is plural!) in a manner that will be easy for anyone to understand. It is mentioned here that any analysis requires little prior knowledge of statistics and indeed the author would like to emphasise that there is nothing to be afraid of in the subject. Statistics conjures up many different images and also many prejudices (you can prove anything with statistics) but an understanding of certain key points will make the reader aware of how useful the academic subject called statistics can be. Just as a matter of interest, here are a few other quotes about statistics.

There are two kinds of statistics: the kind you look up and the kind you make up Numbers are like people; torture them enough and theyll tell you anything. It is easy to lie with statistics, but it is easier to lie without them. An approximate answer to the right question is worth a good deal more than the exact answer to an approximate problem. A knowledge of statistics is like a knowledge of foreign languages or of algebra; it may prove of use at any time under an circumstances.

The second key area covered in the course is experimental design. This, in fact, should really be the starting point since some survey or experiment will inevitably have produced the data used in the statistical analysis. However, it is probably better to become familiar with a few techniques of working with data rst before embarking on more complicated ideas of designing a good experiment. The importance of experimental design stems from the search for inference about causes or relationships as well as simply a description. Researchers are rarely satised to just describe the events they observe. They want to make inferences about what produced, contributed to, or caused events. The purpose of the design is to also to rule out irrelevant factors, leaving only the actual factor that is the real cause. Here are some examples of situations where appropriate quantitative methods can be used to help provide useful results.

H ERIOT-WATT U NIVERSITY 2002

1.1. INTRODUCTION

Examples 1. It is thought that job satisfaction is low in the Police Force. By using a series of questions that members of the force answered, a score of 43.3 was calculated for job satisfaction. This is compared to a score of 45.1, which was obtained in a similar way for a wide collection of professions taken together. Is this proof that job satisfaction is indeed low in the police? 2. It is well documented that driving skill is affected by the level of blood alcohol. A driver is asked to estimate his skill in a simulation exercise as he takes more and more alcohol. He nds that after an initial decrease in skill ability, the situation becomes stable. Does this mean that there is not a linear relationship between skill and blood alcohol or are there other factors involved? 3. A factory is supposed to produce cartons each containing 500ml of milk. A sample of 100 is taken one day and it is found that the average volume is 499.4ml. Are customers being given misleading information? 4. A manufacturer of screws makes special screws for a customer. From every lot of 1000 screws the manufacturer wants to select screws randomly to check whether they match the customers specications or not. How many screws from each lot should be tested to be 98% condent that all screws in that lot meets the specications? 5. A researcher may be interested in the effectiveness of different ways of teaching vocabulary to 6-year old children. Three teaching methods are compared: silent reading of a story by children, story-telling by a teacher, and story-telling by a teacher which is also enhanced by pictures. If scores are calculated for each category, does the highest one always imply the best teaching method? The answers to all of these questions are not as obvious as they may rst appear. Experimental research, whether in life sciences, information science, business, or other sciences, involves experimental or observed data taken from a sample. From this data the scientist derives properties of the whole population. The problem is that because of the relatively small numbers involved in sampling, there may be many random errors creeping in. It is very rare that research would be conducted on a whole population, as it would be too difcult or very time-consuming. In the UK the only real statistics obtained about the whole population of the country come every ten years in the census (the last one was in 2001) so in most cases taking results from a sample is often the only possibility to gain insight into properties of the population. But this process of inference almost always involves an error. For example, one sample of 100 potential customers of a new product may contain 25 people in favour of it, whereas a second sample of 100 potential customers may contain 32 people in favour of the new product. Hence, there is always uncertainty about the actual property (here, being in favour of the new product) of the total population. Statistics provides scientic tools to allow inferences to be made with a probability of certainty and so provides a method to judge the reliability of such inferences. It should be noted also that the term population need not refer to people. It could, for example, be every carton of milk that comes off a production line one day, or all the application forms for tickets for a pop concert waiting to be processed. It can also have a more general use than the more colloquial notion that a population refers to everyone in

H ERIOT-WATT U NIVERSITY 2002

TOPIC 1. DESCRIPTIVE STATISTICS

a country. Student population, for example, could refer to all the students of a particular university. Statistics is about collecting, presenting, and characterising information to assist in data analysis and decision-making. They can be categorised either as descriptive or as inferential statistics. Descriptive statistics is involved with the collection, presentation, and characterisation of data sets whilst inferential statistics aims to make inferences about a population based on information contained in a sample. The rest of this chapter will concentrate on descriptive statistics.

1.2

Measures of Central Tendency

Scores on a logical reasoning test were taken under two different conditions on 50 people. The results are given in the table below. Condition A: 22, 18, 18, 13, 18, 14, 6, 20, 18, 20, 21, 11, 13, 23, 13, 5, 19, 16, 15, 17, 23, 18, 18, 14, 13, 18, 14, 16, 19, 21, 13, 15, 14, 17, 23, 19, 13, 2, 11, 17, 20, 13, 18, 4, 17, 24, 18, 23, 18, 16. Condition B: 13, 15, 5, 14, 6, 10, 7, 16, 5, 5, 15, 12, 15, 14, 12, 10, 16, 13, 22, 8, 12, 14, 5, 9, 7, 12, 15, 9, 10, 13, 11, 15, 9, 14, 12, 17, 5, 13, 13, 9, 15, 7, 11, 7, 13, 1, 16, 14, 12, 9. At rst glance it is very difcult to see whether there is any difference between the two conditions. Probably the most obvious thing to do is to calculate the average score for each condition. In statistics the quantity usually referred to colloquially as the average is called the mean.. So under condition A, the mean is (22 + 18 + 18 +.......... + 16) /50. This calculates as 16.18. Similarly, under condition B, the mean is (13 + 15 + 5 +...........+ 9)/50. This calculates as 11.24. So it seems that people are doing better on the reasoning test under condition A than condition B. Statisticians like to represent quantities by formulae and there is an easy one for the mean. If is the notation for the sample mean, and every score is thought of as a different value of x, then

where n is the number of values considered. The mean gives information about the central tendency of the data. Two other statistics

H ERIOT-WATT U NIVERSITY 2002

1.2. MEASURES OF CENTRAL TENDENCY

can also be used for this purpose and may be more or less useful depending on the particular situation. These are median and mode. The median is the middle result when the data are arranged in ascending order. So for the data set 2, 2, 3, 4, 5, 5, 6, the median would be equal to the number 4. Note, however, that if there is an even number of results, the median will lie half way between the two middle numbers. So for the data set 8,8,9,10,13,14, the median would be 9.5. The mode is simply the most common result. So for the data set 2, 2, 2, 3, 3, 4, 6, 7, 7, 7, 7, 9, 11, the mode would be 7. The median is usually employed in cases where there are very extreme values at either the top or bottom end of a data set. A situation where the manager of a factory has a much higher salary than any of his or her employees is a case in point. In calculating the mean salary, this extreme result would inuence the outcome and would give an unrepresentative salary as far as the general employees are concerned. However, the median would not have this inuence as it is just its position in the sequence of results that matters. Similarly, if a mistake is made in typing results in to a statistics package and one very unusual result is entered, the mean will be affected whilst the median will not. Take as an example the results of the logical reasoning test under condition A. As the results stand at the moment the median is 17 (and as was calculated earlier the mean is 16.18. Now imagine the third result was inputted incorrectly into a spreadsheet and the results were as follows: 22, 18, 1800, 13, 18, 14, 6, 20, 18, 20, 21, 11, 13, 23, 13, 5, 19, 16, 15, 17, 23, 18, 18, 14, 13, 18, 14, 16, 19, 21, 13, 15, 14, 17, 23, 19, 13, 2, 11, 17, 20, 13, 18, 4, 17, 24, 18, 23, 18, 16. This would dramatically change the mean to a value of 51.82 whilst the median is unchanged at 17. The median is said to be more data resistant than the mean. However the mean makes more use of information than the median does and so it can be seen that one trades off the other. The mode is useful in situations where it is the most common result that is required (for example in an ofce a new employee might wish to nd out what his or her salary is most likely to be).

H ERIOT-WATT U NIVERSITY 2002

TOPIC 1. DESCRIPTIVE STATISTICS

1.3

Graphing Results

As well as the usefulness of learning something about the middle of a data set, it can also be very revealing to look at a picture of what the information is showing. Every day on television news bulletins, for example, a wide variety of graphics is used to help explain the important facts about the days events (unemployment gures, opinion polls etc.) Like good writing, effective graphical displays of data communicate ideas with clarity, precision, and efciency. But, like poor writing, bad graphs distort and obscure the data or simply spoil the communicative effect that they are trying to convey. Newspapers and magazines usually choose simplicity over detail.

1.3.1

Histograms

In the 1840s a Belgian statistician called Lambert Quetelet recorded data on the chest measurements of every member of a regiment of 5732 Scottish soldiers. (There is a good web site providing more information at http://www.maps.jcu.edu.au/hist/stats/quet/.) It was mentioned earlier that one of the most important facts that could be found from such a survey would be the mean chest measurement of a Scottish soldier - this is obviously important in the provision of new uniforms. In fact, the mean for Quetelets data was calculated as 39.8 inches. However, there is much more that can be discovered from the research. The rst thing that it is useful to do is to actually look at how the data are distributed. Simply looking at a list of 5732 numbers will not help so it is desirable to group the results in class intervals. Here is a possible distribution of the measurements.

H ERIOT-WATT U NIVERSITY 2002

1.3. GRAPHING RESULTS

Class Interval 33 - under 34 34 - under 35 35 - under 36 36 - under 37 37 - under 38 38 - under 39 39 - under 40 40 - under 41 41 - under 42 42- under 43 43 - under 44 44 - under 45 45 - under 46 46 - under 47 47 - under 48 48 - under 49

Chest Measurements in Inches Frequency 1 9 45 131 312 591 889 1095 1082 765 468 231 86 23 3 1

It is useful to express class intervals in this way when data involving decimals are used as it is clear exactly in which interval to place a given result. An alternative way would be to group the intervals as 32.5 - 33.4, 33.5 - 34.4, 34.5 - 35.4, and so on, but it is not then clear where, for example, a result of 34.46 would be placed. By constructing class intervals as given above there is no ambiguity. A graph called a histogram is a useful way of representing such data. The horizontal axis consists of an appropriate scale to accommodate the class intervals and the vertical axis measures frequency. The size of an individual bar is proportional to how likely it would be that any soldier chosen at random from the regiment would have a chest size between the upper and lower bounds of that class interval. The graph is shown below Histogram showing chest measurements of Scottish soldiers

c

H ERIOT-WATT U NIVERSITY 2002

TOPIC 1. DESCRIPTIVE STATISTICS

The histogram shows that the highest bars are in the middle of the distribution but as the extreme values are reached the bars become smaller and smaller. This, then, gives an idea of the overall picture of the data rather than just revealing something about middle values.

1.3.2

Other Graphs

There are many other types of graphs that can be used to reveal such patterns. Bar charts and pictograms are simply a variation on a theme of histograms but allow more artistic licence as they are not mathematically dened. Also, although the histogram on chest measurements clearly showed a pattern, some of the information was lost since without the original data it is not possible to tell, for example, how many people have a chest measurement of, say, 43.3 inches. However a stem-and-leaf diagram maintains every result but also gives a visual picture. Example The height was measured for each person in a sample of 63 college students. The results are given below with measurements given to the nearest cm. 165 158 160 164 164 167 173 171 165 161 176 178 186 159 151 165 166 168 166 174 178 182 163 165 166 147 151 169 170 175 182 160 161 164 164 158 153 158 162 158 162 166 166 166 175 178 170 174 174 170 165 167 168 165 162 156 169 169 169 173 170 168 165 The rst two digits in each measurement make up the stem part and the last digit is the leaf. The stem unit here is 10cm and the leaf unit is 1cm. The class interval is 5. Notice that the shape is like a histogram on its side. The second row, for example, refers

c

H ERIOT-WATT U NIVERSITY 2002

1.4. MEASURES OF SPREAD OR DISPERSION

to the results 151, 151 and 153. Dotplots simply show each result as a dot on a number line.

1.4

Measures of Spread or Dispersion

It will have been gathered from the graphs of the previous section that there is more to a data set than its central tendency. The fact that in the case of the chest measurements there is a tapering off effect at either end is clearly important. This effect occurs in many day to day situations but not all will show exactly the same picture - some will taper off more severely than others. Thus a feel for a quantity measuring spread is becoming apparent. Consider dotplots for exam results of a class of school students for the subjects English and Maths.

The mean value for each set of results is likely to be around 50%, however they show very different phenomena. In the case of the maths results there is a large variation with some very high and some very low results. But the English results reveal that most students are scoring values very close to the mean. It is necessary, then, to dene a way of measuring this notion of spread (also called dispersion or variation). There are three main ways of doing this. The simplest, but probably least used, is the range. This is just the difference between the lowest and the highest result.(In the case of the heights of college students example it is 186 - 147 = 39).

c

H ERIOT-WATT U NIVERSITY 2002

10

TOPIC 1. DESCRIPTIVE STATISTICS

The second method involves using the median. This has already been dened as a way of measuring the central tendency of a data set. The same process can then be used to pin-point the results that are one quarter and three quarters of the way along the data sequence when it is arranged in order. This produces numbers that are called the rst quartile (Q1) and third quartile (Q3). The difference of these numbers is called the inter-quartile range and this is another measure of spread.

Examples 1. Find the median and the interquartile range of the following: 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 7, 9, 9, 10, 10, 10 The median (or Q2) is the eleventh result, 4. This splits the sequence into two. The middle result in the rst half (2, 2, 2, 3, 3, 3, 3, 4, 4, 4) is between 3 and 3 and is therefore 3. This is Q1. The middle result in the second half (4, 4, 5, 5, 7, 9, 9, 10, 10, 10) is between 7 and 9 and is therefore 8. This is Q3. Thus the inter-quartile range is Q3 - Q1 = 5. A graph called a box-plot provides a useful visual representation of the quartiles and median. Indeed in its simplest form, a box-plot produces a visual summary of 5 important statistics in a data set, namely maximum and minimum values, median and the lower and upper quartiles. 2. Draw a boxplot for the data set 4, 5, 5, 5, 6, 6, 8, 8, 9, 10, 12, 16, 19 The median is 8, Q1 = 5. and Q3 = 11 The boxplot is drawn as follows:

Notice that whiskers are drawn from the lower and upper quartiles to the minimum and maximum values, and the median is represented by a vertical line. There are other denitions of a boxplot but this is the simplest.

c

H ERIOT-WATT U NIVERSITY 2002

1.4. MEASURES OF SPREAD OR DISPERSION

11

Like the median, both the range and inter-quartile range are data resistant but make little use of information. A third measure of spread that uses all the data and is very widely employed in quantitative situations is the standard deviation. Usually represented by the symbols s or it is calculated by the formula.

 !#" %$'& = ( "0)

where x represents each data value, n is the number of points and dened earlier. Example Find the mean and standard deviation of the following: 15, 45, 19, 38, 27, 37 The mean is 30.17. The standard deviation is calculated in several steps. Step 1: Calculate (15 - 30.17)2 + (45 - 30.17)2 + ...........+ (37 - 30.17)2 This gives a value of 692.83 Step 2: Divide by (n - 1) (in this case 5) : 692.83 Step 3: Take the square root: Answer 11.77

is the mean as

5 = 138.57

Unlike the mean, the standard deviation is not often mentioned in everyday situations like in newspaper articles or on television, so it sometimes seems to be an initially confusing statistic. However, the standard deviation is simply a number that tells you how tightly all the points are clustered around the mean in a set of data. When the results are pretty tightly bunched together the standard deviation is small, whereas data points spread far apart from each other would imply a relatively large standard deviation. The units of standard deviation are the same as that of the original data (e.g. cm, seconds, litres). It can be a helpful statistic when comparing two samples that have similar means but perhaps very different ranges. It is particularly useful when the results follow a Normal distribution as was the case of the soldiers chest measurements mentioned earlier. In that example the histogram showed a reasonably symmetrical pattern with the most frequent results being near the average value, whilst it also revealed a tapering off effect at either side. In situations like this, the standard deviation can even be used to give upper and lower bounds for where it is expected that, say, 95% of the results would lie (in fact 95% of the measurements should be approximately between 2 standard deviations on either side of the mean). However when distributions are more skewed, standard deviation may not be the most appropriate measure of spread to use - in these situations it may be that the interquartile range would give more meaningful results. A nal measure of spread is the variance, which is just the standard deviation squared. Notice that the denominator in the standard deviation formula is (n - 1). It may have aroused some puzzlement in the reader as to why this is used in preference to simply n. The reason is that normally it is desirable to obtain an estimate for the population standard deviation from a sample. For reasons beyond the scope of this course, in fact (n - 1) gives a better estimate of this population result than does the value n. However,

2c

H ERIOT-WATT U NIVERSITY 2002

12

TOPIC 1. DESCRIPTIVE STATISTICS

the equation

3
=

4 7 5 6!8#B 9 8%@'A 3

is sometimes used for standard deviation but only really when all results of a population are known. Note that s is usually used for samples and for populations

Drawing graphs
Q1: For the data below, construct a histogram using suitable class intervals. Also draw a stem-and-leaf diagram and a box-plot. 50.3, 31.1, 58.9, 42.2, 35.0, 34.4, 40.4, 42.2, 50.8, 25.1, 37.9, 41.4, 41.1, 38.8, 43.3, 23.7, 47.0, 20.8, 45.1, 39.9, 55.1, 29.5, 46.0, 42.6, 46.5, 40.2, 38.1, 51.7, 33.6, 33.3, 36.8, 36.5, 44.3, 35.4, 43.6, 43.9, 60.5, 41.7, 44.3, 54.5

1.5

Exploratory Data Analysis

The boxplots and stem-and-leaf graphs plotted in previous sections are examples of a set of techniques that are described by the term Exploratory Data Analysis. These techniques allow the researcher to explore data visually both as a precursor to more formal statistical analysis and as an integral part of formal statistical modelling. Exploratory Data Analysis (EDA) is an approach for data analysis that uses a variety of mostly graphical techniques to maximise insight into a data set and uncover underlying structures. It postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is often considered not simply to be a mere collection of techniques but as a philosophy. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore the data, and graphics can often provide the researcher with some new, often unsuspected, insight into what is going on. The particular graphical techniques employed in EDA are often quite simple, as the boxplots and stem-and-leaf diagrams have revealed. Other EDA techniques include probability plots, lag plots, block plots, mean plots and standard deviation plots. Example : Mean Plot Mean plots are used to see if the mean varies between different groups of the data. They can be used with grouped or ungrouped data to determine if the mean is changing over time. Suppose an analyst records the weights of 200 packets of cereal one day. The data are then split into an arbitrary number of equal-sized groups, for example ten groups of 20 points each. A mean plot can then be generated with these groups to see if the mean weight is increasing or decreasing over time. The resulting graph follows.

Cc

H ERIOT-WATT U NIVERSITY 2002

1.6. STATISTICAL SOFTWARE

13

Although the mean is the most commonly used measure of location, the same concept applies to other measures of location. For example, instead of plotting the mean of each group, the median might be plotted instead. This might be done if there were signicant outliers (unusual results) in the data and a more robust measure of location than the mean was desired. Mean plots are typically used in conjunction with standard deviation plots. The mean plot checks for shifts in location while the standard deviation plot checks for shifts in scale. On the subject of outliers, some researchers use quantitative methods to exclude these results. For example, they may not include observations that are outside the range of 2 standard deviations on either side of the mean. In some areas of research this cleaning of the data is absolutely necessary. A typical situation might occur in cognitive psychology research on reaction times where even if almost all scores in an experiment are in the range of 300-700 milliseconds, just a few distracted reactions of 10-15 seconds would completely change the overall picture. Dening an outlier, however, is subjective and the decisions concerning how to identify one must be made on an individual basis taking into account general research experience in the area.

1.6

Statistical Software

There is much software available nowadays which avoids the gatherer of data the need to carry out lengthy calculations in order to obtain appropriate statistical quantities. In years gone by, courses like this one would have spent painstaking time and effort enlightening students on how to calculate means, medians, standard deviations and

Dc

H ERIOT-WATT U NIVERSITY 2002

14

TOPIC 1. DESCRIPTIVE STATISTICS

the like by short cut arithmetical methods or graphs. It will be assumed here that the reader has access at least to a basic spreadsheet package like Microsoft Excel and he or she should make full use of its facilities. The most important aspects of this course are the trends and patterns that the statistics show rather than the physical process of their calculation. Other statistical packages like MINITAB will also be useful on occasion during the course. Excel is very widely used and its facilities will not be discussed here. It is helpful that the package has many intuitive features and links well with Microsoft Word, which is familiar to most people. The benets of using MINITAB are that it is user friendly, has good help facilities, copes with all the main statistical techniques and seems to offer a less steep initial learning curve that many statistics packages. Disadvantages are that although graphs can be obtained very quickly the quality of the diagrams produced can be poor, spreadsheets are not as easily manipulated as in other packages such as Excel and tabular output is not as good as that of another statistical package called SPSS.

Using a statistical package


Statistical packages An insurance company wishes to investigate if there is a difference between the claims received by their Aberdeen and Dumfries ofces. One week of the year is randomly selected and all the claims to each ofce during that week are recorded. Use an appropriate statistical package to compare the two data sets below. Obtain means, standard deviations and appropriate graphs for each sample and write a short report on any differences between them. Aberdeen Claims 339 268 292 297 259 412 392 222 349 345 332 223 342 353 186 335 342 205 335 160 350 201 447 267 284 191 197 Dumfries Claims 193 128 174 164 445 265 486 400 275 331 445 257 319 372 355 208 51 230 506 374 79 371 256 325 280 134 292 119 1293 374 378 320 270 319 189 422 313 307 224 168 171 219 323 307 292 403 281 246 270 560 1303 255 370 333 408 234 220 268 272 408 285 349 135 221 283 363 105 241 456 59 247 344 278 328 1381 334 277 400 173 198 253 160 371 364 245 382 476 351 256 349 318 198 398 196 191 224

451 420 51 60 201 300 334

247 458 385 137 273 343 413

310 171 1249 383 206 299 365

190 420 208 188 290 418 224

301 361 344 275 394 363 231

Ec

H ERIOT-WATT U NIVERSITY 2002

1.7. INTRODUCTION TO PROBABILITY

15

1.7

Introduction to Probability

It has been already stated that this course will make much use of inferential statistics. To make sense of these it is necessary to have a good understanding of the concept of probability. This topic will crop up time and time again but in this rst chapter it is introduced in the simplest possible way. Recall that in presenting the histograms earlier it was implied that the larger the bar the more likely would be the outcome corresponding to it. Another way of expressing this is to say that the bigger the bar the higher the probability of its outcome occurring. Probability can, in fact, be calculated in two ways: 1. Symmetry 2. Examining past events Most people have some concept of what probability means and when questioned would probably say it is something about chance. Weather forecasts often proclaim that there is a 70% chance of rain tomorrow. This is simply saying that the probability of rain tomorrow is 70%, or equivalently as 0.7 if expressed as a decimal. In fact, during this course it will be convenient for probabilities almost always to be represented by decimals. It can be deduced, then, that probabilities are numbers between 0 and 1 with the likelihood of any event occurring becoming greater the closer the probability is to 1. A probability of 0 means the event is impossible and a probability of 1 means it is certain. An example of calculating probabilities by symmetry is the familiar gaming situation of throwing a die. It has six sides and assuming it is not biased this means that any one side is equally likely to come up. So if it is desired to nd the probability of throwing a prime number, simply count the number of primes that can occur (i.e. 3; the numbers 2,3 and 5 are prime) and divide that by the number of possible outcomes (6). So the probability of a prime number being thrown is 3 / 6 = 0.5. In terms of a formula, the probability can be expressed by: Probability = Number of favourable outcomes Number of possible outcomes

In cases where there is no direct method of calculating probabilities by a reasoning method as was used above, it is necessary to adopt a relative frequency approach and examine past events. If it snowed on 3 days in a particular town last January, it is reasonable to assume that the probability of snow this January may be something like 3 / 31 = 0.097. The formula is slightly adapted and can be written as: Probabilty = Number of times event A occured Total number of events considered

The rules of probability will be explored in subsequent chapters.

1.8

Summary and Assessment

By the end of this topic you will be able to:

Fc

H ERIOT-WATT U NIVERSITY 2002

16

TOPIC 1. DESCRIPTIVE STATISTICS

G G G G G G G G G

Give appropriate examples of where quantitative methods can be used successfully; Calculate measures of central tendency; in particular mean, median and mode; Construct histograms, bar charts and pictograms from sample data; Give reasons for the usefulness of a stem-and-leaf diagram and construct one from sample results; Describe why spread or dispersion is an important property of data and calculate standard deviation and inter-quartile range; Construct a boxplot for sample data; Explain the concepts behind Exploratory Data Analysis; Use the basic functions of a statistical software package such as Excel or Minitab; Describe the basic ideas of probability theory and carry out simple calculations of probability.

End of topic test


An on-line test is available at this point.
15 min

Hc

H ERIOT-WATT U NIVERSITY 2002

ANSWERS: TOPIC 1

17

Answers to questions and activities


1 Descriptive Statistics
Drawing graphs (page 12) Q1: The data can be split into class intervals as follows: Class Interval 20 - under 25 25 - under 30 30 - under 35 35 - under 40 40 - under 45 45 - under 50 50 - under 55 55 - under 60 60 - under 65 The histogram can then be drawn as Freq. 2 2 4 8 13 4 4 2 1

The median calculates at 41.55 and the quartiles at 35.95 and 45.55. The minimum is 20.8 and the maximum 60.5. A Stem-and-leaf diagram is constructed with stem unit = 10 and leaf unit = 1. Notice that the results have been rounded to the nearest whole number so that the rst row, for example, refers to the rounded results 20 and 23.

Ic

H ERIOT-WATT U NIVERSITY 2002

18

ANSWERS: TOPIC 1

Pc

H ERIOT-WATT U NIVERSITY 2002

You might also like