S1 Compilation 2

1
Mathematical Modelling A model is a simplification of the real thing. It will be both quicker and cheaper to produce than the real one and will help us to understand the real world object or situation. Mathematical models require the use of probability. A statistical experiment is a test, investigation or some process adopted for collecting data to provide evidence for or against a hypothesis. An event is a sub set of possible outcomes of an experiment. !e can vary parameters if we wish. A disadvantage is that a model does not replicate real world situations in every detail. "ollecting #ata "ollecting data is important as a method must be used to avoid bias. $ne source of bias is using data from responses to questions as people may lie about personal questions such as age and weight. Another source of bias is when using data that does not properly apply to the problem. eg. %sing published unemployment figures to investigate the number of people looking for work, but they don&t include students, people past retirement age etc. but they may include people who are not looking for work. 'o check data is unbiased ask( !here has the data come from) !ho is supplying the data and why) *ow was the data collected) Is it all the relevent data or a sample) If a sample is used, how was the sample chosen) Is the data relevant to the investigation) #oes the conclusion follow from the investigation) 'ypes of #ata +ualitative #ata 'hese are non numerical values such as attitudes, gender, colour, football shirt number +uantitative #ata 'hese data have valid numerical values such as shoe si,e, number of broken eggs, height, time

#iscrete data come from variables which can only take particular values such as shoe si,e. "ontinuous data come from variables which can take any value within a given range.
-ummarising #ata 'he reason that a sample is taken is to make deductions about the population. .raphical and numerical summaries are essential in order to help us analyse the data collected. 'he purpose of these summaries is to condense the data to reveal patterns ans to enable comparisons to be made. -ummarising can lead to a loss of accuracy.
%ngrouped /requency #istribution #ata must be sorted before any sense can be made of it. 'his is often done using a frequency distribution with a cumulative frequency column. -tem and 0eaf #iagrams $ne way of ordering and presenting data is a stem and leaf diagram. 'he benefits are that it retains all the original data and yet it is &grouped& into classes. !e must arrange the leaves in numerical order and give a key. A stem and leaf diagram gives a quick visual impression of the shapes of distribution. 1oth integers and decimal can be represented though the data is usually to 2 sig fig. It may be necessary to round data to meet this constraint. If a large number of leaves are associated with one line then it is usual to use two lines. !e can also improve our diagrams by showing the number of leaves on each stem in brackets. If direct comparison of two data sets is required, a back to back stem and leaf diagram can be drawn. .rouped /requency #istributions !e can summarise data into grouped frequency tables. 'he information becomes more concise, but the original information has been lost. It allows summaries and estimates to be made. 1oth continuous data and discrete data can be grouped. 'he boundaries of the groups must be matched, even if this results in a negative starting point. .roups are usually referred to as classes. Age is a special case, the boundaries are matched to complete years ie. 23 24, 25 26 is actually 23 25, 25 27. "umulative /requency "urves and 8olygons for .rouped #ata !hen data is grouped 9discrete or continuous: we consider the cumulative frequencies to be the total frequency up to the upper class boundary of each interval. 'o draw a cumulative frequency curve, we plot the ucb of each interval against its cumulative frequency 9cf: and join with a smooth curve. /or a cumulative frequency polygon, we join the points with straight lines as opposed to a smooth curve. *istograms If the data available is for a continuous variable and it is summarised by a grouped frequency distribution, then the data can be represented by means of a histogram. 'here are no gaps between the bars of a histogram. Thus boundaries must be matched. 'here is an important relationship between the area of a histogram bar and the frequency that it is representing. Area is directly proportional to frequency. 'otal area is directly proportional to total frequency. /requency density ;
frequency
<class width
/requency ; /requency #ensity = class width
'here are times when it is useful to draw a histogram based on relative frequencies rather than frequencies. 'he relative frequencies are obtained by expressing the frequencies as a proportion of the total frequency. Methods of -ummarising -ample #ata Measures of 0ocation 9averages: 'hese are sometimes called measures of central tendency which attempts to locate a typical value about which a distribution clusters. Methods of #ispersion 'hese are used to represent the spread or variation within the data since it is unlikely that all the values in a data set will be the same. All these measures are generally numerical quantities. Measures of 0ocation 'he Mode 'he mode is the value that occurs most often. It is not always unique 9can be bi modal: and there may not be a mode. In the case of grouped frequencies, the mode is not always useful, bu there are ways to estimate the mode using a histogram. %sually, the modal class would be sufficient. It is easy to calculate and is not affected by any extreme values. It is useful to shops to know what si,es to stock. 'he Median 'he middle value of an ordered set of data. If there are n observations arranged in order of si,e, the median value is the n > 3 th observation. 2 'o find the median, we use the cumulative frequency. !e can estimate the median of grouped data using linear interpolation(
Median +2 ; 0 >
n>3 2 f
f0
=c
0 ; 0ower class boundary of median class n ; total frequency f0 ; cumulative frequency up to the median class f ; frequency in the median class c ; class width of median group -imilar advantages and disadvantages to the mode. $ther +uantiles "an be done using the formula above but with n>3 over 4 for quartiles, 3? for deciles and 3?? for percentiles, and then multiplied by which quantile it is e.g. the 4@rd percentile would be 4@9n>3<3??: in the place of n>3<2 'he Arithmetic Mean
'he mean is the most widely used measure of location and is often used in conjunction with the standard deviation 9a measure of spread: If x3, x2, x@, ...xn are a set of numbers then x;
Ax
<n
Afx
/or a frequency distribution this formula is re written as x ;
<Af where Af ; n
Always state the appropriate values in your answer ie. Afx, Af, n !hen given two means and the frequency you must find the totals and add these together and divide by the total frequency to get the new mean 9weighted mean: /or grouped data we use the midpoint. Bemember age is special( If you have the groups ? 7 3? 37 then you consider the first group as ? 3? therefore the midpoint would be 5. Advantages and disadvantages 'he mean is influenced by extreme valuesC it is sensitive to the presence of outliers. It is not as easily calculated as the median All the values are used directly when calculating the mean. 'he mean has important mathematical properties. Dven if we have grouped frequency distributions of unequal intervals, this makes no difference to the calculation of the mean. Bemember that for grouped data, the mean is only an estimate. "alculating the Mean %sing the Method of "oding use this method if asked to do so y;
xEa
<b alters the original x values
a ; the midpoint of the modal class b ; the class width 9if class widths are not equal then use the smallest class width: /rom this we can calculate the mean of y and decode to find the mean of x x ; by > a !eighted Mean !hen we wish to place greater emphasis on some of the values we use a weighted mean Bange

Measures of #ispersion 'he simplest measure of spread 1ased entirely on extreme values -mallest value is subtracted from largest value. /or grouped frequency distributions, an estimate of the range is the difference between the lower class boundary of the first group and the upper class boundary of the last group. #oes not lend itself to mathematical use %sed only with small data sets in conjunction with either the mode or the median
Interquartile Bange range of the middle 5?F I+B ; +@ E +3 Got affected by extreme values If the median is the measure of location used then the I+B is the appropriate measure of dispersion $ften used when data has extreme values or has open ended classes or is not symmetrical %sed extensively in conjunction with box plots "an help us identify outliers and examine the skewness of a distribution -emi Interquartile Bange -I+B ; I+B<2 -tandard #eviation and Hariance -tandard deviation is used in conjunction with the mean. %ses all the data values 'he population variance is denoted by I2 'he sample variance is denoted by s2 'he standard deviation is the positive square root of the variance. 'he population sd is denoted by I 'he sample sd is denoted by s I2 ; Ax2 n I ; Ax2 n !here x ; Ax n /or most distributions, the bulk 975F: of the distribution lies within 2sd&s of the mean 'he units of sd are the same as the original data !e can never get a negative variance 9as its sqrt is the sd: /or similar sets of data it is useful to compare the sd&s !hen there is a frequency distribution we use the formula( I ; Afx2 Af x2 x2
x2
!e can code and decode like before but when decoding, you do not need to >a as this does not alter the spread. -ee purple notes for "ombining sets of numbers -kewness -ymmetrical 1ell -haped #istribution
mean;median;mode Gormal #istribution 8ositively -kewed #istribution meanJmedianJmode 'he mean is pulled in a positive direction Gegatively -kewed #istribution meanKmedianKmode 'he mean is pulled in a negative direction Measures of -kewness 8earson&s Measure of -kewness 8earson&s Measure of -kewness ; mean E mode standard deviation
If this value is positive then we have positive skewness. If this value is negative then we have negative skewness. .enerally skewness can take any value between @ and @ 'his can be rewritten as( @9mean E median: standard deviation
+uartile "oefficient of -kewness Gormal #istribution +@ +2 ; + 2 E + 3 +uartile skewness ; ? 8ositively -kewed #istribution +@ +2 J + 2 E + 3 +uartile skewness J ? Gegatively -kewed #istribution +@ +2 K + 2 E + 3 +uartile skewness K ? 1ox 8lots illustrates the dispersion or spread of the distributions, as well as the average 9median: it uses the highest and lowest values of the data, and the three quartiles the box encloses the middle 5?F 9the I+B: 'he whiskers extend to the upper and lower values 9the range:
!hen commenting on box plots you must give all the summary statistics 9median, I+B, range: comment on the skewness of the given distributions with justification calculations make comparisons of the two or more distributions Always draw box plots on graph paper and label your axis clearly. %se a suitable scale. -ymmetrical 1ell -haped #istribution 'he whiskers are of equal length and the median is in the middle of the box. 8ositively -kewed #istribution 'he right hand whisker is longer and the median is nearer to the lower quartile.
Gegatively -kewed #istribution 'he left hand whisker is longer and the median is nearer to the upper quartile. %se of 1ox 8lots to Identify $utliers Dxtreme values are known as outliers 'here may be good reason for these results but they are often due to errors 'hey may need to be highlighted 'hey are often considered as points lying more than 3.5 times the I+B above +@ or below +3
8rocedure /ind the value of the quartiles Dvaluate +3 E 3.59+@ E +3: and +@ > 3.59+@ E +3: and note any values that fall outside this range #raw a box based on the quartile values. If there are any outliers, label them with crosses. 'he whisker is usually drawn to the next value towards the median $nly calculate these outliers if the question specifically asks you to do so "orrelation the relationship between two variables x and y bi variate data produce a bi variate distribution 'here may be a relationship but you cannot necessarily expect to find a law<formula relating them !e initially look for basic associations
-catter #iagrams 1i variate data is conveniently displayed through scatter diagrams 'hey help to assess correlation and regression. !e can use to help show linear correlation Dven if we find a mathematical relationship, this does not imply that there is a relationship in reality, or indeed that an increase in one variable causes an increase in the other. "orrelation measures the relationship and the strength of this relationship between the two variables. If both variables increase together we say that they are positively correlated. If one variable increases as the other decreases we say that they are negatively correlated. If no relationship can be seen we say there is no correlation. !hen drawing scatter diagrams it doesn&t matter which axis is used for which variable, however it does when measuring regression. If a hori,ontal line and a vertical line are drawn through the mean point 9x, y:, you can see the association between the two variables in a different way( /or a postive correlation most points lie in the first and third quadrants 9top right and bottom left respectively: /or a negative correlation most points lie in the second and fourth quadrants 9top left and bottom right respectively: If there is no correlation the points are randomly distributed in all four quadrants. 8roduct Moment "orrelation "oefficient, r 8M""
'he pmcc r is a numerical value that indicates the degree of scatter. It measures the relationship between the two variables and its strength. !e must calculate this value and interpret its meaning. 'he value of r lies between 3 and 3 It is a useful measure because it is independent of the units of the scale of the variables. 'he calculation of r should only follow after a scatter diagram has been drawn in reality. It should only be calculated if the scatter diagram reveals some degree of linear correlation. If correlation is non linear than pmcc is not appropriate. $utliers, or rogue results, should be identified as they may upset the general trend. If r ; 3 there is perfect positive linear correlation between the two variables. If r ; 3 there is perfect negative linear correlation between the two variables. If r ; ? 9or close to ?: there is no linear correlationC this does not, however, exclude the existence of another type of relationship. "alculation r; -xy L9-xx-yy: where -xy ; Axy E AxAy n where -xx ; Ax2 E 9Ax:2 n where -yy ; Ay2 E 9Ay:2 n !e must find n, Ax2, Ax, Ay2, Ay, Axy And then use above formulae "alculator must be in linear regression mode. %sing A Method of "oding for "orrelation 'he beauty of coding for the 8M"" is that we do not need to decode at the end. It makes the values of x and y smaller. Mou can subtract any number from the x values, since this only moves the axis. Mou can divide the result by any number since this only changes the scale. 'he correlation coefficient is unaffected by either of these operations. Mou can rewrite the variables x and y as( N ; x a< b M ; y c<d where a, b, c and d are suitable numbers to be chosen. Gote( Oust because two variables have a linear correlation does not necessarily mean that they are related. 'hus, you should have some reason to believe that there might be a relationship before calculating the 8M"", unless your aim is to prove that they are unrelated. #ata can be distorted by an outlier, so the information should be plotted on a scatter graph first. Gote( A quadratic graph would give a 8M"" of ?, as it has correlation, but it is non linear.
$ften variables are linked only through a third variable. 8articularly changes that take place over time. Begression 8urpose( to find a law connecting two variables, so that we can make predictions about the value of y for any given value of x. Dxplanatory and Besponse Hariables 'he value of x is controlled. It is known as the explanatory or independent variable whilst y is called the response or dependent variable. 'he response variable will be subject to some level of error or natural variation. 'o see if there is a relationship, we plot a scatter diagram. 'he explanatory variable is always plotted hori,ontally and the response variable is always plotted vertically. 1y examining the scatter diagrams for data, we can see if a straight line would be a good or appropriate model for the relationship between x and y. 'he -traight 0ine 0aw In statistics, instead or writing y; mx > c, we use y ; a > bx 'his can be rearranged to y y ; b9x x: *aving assumed the linear regression model, the results are used to find a regression line. 'his line is known as the regression line of y on x, since y is the response variable for a given value of x. If you assume a linear regression line, each point with coordinates 9x i, yi: will have a vertical distance ri from the regression line. 'hese are known as residuals. If the residuals are very small, a line may be drawn by eye, however a much better solution is to find the line of best fit using the method of least squares. 0egendre formulated this method. 'he resulting line is known as the least squares regression line. 'he 0east -quares Begression 0ine Making the sum of the squares of the residuals as small as possible. ie A ri 2 is minimised. !e substitute the mean point 9x, y: into the equation y y ; b9x x: and rearrange to get y ; a > bx 'he gradient m is given by the letter b and is called the regression coefficient of y on x. !e will need to calculate b using the formulaC b; x; y; -xy -xx Ax n Ay n
'o draw this line, we choose three points( the mean point and one point whose x value is at the low end of the observed values and another point whose x value is at the high end of the observed values.
10
!e can use our regression line to obtain estimates of y given values of x under appropriate conditions. Application and Interpretation 'o make estimates of the response variable within the range of the observed values of the data is know as interpolation. Mou do not know what happens outside the range of our values of our experimental data. !e are assuming a linear relationship within our observed values and for all we know the relationship between the variables outside of the range of values may be non linear. 'herefore it is dangerous to make predictions or estimates for the response variable based on values outside the range of observed values. 'he process is known as extrapolation. Mou will also be asked to give interpretation for the values of a and b from your regression lie within the context of the question. !hile regression is concerned with finding a linear law between the two variables in question, the value of the response depending for its value upon that of the explanatory, correlation is concerned with how strongly two variables are linearly associated 9not a law: 8robability Henn #iagrams and 8robability #efinitions ; intersection AG# % ; union $B AP ; G$' A 89A: ; 3 89AP: 89AP: ; 3 89A: 89A%1: ; 89A: > 891: 89AP%1: ; 89AP: > 891: 89A%1P: ; 89A: > 891P: 89AP%1P: ; 89AP: > 891P: 89AP1P: ; 3 89AP1: ; 3 89A1P: ; 3 89A%1: 89A%1: 89A%1: 89A: 891: 89A1: 89AP1: 89A1P: 89AP1P: $B in maths means the probability of both
Mutual Dxclusivity 'wo events A Q 1 are said to be mutually exclusive 9m.e: if they cannot occur at the same time. In this case, in the Henn #iagram, A Q 1 do not overlap 'hus 89A1: ; ? 89A%1: ; 89A: > 891: for these events
Dxhaustion If two events A Q 1 are such that A%1 makes up all the possible outcomes 89A%1: ; 3 !e say that A Q 1 are exhaustive 89A: > 891: 89A1: ; 3
11
"onditional 8robability 9#ependent Dvents: If A Q 1 are any two events where 89A: R ? and 891: R ? then the probability of A given that 1 has already occurred is written as 89AP1: 89AP1: ; 891PA: ;
89A1:
< 891: < 89A:
89A1:
"onditional probability reduces the sample space Gote( If events A Q 1 m.e then we know 89A1: ; ? so P(A|B) = P(B|A) = 0 Gote( !e can extend this basic conditional probability definition to things like 89APP1: ; 89AP1: < 891: Gote( 89APP1: ; 3 89AP1: 89APP1P: ; 3 89AP1P: = = = = without replacement is conditional probability. with replacement is independent event %se common sense where possible Besort to definitions when common sense fails
Independent Dvents 2 events are independent if the probability that 3 of them occurs is no way influenced by whether or not the other has occurred. 'hus In this case 89AP1: ; 89A: 891PA: ; 891: 89A1: ; 89A: = 891: #iscrete Bandom Hariables 'he following are examples of discrete random variables. the score when a die is thrown the value of a pri,e awarded the profit in a game of chance etc 'he set of all possible values of a r.v. together with their probabilities is called a probability distribution 9probability disn: Also, the function that describes how the probabilities are assigned is called the probability function. /or an r.v, N the probability function is denoted by 89N;x: Bemember A 89N;x: ; 3 Bandom variables are denoted by capital letters and the particular values they take are denoted by lower case letters. !hatever the question is, always define what the random variable is. 'he function that is responsible for allocating the probabilities 89N;x: is also known as the probability density function 9pdf:
12
-ometimes it can be expressed in a tabular form or in a formula. 'he cumulative distribution function 9cdf: /9x: ; 89NSx: /9last number: ; 3 Dxpectation D9N: D9N: ; A x 89N;x: D9N: is the expected value, the mean of the probabilities. !e obtain this value of the expected mean by multiplying each score by its corresponding probability and summing them. 'his is a theoretical approach 9the mean of the frequency distribution is a experimental approach:. Gote( -ome probability distributions are symmetrical about a central value. In this case the D9N: is the middle value. A discrete random variable with pdf 89N;x: ; k , for all given values of x, where k is a constant is said to follow a Uniform Distribution 'he Dxpectation of Any /unction of N 'he definition of expectations can be extended to any function of the r.v N, such as N2 , 7N, N 4, @N2 5N In general, if g9x: is a function of N, a discrete random variable, then DTg9x:U ; A Tg9x:U 89N;x: 'he following results hold when N is a discrete random variable and when both a and b are constants 3. D9a: ; a 2. D9aN: ; aD9N: @. D9aN > b: ; aD9N: > b 'he Hariance of N Har9N: ; D9N2: TD9N:U2 where D9N: is the mean V
Har9a: ; ? Har9aN: ; a2 Har9N: Har9aN > b: ; a2 Har9N: Har9aN W bM: ; a2 Har9N: > b2 Har9M: 'he #iscrete %niform #istribution If the discrete random variable N is defined over the set of distinct values. Xx3 , x2 , x@ ... xnY and each value is equally likely, then N has a discrete uniform distribution and 89N ; xr: ; 3<n r ; 3, 2, @ ... n
N ; the value of next outcome
13
If N is the discrete uniform variable and xn ; n 9ie. x values start at 3 and progress up consecutively: V ; D9N: ; n>3<2 I2 ; Har9N: ;
9n>3:9n 3:
<32
'he Gormal #istribution Most important continuous distribution in statistics. -een in heights, masses, age etc. 'he probability density function of the normal random variable is very complicated. 'he shape of the curve depends on two parameters, mean and variance. N Z G9V, I2: 'he distribution is bell shaped and symmetrical about the mean Mean ; median ; mode 75F of the distribution lies within 2 sd&s of the mean. 77.6F lies within @ sd&s of the mean It is a two parameter distribution. 'he probability of N relies only on V and I2 Area under curve ; 3 !e must standardise N to get the standard normal random variable 9[: [ Z G9?, 3: Areas under the "urve %se the tables to find values of \9a: in the interval ? to 4 /or values between 4 and ? we use the symmetry of the normal distribution to find appropriate probabilities. 89[ K a: ; \9a: 89[ J a: ; 3 ;3 89[ K a: \9a: 9by symmetry:
89[ K a: ; \9 a: ; 3 \9a: 89[ J a: ; \9a: 89aK[Kb: ; \9b:
\9a: 3
89 aK[Ka: ; 89P[P K a: ; 2\9a: 89P[P J a: ; 3 ;2 89P[P K a: 2\9a:
%se all four decimal places from table. -pecial 8robability 'able 'his contains , values for the normal variable [ZG9?,3: such that r.v exceeds , with probability p. 89[J,:
14
Mou can use both tables in reverse to find the value of ,, given a probability. 'ransformation of any Gormal Bandom Hariable to a -tandard Gormal r.v. If NZG9V,I2: then [ ;N I V !here [ZG9?,3:
'his is called standardising N to the normal r.v [

S1 Compilation 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S1 Compilation 2

Uploaded by

Copyright:

Available Formats

1

/requency ; /requency #ensity = class width

/or a frequency distribution this formula is re written as x ;

<b alters the original x values

< 891: < 89A:

N ; the value of next outcome

89[ K a: ; \9 a: ; 3 \9a: 89[ J a: ; \9a: 89aK[Kb: ; \9b:

89 aK[Ka: ; 89P[P K a: ; 2\9a: 89P[P J a: ; 3 ;2 89P[P K a: 2\9a:

'his is called standardising N to the normal r.v [

You might also like