Professional Documents
Culture Documents
The main objective of statistical analysis is to represent the data by one single
value which shows the concentration of data at that particular value. Such a value is
called the central value which facilitates easy comparison between two or more series
compared to loose data. Quantitative data organized or unorganized show a common
characteristic to concentrate at certain values usually some where in the centre of
distribution. Thus various measures which are employed to measure this tendency are
called measures of Central tendency. Constructing frequency distribution of raw data is
the first step towards condensation of large data into compact form. It is necessary to
condense the data into a single value. Such a single value is called an average. In most
of the data the average is a centre of concentration of the values in the date. Therefore,
the average is called a measure of central tendency. All values of the data are clustered
around the average and it carries the important properties of data. In that sense, it is
representative of the distribution. Two famous statistician named Yule and Kendall had
laid down certain requirements for an ideal average as follows:
1. AM : It is the best known & widely used measures of central tendency. It is the
sum of all observations divided by no. of observations.
Sum of all observations
Mean =
No. of observations
Symbolically, if X1, X2, …….. XN are the values of a variable the mean is
computed by the formula.
N
N N
∑ is read as sigma
X = The mean of values
Xi = Values of the variable
N = No. of values
Symbolically, if X1, X2 , …….XN are the value of a variable and F1, F2 …………..FN are
their corresponding frequencies, the mean is computed by the formula
N N
X = f1 X1 + f2 X2 + ……… + fN XN = ∑ f Xi = ∑ f Xi
i=1 i=1
f1 + f2 + ……… + fN ∑f N N
2
N
∑ f dxi
X = A+ i=1
_______
N
Where A stands for assumed mean
dxi = deviations of xi values from assumed mean
f = frequencies
N = total frequencies
From this assumption we take X1, X2 ………. XN as mid values of intervals and
calculated arithmetic mean
N
∑ fxi
X = i=1 where ∑ f = N
N
Computation procedure :
Step I : Write all class intervals serially in the first coln and
corresponding frequency in the second coln
3
Sum of Second coln
If the values of variables are large in size, make it simple by using short cut
method.
Symbolically, X = A + d
Step – I choose any value from data which is called assured mean (a)
Step – II take the difference of assured mean & mid values known as
deviation of difference (d)
Step – III multiply each d by corresponding f
Step – IV calculate d by using the formula
Step – V the formula X = a + d is used to find mean of original data
Demerits of AM :-
1. It is used for quantitative data, mean cannot the calculated for qualitative data like
caste, religion and sex.
2. It is unduly affected by extreme observations.
3. It cannot be calculated when the frequency dist is with open end classes.
4. Some times, AM may not be an observation in a data.
5. It cannot be determined graphically.
4
n1 + n2
Median:-
Definition:-
Median may be defined as the central value of a variable when the values are
arranged in order of magnitude i.e., either in ascending order or in the descending order.
The median divides the series into two equal parts, 50% of the observations will be
smaller than the median while 50% of the observations will be larger than it.
5
Merits of median;--(1) Easy to understand and easy to calculate .
(2) Can be computed for a distribution with open and classes.
(3) Not affected due to extreme observation .
(4) Applicable for quantitative as well as qualitative data.
(5)Can be determined graphically.
Demerits;- (1)It is not based on all the observations, hence it is not proper
representative.
Mode- The mode is the most common value of a variable that occurs
most frequently in a series.
(1) Ungrouped data: -In this case mode is obtained by inspection. For a
given data, mode may or may not exit & even if exists, it is not necessarily
. unique.
Demerits:-
6
i. It is not based on all the observations.
ii. Not capabule of further Mathematical treatment.
iii. It is not rigidly defined.
iv. The calculation of mode is labourious & time consuming.
v. Quartiles :- The values which divide the given data into four
equal parts when observations are arranged in order of
magnitude are known as Quartiles. There will be three quartiles
Q1, Q2,& Q3. Q1 is known as lower quartile or first quartile
and will have 25% observations of the distributions
Below it and consequently 75% of the observations are
greater than it. The second quartile is known as Median &
Q3,75% observation below & 25% obs after.
N
=1 log xi
N
Or, G= Antilog [N log xi ]
-=1----------
N
7
For Disorate series ,
G=Antilog [ N f log xi ]
----------
N
MEASURES OF DISPERSION
As already discussed, the whole data is represented by a single value known as average.
It cannot describe the data completely. There may be two or more data sets with same
mean but data set may not be identified.
8
To avoid disuniformity in observations, if it is necessary to study the variation.
The variation is also known as dispersion. It gives the information how individual
observations are scattered or dispersed for the means of a large sizes.
Deviation=observation-Mean
Different Measures of Dispersion :
(i) Range : A-B
(ii) Quartile deviation : Q3-Q1
2
(iii) Coefficient of Quartile deviation : Q3 - Q1
Q3 + Q1
(iv) Mean deviation Md = ∑ x-x
(v) Standard deviation Md= ∑ + x-x
N
(vi) Variance : N= ∑f
(vii) Coefficient of variation :
Coefficient of mean deviation about mean = MD about mean ∑ x-x /X
mean n
Standard deviation : Positive square root of the arithmetic mean of the square of the
taken for the mean denoted by
δ = ∑ x-x 2
n
When population mean is not known, we can take sample mean as an estimate of
population mean. In this case, only (n-1) observations are independent. Therefore, when
there are n observation in the data, divisor is n-1. In statistical language n-1 is called
degree of freedom.
δ = ∑ x-x 2
n-1
on simplification = δ2 = 1/n(∑x2-nx-2)
When observations are large in size the formula for SD is lebonion short cut method may
be used.
I- Divide assigned mean ‘a’
9
II- Obtain deviation values u,d = x-a
III- Complete mean deviation
IV- Apply formula δ = ∑ (d2-nd-2
n-1
For grouped data δ = ∑ fd2- d-2 xh
n-1
6. Variance : The square of the standard deviation of a set of object is called the
variance & denoted by δ2
Merits of Standard deviation :
(i) It is rigidly defined.
(ii) It is based upon all observations.
(iii) It does not ignore the algebraic sign of deviation.
(iv) It is capable of further treatment.
(v) It is not much affected by sampling fluctuation.
Demerits of Standard deviation :
(i) It is difficult to understood & calculate.
(ii) It cannot be calculated for quantitative data &
(iii) It is unduly affected due to extreme deviation.
Coefficient of variation :
For comparing the variability of two frequency distribution, the relative is
known as Coefficient of variation. It is always expressed in percentage.
Cv = δ x 100
x
SUMMARY :
1. Standard deviation or variance is never negative.
2. When all observations are equal, standard deviation is zero.
3. When all the observation in the data are increased or decreased by a constant,
Standard deviation remains the same.
10
4. When each of the observation is multiplied by constant K, then the standard
deviation is K times the standard deviation of original data.
11
√ (1/n. ∑ x2 - x2 ) x √ 1/n ∑y2-y2)
Properties of Correlation Coefficient :
(i) It always lies between -1 & +1. symbolically -1≤ r ≤ +1
(ii) r is a pure member , r is a unit less quantity.
(iii) Two independent variables are uncorrected , when x & y are independent ,
then r=0
(iv) The absolute value of Correlation Coefficient r is independent of change of
origin & scale.
RANK CORRELATION :
Given by the formula :
rs = 1- ∑ d2
n (n2-1)
Where n = No. of paired observation.
d= difference between respective ranks.
LINEAR REGRESSION :
First used by British biometrician Galton literally means stepping back towards
averages. Regression analysis is a mathematical measures of the average relationship
between two or more variables in terms of original units of the data . In Regression
analysis, there are two types of variables. The variables whose value is to be predicted is
called dependent variable & the variable which is used for prediction is called the
independent variable. In Regression analysis, independent variable is also known as
regressor, or predictor or explanator while the dependent variable is also known as
regressed or explained variable.
Y= a + bx
LINE OF REGRESSION :
If the variables in a bivariate distribution are related, we will find that the points
in the scatter diagram will cluster round some curve called the Curve of Regression. If the
curve is a straight line, it is called Line of Regression & there is said to be Linear
Regression between two variables. The Line of Regression is the line which gives the
best estimate to the value of one variable for any specific value of the other variable.
Thus the line of regression is the “line of best fi” & obtained by the principles of least
square.
12
Linear Equation satisfy an equation of the form
Y= a + bx falls as a straight line where a, b, are constant.
Mathematically, a is the y intercept &
b is the slope of the line.
13