Professional Documents
Culture Documents
1.
2.
3.
4.
5.
a.
b.
6.
a.
b.
7.
8.
9.
Background
Motivation
Objective
Description of Data
Intelligence extracted from Data
Using Scatter Plots and Null Hypothesis
Graphs of Correlation
Use of R Programming
R-code
Module wise description
What you have learnt from this Project?
Summary
Innovation finds in the field of Communication
1. Background
With the ever increasing traffic of data both on web and in inventories
we have reached a stage when we are dealing with the concept of Big Data.
Thus, we have abundant data with us ready to be exploited but it is of no
usage unless we make some meaning out of it or untill we analyze it.
2. Motivation
AD test is one of the statistical test that is being applied upon the data to
understand the behaviour of data and exploit it characteristics. The
Advantages of using AD test as compared to other lies below:
Advantages of AD Test
1. Determine type of Distribution:
->
AD test can be used to determine the distribution followed by the
specific data. It can be used to test that which distribution is being followed
from the given list of distributions such as: Weibull distribution, Exponential
Distribution, Log-Normal, Normal etc.
->
Thus on knowing the type of distribution of data, we can mention about
the characteristic that data follows and comment about its behavior.
2.
-> According to the M.A. Stephen the test statistic of Ad test is one of the
best as it can be easily used to find deviations and departures of data from
normality. [1]
3. Objective
What you are going to do with an AD test in data analysis and
communication algorithm?
->
The main objective behind AD Test is to know the the type of
distribution followed by data and accordingly predict its behaviour.
->
Each type of distribution has a specific characteristics of their own and
with this characteristics we can get to know the behaviour of the data being
study and thus, analyzing data depending upon their behaviour helps us to
generate some refined conclusions and find particular pattern being
followed.
->
The statistics of Anderson darling test are used in goodness-of-fits-test
for Gompertz distribution, which in turn is used to find out span of real
elements like life cycle of an electronic item, rate at which a code would fail,
and widely used for generating span of living organisms. Anderson darling
test is used with some modifications to find the upper and lower tails of
many distributions.[4] [5]
->
Anderson Darling technique is used in Cognitive Radio. Cognitive radio
is the concept in which unused part of Spectrum is supplied to Secondary
user while catering the requirements of Primary User. In such system the
distribution of Signal can be modeled by Gaussian Distribution and then we
compare the received signal with the noise distribution.In such cases if we
have an aprior information about the noise distribution then we can use
Anderson Darling Test to check whether the received signals are drawn from
the noise distribution.This method is also called as Anderson Darling Sensing.
[6]
4. Description of Data
The collected data monitors the weather and atmospheric conditions of
place in and around James Clerk Maxwell Building, located in Edinburg, U.K..
) and the
).
Not every Bell Shape curve represents the Normal Distribution. The
shape of the Normal distribution does not depend on the distribution
parameters. Even though the data is symmetric in the probability
distribution. Other distributions do have a bell shape curve as we can see
from the following:
5.a
Relative Humidity does not follow Normal Distribution. Thus, we reject our
null hypothesis.
As we can observe from the Histogram that the value of Relative Humidity is
ranging mainly from 72% to 90%
Average Yearly Relative Humidity of Edinburgh is 80.18249 %
2. Students T Test
H0 = The mean of Relative Humidity is 82.91667 %
H1 = The mean of Relative Humidity is not 82.91667 %
2.a.
2.b.
2.c.
2.d.
2.e.
Surface Temperature does not follow Normal Distribution. Thus, we reject our
null hypothesis.
As we can observe from the Histogram that the value of Surface Temperature
is ranging mainly from 50C to 150C.
Average Yearly Relative Surface Temperature of Edinburgh is 9.410C
2. Students T Test
H0 = The mean of Surface Temperature is 13 0C
H1 = The mean of Surface Temperature is not 13 0C
2.a.
2.c.
2.e.
By observing the graph and checking from ad.test(), we find that Wind Speed does
not follow Normal Distribution. Thus, we reject our null hypothesis.
As we can observe from the Histogram that the value of Wind Speed is ranging
mainly from 1.042 m/s to 4.396 m/s
It shows a linear decrease from 1 m/s to 14m/s.
The mean Wind Speed is 2.952 m/s, indicating that a give regular day it is more
likely that a wind speed will be around 3 m/s
Thus, it is less likely to have wind speed beyond 7.5m/s as they take place during
uneven weather conditions
Average Yearly Wind Speed in Edinburgh is 2.952 m/s
Overall Conclusion:
Null Hypothesis get rejected as both test and graphical observation support the
same result.
2. Students T Test
H0 = The mean of Wind Speed is 2.83 m/s
H1 = The mean of Wind Speed is not 2.83 m/s
2.a.
Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.
The mean speed from data is around 6.3 m/s whereas we are checking for 2.83m/s.
Thus, there is a large variation between two means.
2.b.
Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.
2.c.
Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.
First 10,000 samples correspond to the data of wind speed from the first week of
January. Roughly, the wind speed in that time is 14Km/h or 3.9m/s. Thus we find,
that 2.83 m/s deviates quite a lot from the recorded mean.
2.d.
Null hypothesis of Wind Speed having the mean of 2.83 m/s gets accepted with 95%
of confidence level.
The acceptance level of mean ranges from 2.819304 and 2.85. Whereas, the mean
that was assumed was 2.83. It perfectly fits in the mid range and hence gets
accepted.
2.e.
Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.
By observing the graph and checking from ad.test(), we find that Wind Direction
does not follow Normal Distribution. Thus, we reject our null hypothesis.
There is major distribution around two peaks, one at around 225 o-250o and other at
around 301o-320o. Thus, wind direction does not follow normal distribution.
First peak correspond to direction of Southwest and some parts of West and other
peak corresponds to direction of Northwest.
Thus, majority of time wind flows from the west (North-west as well as south-west)
side of direction.
This can also be validated from the fact that there is huge open golf course (Craig
Millar Park) surrounding the west and the southern part of the observatory.
Range from 45o to 135o corresponds to direction of North-East, East and South-East.
Thus no wind from that side.
Average Yearly Wind Direction in Edinburgh is 159.6 O
2. Students T Test
H0 = The mean of Wind Direction is 238O
H1 = The mean of Wind Speed is not 238O
2.a.
Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.
2.b.
Null hypothesis of Wind Direction having the mean of 238 O gets accepted with 95%
of confidence level.
2.c.
Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.
2.d.
Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.
2.e.
Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.
5.b.
1.
Graphs of Correlation
SCATTER PLOT
Observation through plot:
This is the relationship between rainfall and surface- temperature - bar plot for
correlation of atmospheric pressure and relative humidity.
Rainfall on y axis and surface- temperature on x axis
There are much scattered data points in this plot which shows that this relationship
will be weak to a great extent
no linear or curvilinear relationship
Very few influential data points in the range of -9 to 25 values of surfacetemperature
This relation has much lower correlation as seen from the graph due to the
scattered data points.
R CODE:
plot(data$surface.temperature..C.,data$rainfall..mm. )
WITH REGRESSION LINE
lines(lowess(data$surface.temperature..C.,data$rainfall..mm.),col="blue")
Statistical Observation:
CORRELATION
On the basis of r value, it can be said that the strength of the relationship is much
weaker almost tending to 0..
Through statistical data also we can see that the relationship is almost zero and
thus weaker from the graph it was seen as the data points are much scattered.
Since, we have an horizontal line, there is no correlation between data
Also, by the correlation function, we get the value near to zero
Since, the correlation coefficient is negative but close to zero, we find that they are
not correlated
2. Rainfall and humidity
SCATTER PLOT
Observation through plot:
This is the relationship between relative humidity and rainfall- plot for correlation of
rainfall and relative humidity.
Relative humidity is on y axis and rainfall on x axis
The relative humidity is mainly clustered over a certain range between 0 to 2 values
of rainfall
From the graph it is seen that the data points are clustered only at some area this
type of clustering can be said to have no correlation or much lesser correlation. We
can say that the correlation is weak but it cannot be negative.
It does not even follow any linear or curvilinear relationship.
The data points in the range of 0 to 2 of rainfall can be said to be somewhat
influential.
R CODE
plot(data$rainfall..mm.,data$relative.humidity....)
WITH REGRESSION LINE
lines(lowess(data$rainfall..mm.,data$relative.humidity....),col="blue")
Statistical Observation:
CORRELATION
On the basis of r value, it can be said that the strength of the relationship is very
weak relationship but positive weak relationship
Regression line is slightly curvilinear and then constant, and thus r would be near to
zero.
And on calculating correlation, we get it nearly zero and hence proved that they are
not correlated
This is the relationship between rainfall and wind-speed - bar plot for correlation of
rainfall and wind speed.
Rainfall on y axis and wind-speed on x axis
The data is not even clustered at any place
Slope of the line is also too less which shows that there is not much correlation i.e
lesser correlation. We can say it has weak correlation but we can say that the
relationship would not be negative.
It follows somewhat linear relationship with slope almost negligible so this also
shows that the correlation is weak.
R CODE
plot(data$wind.speed..m.s.,data$rainfall..mm.
WITH REGRESSION LINE
lines(lowess(data$wind.speed..m.s.,data$rainfall..mm.),col="blue")
Statistical Observation:
CORRELATION
On the basis of r value, it can be said that the strength of the relationship is very
weak relationship but positive weak relationship.
From the r value it is clearly seen that there is very weak correlation
Wind speed and rainfall are not correlated as the regression line is horizontal, yet
we can see slight positive correlation between data
This is due to the scattered points above the regression line
This is the relationship between atmospheric pressure and surface-temperature bar plot for correlation of surface-temperature and atmospheric pressure.
atmospheric pressure is on y axis and surface-temperature on x axis
The graph is mainly clustered over a certain range of surface-temperature values
between -9 approximately and 25
These data points do not have a specific pattern so we can say that they have
lesser correlation i.e the correlation is weak
It does not follow any linear or curvilinear relationship
The data points in the range of -9 to 25 can be said to be somewhat influential data
points.
R CODE
plot(data$surface.temperature..C.,data$atmospheric.pressure..mBar.)
WITH REGRESSION LINE
lines(lowess(data$surface.temperature..C.,data$atmospheric.pressure..mBar.),col="
blue")
CORRELATION
On the basis of r value, it can be said that the strength of the relationship is weak
relationship
The surface temperature and atmospheric pressure are positively correlated with
each other, as we get a positive slope regression line
This is the relationship between relative humidity and atmospheric pressure- bar
plot for correlation of atmospheric pressure and relative humidity.
Relative humidity is on y axis and atmospheric pressure on x axis
The graph is mainly clustered over a certain range of atmospheric pressure values
between 950 approximately and 1100
The cluster it is decreasing downward gradually after sometime so it can be said
that the direction is downwards and it has negative association. As the atmospheric
pressure increases the relative humidity decreases. Thus we can say that it has
negative correlation by observing the plot.
The form cannot be stated clearly as it is all clustered it does not follow any linear
or curvilinear relationship
The data points are closer in the right corner that shows that they are closely
related with each other i.e they have higher correlation at that corner. We can say
that they show a higher negative correlation as they have negative association and
are more closely related but overall it can be concluded that it has lower correlation.
The data points in the right corner can be said to be influential as they are in the
flow of major cluster of the data points
Statistical Observation:
On the basis of r value, it can be said that the strength of the relationship is
negative weak.
From the regression line, we observe that we have a negative linear regression line
contributing to negative correlation of data
6. Use of R Programming
a. Module wise description of Functions used
1. ad.test()
function (x)
{
DNAME <- deparse(substitute(x))
x <- sort(x[complete.cases(x)])
n <- length(x)
if (n < 8)
stop("sample size must be greater than 7")
logp1 <- pnorm((x - mean(x))/sd(x), log.p = TRUE)
logp2 <- pnorm(-(x - mean(x))/sd(x), log.p = TRUE)
h <- (2 * seq(1:n) - 1) * (logp1 + rev(logp2))
//STEP-5
A <- -n - mean(h)
AA <- (1 + 0.75/n + 2.25/n^2) * A
if (AA < 0.2) {
begins
pval <- 1 - exp(-13.436 + 101.14 * AA - 223.73 * AA^2)
}
else if (AA < 0.34) {
pval <- 1 - exp(-8.318 + 42.796 * AA - 59.938 * AA^2)
}
else if (AA < 0.6) {
pval <- exp(0.9177 - 4.279 * AA - 1.38 * AA^2)
}
else if (AA < 10) {
//STEP-1
//STEP-2
//STEP-3
//STEP-4
//STEP-4
//STEP-5
//STEP-6
//STEP-7
3. plot(x,y)
Plots the scatter plot of two data columns.
4. hist(x) or plot(table(x))
Using either function we get the graphical representation of the
frequency of a data column with respect to its values. On X - axis data values
whereas on the Y - axis it has frequency of occurrence.
5. t.test(x, mu = assumed mean value, conf. level = confidence level)
It compares the assumed mean value with the actual mean
value of data and correspondingly takes decision on Null Hypothesis for a given
confidence level.
6. lines(lowess)
Gives the regression line of scatter plot, it is used for
interpreting correlation of data. The factor of alpha, passed as an argument to the
function smoothes the line
Programming Skills
We got an opportunity to Explore R Programming Language
We got to learn in depth about many functions inside the various packages of
R
i.e., Nortest Package, ad.test, t.test, cor, distfit, MASS Package, FitDistPlus
package
It has enabled/familiarized us to determine Whether a data is drawn from a
specific probability distribution or not.
If the Test Statistic (Ac2) exceeds the critical value then the Null Hypothesis is
rejected.
Another approach can be if the P-value is less than 0.05 significance level
then the Null Hypothesis is rejected.
-
How to correlate two different types of data through graphically and by the
correlation coefficient
We learned to interpret from plot of two types and associate as well as
correlate the data graphically. [We learned to interpret from various types of
plots such as scatter, histogram, normal-plot etc and are able to associate as
well as correlate the data graphically.Thus we learnt the data analysis part.]
From graph, we can approximately get the value of mean which will be around the
peak and standard deviation by width of the graph provided that it is a normal
distribution.
8. Summary
Anderson Darling Test provided in nortest package is used to
accept/reject null hypothesis by looking at the normal distribution of
data.The normal distribution is Gaussian distribution, calculated from mean
and standard deviation of data.When our Null Hypothesis is based on means
we also apply students t-test on individual data columns and arrive to
certain conclusions based on varying means. The two sample students t-test
familiarizes us with interrelation of data. We have used correlation of data to
arrive to certain conclusion on the interrelation of data. Overall, The
Anderson Darling test, can also be used to detect various distributions such
The location registration, gives the local device details of the location
of mobile station. We need frequent contact with the mobile station for
higher accuracy. One of the registration types is Distance based registration.
There is Centralized tendency for the mobile terminal during random
movement such that it is distributed at the center. The benefit of this
tendency is that the probability density function of random variable for
movement of mobile station is approximated to normalized distribution.
Here, the Anderson Darling test is used as goodness of fit test for
approximation by finding p-value for each of the multiple contacts.
10.
References:
a. Text References
1. http://en.wikipedia.org/wiki/Anderson
%E2%80%93Darling_test#cite_note-Stephens74-1
2. http://www.isixsigma.com/dictionary/anderson-darlingnormality-test/
3. https://www.scribd.com/doc/234252923/Anderson-Darling-Test
4. http://iussp.org/sites/default/files/event_call_for_papers/lenartMi
ssov.pdf
5. http://maths.york.ac.uk/www/sites/default/files/QilinHu.pdf
6. http://maths.york.ac.uk/www/sites/default/files/QilinHu.pdf
7. http://mathworld.wolfram.com/NormalDistribution.html
8. http://www.mathwave.com/articles/distribution_fitting_faq.html
#q4
b. Other references
1.
http://en.wikipedia.org/wiki/Predictive_analytics
2.
http://www.mathwave.com/articles/goodness_of_fit.html
3.
http://www.mathwave.com/articles/distribution_fitting_faq.html
#q3
4.
http://www.cde.ca.gov/ta/tg/hs/documents/mathstudysec2.pdf
5.
http://www.westga.edu/assetsCOE/virtualresearch/scatterplots_
and_correlation_notes.pdf
6.
http://math.tutorvista.com/statistics/scatter-plot.html
7.
Spectrum sensing in cognitive radio using goodness of fit
testing By Haiquan Wang, Member, IEEE, En-hui Yang, Fellow, IEEE, Zhijin
Zhao and Wei Zhang, Member, IEEE
8. Information Networking Advances in Data Communications and Wireless
Networks:International Conference, ICOIN 2006, Sendai, Japan, January 16-19, 2006,
Revised Selected Papers (https://books.google.co.in)
i. http://www.accuweather.com/en/gb/edinburgh/eh1-3/weatherforecast/327336
ii. http://www.bbc.com/weather/2650225
iii. http://www.weatherhq.co.uk/weather-station/edinburgh-airport
iv. https://weatherspark.com/averages/28753/Edinburgh-Scotland-UnitedKingdom