Skript

Nonparametric Statistics
Olaf Wittich
TU/e 2009
2 This course was intended as a two block course (2h lecture + 1h instruction/week) serving as an introduction to nonparametrics. Due to personal preferences, the focus is on two basic ideas namely using invariance under group actions as a construction principle, using empirical processes as a tool for asymptotics. As paradigms, rank tests and goodness of t tests were used. Most material is stolen elsewhere, the treatment of rank statistics from [4], the functional delta method from [6] and the considerations about invariance from [5].
Contents
1 Rank Tests 1.1 Nonparametric assumptions . . . . . . . . . . . . . . . 1.2 A rst example: The one-sided sign test . . . . . . . . 1.3 The sign test in a parametric setting . . . . . . . . . . 1.3.1 The parametric test . . . . . . . . . . . . . . . . 1.3.2 The nonparametric test . . . . . . . . . . . . . . 1.3.3 Pitmans asymptotic eciency . . . . . . . . . . 1.4 Group actions and invariant tests . . . . . . . . . . . . 1.4.1 Group actions . . . . . . . . . . . . . . . . . . . 1.4.2 Example 1. Permutations and order statistics . 1.4.3 Example 2. Monotone maps and rank statistics 1.4.4 Invariant tests . . . . . . . . . . . . . . . . . . . 1.5 A testing problem on domination . . . . . . . . . . . . 1.6 A preliminary remark . . . . . . . . . . . . . . . . . . . 1.7 Construction of critical regions . . . . . . . . . . . . . . 1.8 Three two sample rank tests . . . . . . . . . . . . . . . 1.9 Two sample problems and linear rank tests . . . . . . . 1.9.1 Tests on Location . . . . . . . . . . . . . . . . . 1.9.2 Tests on Scale . . . . . . . . . . . . . . . . . . . 1.9.3 The distribution of the Wilcoxon test statistic . 1.10 Asymptotic Normality . . . . . . . . . . . . . . . . . . 2 Goodness of Fit 2.1 A functional limit theorem . . . . . 2.2 The Kolmogorov Smirnov test . . . 2.3 The Chi-square idea . . . . . . . . 2.4 A Chi square test on independence 3 5 5 6 8 9 9 11 11 12 13 14 16 18 22 23 27 32 37 37 40 41 45 45 49 54 58
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
CONTENTS
A The functional delta method 61 A.1 The Mann-Whitney statistic . . . . . . . . . . . . . . . . . . . 61 A.2 Dierentiability and asymptotic normality . . . . . . . . . . . 62 B Some Exercises 67
Chapter 1 Rank Tests

1.1 Nonparametric assumptions
In contrast to the situation in parametric statistics where we usually assume that the underlying distribution belongs to a family indexed by a few real parameters, nonparametric statistics deals with problems where hardly anything is known about the underlying distribution. In a sense that will be made precise below, nonparametric families can not be indexed by nitely many parameters. For instance, a typical parametric assumption is that the true distribution underlying our random experiment is N (, 2 ) where R and 2 0 are two real parameters. In contrast to that, we consider two examples of nonparametric assumptions. Example 1. The probability distribution of a random variable X is called subgaussian, i 2 E eaX ea for all a R. The subgaussian distributions form a nonparametric family. The characterising condition above is called an (exponential) moment condition. Example 2. The probability distribution of a random variable X is called symmetric around its mean X , i X X (X X ) are equal in distribution. The symmetric distributions also form a nonparametric family. 5
CHAPTER 1. RANK TESTS
Whereas nonparametric ideas are even required for some testing problems such as for instance for goodness-of-t tests where we want to check whether a given assumption on the underlying distribution is reasonable, we must be well aware of the fact that less information about the underlying distributions also inevitably leads to weaker results such that the more powerful parametric methods should be used whenever a reasonable assumption on the underlying distribution family is available. Remark. Since every set which has the same cardinality as the set of real numbers can be mapped one-to-one and onto R, we could in principle also parameterize the non-parametric families above by even a single real parameter. So we can not use the distinction into parametric and nonparametric families of distributions given above without an actual restriction on which kind of parameterizations are considered. It is customary to allow only parameterizations that are continuous in the weak topology on the set of probability measures. Doing so, the nonparametric families considered above can indeed not be parameterized by a nite number of real values.
1.2
A rst example: The one-sided sign test
As the basic paradigm for the construction of nonparametric tests, we consider a one-sided test on location. Note that we will observe here a basic feature that we will meet again and again in the sequel: The nonparametric problem will be translated into a parametric problem while on the way the initial information that was contained in the sample is considerably reduced and only some rough qualitative properties of the actual sample actually serve as input for the nal parametric test. Let X be a continuous random variable with cumulative distribution function F . Consider the median of X dened as follows: Let 1 m (X) := sup{t R : P (X t) < }, 2 1 + m (X) := inf{t R : P (X t) > }. 2
1.2. A FIRST EXAMPLE: THE ONE-SIDED SIGN TEST Then, the median is given by 1 med(X) := (m+ (x) + m (x)). 2
Without any further knowledge about X, we will now construct a test of the hypothesis H0 : med(X) = m0 against the alternative H1 : med(x) > m0 : Since X was assumed to be continuous, we have P (X = med(X)) = 0 and hence 1 (1.1) P (X > med(X)) = P (X < med(X)) = . 2 The basic idea to construct the test is now the following: Consider a random sample X1 , ..., Xn and derive from that the number of realisations that are larger than m0 . To be precise, let Si := 1Xi >m0 = Then, S given by
n
1 , Xi > m0 . 0 , else
S :=
i=1
Si
is the required random variable which under H0 is by formula (1.1) binomially distributed with parameter 1/2, i.e S Bin(n, 1/2). On the other hand, under the alternative med(X) > m0 , the random variable S is distributed according to a binomial distribution Bin(n, p) with p > 1/2. Thus we have reduced the non-parametric problem to a parametric one, namely a test for the parameter p of a binomial distribution Bin(n, p) where the hypothesis H0 : p = 1/2 is tested against the alternative H1 : p > 1/2. The solution of this problem is well known: Fix a level of signicance > 0. Then the corresponding rejection- or critical region is given by C := {n0 , n0 + 1, ..., n} where n0 is chosen to be the smallest integer such that (assuming H0 ) 1 P (S n0 ) = n 2
n
k=n0
n k
CHAPTER 1. RANK TESTS Observe that the information in the sample is reduced from real values to the information whether the value is larger or smaller than zero (coarsening of information). For the nal parametric test, only this information about the initial observations is used.
1.3
Performance of the sign test in a parametric setting
To underline the statement from the introduction that using more information usually yields stronger results, i.e. that parametric methods are usually more powerful in dealing with parametric problems, we compare the sign test with one of the usual parametric tests where the distribution family is assumed to be X N (, 2 ) and we assume for simplicity that the variance 2 > 0 is known. The test on location is here H0 : = 0 versus H1 : > 0 , we x the signicance level again to some > 0, the random sample is X1 , ..., Xn . We now strive for a comparison of the two tests in terms of eciency, meaning: Which sample size is necessary to achieve a given power ? You might want to argue that this comparison is not entirely fair, because we only compare the two tests exactly in the situation where the parametric test is designed for. By the following criterion, absolutely no information is provided how the two tests compare when the underlying distribution family is not the one given above. So, what we expect to be the strength of the sign-test, namely that its performance does not dramatically alter if the underlying assumptions on the distribution family are not valid, is actually not measured by this approach. Remark. We will see later (Denition 11) that we can also design nonparametric tests which perform better in comparison to the t-test with respect to the criterion which we will consider now.
1.3. THE SIGN TEST IN A PARAMETRIC SETTING
1.3.1
The parametric test

n
We consider the sample mean X= 1 n Xk

k=1
with E(X) = and Var(X) = 2 /n. Hence Z= X 0 n N (0, 1)
is standard normally distributed under H0 and and the rejection region is given by C() = [0 + z , ) n where z is given by P (Z > z ) = and H0 is rejected, when X C(). For true > 0 , the type II error is given by := P X true 0 true n n + z
and the left hand side is a standard normal variable. Hence, if we want that the test has signicance level and power 1 , we need that 0 true n + z z1 = z , or n
2
z + z 0 true
(1.2)
1.3.2
The nonparametric test
Note rst that for the normal distribution, the mean value and the median coincide (as it is the case for all symmetric distributions). We may thus use the sign test from the preceding section as our nonparametric test. The type II error for the sign test is given by
n0 1
= P (S < n0 ) =
k=0
n k
pk (1 ptrue )nk true
(1.3)
10 where
ptrue = P (X > 0 ) = P (X true / > 0 true /) = 1
0 true
For large n, we can use the central limit theorem to approximate the type II error (1.3). We have = P (S < n0 ) = P Hence n0 nptrue nptrue (1 ptrue ) S nptrue nptrue (1 ptrue ) . < n0 nptrue nptrue (1 ptrue )
n0 nptrue nptrue (1 ptrue )
z .
(1.4)
The same approximation for the determination of n0 from the signicance level yields (here the parameter is 1/2 since we consider here the distribution assuming the hypothesis) 2n0 n z . n
1 Inserting n0 2 ( nz + n) from (1.5) back into (1.4), we obtain
(1.5)
z + 2 ptrue (1 ptrue )z 2ptrue 1
(1.6)
To nally compare the samples sizes for both tests, we have to nd a relation between ptrue and true . By power expanding the cumulative density function of the standard normal distribution, we obtain ptrue = 1 true 0 1 + + O(|true 0 |3 ) 2 2
and inserting this into (1.6) nally yields 2 z + 1 2 (true 0 )2 z 2 . n 2 true 0

(1.7)
1.4. GROUP ACTIONS AND INVARIANT TESTS
11
1.3.3
Pitmans asymptotic eciency
Now we are going to make precise what we mean by the statement that nonparametric tests yield weaker results in a parametric setup. Let H0 : = 0 versus H1 : > 0 be a testing problem in a parametric family F . Let > 0 be a xed level of signicance . We consider two dierent statistical tests A and B and denote by nA,, () and nB,, () the minimal sample size such that the power of the respective test is 1 given the true parameter is . The following parameter is hence a measure for the relative quality of the tests A and B. Denition 1 (Pitmans asymptotic eciency) The asymptotic relative eciency of the tests A,B is given by eAB (, ) := lim nA,, () . nB,, ()
We can now calculate this relative eciency for the two dierent tests from the preceding subsections. Combining formula (1.2) with (1.7), we obtain 2 r 2 = 0.64. z + 1 2 (true 0 )2 z 2 2 true 0 2
z +z 0 true 2
eAB (, ) =
true 0
lim
That means, in the sense of this asymptotic result, you need only 2/3 of the sample size of the sign test for a parametric (t-)test with the same power and signicance level.
1.4
Group actions and invariant tests
So far, we just saw an example how to construct a test without having to assume specic properties of the underlying distributions. In this section, we want to make precise what we mean by the statement that the full information contained in the observation building up the sample is reduced to a few
12
characteristic features. Reduction to a few features means that we actually decompose the set of random samples into subclasses of observations to which we assign the same reduced information. But that means, these subclasses form equivalence classes of observations in the sense that two samples are equivalent if we extract from them the same reduced amount of information. For example, two observations (1.3, 2, 3.6, 22) and (0.1, 0.1, 0.1, 0.1) are equivalent with respect to the reduction used for the sign test. For both observations, we extract the same information (, +, , +) for the signs of the observations. The next step will be to present one of the important construction principles for those equivalence relations which is particularly useful for statistical problems that are invariant under some transformation group acting on the space of random samples.
1.4.1
Group actions
In the sequel, we will think of a random sample always as an element x = (x1 , ..., xn ) Rn and G will always denote a group. Denition 2 An (eective) group action of the group G on Rn consists of a (one-to-one) identication of group elements g G with bijective maps g : Rn Rn such that : (i) the mapping associated to the neutral element e G is the identity e = id : Rn Rn , (ii) g, h G implies gh = g h , i.e. the composition of the associated maps respects the group multiplication. Given a group action, we decompose the sample space Rn into its orbits. The set of orbits can be parameterized by a maximal invariant mapping. Denition 3 Let x0 Rn . The orbit of x0 under the operation of the group G on Rn is given by orbG (x) := {g (x0 ) : g G}. A maximal invariant map is a one-to-one assignment of points in a parameter set J and the orbits of the G-action jG : J {orbG (x) : x Rn }. where the orbit space is given by Orb(G) := {orbG (x) : x Rn }.
1.4. GROUP ACTIONS AND INVARIANT TESTS First of all, we consider two examples:
13
1.4.2
Example 1. Permutations and order statistics
Let n be the group of permutations of n elements, i.e. n consists of all one-to-one maps : {1, ..., n} {1, ..., n}. n acts on Rn by permuting the components of the random sample, i.e. (x1 , ..., xn ) := (x(1) , ..., x(n) ). To determine the orbits of this action, we rst consider the case n = 2: The group 2 has two elements, namely e and (1, 2) = (2, 1). Thus e (x1 , x2 ) = (x1 , x2 ) and (x1 , x2 ) = (x2 , x1 ) and the orbits of the operation are given by orb (x) = {x} x 2 , {x, (x)} x 2 /
where 2 := {(x, x) : x R} R2 denotes the diagonal. Every orbit contains exactly one sample x = (x1 , x2 ) with x1 x2 . Thus, we obtain a one-to-one correspondence of the set of orbits with the set J := {x R2 : x1 x2 }. For the point x J which is associated to the orbit orb (x), we will use the notation
1 x = j (orb (x)) = (x(1) , x(2) ).
For general n, there is an analogous result. Lemma 1 (signicance of order statistics) Let n operate on Rn by permuting components. Then in every orbit, there is exactly one point x with x1 ... xn . Hence, the map jn : J Orb(n ) with jn (x) := orbn (x) and J := {x Rn : x1 ... xn } is a maximal invariant map. Proof: Exercise. Denition 4 (order statistics) Given a sample x = (x1 , ..., xn ), the corresponding sample 1 o(x1 , ..., xn ) := jn (orbn (x)) is called the order statistics of x. We will often write (x(1) , ..., x(n) ) := o(x1 , ..., xn ).
14
1.4.3
Example 2. Monotone maps and rank statistics
Let M be the group of monotone maps, i.e. the maps f : R R which are continuous, surjective and strictly monotone in the sense that x > x implies f (x) > f (x ). Then, M acts on Rn componentwise, i.e. f (x) = (f (x1 ), ..., f (xn )). The group multiplication on M is given by the composition of maps.
Lemma 2 (signicance of rank statistics) Under the action of M on Rn , a point x Rn lies in the orbit of x Rn , i.e. x orbM (x) if and only if r(x) := (r1 (x), ..., rn (x)) = r(x ) = (r1 (x ), ..., rn (x )) where ri (x) := |{xj : 1 j n, xj xi }| denotes the rank of the component xi of x, i.e. the number of components (including xi ) which are less or equal than xi . Proof: (i) Let rst f be monotonous. Hence xj < xi implies f (xj ) < f (xi ) and ri (f (x)) := |{f (xj ) : 1 j n, f (xj ) f (xi )}| = |{xj : 1 j n, xj xi }| = ri (x). Thus, two sample vectors in the same orbit have the same ranks ri . (ii) Let x and x be given with ri (x) = ri (x ) and let (x(1) , ..., x(n) ) and (x(1) , ..., x(n) ) be the corresponding order statistics. Let
x(k+1) x(k)
qk := and
x(k+1) x(k)
0
n1
, x(k+1) = x(k) , else
(1.8)
u(t) :=
k=1
qk 1[x(k) ,x(k+1) ) (t).
1.4. GROUP ACTIONS AND INVARIANT TESTS Let now
15
x + x(1) x(1) , x x(1) x x + x(1) u(t)dt , x(1) < x x(n) . f (x) = (1) x+x x , x > x(n) (n) (n)
Then f is continuous, surjective and piecewise linear. f is hence strictly monotone i for all x (x(k) , x(k+1) ], k = 0, ..., n, the (left) derivative of f is strictly positive. Here we understand x(0) = , x(n+1) = +. But that follows from (x(k) , x(k+1) ] = x(k+1) > x(k) x(k+1) > x(k) (due to the rank condition) and formula (1.8). Denition 5 (rank statistics) Given a sample x = (x1 , ..., xn ), the corresponding sample r(x) := (r1 (x), ..., rn (x)) is called the rank statistics of x. To nd the maximal invariant map in this case, we have to nd every possible rank statistics. For that, consider integers 1 a1 < a2 < ...as n, write a = (a1 , ..., as ), and construct the sample r(a) = (r1 , ..., rn ) with r1 = a1 , ..., ra1 = a1 , ra1 +1 = a2 , ..., ra2 = a2 , ra2 +1 = a3 , ...., rn = as . The vector r is the rank statistics of some sample vector x Rn (for instance, he is his own rank statistics). All possible rank statistics are permutation from vectors obtained in this way. Hence, the rank statistics are in one-to-one correspondence with the set J :=
{1a1 <a2 <...as n,1sn}
orbn (r(a))
(1.9)
and jM (r) := orbM (r) Rn . Remark. Note that the consideration of rank statistics simplies drastically if we consider random samples for continuous random variables. In that case, we have that P (1i<jn : Xi = Xj ) = 0 and therefore P(r(x) = r) = 0 for all r orbn (r0 ) with r0 = (1, 2, 3, ..., n). /
16
CHAPTER 1. RANK TESTS We made precise what we will consider in the sequel as our way to reduce information: We consider two samples as equivalent, if they belong to the same orbit of a group action on sample space. The reduced amount of information that we assign to both of them is a representative of the corresponding orbit.
Example. We consider again the sign test example. Here, we have an action of the multiplicative group of n-tuples of positive real numbers (R+ )n := { = (1 , ..., n ) : i > 0} on Rn given by (, x) (1 x1 , ..., n xn ). With that denition, we see immediately that the two samples (1.3, 2, 3.6, 22) and (0.1, 0.1, 0.1, 0.1) lie both in the orbit of (1, 1, 1, 1).
1.4.4
Invariant tests
The reason why the group approach is useful is that there are often a priori group operations which should not aect the result of your test. Consider the following example: Example. Scientists in London and in Amsterdam want to test the hypothesis that the average height of the population in both countries is the same against the alternative that Dutch people are larger in average. The British scientists measure the height of about 1000 randomly chosen people in Great Britain and provide a list with heights in inches. The Dutch scientists measure the height of about 1000 randomly chosen people in the Netherlands and provide a list with heights in centimeters. Now the British scientists convert the Dutch list into inches and compare it to their own list, whereas the Dutch scientists convert the British list into centimeters and compare it to their own list. Both groups use the same test but with the observations measured in their own height scale. If the two tests would not come to the same conclusion, we would think that there is something wrong. Where is the group ? Well, we translate the statement that the decision should not depend on the length scale used to collect the data into the statement that the test decision is invariant with respect to the action of R+ on the space of observations Rn given by (, x) (x1 , ..., xn ) where > 0 and x = (x1 , ..., xn ) Rn , since changing the length scale (for instance by switching from inches to centimeters) means nothing else but multiplying all numbers in the sample by a constant conversion factor. Thus, we are lead to the following notion.
1.4. GROUP ACTIONS AND INVARIANT TESTS
17
Denition 6 (invariant test) A test with test statistic T and critical region C is called invariant, i the test decision is invariant under the action of the group G, i.e. T (g (x)) C T (x) C for all g G. An invariant test statistic can always be reduced to a test statistic on J. And this is how we will use the consideration above all the time in the sequel. The following proposition summarizes the relevant facts. Proposition 1 (i) Let T be a G-invariant test statistic and jG a maximal invariant map for the action of G on Rn . Then there is a map : J R such that 1 T (x) = jG (orbg (x)). (ii) Let T = 1 orbG be an invariant test statistic. Then the distribution G of T depends only on the distribution of 1 orbG (x). G If we have a priori information about the invariance of our testing problem with respect to some group action, it is natural to reduce the full information in the sample by considering two samples as equivalent if they belong to the same orbit. Every invariant test can then be written as a function on the orbit space alone. Remark. Please note that we were cheating quite a bit in this section. Except for part (ii) of Proposition 1 we can get along with it. But for this last statement, we need that we can choose the map to be measurable in order to transport the measure from the orbit space. For that, we have to require that the group action GRn Rn is a measurable map with respect to some suitably chosen sigma algebras on the respective sets. And even then it is a theorem, that jG and can be chosen to be measurable. Since this theorem holds under very weak conditions on the underlying spaces which are met in every situation that we consider in the sequel, we will completely ignore these problems in the sequel but have to be well aware of the fact that they are there. A possible rigorous version of the statements above can be obtained as follows: Let (X, X ) be a measurable space and G X X be a measurable group action. We denote the orbit space by X/G and by q : X X/G the
18
orbit map q(x) := orbG (x). Then, q is a surjective map and also measurable, if X/G is equipped with the sigma algebra X/G := {B X/G : q 1 (B) X }. X/G is the maximal sigma algebra with that property. If T : X R is a G-invariant and (X , R )-measurable map then by invariance, T 1 (R ) q 1 (X/G ). That implies by Dynkins lemma that we can nd a measurable map : X/G R with T = q. The orbit space X/G plays thus the role of the parameter set.
1.5
A testing problem on domination
We will now apply Proposition 1 to a testing problem on domination with multiple applications. It also presents a class of examples where we are facing a natural operation of the group M. Denition 7 Let X and Y be random variables with cumulative distribution functions FX and FY , respectively. We say that X is stochastically larger than Y and write X Y i FY (t) FX (t) for all t R. We say that X is strictly stochastically larger than Y and write X Y i there is in addition one t R such that FY (t) = FX (t) . Remark. If you are puzzled by the inverse relation between the cumulative distribution functions, note that X being stochastically larger than Y is equivalent to P (X t) P (Y t) for all t R. The testing problem that we consider now is a two-sample problem: Test on domination. Let X, Y be continuous random variables. Construct on the basis of the two independent random samples (X1 , ..., Xn ) and (Y1 , ..., Ym ) a test for H0 : X = Y versus H1 : Y X. Why should one be interested in such tests ? Example. Suppose you are interested in the question whether the treatment with a certain medicant is eective or, alternatively, whether the patients suer some side eects. It is a dicult problem to measure these eects
1.5. A TESTING PROBLEM ON DOMINATION
19
on the base of a numerical evaluation. In order not having to go into this, we consider an example where there is a canonical numerical evaluation. We consider as side eect of a medicant (for instance a sedative) treated patients become very sleepy. Let there thus be two groups of patients, only one of which is treated with the medical in question. Let Y be the random variable sleeping time during the day of a patient treated with the medical and X be the random variable sleeping time of a patient not treated. We assume that there are cumulative distribution functions FX and FY and that they are continuous. If this assumption is reasonable or not depends among other things on the choice of the patients and is not a priori clear. Also, the independence of the measurements Xi and Yj depends heavily on the design of the experiment. But assuming that, the linguistic question Are treated patients more sleepy ? can be translated into testing H0 : X = Y against the alternative H1 : Y X independent on any assumption on the special shape of the distribution (except continuity). To construct now a test on domination, we will try to follow the ideas developed in the preceding section. Lemma 3 Let f M. Then (i) X Y is equivalent to f (X) f (Y ),
(ii) X = Y is equivalent to f (X) = f (Y ), (iii) X Y is equivalent to f (X) f (Y ).
Proof: (i) The maps f are invertible. Denoting the inverse map by f 1 , i.e. f f 1 = id, we have hence P(f (X) t) = P(X f 1 (t)) P(Y f 1 (t)) = P(f (Y ) t). (ii), (iii) follow analogously. Lemma 3 therefore means that hypothesis and alternative of our testing problem are invariant with respect to the action of M, or The test on domination considered above is invariant under the action of M.
20
Thus, it seems also to be appropriate to look for an invariant decision rule. To get an invariant decision rule, it is appropriate to look for an invariant test statistic. But by Proposition 1 and Lemma 2, invariant test statistics only depend on the rank statistics r(X1 , ..., Xn , Y1 , ..., Ym ). Thus, as indicated in the previous section, the assumption of invariance of the test statistic under the action of a certain group, based on the lack of knowledge on the underlying distribution, immediately leads to a specic coarsening of the information obtained by the random sample. Thus, by Proposition 1, (i) All M-invariant decision rules for the test on domination are functions of the joint rank statistics r(X1 , ..., Xn , Y1 , ..., Ym ). But we can do even a bit better. First we recall the denition of suciency. Denition 8 Let (X1 , ..., Xn ) be a random sample of the random variable X with probability density function fX . Let S : Rn Rk be a map. S = S(X1 , ..., Xn ) is called sucient statistic, i for all other statistics T : Rn Rm , the conditional probability distribution given S = s denoted by fT |s (t) does not depend on fX . We now consider the rank statistic r(X1 , ..., Xn , Y1 , ..., Ym ) and compose it with the rank statistic of the ordered individual samples, (X(1) , ..., X(n) ) and (Y(1) , ..., Y(m) ), namely Denition 9 (joint ordered rank statistics) We call the expression (X1 , ..., Ym ) = r(o(X1 , ..., Xn ), o(Y1 , ..., Ym )) the joint ordered rank statistics. The set of all joint ordered rank statistics depends on the respective sample sizes and is denoted by R(n, m). Since the underlying random variables X and Y were assumed to be continuous, we have with probability one that X(1) < ... < X(n) , Y(1) < ... < Y(m) , X(i) = Y(j) simultaneously. Now we show that the joint ordered rank statistics are sufcient for the ordered rank statistics.
1.5. A TESTING PROBLEM ON DOMINATION
21
Lemma 4 Let s = (s1 , ..., sn ) Nn and t = (t1 , ..., tm ) Nm . Under the hypothesis FX = FY we have the following conditional probability P(r(X1 , ..., Ym ) = (s, t) | (X1 , ..., Ym ) = (s , t )) =
1 n! m!
if s = r(s) and t = r(t) else
(1.10)
Proof: We have for the joint probability P(r(X1 , ..., Ym ) = (s, t), (X1 , ..., Ym ) = (s , t )) P(r(X1 , ..., Ym ) = (s, t)) , if s = r(s), t = r(t) . = 0 , else Hence, the conditional probability is either 0 in the latter case, or in the former case given by the quotient P(r(X1 , ..., Ym ) = (s, t) | (X1 , ..., Ym ) = (r(s), r(t))) P(r(X1 ,...,Ym )=(s,t)) = P((X1 ,...,Ym )=(r(s),r(t))) . (1.11)
But now, under the hypothesis FX = FY the joint sample is distributed according to a product measure and thus, the probabilities are not altered by a permutation of the random variables in the joint sample. To be precise P(r(X1 , ..., Ym ) = (s, t)) = P(r(X(1) , ..., Y(m) ) = (s, t)) for all n+m . However, if we use a general permutation, we would interchange X- and Y -values and thus leave the orbit of the xed joint rank statistic. Hence we can only use permutations of the type = (1 , 2 ) n m where 1 n , 2 m . There are n! m! of them. Thus P(r(X1 , ..., Ym ) = (s, t), (X1 , ..., Ym ) = (s , t )) = n!1 n m P(r(X(1) , ..., Y(m) ) = (s, t)) m! 1 = n! m! P((X1 , ..., Ym ) = (s, t)). and inserting this into (1.11) yields the statement. In other words, the ordered rank statistic is sucient with respect to the rank statistics of the data. That means that in fact All M-invariant decision rules for the test on domination are functions of the joint ordered rank statistics (X1 , ..., Ym ). In order to determine the distribution of a test statistic under H0 , we have to determine the distribution of the joint ordered rank statistics under H0 .
22
By reducing the problem of constructing a test by means of suciency and invariance, we thus nally arrive at a statement that any invariant decision rule only uses the information how the two samples (X1 , ..., Xn ) and (Y1 , ..., Ym ) interlace, i.e. if for instance n = 3, m = 2 and X1 < Y2 < X3 < X2 < Y1 it uses only the information represented by the shorthand xyxxy on the order of the observations of the two samples. Thus, so far without constructing a single test, we found out that the structure on invariant decisions for a test on domination is rather restricted. Remark. In the sequel, we will use interlacing patterns xyxx...xyx of the type considered above to represent elements of R(n, m) without further mentioning it. Now we compute the distribution of the joint ordered rank statistics. In particular, having in mind that both random variables are still assumed to be continuous such that the observations are all mutually dierent with probability one, we obtain: Lemma 5 Under H0 , the joint ordered rank statistics are equidistributed, i.e. for every R(n, m) we have 1 . (1.12) P() =
n+m m
Proof: (Exercise.) A interlacing pattern = xxy...yyx R(n, m) is determined by the positions of the xs in this pattern. There are n + m possible positions and you have to choose n of them for the x-values. The fact that all patterns have the same probability follows from permutation invariance under H0 . By Proposition 1, (ii), the distribution of any invariant decision rule T (R) is given by the distribution of T under (1.12).
1.6
A preliminary remark about the construction of critical regions
We saw now that all invariant tests of the hypothesis H0 : X = Y against H1 : Y X only use the information of the ordered rank statistics of the
1.7. CONSTRUCTION OF CRITICAL REGIONS
23
sample. But this immediately implies that there can be no uniformly best invariant tests of this hypothesis in the following sense: The set of possible alternative random variables Y with X Y is so large and consists of so many dierent types of distribution functions that there is no unique choice of a critical region which would be optimal to distinguish X from all these alternatives. This will be illustrated by the following example. Example. Let X be uniformly distributed on [0, 1]. We consider two possible random variables 0 Yi 1 which are contained in the alternative Y X, namely (it is recommended to draw the graphs of these functions) 1. Y1 with density f1 = 2(1[1/4,1/2) + 1[3/4,1) ), 2. Y2 with density f2 = 4(1[1/8,1/4) + 1[3/8,1/2) + 1[5/8,3/4) + 1[7/8,1) ), where 1[a,b) denotes the indicator function of the set [a, b). Suppose now that m = 5 and n = 6 and that we want to construct a critical region to a signicance level > 0 which is so large that for the last interlacing pattern that can be added to C, we have to choose one of the two alternatives 1 = xxxyyyxxxyy, 2 = xyxxxyyxxyy. In case the true distribution of Y is given by f1 , the rst choice is better, in case it is f2 , we better take the second pattern. Remark. The situation that you have to choose between 1 and 2 is not as articial as you might think. Both patterns have the same rank sum (see Denition 10 and the subsequent paragraph) of 36. Thus you can consider a Wilcoxon test together with a signicance level such that all interlacing patterns with rank larger than 36 belong to the critical region and you can choose exactly one more pattern with rank 36.
1.7
Construction of critical regions
In this subsection, we will establish a criterion for the construction of critical regions. First of all, when we x a signicance level of > 0, the critical
24
region for an M-invariant test must be a subset C R(n, m) such that the type I error 1 P( C) = |C|
n+m m
is less or equal than . Here |C| denotes the cardinality of C. That implies that the maximum number of interlacing patterns that we are allowed to put into the critical region is already xed by that to |C| =
n+m m
(1.13)
where u denotes the largest integer less or equal to u. However, that brings us not even close to the answer of the question which of the interlacing patterns we should put into the critical region. For that we have to cross another conceptual gap. In the preceding subsection we already argued that there can be no unique optimal choice for the whole alternative. In the sequel, we ask the question which choices of the critical region are optimal for a given parametric alternative. Depending on this restricted alternative, we construct several tests which are optimal with respect to Pitmans asymptotic eciency. For simplicity, we will assume in the sequel that the cumulative distribution functions FX and FY are given by strictly positive and smooth densities fX and fY . Proposition 2 Let X1 , ..., Xn and Y1 , ..., Ym be two random samples, the rst one distributed according to FX and the second one according to FY . Then P( = (s, t)) = 1
n+m m
fY (Y(t1 ) )...fY (Y(tm ) ) fX (Y(t1 ) )...fX (Y(tm ) )
(1.14)
where the expectation is taken with respect to the probability distribution of the hypothesis. Proof: Let U() Rn+m the subset consisting of tuples (u1 , ...un , v1 , ..., vm ) Rm+n such that u1 < ... < ut1 1 < v1 < ut1 < ... < utm 1 < vm < utm < ... < un .
1.7. CONSTRUCTION OF CRITICAL REGIONS Then, under the distribution assumption above P( = (s, t)) =
U ()
25
du1 ...dun dv1 ...dvm fX (u1 )...fX (un )fY (v1 )...fY (vn ).
Let now = (1 , 2 ) 1 2 act on by () = (s1 1 , ..., s1 n , t2 1 , ..., t2 m ). Then by Fubinis theorem P( = (s, t)) 1
n+m m 1 2 U(())
= =
du1 ...dun dv1 ...dvm fX (u1 )...fY (vn ) du1 ...dun dv1 ...dvm fX (u1 )...fY (vn ),
1
n+m m 1 2 U(())
but the union U(()) = Rm+n N

1 2
diers from Rn+m only by a Lebesgue zero set which means that actually even P( = (s, t)) = But that implies P( = (s, t)) 1
n+m m Rm+n
1
n+m m Rn+m
du1 ...dun dv1 ...dvm fX (u1 )...fY (vn ).
du1 ...dun dv1 ...dvm fY (Y(t1 ) )...fY (Y(tm ) ) fX (Y(t1 ) )...fX (Y(tm ) )
fY (v1 )...fY (vm ) fX (u1 )...fX (vm ) fX (v1 )...fX (vm )
1
n+m m
where the expectation is taken with respect to the distribution on Rn+m with density fX (u1 )...fX (vm ) which is the distribution of the sample under the hypothesis FX = FY .
26
So far, we considered how the distribution of the ordered rank statistics changes, if the true distribution of Y is given by fY . That reminds of the computation of a type II error in parametric statistics where you also have to know the true distribution. But due to the discussion in the preceding section, we still have to close a conceptual gap before we can actually construct critical regions. The example given in the preceding section indicates that there can be no optimal choice of a critical region covering all alternatives. In order to nevertheless construct critical regions for tests based on joint ordered rank statistics, we turn this around in the sense that we consider now critical regions which are optimal choices for very restricted, special alternatives which are provided by one-dimensional location families FY (x) = FX (x ). We now compute the type II error for small values of . Assuming that the density fX is suciently smooth, we Taylor-expand it getting fY (x) = fX (x ) = fX (x) where O() denotes the Landau symbol f () = O(2 ) lim |f ()| = C < . 2 fX (x) + O(2 ) x (1.15)
Lemma 6 Let C R(n, m) be a subset of joint ordered rank statistics serving as the critical region of rank test. Then we have for the type II error if the true distribution is the one of Y
m
()|=0 =
C j=1
1
n+m m
d ln fX (Ytj ) . dx
Proof: We have by (1.14) using the shorthand f = fY () = 1

C
P() 1
C n+m m
= 1
f (Yt1 )...f (Ytm ) fX (Yt1 )...fX (Ytm )
1.8. THREE TWO SAMPLE RANK TESTS
27
and thus using (1.15) and after interchanging dierentiation and integration ()|=0 =
tC m
1
n+m m
E
j=1
fX (Ytj ) fX (Ytj ) d ln fX (Ytj ) dx
=
C j=1
1
n+m m
The function () is expected to decrease for increasing > 0 and to be equal to 1 for = 0 . Thus, (0 ) is a measure for how fast the type II error decreases with increasing > 0 . If (0 ) is small, we observe a faster increase of the power of the test as deviates from 0 . So we obtain for small values of and for xed values of > 0: A critical region leading to the smallest asymptotic type II error at signicance level > 0 is given by a subset C R(n, m) with cardinality |C| for which
m n+m m
C j=1
d ln fX (Ytj ) dx
(1.16)
is maximal. Remark. This condition may still not be sucient to enforce uniqueness of the critical region for all levels of signicance > 0. It might still happen that to ll up the critical region, we have to make a choice among several interlacing patterns with the same value of the test statistic. In that case you can choose either one of the possible patterns or randomize over all possible critical regions thus obtained.
1.8
Three two sample rank tests
Now we will for the rst time benet from our preparations in the sense that we will now construct the rst real tests for the two sample problem under consideration. We start with
28
Denition 10 (Wilcoxon two sample test) The Wilcoxon test is the test associated to the logistic distribution with cumulative distribution function Flog (x) = (1 + ex )1 . The test statistic is given by
m
TW (X1 , ..., Ym ) :=
j=1
tj
(1.17)
where tj is the rank of Xj in the joint ordered sample. To compute the critical region for a random sample X1 , ..., Xn , Y1 , ..., Ym and signicance level > 0, we have to compute by (1.16) the derivative of the logarithmic density ln flog (x) = ln ex (1 + ex )2 = x 2 ln 1 + ex , hence 2ex 1 + ex 2ex d ln flog (x) = 1 =2 1 dx 1 + ex 1 + ex 1 + ex 2 1 = 2Flog (x) 1 = 1 + ex
Thus, maximizing a sum of expectations E( d ln flog (Y(tj ) )) dx
is equivalent to maximizing E(Flog (Y(tj ) )). Lemma 7 If Z is a continuous random variable with cumulative distribution function FZ , then the random variable U = FZ (Z) is uniformly distributed on [0, 1]. Proof: Exercise. You may assume for simplicity that F is continuous and strictly increasing. The ith order statistic Z(i) of a random sample Z1 , ..., Zn is distributed by
n
P(Z(i) t) =
s=i
n s
FZ (t)s (1 FZ (t))ns
1.8. THREE TWO SAMPLE RANK TESTS which means in the special case of uniform random variables
n
29
P(U(i) t) =
s=i
n s
ts (1 t)ns =: F(i) (t).
That implies
1
E[Flog (Y(tj ) )] = EUtj =

0 1
t dF(tj ) (t)
1 1
= t F(tj ) (t) 0 By denition of the Beta function

1
F(tj ) (t)dt = 1
0 0
F(tj ) (t)dt.
B(x, y) =
0
tx1 (1 t)y1 dt
and the connection with the Gamma function B(x, y) = (x) (y) (x + y)
we have (under H0 where X and Y are identically distributed)

n+m
E[Flog (Y(tj ) )] = 1
s=tj n+m
n+m s n+m s
B(s + 1, n + m s + 1) s! n + m s! , (k + 1) = k! n + 1!
= 1
s=tj n+m
= 1
s=tj
1 tj = . n+m+1 n+m+1
=(s,t)C
Thus, maximizing (1.16) is equivalent to maximizing the test statistic m TW () :=

j=1
TW (), where
tj
is the rank sum of the observations coming from the Y -sample in the tuples in the citical region. To maximize this for the construction of a critical region for a given signicance level, we start with collecting those tuples with
30
maximal rank sum and add subsequently those of the remaining tuples with largest rank sum. The problem is that, since rank-sums are equal for many tuples, you might have to make choices leading to the fact that the critical regions are not always uniquely dened. The critical region for the Wilcoxon test is a collection of tuples such that there is no tuple outside the critical region with a larger rank sum than any of the tuples inside. Remark. Usually, a simplied version of the rejection region is used. The hypothesis is rejected, if the rank sum (1.17) exceeds a given value. This value is chosen to be the minimal rank sum of all tuples chosen for the rejection region following the procedure described above. Thus, for certain values of > 0, the rejection region will be larger than the one constructed above. The next example is the Fisher-Yates test: Denition 11 (Fisher - Yates test) The Fisher-Yates test is the test associated to the normal distribution, i.e. to f, (x) = In that case, we have and thus
m
1 1 exp 2 (x )2 . 2 2
x d ln f, (x) = dx 2
j=1
Y(sj ) d 1 1 ln f, (Y(sj ) ) = E = E[W(sj ) ] dx
where the W(sj ) are order statistics of a sample of standard normal variables. The test statistics is hence
m
TF Y (X1 , ..., Ym ) =
j=1
EW(sj ) .
In order to construct the rejection region, we have to compute the expectations EW(sj ) . These numbers can only be calculated numerically, but there are clearly tables available for them.
1.8. THREE TWO SAMPLE RANK TESTS The critical region for the Fisher - Yates test is chosen in a similar way as for the Wilcoxon test, where tables for the expectation values are used. Often you can also see the following simplied version: The hypothesis is rejected if the test statistic exceeds a given value (Table) depending on signicance level and sample size n.
31
Another way to get hold of the EW(sj ) is to use the following asymptotic result which we will state without a proof. Theorem 1 Let X1 , ..., XN be independent identically distributed continuous random variables with cumulative distribution function F and density f . If f has a derivative at and f () > 0 then the density of ZN := N f ()(Y(kN ) ) pq
converges to that of N (0, 1) as N . Here = F 1 (p), q = 1 p and kN = N p or = N p + 1. Proof:[1], p. 191 . Using this result, we can construct a simplied but asymptotically equivalent version of the Fisher-Yates test. If we specialize to F (t) = (t), N = n + m the cumulative distribution function of standard normal variables, and to p = kN /(N + 1), we obtain that ZN := N 1 1 (p) 2 e (X(kN ) 1 (p)) pq 2
is approximately standard normal distributed. Hence X(kN ) is approximately normally distributed with expectation value EX(kN ) = 1 and standard deviation 2pq 1 (p) 2 e 0 N as N . Thus, for large values of N = n + m, we may approximate the expectation EX(kN ) of the order statistic in the Fisher-Yates test simply with kN 1 N +1 . That yields to the following simplied version of the Fisher-Yates test: (X(kN ) ) kN N +1
32
Denition 12 (Van der Waerden X-test) The van der Waerden test is the rank test with test statistic
m
TX (X1 , ..., Ym ) =
j=1
tj n+m+1
The construction of the three tests imply dierent behavior with respect to Pitmans asymptotic eciency: The Fisher-Yates test was constructed using the location problem for the normal distribution (where we also considered no assumption on the variance). It is thus not surprising that it competes well with the t-test. In fact, Pitmans asymptotic eciency relative to the t-test is actually one. Since the Fisher-Yates and the van der Waerden X-test are very close for large sample size, their relative eciency is one as well. The Wilcoxon test is constructed being optimal in the location problem for the logistic distribution. In comparison to the t-test which always means in the situation where the t-test is designed for Pitmans relative eciency is 0.95. That means you need about ve percent more data to achieve the same asymptotic power.
1.9
Two sample problems and linear rank tests
So far, we were considering examples of non-parametric tests and how they were constructed. Now we aim at a more systematic point of view in order to classify problems and the corresponding tests. First of all, we restrict ourselves to the two sample problem, having two independent random samples X1 , ..., Xn and Y1 , ..., Ym . Again, we will assume if not stated otherwise, that the random variables under consideration are continuous and denote their distribution functions by FX and FY , respectively. Denition 13 (location- and scale parameter) Let P R and X,P be a family of random variables with cumulative distribution functions F,P . (i) is called location parameter, if there is one cumulative distribution function F such that F (x) = F (x )
1.9. TWO SAMPLE PROBLEMS AND LINEAR RANK TESTS for all P,
33
(ii) is called scale parameter, if there is one cumulative distribution function F such that F (x) = F (x/) for all P. Now we consider several testing problems applying the knowledge about tests on domination gained so far. To be precise, we test the hypothesis H0 : FX = FY against the following alternatives: (1) Tests on location. Under the assumption that FY (x) = FX (x ) is a location family, we consider the alternatives
+,,= H1,L : >, <, = 0.
(2) Tests on scale. Under the assumption that FY (x) = FX (x/) is a scale family, we consider the alternatives
+,,= H1,S : >, <, = 1.
The problem is approached by reformulating these tests as tests on domination. This works pretty well for tests on location but not without modications for general scale families. On the basis of our knowledge about tests on domination, we will construct some tests which are well adapted to the corresponding problems. We restrict ourselves to the class of linear rank tests and constructing such a test is equivalent to choose proper regression coecients. Note that how to choose these coecients properly is subject to some intuition that we basically get from our knowledge about the tests on domination. Let R := R(n, m) be again the set of all possible joint ordered rank statistics (=interlacing patterns) of two samples of size n, m, respectively. In order to dene the notion of linear rank statistics below, we will rst introduce an alternative presentation of some R. Another representation of interlacing patterns. Instead of interlacing structures of the type = (xxyx...yx), we consider R(X1 , ..., Ym ) = (R1 , ..., Rn+m )
34
where Ri = 0 or 1 according to whether the ith component of the ordered rank statistic comes from an observation in the X-sample or from the Y -sample. For example = (xxyxxyy) in the old notation translates to R = (0, 0, 1, 0, 0, 1, 1) in the new one. The ith component of R will in the sequel be denoted by Ri = Ri (X1 , ..., Ym ).
Denition 14 (Linear rank statistic) A statistic T is called linear rank statistic if

n+m
T (X1 , ..., Ym ) :=
i=1 (T )
Ci Ri (X1 , ..., Ym ).
(T )
(1.18)
The real numbers Ci , i = 1, ..., n + m are called the regression coecients of T . Dierent choices for the regression coecients naturally lead to dierent tests. Lemma 8 Under H0 , we have for all i, j = 1, ..., n + m ERi = where N = n + m. Proof: By Lemma 5, we have P(Ri = 1) = |{ R(n, m) : Ri = 1}| P()
n+m m , N
Var(Ri ) =
mn , N2
mn Cov(Ri , Rj ) = N 2 (N 1) ,
= |{R {0, 1} =
n+m
:
s=1
Rs = m, Ri = 1}| P() m . m+n
n+m1 m1
P() =
Hence P(Ri = 0) = 1P(Ri = 1) = n/(n+m), the variables Ri are Bernoullidistributed with parameter p = m/(m + n). Expectation and variance are hence given by ERi = p =
m , N
Var(Ri ) = pq =
mn . N2
1.9. TWO SAMPLE PROBLEMS AND LINEAR RANK TESTS For the joint moments we have E[Ri Rj ] = P(Ri = 1, Rj = 1) = and thus for the covariance Cov(Ri , Rj ) = E[Ri Rj ] ERi , ERj = n+m2 m2 P() = m(m 1) . N (N 1)
35
m(m 1) m 2 mn . = 2 N (N 1) N N (N 1)
From that, we immediately conclude Proposition 3 Let T be a linear rank statistic. Then ET =
m N N k=1
Ck , Var(T ) =
(T )
mn N 2 (N 1)
N k=1
Ck
(T ) 2
N k=1
Ck
(T )
where N = n + m. Proof: Exercise. Proposition 4 Under H0 , the distribution of T is symmetric around its mean if there is some constant C R such that Ck for all k = 1, ..., N . Proof: Let R = (R1 , ..., RN ) and R = (RN , ..., R1 ) in R(n, m). We consider the map : R(n, m) R(n, m) given by (R) = (R ). The map is bijective with 2 = id. Hence
N (T )
+ CN k+1 = C
(T )
T (R) + T ((R)) =
k=1 N
Ck (Rk + RN k+1 ) (Ck

k=1 N (T )
(T )
+ CN k+1 ) Rk
(T )
= C
k=1
Rk = C m.
36
Since is bijective and under H0 we have an equidistribution on the set of all ordered rank statistics, that implies P(T (R) = t) = P({R : T (R) = t}) = P(({R : T (R) = t})) = P(T (R ) = Cm t). That implies P(T (R) = Cm/2 + s) = P(T (R ) = Cm/2 s) and since by the above ET (R) = ET ((R)) = Cm/2, the distribution is symmetric around its mean. Example. For the Wilcoxon statistic
m
TW (X1 , ..., Ym ) =
j=1
tj ,
we have Ck
(W )
+ CN k+1 = k + N k + 1 = N + 1
N (W ) Ck k=1 N
(W )
and hence TW is symmetrically distributed around ETW m = N m = N k=

k=1
m (N + 1). 2
The value of the Wilcoxon statistic ranges from

m
TW
k=1
k=
m (m + 1) 2
in the case where the Y -values are the smallest to

N
TW
k=N m
k=
m (N + n + 1). 2
The symmetry of the distribution is important to construct the critical region for a two-sided alternative = 0 . That means: To construct a critical region given a signicance level > 0 we choose in the case of a one-sided alternative everything as before and in the case of a two-sided alternative, we choose ordered rank statistics with very small and very large value of the test statistic, distributing them with probability /2 on both sides. As before, this choice is in general not unique.
1.9. TWO SAMPLE PROBLEMS AND LINEAR RANK TESTS
37
1.9.1
Tests on Location
The location problem can also be interpreted as a problem on domination of random variables. > 0 for example implies that FX FY . All tests discussed in the preceding section are thus examples of linear rank tests on location and there is not much more to say except listing them. Example. (i) The Wilcoxon test is the linear rank test W with regression (W ) coecients Ck = k. (ii) The Fisher-Yates test is the linear rank test with (F Y ) regression coecients Ck = E(k) where (k) is the kth order statistic of a normal population. The Fisher-Yates test is also frequently called TerryHoeding test. (iii) The van der Waerden test is the linear rank test with (X) regression coecients Ck = 1 (k/(n + m + 1)) where is the cumulative distribution function of the normal distribution.
1.9.2
Tests on Scale
Tests on scale can not be interpreted as tests on domination without further modications. Example. (Tests on scale the positive case) The basic paradigm for a test on scale is the test on the variance of a normal distribution. For the nonparametric setup, we consider a scale family F (x) = F (x/) and test H0 : = 0 against the usual alternatives. In the positive case F (0) = 0, i.e. the underlying random variable is positive, we see that H1 : > 0 implies that x/ < x/0 and thus due to monotonicity of cumulative distribution functions F (x/) F (x/0 ). Hence, Y dominates X stochastically and we can apply the same tests as for locations again, since they were all based on domination. If the random variables under consideration are not strictly positive, we have to understand that dierences in scale also cause dierences in location. As an example, we consider the eect of scale change for an N (, 1)distributed random variable X. It turns out that in this case, the scale family FY (x) = F (x/) yields random variables Y distributed according to
38
a N (, )-distribution. Thus, a scale change even aects the location parameter. It is therefore not obvious how to distinguish scale and location alternatives. However, that is dierent if the data are normalized such that the location parameter is zero, or, equivalently, if we parameterize the underlying family of distributions in a dierent way. If the location parameter is zero, we observe that the change of scale results in a picture, where for > 1, FX (x) FY (x) for x > 0 and FX (x) < FY (x) for x < 0. For < 1, it is the other way round. This observation will in the sequel serve as an intuition for how to choose the regression coecients for associated test on scale. First we formalize the normalization approach by the notion of location/scale family below. Denition 15 (location/scale family) A location/scale family is a family X, , R, > 0 of random variables such that there is a function F : R [0, 1] and x F, (x) = F are the cumulative distribution functions for all values of , . In the case of a location/scale family we now have some indication how to construct linear rank tests for the scaling problem. In the case that the hypothesis FX = FY or equivalently = 1 holds, we expect that the average rank for the observations in the X and in the Y sample is the same. Lemma 9 Under H0 , the expected average rank for the Y - and X-variables is (N + 1)/2 where N = n + m. Proof: Using the random variables Rk from the denition of linear test statistics, we may write the average rank (a random variable) of the values of the Y -sample as N 1 RY = k Rk . m k=1 Therefore, by Lemma 8, we have 1 ERY = m 1 k ERk = (N + 1). 2 k=1
N
1.9. TWO SAMPLE PROBLEMS AND LINEAR RANK TESTS
39
The same holds for the ranks of the X-samples due to the fact that in the calculation above the sample size m cancels. We will now construct two rank statistics based on an idea associated to the observation made above: We believe that if > 1, i.e. the distribution of Y is more dispersed than the distribution of X, interlacing patterns of the type yyyxxxxxxyyy where the observations in the Y -sample take the small and the large values are more likely to appear than under the hypothesis. Denition 16 (Mood test) The Mood test is a linear rank test on dispersion with test statistic given by
N
M :=
k=1
N +1 2
Rk
By the observation above, a large value for M would lead us to the conclusion that the Y -sample is more dispersed where small values would support the conclusion that X is more dispersed than Y . The rejection regions for the Mood-test are therefore given for the dierent alternatives by: H1 critical region >1 M > m+ <1 M < m m+ , /2 =1 M < m /2
M>
For the two-sided alternative, note that the coecients of the Mood-test satisfy the assumptions of Proposition 4. Remark. A variant of the idea of the Mood-test is provided by the so called Freund-Ansari-Bradley-David-Barton test, where the test statistic is given by
N
A :=
k=1
N +1 Rk . 2
Another test that is using the same basic idea is the Siegel-Tukey test. Here the weights for the dierent Y -ranks are just the integers from 1, ..., N = n + m arranged in a suitable manner such as (for instance for N even) k (ST ) Ck 1 2 ... 1 4 ... N/2 N/2 + 1 ... N 1 N N N 1 ... 3 2
40
The idea is clearly the same as above, however large dispersions are now weighted with low regression coecients. Therefore the construction of the critical regions is the other way round.
Denition 17 (Siegel-Tukey test) The Siegel-Tukey test is the linear test on dispersion given by the test statistic
N
S :=
k=1
Ck
(ST )
Rk
where Ck
(ST )
2k 2k 1 = 2(N k) + 2 2(N k) + 1
,k ,k ,k ,k
even, 1 k N/2 odd, 1 k N/2 . even, N/2 < k N odd, N/2 < k N
S takes large values, if Y is less dispersed than X. Hence, the rejection regions for the various alternatives are given by: H1 critical region >1 S < s+ <1 S > s S> s+ , /2 =1 S < s /2
and the critical values are again tabulated. The special point with the SiegelTukey test is the following: Remark. Under H0 , the distribution of S is the same as for the Wilcoxon statistic Wn,m . That was also the initial reason for this kind of reordering the ranks. Thus, we do not even have to calculate a new table.
1.9.3
The distribution of the Wilcoxon test statistic
We now start a little detour about the business of exact distributions. Asymptotic distribution results and the use of tables for these probabilities are still very useful. However, since computers became easily accessible, it is possible to calculate exact distributions of the test statistic for nite sample sizes. As an example for a recursion relation which can easily be implemented, we consider the exact distribution of a Wilcoxon statistic for the two sample problem with samples of size n, m, respectively. We denote the corresponding statistic by Wm,n .
1.10. ASYMPTOTIC NORMALITY Lemma 10 Let Pn,m (k) = P(Wn,m = k) Then we have the following recursive formula (m + n)Pn,m (k) = mPn,m1 (k N ) + nPn1,m (k). Proof: Let Ln,m (k) be given by Pn,m (k) = Ln,m (k) P(). Then the statement is proved if we show that Ln,m (k) = Ln,m1 (k N ) + Ln1,m (k).
41
(1.19)
This corresponds to the decomposition of the ordered rank statistics with rank sum k into (a) those where there is some i with r(Yi ) = N (there are Ln,m1 (k N ) of them), (b) those where there is some i with r(Xi ) = N (there are Ln1,m (k) of them).
The preceding lemma gives all information that is needed to write a program that calculates all probabilities Pn,m (k) in principle.
1.10
Asymptotic Normality
For large values of N , the distributions of linear rank statistics are all approximately normal distributed. Variance and expectation value of the rank statistics were already computed in Proposition 3. Thus, we ask the question whether for
N
TN =
k=1
N Ck Rk
42
and sample sizes mN + nN = N , the distribution of ZN :=

mN nN N 2 (N 1)
TN N
mN N
N k=1
N Ck N k=1 N Ck 2
(1.20)
N k=1
N Ck 2
converges to a suitable limit as N . The problem is that the random variables Rk are not independent. However, they are Bernoulli variables with covariance mN nN Cov(Ri , Rj ) = 2 . N (N 1) Under the assumption mn /N , 0 < < 1, we have that Cov(Ri , Rj ) (1 ) 0 N 1
as N tends to innity. That implies that asymptotically the variables Ri and Rj are independent. This is the reason why we obtain the following result using a central limit theorem for dependent variables. To compare linear rank statistics for dierent values of N , we assume the existence of a function : [0, 1] R such that 1. is either nondecreasing on [0, 1] or nonincreasing on [0, a)] and nondecreasing on [a, 1] for some 0 < a < 1, 2. 0 <
1 ((t) 0
)2 dt < , =
1 0
(t)dt.
N Furthermore, we assume that the regression coecients Ck for the rank tests for dierent values of N are given by
N Ck =
k N +1
Under these assumptions, we can prove the following asymptotic statement.
1.10. ASYMPTOTIC NORMALITY
43
Theorem 2 (Asymptotic normality) For every > 0, there exists an M = M () such that for all N with min{mN , nN } > M we have sup |P(ZN t) (t)| <
tR
where denotes the cumulative distribution function of the standard normal distribution. Proof: see [4], Theorem 4B, p. 15. Example. To illustrate the usage of the function , we consider the Wilcoxon statistic TW (unlike above, we suppress the N -dependence in the notation) (W ) where the regression coecients are given by Ck = k. Normalizing this to TW 1 = N +1
N
r Rk ,
k=1
we can use the function (x) = x in the theorem above. By mN (N + 1) 2 mN nN Var(TW ) = (N + 1) 12 E(TW ) = we obtain asymptotic normality of TW ETW TW ETW N +1 ZN = = = VarTW VarTW 2
(N +1)
TN
mN 2
mN nN (N 12
+ 1)
44
Chapter 2 Goodness of Fit

2.1 A functional limit theorem
Another application of nonparametric statistics is to test whether a given distribution is really the underlying distribution of a given random variable. The basic fact underlying such kind of analysis is the convergence of the empirical cumulative distribution function to the true one. Let thus once again X1 , ..., Xn be independent identically distributed random variables with cumulative distribution function F . Note that we can drop the assumption on continuity of F in this whole section. Denition 18 (empirical distribution function) The empirical distribution function of a random sample X1 , ..., Xn is given by 1 Fn (x) := n
n
1[Xi ,) (x).
i=1
Remark. We can also write Fn in the form 1 Fn (x) := |{1 i n : Xi x}| . n In the sequel, let X1 , X2 , ... be an i.i.d. sequence of random variables with cumulative distribution function F . To see that the empirical distribution 45
46
CHAPTER 2. GOODNESS OF FIT
function converges pointwise to the true cumulative distribution function F is a consequence of the strong law of large numbers. Let t > 0 be xed and Yi = 1(,t] (Xi ) = 0 Xi > t . 1 Xi t
Then the random variables Y1 , Y2 , ... are i.i.d. Bernoulli-distributed with EYi = P(Xi t) = F (t). Hence 1 |{1 i n : Xi t}| Fn (t) = n n n 1 1 = 1(,t] (Xi ) = Yi n i=1 n i=1 and by the strong law of large numbers P lim Fn (t) = F (t) = P 1 lim n n
n
Yi = EY1
i=1
= 1.
The next statement shows that this is even true in a much stronger uniform sense. Theorem 3 (Glivenko-Cantelli) Let Xi,i1 be a sequence of i.i.d. random variables with distribution function F . Let dn := sup Fn (x) F (x) .
xR
Then P( lim dn = 0) = 1.
n
Proof: See e.g. [3], 11.4.2, p. 314. Remark. The metric d(F, G) := sup |F (x) G(x)|
xR
on the space of distribution functions is called Kolmogorov distance.
2.1. A FUNCTIONAL LIMIT THEOREM
47
If there is a strong law, it is reasonable to expect that there is also a central limit theorem. Lets again consider the situation for xed t R: We already saw that EYi = F (t). Since Yi is a Bernoulli variable, we obtain for the variance VarYi = F (t)(1 F (t)). Hence, by the central limit theorem, the random variable Zn (t) := n{Fn (t) F (t)} = n 1 n
n n
i=1
1 Yi E n
Yi
i=1
converges in distribution to a normal variable with mean zero and variance F (t)(1 F (t)). Also for this statement, there is a much stronger uniform version. In order to make that understandable, we rst calculate the covariance structure Cov(Zn (t), Zn (s)) for t s. Lemma 11 Let s t, 1 i, j n. Then denote by Yi (s), Yj (t) the Bernoulli variables constructed as above for time s and t, respectively. Then EYi (s)Yj (t) = ij F (min{s, t}) + (1 ij )F (s) F (t) where the Kronecker symbol is given by ij = Proof: We have EYi (s)Yj (t) = E[1(,s] (Xi )1(,t] (Xj )] = P(Xi s, Xj t) P(Xi s) i=j = P(Xi s) P(Xj t) i = j = F (min{s, t}) i = j . F (s) F (t) i=j 1 i=j . 0 i=j
That implies for the covariances Cov(Zn (t), Zn (s)) = E(Zn (t) EZn (t))(Zn (s) EZn (s)) the following statement:
48
Lemma 12 The covariance structure is given by Cov(Zn (t), Zn (s)) = F (min{s, t}) F (s)F (t). Proof: By Lemma 11, we have nE[(Fn (t) F (t))(Fn (s) F (s))] 1 = nE[ 2 n
n n n n
1 Yi (t)Yj (s) n i,j=1
i=1
1 Yi (t)F (s) n
Yj (s)F (t) + F (s)F (t)]

j=1
1 {ij F (min{s, t}) (1 ij )F (s) F (t)} nF (s)F (t) = n i,j=1 = F (min{s, t}) F (s) F (t). For every value of t R we can thus determine the limit random variable Xt by the central limit theorem. But as in the case of the Glivenko-Cantelli theorem, there is a corresponding uniform statement about the weak convergence of the dierence between empirical and cumulative distribution function to a Gaussian stochastic process. This is an instance of a functional limit theorem. Denition 19 (Gaussian process) A stochastic process (Xt )tR is called Gaussian if for every nite index set t = (t1 , ..., tn ) the vector valued random variable Xt = (Xt1 , ..., Xtn ) is Gaussian. If EXt = 0 for all t R, the Gaussian process is called centered. Remark. A centered Gaussian process is uniquely determined by its covariance structure C(s, t) = Cov(Xs , Xt ) = EXs Xt .
Theorem 4 (Donsker) As n , the empirical process (n) Xt := n{Fn (t) F (t)} converges in distribution to a centered Gaussian process Xt with covariance structure Cov(Xs , Xt ) = F (min{s, t}) F (s) F (t).
2.2. THE KOLMOGOROV SMIRNOV TEST Remark. By monotonicity of cumulative distribution functions, we have F (min{s, t}) = min{F (s), F (t)}.
49
By that, we can actually identify the limiting process. The centered Gaussian process (bt )t[0,1] with covariance structure C(s, t) = min{s, t} st is called the standard Brownian bridge. Thus, we can write Xt more explicitly as Xt = bF (t) . Some remarks on the proof of Donskers theorem. Convergence (n) of Xt to Xt in nite dimensional distributions follows from the multidimensional central limit theorem for the i.i.d. random vectors Y i := (Yi (s1 ), ..., Yi (sk )) for every nite s1 < ... < sk with Y i {(0, ..., 0), (0, ..., 0, 1), (0, ..., 0, 1, 1), ..., (1, ..., 1)} and P(Yir = 1rl ) = P(Xi (sl1 , sl ]) = F (sl ) F (sl1 ) where we use the convention s0 = and sk+1 = . Note that this does not provide a full proof of weak convergence of the processes. For that, we would have to discuss tightness of the approximating sequence (n) Xt . For a full proof, see the original article [7].
2.2
The Kolmogorov Smirnov test
The signicance of Theorem 4 is more important for us than its proof. It means that as n tends to innity, or, approximately for large values of n, we have (n) sup Xt := sup n|Fn (t) F (t)| sup |bF (t) | = sup |bs |. (2.1)
tR tR tR s[0,1]
50
And this statement is independent of F . Thus, this result ts into our general strategy in nonparametric statistics. Even the distribution of the supremum of a Brownian bridge can be computed more or less explicitly. Theorem 5 (Supremum of modulus of Brownian bridge) We have P( sup |bs | a) = 2
s[0,1] n1
(1)n+1 exp 2n2 a2 .
(2.2)
Proof: See e.g. [3], 12.3.4, p. 364. That implies the basic idea behind the so called Kolmogorov - Smirnov goodness of t test. Corollary 1 (Kolmogorov - Smirnov) For every z 0 lim P(sup |Fn (t) F (t)| z/ n) = L(z)
n tR
where L(z) := 1 2
k1
(1)k+1 exp 2k 2 z 2 .
Another consequence of Theorem 4 is that we can also calculate the values of other related statistics. For instance, let
+ Dn := sup Fn (t) F (t) tR
(2.3)
without the absolute value. By Theorem 4 we have as in (2.1) for large values of n + nDn sup bF (t) = sup bs .
tR s[0,1]
But also the distribution of the supremum of the Brownian bridge is known: Theorem 6 (Suprema of Brownian bridges) We have P( sup bs a) = e2a .
s[0,1]
2
Proof: See e.g. [3], 12.3.5, p. 365. That implies
2.2. THE KOLMOGOROV SMIRNOV TEST Corollary 2 For every z 0 we have

n + lim P(4nDn 2 z) = 2 (z) 2
51
where 2 denotes the cumulative distribution function of a 2 -distribution 2 with two degrees of freedom. Proof: Note rst that b0 = b1 = 0 implies that sup bs 0,
s[0,1]
which you can (almost surely) also conclude from Theorem 6 with a = 0. Hence, we also have for z 0 that + + 4nDn 2 z 2 nDn z,
+ since Dn 0 for the same reason. That implies again by Theorem 6 + lim P(4nDn 2 z) =
+ 2 lim P( nDn z/2) = 1 e
z 2
2
= 1 ez/2 = 2 (z). 2 From these considerations, we may now derive goodness of t tests for all the relevant alternatives: Consider H0 : F = F0 that the random variables in the sample are distributed with cumulative distribution F0 . To test it against one of the alternatives H1 : F , , = F0 , we use the test statistics Dn := sup |Fn (t) F (t)|,
tR +/ Dn := sup Fn (t) F (t) . tR
Under H0 , the asymptotic distributions of these test statistics are given by + Corollary 1 and 2. The asymptotic distributions of Dn and Dn coincide, since they are given by the suprema of the Brownian bridge bs and the process bs , respectively, and it is easy to see that these two processes are centered gaussian with the same covariance structure, so their distributions are identical.
52
Denition 20 (Kolmogorov - Smirnov goodness of t test) The Kolmogorov - Smirnov test to a signicance level > 0 is given by H1 test statistic critical region F = F0 Dn Dn > Dn, F F0
+ Dn + + Dn > Dn,
F0
Dn
where the critical values are given by Dn, := min{z 0 : L(z) 1}/ n 2 + and Dn, := min{z 0 : e2z /n } respectively. Remark. The Kolmogorov-Smirnov statistic can not be written as a linear rank statistic but as the maximum of a nite number of linear rank statistics (cf. [4], p. 62). An analogous test can be performed in the two sample case. Let X1 , ..., Xn , Y1 , ..., Ym be independent random samples and we want to test the hypothesis H0 : X = Y against the three relevant alternatives. Under H0 , we have by the preceding considerations (FX = FY = F ) D m{Fm (t) F (t)} bF (t) D n{Fn (t) F (t)} bF (t) in distribution as m and n tend to innity by Theorem 4. Consider now the statistic Dn,m := sup |Fn (t) Fm (t)|.
tR
+ Dn > Dn,
Due to nm (Fn (t) Fm (t)) n+m nm (Fn (t) F (t) + F (t) Fm (t)) n+m m n n(Fn (t) F (t)) m(Fm (t) F (t)) n+m n+m
= =
which converges for m, n and m/n c > 0 to F (t) = bF (t) 1+c 1 + 1/c bF (t)
(1) (2)
2.2. THE KOLMOGOROV SMIRNOV TEST
53
where b(1) and b(2) are independent Brownian bridges. Since this is a sum of independent centered Gaussian processes, is also a centered Gaussian process with covariance structure c 1 (1) (2) E(s t ) = E(b(1) bt ) + E(b(2) bt ) = min{s, t} st s s 1+c 1+c and hence is a standard Brownian bridge, too. That implies Lemma 13 We have (i) limm,n,m/nc>0 P (ii) limm,n,m/nc>0 P
nm D n+m n,m nm D+ n+m n,m
z = L(z), z = 1 e2z ,
2
+ where Dn,m := suptR Fn (t) Fm (t).
Proof: Exercise. Denition 21 (Kolmogorov - Smirnov two sample test) The Kolmogorov - Smirnov two sample test to a signicance level > 0 is given by H1 test statistic critical region FX = FY nm D n+m n,m nm D > Dn+m, n+m n,m FX nm + Dn,m n+m + nm D+ > Dn+m, n+m n,m FY
where the critical values are the same as in Denition 20. There is also the analogous statement for the test statistic Dn,m := suptR Fm (t) Fn (t). Remark. Due to the invariance principle Theorem 4, we can base a goodness of t test as well on other functions of the distance to the empirical process. For instance the Kuiper test considers the functional
K + Dn := Dn Dn . K By Theorem 4, Dn is asymptotically distributed as K D = sup bs inf bu , s[0,1] u[0,1]
where the supremum and the inmum is taken from the same path of the Brownian bridge. The joint distribution (sup bs , inf bu ) can also be calculated K explicitly and the statement about the distribution of D follows then from the continuous mapping principle which we will use more explicitly in the next section.
54
2.3
The Chi-square idea
Another idea to construct a test on goodness of t is the following: We have a sample x = (x1 , ..., xn ) of observations on the real line, but we decompose the real line into bins Ik := [tk , tk+1 ), k = 0, ..., K where t0 = < t1 < ... < tK < tK+1 = . Then the number of observations nk in Ik is given by nk = n(Fn (tk+1 ) Fn (tk )) and the expected number of observations in Ik is given by Nk = nP(X Ik ) = n(F (tk+1 ) F (tk )) = n pk where pk = P(X Ik ) is the probability that X falls into the kth bin. Consider now assuming that the tk are chosen in a way such that Nk > 0 for all k: (nk Nk )2 = Nk = n(Fn (tk+1 ) F (tk+1 ) (Fn (tk+1 ) F (tk )))
2
n(F (tk+1 ) F (tk )) n(Fn (tk+1 ) F (tk+1 )) n(Fn (tk+1 ) F (tk ))) F (tk+1 ) F (tk )
In the sequel, the main idea to measure the deviation from the hypothesis will be to consider test statistics consisting of terms of the type (observed expected)2 . expected By Theorem 4, we have that n(Fn (tk+1 ) F (tk+1 )) n(Fn (tk+1 ) F (tk ))) bF (tk+1 ) bF (tk ) in distribution. Now by the Continuous mapping principle. If a sequence Xn of random variables converges in law to another random variable X and if is a continuous map, then also (Xn ) (X) in distribution.
2.3. THE CHI-SQUARE IDEA
55
(for an exact statement and a proof see for instance [3], 9.3.7, p. 232), we obtain that 2 bF (tk+1 ) bF (tk ) (nk Nk )2 Nk F (tk+1 ) F (tk ) as n tends to innity since (x) = x2 F (tk+1 ) F (tk )
is continuous. Let now Tk := F (tk ). Then by monotonicity of the cumulative distribution function 0 = T0 T1 ... TK+1 = 1 and again applying the continuous mapping principle, this time to the map (x1 , ..., xn ) = n xi , we obtain i=1 Corollary 3 As n , the random variable
K 2 Sn
:=
k=0
(nk Nk )2 Nk bTk+1 bTk Tk+1 Tk

2
converges to
K
S :=
k=0
in distribution. Here Nk = n pk . In the sequel, we will use the following facts about 2 -distributions: Two facts about the 2 -distribution for integer values n, m N for the number of degrees of freedom: (i) Let X1 , ..., Xn N (0, 1) independent standard normal variables. Then the sum 2 2 X := X1 + ... + Xn is 2 -distributed. n (ii) Let X 2 , Y 2 be independent. Then X + Y 2 . n m n+m
56
The distribution of S 2 is given by (note that there are K + 1 bins) Lemma 14 The distribution of S 2 is a 2 -distribution with K degrees of freedom. Proof: First, we calculate the covariance structure of the centered Gaussian variables bT bTk Xk := k+1 , Tk+1 Tk k = 0, ..., n. That yields Ckl = EXk Xl = kl (Tk+1 Tk )(Tl+1 Tl ). (2.4)
In particular, the increments of a Brownian bridge are far from being independent. It seems thus that we cannot use the characterization (i) of a 2 -distribution above. But in fact we can and what follows now is a very useful idea in dealing with these distributions. First of all, the covariance matric Ckl is symmetric. Hence there is a orthogonal matrix U such that U + CU = D = diag(0 , ..., n ) is diagonal. That implies that the variable (Y0 , ..., Yn ) = (X0 , ..., Xn )U are independent centered normal variables with variance VarYi = Var
j
Xj Uji =
j,s
EXj Uji Xs Usj Uji Usj Cjs = i .

j,s
=
j,s
Uji Usj EXj Xs =
In particular, if i = 0 then Yi = 0 almost surely. The crucial fact is now that the covariance matrix in our case is idempotent, i.e. C 2 = C. We see this by (2.4) and the explicit calculation
2 Cij = s
Cis Csj (is

s
(Ti+1 Ti )(Ts+1 Ts ))(sj
(Ts+1 Ts )(Tj+1 Tj ))
= ij
(Ti+1 Ti )(Tj+1 Tj ) = Cij .
2.3. THE CHI-SQUARE IDEA
57
Hence, C is symmetric and idempotent, that means it it a projection and projections only have eigenvalues zero or one. The diagonal matrix D has therefore the form D = diag(1, ..., 1, 0, ..., 0) where the number of ones is equal to the rank rk D = rk C of the covariance matrix. Thus, the independent normal variables Yi are either almost surely zero or standard normal and since U is an orthogonal matrix, we have almost surely
2 2 Y02 + ... + Yrk C1 = Y02 + ... + Yn = (Y0 , ..., Yn ) 2 2 2 = (X0 , ..., Xn )U 2 = X0 + ... + Xn
and the Yk , k = 0, ..., rk C 1 are independent standard normal variables. 2 2 Hence X0 + ... + Xn is 2 -distributed with rk C degrees of freedom. It remains to show that for the covariance matrix above, rk C = K. Here we use again that the eigenvalues are only zero or one. That implies that the rank of D equals the number of eigenvalues equal to one and this is the trace of D. Thus rk D = tr D = trU + C U = tr C =
s
Css =
s
(1 (Ts+1 Ts )) = K.
From that, we can derive the second test on goodness of t. Denition 22 (2 -goodness of t test) The 2 -goodness of t test of the hypothesis F = F0 against H1 : F = F0 is given by the test statistic
K
X 2 :=
k=0
(nk Nk )2 Nk
which, under H0 , is asymptotically 2 -distributed. The critical region for K signicance level > 0 is therefore given by C = {X 2 > 2 1,K } where 2 1,K denotes the corresponding quantile.
58
Remark. Even though the test was constructed for samples of real random variables, we can apply it even to categorial data with just a discrete probability distribution determining which observation falls into which bin. Even the proof does not change, if you just construct an articial random variable X and choose the bins in a way that the probability P(X Ik ) equals the probability that the initial (categorial) data falls within bin number k.
2.4
A Chi square test on independence
Now we will present another way how to use the basic chi square idea, namely to construct a test on independence of two random variables. Let X and Y be random variables with cumulative distribution functions FX , FY , respectively. We divide the range of X into K + 1 subintervals I0 , I1 , ..., IL and the range of Y into L + 1 subintervals J0 , J1 , ..., JL . Then we perform an experiment with N paired observations (Xi , Yi ), i = 1, ..., N . We keep track about how many of the data pairs fall into the dierent bins according to the following scheme: I0 n00 . . . nL10 nL0 n0 I1 n01 . . . nL11 nL1 n1 I2 n02 . . . nL12 nL2 n2 ... ... IK1 n0K1 . . . IK n0K . . . nL1K nLK nK
J0 . . . JL1 JL where nk =
m0 . . . mL1 mL
... nL1K1 ... nLK1 ... n0K1
L l=0
nlk , ml =
K k=0
nlk , N =
L l=0
K k=0
nlk .
(2.5)
Our aim is now to prove that the statistic

L K
2 SN
=
l=0 k=0
nlk
ml nk 2 N ml nk N
is 2 -distributed with K L degrees of freedom.
2.4. A CHI SQUARE TEST ON INDEPENDENCE
59
Assume rst that the marginal distributions are given by P(X Ik ) = pk , P(Y Jl ) = ql . We consider the statistic
L 2 XN K
:=
l=0 k=0
(nlk N pk ql )2 N pk q l
and our aim is to test the hypothesis H0 : X and Y are independent against 2 the alternative H1 : X and Y are not independent. By Lemma 14, XN is asymptotically 2 -distributed with (K + 1) (L + 1) 1 degrees of freedom.
2 2 The dierence between XN and SN is that in the case that pk and ql are not 2 known, we have to estimate the marginals from the sample. Thus, in SN , the true marginals are substituted by the Maximum-Likelihood estimators
pk =
nk , N
ql =
ml . N
It is a not at all obvious fact that this procedure reduces the number of degrees of freedom by the number of parameters which have to be estimated. In this case, we have to estimate the marginal probabilities p0 , ..., pK and q0 , ..., qL . These are K + L + 2 probabilities but since both, the pi s and the qi s sum up to one, we have to eectively estimate only K + L numbers. That yields (K + 1)(L + 1) 1 (K + L) = KL degrees of freedom for the resulting 2 -distribution. Please note again that this is far from being a proof, for which we refer to the reference below.
2 Theorem 7 Under H0 , the statistic SN is asymptotically 2 -distributed with K L degrees of freedom.
Proof: See [2], Sec. 30.3, p. 426 . Using this, we can nally construct an asymptotic test on independence. Denition 23 (2 -independence test) The 2 -test on independence for a paired sample (Xi , Yi )i=1,...,N is given by the test statistic
L 2 SN K
=
l=0 k=0
nlk
ml nk 2 N ml nk N
L K where nk = l=0 nlk , ml = k=0 nlk . The hypothesis that X and Y are 2 independent is rejected to a signicance level > 0 if SN > 2 1,KL .
60
Remark. For the same reasons as in the preceding paragraph, we can apply this test to categorial data as well. Degrees of freedom. Note that the number of degrees of freedom can also be calculated from (2.5). According to I. J. Good [8], the number of degrees of freedom of a statistical problem is independently from the occurence of F - or 2 -distributions or quadratic forms dened as the codimension of the hypothesis in a larger hypothesis meaning the full space of all distributions under consideration. In our case, an arbitrary distribution on the (K + 1) (L + 1) bins is characterized by the same number of non-negative numbers with sum 1 meaning that we consider in total a (K + 1) (L + 1) 1 = KL + K + L dimensional simplex of all possible distributions in the larger hypothesis. Product distributions are characterized completely by the marginals which are given by a Kand a L-dimensional simplex, respectively. The hypothesis that the two variables are independent thus consists of a space of distributions of dimension K + L. By the denition above, the number of degrees of freedom is thus the codimension KL + K + L (K + L) = KL. For further examples for this interpretation see [8].
Appendix A The functional delta method

The purpose of this appendix is an informal discussion of the so called functional delta method. It extends considerably the technique to formulate and solve asymptotic problems in terms of the empirical distribution function. You can also consider the nal example as a comment on the asymptotic normality result Theorem 2, p. 41. For a concise outline of the general method and the technical details involved, see the encyclopedia article [9].
A.1
The Mann-Whitney statistic
Recall that the continuous mapping principle enabled us to conclude from Donskers result that n{Fn (x) F (x)} bF (x) implies for continuous . Consider now the Wilcoxon statistic in the two sample case. Recall that the underlying random variables were supposed to be continuous. We have
n+m m
( n{Fn (x) F (x)}) (bF (x) )
W (x1 , ..., ym ) =
k=1
kRk (x1 , ..., ym ) =

j=1
nFn (yj ) + mGn (yj )
61
62
APPENDIX A. THE FUNCTIONAL DELTA METHOD
where the two samples are given by (x1 , ..., xn ), (y1 , ..., ym ) and Fn , Gm are the empirical distribution functions of X, Y , respectively. Clearly,
m m
mGm (yj ) =
j=1 j=1
j=
m (m + 1) 2
independent of the sample. Thus we may reduce the statistic to

m
W =n
j=1
Fn (yj ) = nm
R
Fn dGm .
Thus, the Wilcoxon statistic in terms of the empirical distribution functions is basically given by Fn dGm (A.1) (Fn , Gm ) :=
R
and this is called the Mann-Whitney form of the Wilcoxon statistic. It can be shown that is continuous in both arguments with respect to convergence in probability of the random variables associated to the distribution functions. Convergence in Kolmogorov distance of the cumulative distribution functions implies convergence in probability of the associated variables and therefore we have (Fn , Gm ) (F, G) as m, n by the Glivenko - Cantelli theorem. Remark. We can even compute the limit (assuming for simplicity that G has a density g) to (F, G) = F dG =
R
dyg(y)P(X y) = P (X Y ).
(A.2)
A.2
Hadamard dierentiability and asymptotic normality
The dierence is now that we have to compute an asymptotic of the form N , m/n /(1 ) for nm (Fn , Gm ) (F, G) N
A.2. DIFFERENTIABILITY AND ASYMPTOTIC NORMALITY to prove Theorem 2 for the special case of the Wilcoxon statistic.
63
The idea is to use Taylor expansion: If (un , vn ) (u, v) and f is suciently dierentiable at (u, v), then f (un , vn ) = f (u, v) + f f (u, v)(un u) + (u, v)(vn v) + R(un , vn ) u v
where the remainder R is asymptotically negligible. The problem is clearly that the space of cumulative distribution functions is innite dimensional and that it is very hard to nd suitable notions of dierentiability. In fact, there are several ways to do that, depending on the given problem. The space of distribution functions is not a vector space because the sum of two distribution functions is not a distribution function any more. But for , 0, + = 1, the convex combination F + G of two distribution functions F and G is again a distribution function. Thus, the set of distribution functions D forms a convex subset of the vector space of distribution functions of signed and nite measures Sn = {aF bG : a, b R, F, G D}. In the denition below, LF is thus meant to be linear, if it is a linear map on S. In that picture, we may think of the set TF D := {U Sn : U = lim |t|1 (Ft F ) : Ft , F D, Ft F } TF Sn
t0
as the tangent cone along the submanifold D. Here, the limit is again understood in the sense of convergence in probability. Denition 24 A functional on the space of distribution functions is called Hadamard dierentiable at F , if there is some linear functional LF with lim 1 {(Ft ) (F )} = LF (U ) t0 |t| (A.3)
for all sequences Ft for which the limit |t|1 (Ft F ) U TF D exists.
64
Remark. A proper description of the limit in (A.3) requires the xation of a suitable metric on the space of these functionals. Choosing this metric appropriately for a given problem is an important technical point. Without a proof, we will use that is Hadamard-dierentiable with respect to both arguments G and F . That implies (Fn , Gm ) (F, G) = = = (Fn F ) dG + (Fn F ) dG Fn dGm F dG (Fn F ) d(Gm G) (Fn F ) d(Gm G)
F d(Gm G) + (Gm G) dF +
where we used partial integration F d(Gm G) + (Gm G) dF = (Gm G)F | = 0

R
for the last step. That implies with t =
m ,s=
that
nm (Fn , Gm ) (F, G) N m n n(Fn F ) dG m(Gm G) dF = N N m + n(Fn F ) d(Gm G) N and thus by the continuous mapping principle, as N and n/m /1 , this converges to bF (x) dG(x)
(1)
bG(x) dF (x)
(2)
(A.4)
where b(1) and b(2) are two independent Brownian bridges. Hence (A.4) is the dierence of two independent normal variables which implies that the Wilcoxon statistic is asymptotically normal.
A.2. DIFFERENTIABILITY AND ASYMPTOTIC NORMALITY Furthermore, under the hypothesis F = G, we have for the limit variable Z=
R
65
(1) (2) ( bF (x) 1 bF (x) )dF (x) =
1 0
( b(1) 1 b(2) )ds s s
that it is centered Gaussian with variance Var Z = EZ 2

1 1
= E
0 1
ds
0 1
(1) (2) dt( b(1) 1 b(2) )( bt 1 bt ) s s
= E
0 1
ds
0 1
(1) (1) (2) dt(b(1) bt 2 1 bt b(2) + (1 )b(2) bt ) s s s
=
0
ds
0 1
dt(min(s, t) st)
1 1 1
= 2
0
ds s
s
dt +
0
ds s
0
dt t
1 1 1 = . 3 4 12
Altogether, we showed that for N , m/N > 0 that nm 1 (Fn , Gm ) (F, G) = {W EW } N N nm converges to a normal variable Z with mean zero and variance 1/12. This provides one example for how the general idea of the functional delta method works: Hadamard dierentiability of a functional together with Donskers Theorem automatically implies an asymptotic result by letting Ft = Fn , t = 1/ n and n((Fn ) (F )) = (Ft ) (F ) LF (lim n(Fn F )) = LF (bF ) n t
provided the dierential LF exists.
66
Appendix B Some Exercises

Exercise 1. Prove that is a location parameter if and only if the distribution of X does not depend on U. What is the analogous result for scale parameters ? Exercise 2. (Bain/Engelhardt, p. 495) The following 20 observations are obtained from a random number generator: 0.48, 0.10, 0.29, 0.31, 0.86, 0.91, 0.81, 0.92, 0.27, 0.21, 0.31, 0.39, 0.39, 0.47, 0.84, 0.81, 0.97, 0.51, 0.59, 0.70 1. Test H0 : med = 0.5 against H1 : med > 0.5 at level = 0.1. 2. Test H0 : med = 0.25 against H1 : med > 0.25 at level = 0.1. For the rst test, is it necessary to actually compute the rejection region ? Exercise 3. (i) Can you draw the cumulative distribution of a random variable X with P (X = med(X)) > 0. (ii) Can you modify the sign test to include also such distributions ? Exercise 4. Prove that for symmetric distributions, the mean coincides with the median. Exercise 5. What happens to Pitmans asymptotic eciency if in the parametric location problem for the normal distribution N (, 2 ) the variance 2 is unknown ?
67
68
APPENDIX B. SOME EXERCISES
Exercise 6. Prove Lemma 4.1 on the signicance of order statistics. Exercise 7. Let X be a random variable with cumulative distribution function F and (X1 , ..., Xn ) be an associated random sample. Ordering this random sample by magnitude X(1) ... X(k) ...X(n) yields the order statistics X(k) , k = 1, ..., n. In particular, X(1) = min {X1 , ..., Xn } and X(n) = max {X1 , ..., Xn }. 1. Compute the distribution function of X(k) . 2. Compute the expectation value of X(k) , if X Unif(0, 1) is uniformly distributed on the interval [0, 1]. Exercise 8. (i) Let G = R operate on Rn by u(x1 , ..., xn ) := (x1 + u, ..., xn + u). Construct a maximal invariant map for this operation. (ii) Let G = R+ := {u R : u > 0} operate on Rn by u(x1 , ..., xn ) := (u x1 , ..., u xn ). Construct a maximal invariant map for this operation. Exercise 9. A composition of an integer n 1 is one way to write n as a sum of positive integers where the order of the summation is taken into account. The sixteen compositions of 5 are for instance given by {5, 4+1, 1 + 4, 2 + 3, 3 + 2, 1 + 1 + 3, 1 + 3 + 1, 3 + 1 + 1, 2 + 2 + 1, 2 + 1 + 2, 1 + 2 + 2, 1 + 1 + 1 + 2, 1 + 1 + 2 + 1, 1 + 2 + 1 + 1, 2 + 1 + 1 + 1, 1 + 1 + 1 + 1 + 1}. 1. Prove that there are actually 2n1 compositions of n 1. 2. Prove that the number of compositions of n into k parts is given by n1 . k1 Exercise 10. Prove that the set M := {f : R R : f strictly monotone, continuous and onto} forms a group if we take as group multiplication the composition of maps f g(x) := f (g(x)). Exercise 11. Prove Lemma 5 on the distribution of ordered rank statistics. Hint: Use the representation [xyxx...xyy] of the information that is contained in the ordered rank statistic mentioned in the lecture.
69 Exercise 12. Let X be a two-dimensional random vector distributed according to one of the distributions with density f (x, y) = 1 1 exp 2 (x2 + y 2 ) 2 2 2
and > 0. Given a sample X = (X1 , ..., Xn ), prove that S(X) := ( X1 , ..., Xn ) is a sucient statistic ( denotes the euclidean norm).
Exercise 13. Consider paired observations (Xi , Yi )i=1,...,n where the Xi are observations of the continuous random variable X and the Yi are observations of the continuous random variable Y . 1. If you compare this situation to the two-sample situation in the patient example, in which kind of situations would you assume that the observations are paired ? 2. (adapted from Engelhard p. 469) Twelve pairs of twin male lambs were selected; diet plan I was given to one twin and diet plan II was given to the other twin in each case. The weights at eight months are given by the table below. Use a sign test for the dierences Xi Yi to test the conjecture that diet I is more eective than diet II to a signicance level of = 0.05. 3. Which piece of (probably useful) information about the sample do you not take into account when you use a sign test ?
I (Xi ) 111 102 90 110 108 125 99 121 133 115 90 101 II (Yi ) 97 90 96 95 110 107 85 104 119 98 97 104 Exercise 14. Let U1 , ..., Un be independent U(0, 1)-distributed. Denote by U(k) the value of the kth order statistic. Let kn be a sequence of numbers such that kn /n x [0, 1] as n tends to innity. Prove that
n
lim Ef (U(kn ) ) = f (x)
70
for all bounded continuous functions f . Exercise 15. Consider a one-sample location problem where we assume that the underlying distribution is continuous and symmetric. Based on a random sample X1 , ..., Xn we want to test the hypothesis H0 : med(X) = 0 versus the alternative H1 : med(X) > 0 . We decide for the Wilcoxon signed rank test, i.e the test statistic is constructed as follows: We consider the rank statistic of (Z1 := |X1 0 |, ..., Zn := |Xn |) and sum up the ranks of those numbers |Xk 0 |, where X1 0 > 0. 1. Do you see intuitively why we have to assume that the distributions are symmetric ? 2. Is it reasonable to expect that this assumption is fullled for the data in exercise 13 ? 3. Denote the test statistic by R+ . Show that you can write
n
R (X) =
i=1
Vi (X1 , ..., Xn )ri (Z)
where X = (X1 , ..., Xn ) denotes the random sample, Vi : Rn R is a suitably chosen function and ri (Z) denotes the rank of Zi in Z. (Hint: It just looks complicated.) 4. Is the signed rank test invariant under monotone transformations ? Exercise 16. Perform the Wilcoxon signed rank test for the data from Exercise 13. In the case of paired samples (X, Y ), you use Zi := |Xi Yi | instead of the Zi for the one-sample case dened above (why ?). The hypothesis is rejected if R+ exceeds a given (tabulated) critical value. You can nd another description of the signed rank test in the english Wikipedia at (http://en.wikipedia.org/wiki/Wilcoxon signed-rank test). A table of the critical values is provided by the rst of the external links at the bottom of the page. What does the assumption of symmetry of the distribution (cf. Exercise 15) means for paired observations ? Do you think this assumption is justied ? Compare the result with those obtained by
71 using a sign-test and a t-test for the same problem. Exercise 17. Use the recursion relation (1.18) from Lemma 10, p. 41 to write a program (in R, C, whatever) with which you can generate a list of the tail probabilities P (Tn,m k) for given n, m. Exercise 18. Use Proposition 9.1 to calculate expectation and variance of the Mood test. Is the distribution of the test statistic symmetric around its mean ? Exercise 19. Prove that the distribution of the Wilcoxon statistic WN equals the distribution of the Siegel-Tukey statistic SN . Exercise 20. Prove that two identically distributed Bernoulli variables are independent if and only if their covariance is zero. Exercise 21. Let X1 , ..., Xn be independent, identically distributed standard normal variables with mean X= 1 n
n
Xi .
i=1
Prove that (n 1)s2 = n (Xi X)2 is 2 -distributed with n 1 degrees n i=1 of freedom (at least for n = 3). Exercise 22. Prove that the law of a centered Gaussian process is uniquely determined by its covariance structure.
+ Exercise 23. Give an alternative proof that the statistics Dn , Dn and Dn from the Kolmogorov-Smirnov one-sample test are distribution free. Assume that the cumulative distribution F is continuous and strictly monotonous. Proceed as follows: + 1. Rewrite for instance Dn by using the order statistic (X(1) , ..., X(n) ) of the sample to
+ Dn = max
1in
max
i F (X(i) ) , 0 . () n
72

+ 2. Give an argument why the distribution of Dn does not depend on F any more. 3. Give a short argument how this implies the same for Dn and Dn .
Exercise 24. Show that the alternative representation (*) of the KolmogorovSmirnov statistics in 23 is a representation as maximum of nitely many rank statistics. Exercise 25. A Brownian bridge is a process Xt,t[0,1] such that for 0 < t1 < t2 < ... < tn < 1 and Borel subsets A1 , ..., An R we have P (Xt1 A1 , ..., Xtn An ) n+1 2 dx1 = dx2 ... dxn
A1 A2 An k=1
1 2(tk tk1 )
exp
(xk xk+1 )2 2(tk tk1 )
where t0 = 0, x0 = 0, tn+1 = 1, xn+1 = 0. Show that

n+1
Sn :=
k=1
(Xtk Xtk1 )2 tk tk1
is 2 -distributed, at least for n = 2, 3. n Exercise 26. Let X and Y be independent continuous random variables with cumulative distribution functions F and G, respectively. Let Fn and m be the empirical distribution functions given two independent samples G (x1 , ..., xn ), (y1 , ..., ym ) of size n,m, respectively. Show that rank(yk ) = Fn (yk ) + Gm (yk ). where rank(yk ) denotes the rank of yk in the joint sample. Exercise 27. Let P be a symmetric and idempotent n n-matrix, i.e. P + = P and P 2 = P . 1. Prove that P can have no eigenvalues other than 0 or 1. 2. Show that the rank of P (the number of linearly independent columns) is equal to the trace of P (i.e. the sum of diagonal elements). (Hint: You may use cyclicity of the trace, i.e. trace(ABC) = trace(BCA) = trace(CAB).)
73 Exercise 28. Let x = (x1 , ..., x4 ) R4 . 1. Determine the orbits of R4 under the action of the perturbation group 4 of four elements given by (x1 , ..., x4 ) = (x(1) , ..., x(4) ), 4 . 2. Determine all possible rank statistics r(x), x R4 , modulo permutations of the xi . 3. When is r(x) a permutation of {1, 2, 3, 4} ? Exercise 29 Let (X1 , ..., Xn ) be a random sample. Assume that the cumulative distribution function F is continuous. Prove that the probability P ({(X1 , ..., Xn ) | n : r(X1 , ..., Xn ) = (1, ..., n)) = 1. Exercise 30. The following tables were obtained from two dierent machines producing steal bolts. The data represent the deviation (in mm) from the distinguished length of the bolts. Machine I: 0.15, -1.99, -1.08, -1.98, 2.87, 5.19, -0.37, - 0.53, -1.09, 0.56, 1.15, -0.02, -1.32, 0.06, -0.21, -0.25, -1.35, -1.68, -1.41, -0.82 Machine II: 1.18, 1.26, 3.65, -0.81, 2.64, 0.31, 2.92, -3.60, 1.81, 1.38, 2.76, -3.25, -1.085, 1.19, 1.92, 1.53, 1.56, 3.09 1. Draw a qq-plot to justify your suspicion that the deviation is not normally distributed. (Use R for convenience.) 2. Perform a Wilcoxon rank-sum test to signicance level = 0.05 to check whether the means of the deviations of the two machines full II > I . Exercise 31. You suspect that the deviations above are distributed according to a two-sided exponential distribution with density 1 (x) = e|x| 2
74
(but dierent locations). Construct a rank statistic which in that case performs best in the sense of Pitmans asymptotic eciency. Exercise 32. (Removal of ties) A lazy observer only tabulated the data from Exercise 30 rounded o to one digit getting Machine I: 0.2, -2.0, -1.1, -2.0, 2.9, 5.2, -0.4, - 0.5, -1.1, 0.6, 1.2, -0.0, -1.3, 0.1, -0.2, -0.3, -1.4, -1.7, -1.4, -0.8 Machine II: 1.2, 1.3, 3.7, -0.8, 2.6, 0.3, 2.9, -3.6, 1.8, 1.4, 2.8, -3.3, -1.1, 1.2, 1.9, 1.5, 1.6, 3.1 Some of the numbers are now equal. To perform again a Wilcoxon rank test, you proceed as follows: To every number Xi in the pooled sample you simulate a Uniform(0,1) random variable Ui such that the Ui are independent. Then you assign modied ranks r to the Xi according to: r (Xi ) < r (Xj ) if and only if either Xi < Xj , or Xi = Xj and Ui < Uj . 1. Construct a critical region for the Wilcoxon rank-sum test based on these modied ranks r . What is the distribution of the joint ordered rank statistic based on the modied ranks r ? 2. Discuss possible weak points of this method. Exercise 33. Let X be a random variable with strictly monotonous cumu lative distribution function F . Let Y = f (X) with f C0 (R) smooth with compact support. 1. Prove the strong law of large numbers for Y by using Glivenko-Cantelli. 2. Prove the central limit theorem for Y by using Donskers Theorem. Exercise 34. Compare the example about the asymptotic of the Wilcoxon statistic TW following Theorem 2 with the asymptotic result for the MannWhitney statistic at the end of Appendix A. Exercise 35. Let x1 , ..., xn be a sample from a random variable X and Fn the associated empirical distribution function. Show that if the real function f is continuous in the points x1 , ..., xn , we have f (t) dFn (t) =
R
1 n
f (xi ).
i=1
Bibliography
[1] Chandra, T. K. (1999). A First Course in Asymptotic Theory of Statistics. Narosa Publ. House, New Delhi. [2] Cramer, H. (1999). Mathematical Methods of Statistics. 19th printing, Princeton University Press, Princeton, NJ. [3] Dudley, R. M., G. (2002). Real Analysis and Probability. Cambridge University Press, New York. [4] Hajek, J. (1969). Nonparametric Statistics. Holden Day, San Francisco. [5] Krengel, U. Mathematische Statistik, Vorlesungsausarbeitung. Gttingen WS73/74. o [6] v. d. Vaart, A., Wellner, J. A. (1996). Weak Convergence and Empirical Processes, Springer, New York. [7] Donsker, M. D., Justication and extension of Doobs heuristic approach to the Kolmogorov Smirnov theorems, Annals of Mathematical Statistics, 23, (1952), p. 277281 [8] Good, I. J., What are Degrees of Freedom ?, The American Statistician, December 1973, Vol. 73, No. 5 [9] Rmisch, W., Delta Method, Innite Dimensional, article in: Eno cyclopedia of Satistical Sciences, J. Wiley & Sons Inc., 2006
75
Index
Brownian bridge, 49 chi-square distribution, 55 Donskers theorem, 48 empirical distribution function, 45 functional delta method, 61 Gaussian process, 48 centered, 48 Glivenko - Cantelli theorem, 46 group action, 12 eective, 12 Hadamard dierentiable, 63 interlacing pattern rst representation, 22 second representation, 33 joint ordered rank statistics, 20 Kolmogorov distance, 46 Kronecker symbol, 47 linear rank statistic, 34 location/scale family, 38 maximal invariant map, 12 median, 7 monotone maps group, 14 orbit, 12 orbit space, 12 order statistics, 13 parameter location, 32 scale, 33 Pitmans asymptotic eciency, 11 rank statistics, 15 Siegel-Tukey test, 40 stochastic domination, 18 suciency, 20 tangent cone, 63 test chi-square goodness of t, 57 chi-square independence, 60 Fisher Yates, 30 Freund-Ansari-Bradley-David-Barton, 39 invariant, 17 Kolmogorov - Smirnov, 52 Kolmogorov - Smirnov, two sample case, 53 Kuiper, 53 Mood, 39 Terry-Hoeding, 37 Van der Waerden X, 32 Wilcoxon, 28 test on domination, 18 76
INDEX Wilcoxon statistic distribution, 40 Mann - Whitney form, 62
77

Skript

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Skript

Uploaded by

Copyright:

Available Formats

Nonparametric Statistics

Chapter 1 Rank Tests

CHAPTER 1. RANK TESTS

A rst example: The one-sided sign test

Performance of the sign test in a parametric setting

1.3. THE SIGN TEST IN A PARAMETRIC SETTING

The parametric test

We consider the sample mean X= 1 n Xk

with E(X) = and Var(X) = 2 /n. Hence Z= X 0 n N (0, 1)

The nonparametric test

pk (1 ptrue )nk true

CHAPTER 1. RANK TESTS

ptrue = P (X > 0 ) = P (X true / > 0 true /) = 1

n0 nptrue nptrue (1 ptrue )

z + 2 ptrue (1 ptrue )z 2ptrue 1

and inserting this into (1.6) nally yields 2 z + 1 2 (true 0 )2 z 2 . n 2 true 0

1.4. GROUP ACTIONS AND INVARIANT TESTS

Pitmans asymptotic eciency

Group actions and invariant tests

CHAPTER 1. RANK TESTS

Example 1. Permutations and order statistics

CHAPTER 1. RANK TESTS

Example 2. Monotone maps and rank statistics

, x(k+1) = x(k) , else

qk 1[x(k) ,x(k+1) ) (t).

1.4. GROUP ACTIONS AND INVARIANT TESTS Let now

1.4. GROUP ACTIONS AND INVARIANT TESTS

CHAPTER 1. RANK TESTS

A testing problem on domination

1.5. A TESTING PROBLEM ON DOMINATION

(ii) X = Y is equivalent to f (X) = f (Y ), (iii) X Y is equivalent to f (X) f (Y ).

CHAPTER 1. RANK TESTS

1.5. A TESTING PROBLEM ON DOMINATION

if s = r(s) and t = r(t) else

CHAPTER 1. RANK TESTS

A preliminary remark about the construction of critical regions

1.7. CONSTRUCTION OF CRITICAL REGIONS

Construction of critical regions

CHAPTER 1. RANK TESTS

fY (Y(t1 ) )...fY (Y(tm ) ) fX (Y(t1 ) )...fX (Y(tm ) )

but the union U(()) = Rm+n N

du1 ...dun dv1 ...dvm fX (u1 )...fY (vn ).

fY (v1 )...fY (vm ) fX (u1 )...fX (vm ) fX (v1 )...fX (vm )

CHAPTER 1. RANK TESTS

Proof: We have by (1.14) using the shorthand f = fY () = 1

f (Yt1 )...f (Ytm ) fX (Yt1 )...fX (Ytm )

1.8. THREE TWO SAMPLE RANK TESTS

fX (Ytj ) fX (Ytj ) d ln fX (Ytj ) dx

Three two sample rank tests

CHAPTER 1. RANK TESTS

Thus, maximizing a sum of expectations E( d ln flog (Y(tj ) )) dx

ts (1 t)ns =: F(i) (t).

E[Flog (Y(tj ) )] = EUtj =

= t F(tj ) (t) 0 By denition of the Beta function

we have (under H0 where X and Y are identically distributed)

Thus, maximizing (1.16) is equivalent to maximizing the test statistic m TW () :=

CHAPTER 1. RANK TESTS

Y(sj ) d 1 1 ln f, (Y(sj ) ) = E = E[W(sj ) ] dx

CHAPTER 1. RANK TESTS

Two sample problems and linear rank tests

CHAPTER 1. RANK TESTS

Denition 14 (Linear rank statistic) A statistic T is called linear rank statistic if

Rs = m, Ri = 1}| P() m . m+n

Ck (Rk + RN k+1 ) (Ck