Robust Nonparametric Statistical Methods

Robust Nonparametric Statistical Methods
Thomas P. Hettmansperger
Penn State University
and
Joseph W. McKean
Western Michigan University
Copyright c _1997, 2008, 2010 by Thomas P. Hettmansperger and Joseph W. McKean
All rights reserved.
ii
Dedication: To Ann and to Marge
Contents
Preface ix
1 One Sample Problems 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Geometry and Inference in the Location Model . . . . . . . . . . . . . . . . . 4
1.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Properties of Normed-Based Inference . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Basic Properties of the Power Function
S
() . . . . . . . . . . . . . 18
1.5.2 Asymptotic Linearity and Pitman Regularity . . . . . . . . . . . . . . 21
1.5.3 Asymptotic Theory and Eciency Results for

. . . . . . . . . . . . 24
1.5.4 Asymptotic Power and Eciency Results for the Test Based on S() 25
1.5.5 Eciency Results for Condence Intervals Based on S() . . . . . . . 27
1.6 Robustness Properties of Norm-Based Inference . . . . . . . . . . . . . . . . 30
1.6.1 Robustness Properties of

. . . . . . . . . . . . . . . . . . . . . . . . 30
1.6.2 Breakdown Properties of Tests . . . . . . . . . . . . . . . . . . . . . . 33
1.7 Inference and the Wilcoxon Signed-Rank Norm . . . . . . . . . . . . . . . . 35
1.7.1 Null Distribution Theory of T(0) . . . . . . . . . . . . . . . . . . . . 36
1.7.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7.3 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.8 Inference Based on General Signed-Rank Norms . . . . . . . . . . . . . . . . 44
1.8.1 Null Properties of the Test . . . . . . . . . . . . . . . . . . . . . . . . 46
1.8.2 Eciency and Robustness Properties . . . . . . . . . . . . . . . . . . 47
1.9 Ranked Set Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.10 Interpolated Condence Intervals for the L
1
Inference . . . . . . . . . . . . . 56
1.11 Two Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2 Two Sample Problems 73
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.2 Geometric Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
iii
iv CONTENTS
2.2.1 Least Squares (LS) Analysis . . . . . . . . . . . . . . . . . . . . . . . 77
2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis . . . . . . . . . . . . . . 78
2.2.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.4 Inference Based on the Mann-Whitney-Wilcoxon . . . . . . . . . . . . . . . . 83
2.4.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.4.2 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.4.3 Statistical Properties of the Inference Based on the MWW . . . . . . 92
2.4.4 Estimation of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.4.5 Eciency Results Based on Condence Intervals . . . . . . . . . . . . 97
2.5 General Rank Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.5.1 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.5.2 Eciency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.5.3 Connection between One and Two Sample Scores . . . . . . . . . . . 107
2.6 L
1
Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
2.6.1 Analysis Based on the L
1
Pseudo Norm . . . . . . . . . . . . . . . . . 108
1
Norm . . . . . . . . . . . . . . . . . . . . . 112
2.7 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.7.1 Breakdown Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.7.2 Inuence Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.8 Lehmann Alternatives and Proportional Hazards . . . . . . . . . . . . . . . . 118
2.8.1 The Log Exponential and the Savage Statistic . . . . . . . . . . . . . 119
2.8.2 Eciency Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.9 Two Sample Rank Set Sampling (RSS) . . . . . . . . . . . . . . . . . . . . . 123
2.10 Two Sample Scale Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.10.1 Optimal Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . 125
2.10.2 Ecacy of the Traditional F-Test . . . . . . . . . . . . . . . . . . . . 133
2.11 Behrens-Fisher Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.11.1 Behavior of the Usual MWW Test . . . . . . . . . . . . . . . . . . . . 135
2.11.2 General Rank Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.11.3 Modied Mathisens Test . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.11.4 Modied MWW Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
2.11.5 Eciencies and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 141
2.12 Paired Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
2.12.1 Behavior under Alternatives . . . . . . . . . . . . . . . . . . . . . . . 145
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3 Linear Models 153
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.2 Geometry of Estimation and Tests . . . . . . . . . . . . . . . . . . . . . . . . 153
3.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.2.2 The Geometry of Testing . . . . . . . . . . . . . . . . . . . . . . . . . 156
CONTENTS v
3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.4 Assumptions for Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . 164
3.5 Theory of Rank-Based Estimates . . . . . . . . . . . . . . . . . . . . . . . . 166
3.5.1 R-Estimators of the Regression Coecients . . . . . . . . . . . . . . . 166
3.5.2 R-Estimates of the Intercept . . . . . . . . . . . . . . . . . . . . . . . 170
3.6 Theory of Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.6.1 Null Theory of Rank Based Tests . . . . . . . . . . . . . . . . . . . . 177
3.6.2 Theory of Rank-Based Tests under Alternatives . . . . . . . . . . . . 181
3.6.3 Further Remarks on the Dispersion Function . . . . . . . . . . . . . . 185
3.7 Implementation of the R-Analysis . . . . . . . . . . . . . . . . . . . . . . . . 187
3.7.1 Estimates of the Scale Parameter
. . . . . . . . . . . . . . . . . . 188
3.7.2 Algorithms for Computing the R-Analysis . . . . . . . . . . . . . . . 191
3.7.3 An Algorithm for a Linear Search . . . . . . . . . . . . . . . . . . . . 193
3.8 L
1
-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.9.1 Properties of R-Residuals and Model Misspecication . . . . . . . . . 196
3.9.2 Standardization of R-Residuals . . . . . . . . . . . . . . . . . . . . . 202
3.9.3 Measures of Inuential Cases . . . . . . . . . . . . . . . . . . . . . . 208
3.10 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.11 Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
3.11.1 Hubers Condition for the Correlation Model . . . . . . . . . . . . . . 221
3.11.2 Traditional Measure of Association and its Estimate . . . . . . . . . . 223
3.11.3 Robust Measure of Association and its Estimate . . . . . . . . . . . . 223
3.11.4 Properties of R-Coecients of Multiple Determination . . . . . . . . 225
3.11.5 Coecients of Determination for Regression . . . . . . . . . . . . . . 230
3.12 High Breakdown (HBR) Estimates . . . . . . . . . . . . . . . . . . . . . . . 232
3.12.1 Geometry of the HBR-Estimates . . . . . . . . . . . . . . . . . . . . 232
3.12.2 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
3.12.3 Asymptotic Normality of

HBR
. . . . . . . . . . . . . . . . . . . . . 235
3.12.4 Robustness Prperties of the HBR Estimates . . . . . . . . . . . . . . 239
3.12.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
3.12.6 Implementation and Examples . . . . . . . . . . . . . . . . . . . . . . 243
3.12.7 Studentized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 244
3.12.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
3.13 Diagnostics for Dierentiating between Fits . . . . . . . . . . . . . . . . . . 247
3.14 Rank-Based procedures for Nonlinear Models . . . . . . . . . . . . . . . . . . 252
3.14.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
3.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
vi CONTENTS
4 Experimental Designs: Fixed Eects 275
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
4.2 Oneway Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.2.1 R-Fit of the Oneway Design . . . . . . . . . . . . . . . . . . . . . . . 277
4.2.2 Rank-Based Tests of H
0
:
1
= =
k
. . . . . . . . . . . . . . . . 281
4.2.3 Tests of General Contrasts . . . . . . . . . . . . . . . . . . . . . . . . 283
4.2.4 More on Estimation of Contrasts and Location . . . . . . . . . . . . . 284
4.2.5 Pseudo-observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
4.3 Multiple Comparison Procedures . . . . . . . . . . . . . . . . . . . . . . . . 288
4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
4.4 Twoway Crossed Factorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
4.5 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
4.6 Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
4.7 Rank Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.7.1 Monte Carlo Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
5 Models with Dependent Error Structure 323
5.1 General Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
5.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
5.2 Simple Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
5.2.1 Variance Component Estimators . . . . . . . . . . . . . . . . . . . . . 328
5.2.2 Studentized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 329
5.2.3 Example and Simulation Studies . . . . . . . . . . . . . . . . . . . . 330
5.2.4 Simulation Studies of Validity . . . . . . . . . . . . . . . . . . . . . . 331
5.2.5 Simulation Study of Other Score Functions . . . . . . . . . . . . . . . 333
5.3 Rank-Based Procedures Based on Arnold Transformations . . . . . . . . . . 333
5.3.1 R Fit Based on Arnold Transformed Data . . . . . . . . . . . . . . . 334
5.4 General Estimating Equations (GEE) . . . . . . . . . . . . . . . . . . . . . . 339
5.4.1 Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
5.4.2 Implementation and a Monte Carlo Study . . . . . . . . . . . . . . . 343
5.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5.5 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6 Multivariate 351
6.1 Multivariate Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
6.2 Componentwise Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
6.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
6.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
6.2.3 Componentwise Rank Methods . . . . . . . . . . . . . . . . . . . . . 364
6.3 Spatial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
CONTENTS vii
6.3.1 Spatial sign Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
6.3.2 Spatial Rank Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 373
6.4 Ane Equivariant and Invariant Methods . . . . . . . . . . . . . . . . . . . 377
6.4.1 Blumens Bivariate Sign Test . . . . . . . . . . . . . . . . . . . . . . 377
6.4.2 Ane Invariant Sign Tests in the Multivariate Case . . . . . . . . . . 379
6.4.3 The Oja Criterion Function . . . . . . . . . . . . . . . . . . . . . . . 387
6.4.4 Additional Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
6.5 Robustness of Multivariate Estimates of Location . . . . . . . . . . . . . . . 392
6.5.1 Location and Scale Invariance: Componentwise Methods . . . . . . . 392
6.5.2 Rotation Invariance: Spatial Methods . . . . . . . . . . . . . . . . . . 392
6.5.3 The Spatial Hodges-Lehmann Estimate . . . . . . . . . . . . . . . . . 394
6.5.4 Ane Equivariant Spatial Median . . . . . . . . . . . . . . . . . . . . 394
6.5.5 Ane Equivariant Oja Median . . . . . . . . . . . . . . . . . . . . . 394
6.6 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
6.6.1 Test for Regression Eect . . . . . . . . . . . . . . . . . . . . . . . . 397
6.6.2 The Estimate of the Regression Eect . . . . . . . . . . . . . . . . . 404
6.6.3 Tests of General Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 405
6.7 Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
A Asymptotic Results 421
A.1 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
A.2 Simple Linear Rank Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 422
A.2.1 Null Asymptotic Distribution Theory . . . . . . . . . . . . . . . . . . 423
A.2.2 Local Asymptotic Distribution Theory . . . . . . . . . . . . . . . . . 424
A.2.3 Signed-Rank Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 431
A.3 Results for Rank-Based Analysis of Linear Models . . . . . . . . . . . . . . . 433
A.3.1 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
A.3.2 Asymptotic Linearity and Quadraticity . . . . . . . . . . . . . . . . . 437
A.3.3 Asymptotic Distance Between

and

. . . . . . . . . . . . . . . . . 439
A.3.4 Consistency of the Test Statistic F
. . . . . . . . . . . . . . . . . . . 440
A.3.5 Proof of Lemma 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 442
A.4 Asymptotic Linearity for the L
1
Analysis . . . . . . . . . . . . . . . . . . . . 443
A.5 Inuence Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
A.5.1 Inuence Function for Estimates Based on Signed-Rank Statistics . . 447
A.5.2 Inuence Functions for Chapter 3 . . . . . . . . . . . . . . . . . . . . 448
A.5.3 Inuence Function of

HBR
of Chapter 5 . . . . . . . . . . . . . . . . 454
A.6 Asymptotic Theory for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . 455
B Larger Data Sets 465
viii CONTENTS
Preface
I dont believe I can really do without teaching. The reason is, I have to have something
so that when I dont have any ideas and Im not getting anywhere I can say to myself,
At least Im living; at least Im doing something; Im making some contribution-its just
psychological.
Richard Feynman
We are currently revising these notes. Any corrections and/or comments are welcome.
This book is based on the premise that nonparametric or rank based statistical methods
are a superior choice in many data analytic situations. We cover location models, regres-
sion models including designed experiments, and multivariate models. Geometry provides
a unifying theme throughout much of the development. We emphasize the similarity in
interpretation with least squares methods. Basically, we replace the Euclidean norm with
a weighted L-1 norm. This results in rank based methods or L-1 methods depending on
the choice of weights. The rank-based methods proceed much like the traditional analy-
sis. Using the norm, models are easily tted. Diagnostics procedures can then be used to
check the quality of t (model criticism) and to locate outlying points and points of high
inuence. Upon satisfaction with the t, rank-based inferential procedures can be used to
conduct the statistical analysis. The benets include signicant gains in power and eciency
when the error distribution has tails heavier than those of a normal distribution and superior
robustness properties in general.
The main text concentrates on Wilcoxon and L-1 methods. The theoretical develop-
ment for general scores (weights) is contained in the Appendix. By restricting attention to
Wilcoxon rank methods, we can recommend a unied approach to data analysis beginning
with the simple location models and extending through complex regression models and de-
signed experiments. All major methodology is illustrated on real data. The examples are
intended as guides for the application of the rank and L-1 methods. Furthermore, all the data
sets in this book can be obtained from the web site: http://www.stat.wmich.edu/home.html.
Selected topics from the rst four chapters provide a basic graduate course in rank based
methods. The prerequisites are an introductory course in mathematical statistics and some
background in applied statistics. The rst seven sections of Chapter 1 and the rst four
sections of Chapter 2 are fundamental for the development of Wilcoxon signed rank and
Mann-Whitney-Wilcoxon rank sum methods in the one- and two-sample location models. In
ix
x PREFACE
Chapter 3, on the linear model, sections one through seven and section nine present the basic
material for estimation, testing and diagnostic procedures for model criticism. Sections two
through four of Chapter 4 give extensive development of methods for the one- and two-way
layouts. Then, depending on individual tastes, there are several more exotic topics in each
chapter to choose from.
Chapters 5 and 6 contain more advanced material. In Chapter 5 we extend rank based
methods for a linear model to bounded inuence, high breakdown estimates and tests. In
Chapter 6 we take up the concept of multidimensional rank. We then discuss various ap-
proaches to the development of rank-like procedures that satisfy various invariant/equivariant
restrictions.
Computation of the procedures discussed in this book is very important. Minitab contains
an undocumented RREG (rank regression) command. It contains various subcommands that
allow for testing and estimation in the linear model. The reader can contact Minitab at (put
email address or web page address here) and request a technical report that describes the
RREG command. In many of the examples of this book the package rglm is used to obtain
the rank-based analyses. The basic algorithms behind this package are described in Chapter
3. Information (including online rglm analyses of examples) can be obtained from the web
site: http://www.stat.wmich.edu/home.html. Students can also be encouraged to write their
own S-plus functions for specic methods.
We are indebted to many of our students and colleagues for valuable discussions, stim-
ulation, and motivation. In particular, the rst author would like to express his sincere
thanks for many stimulating hours of discussion with Steve Arnold, Bruce Brown, and Hannu
Oja while the second author wants to express his sincere thanks for discussions with John
Kapenga, Joshua Naranjo, Jerry Sievers, and Tom Vidmar. We both would like to express
our debt to Simon Sheather, our friend, colleague, and co-author on many papers. Finally,
we would like to thank Jun Recta for assistance in creating several of the plots.
Tom Hettmansperger
Joe McKean
July 2008
State College, PA
Kalamazoo, MI
Chapter 1
One Sample Problems
1.1 Introduction
Traditional statistical procedures are widely used because they oer the user a unied
methodology with which to attack a multitude of problems, from simple location prob-
lems to highly complex experimental designs. These procedures are based on least squares
tting. Once the problem has been cast into a model then least squares oers the user:
1. a way of tting the model by minimizing the Euclidean normed distance between the
responses and the conjectured model;
2. diagnostic techniques that check the adequacy of the t of the model, explore the
quality of t, and detect outlying and/or inuential cases;
3. inferential procedures, including condence procedures, tests of hypotheses and multi-
ple comparison procedures;
4. computational feasibility.
Procedures based on least squares, though, are easily impaired by outlying observations.
Indeed one outlying observation is enough to spoil the least squares t, its associated di-
agnostics and inference procedures. Even though traditional inference procedures are exact
when the errors in the model follow a normal distribution, they can be quite inecient when
the distribution of the errors has longer tails than the normal distribution.
For simple location problems, nonparametric methods were proposed by Wilcoxon (1945).
These methods consist of test statistics based on the ranks of the data and associated esti-
mates and condence intervals for location parameters. The test statistics are distribution
free in the sense that their null distributions do not depend on the distribution of the errors.
It was soon realized that these procedures are almost as ecient as the traditional methods
when the errors follow a normal distribution and, furthermore, are often much more ecient
relative to the traditional methods when the error distributions deviate from normality; see
Hodges and Lehmann (1956). These procedures possess both robustness of validity and
1
2 CHAPTER 1. ONE SAMPLE PROBLEMS
power. In recent years these nonparametric methods have been extended to linear and non-
linear models. In addition, from the perspective of modern robustness theory, contrary to
least squares estimates, these rank-based procedures have bounded inuence functions and
positive breakdown points.
Often these nonparametric procedures are thought of as disjoint methods that dier from
one problem to another. In this text, we intend to show that this is not the case. Instead,
these procedures present a unied methodology analogous to the traditional methods. The
four items cited above for the traditional analysis hold for these procedures too. Indeed the
only operational dierence is that the Euclidean norm is replaced by another norm.
There are computational procedures available for the rank-based procedures discussed
in this book. We oer the reader a collection of computational functions written in the
software language R at the site http://www.stat.wmich.edu/mckean/Rfuncs/ . We refer
to these computational algorithms as rank-based R algorithms or RBR. We discuss these
functions throughout the text and use them in many of the examples, simulation studies,
and exercises. The programming language R (see Ihaka, R. and Gentleman, R., 1996) is
freeware and can run on all (PC, Mac, Linux) platforms. To download the R software and
accompanying information, visit the site http://www.r-project.org/. The language R has
intrinsic functions for computation of some of the procedures discussed in this and the next
chapter.
1.2 Location Model
In this chapter we will consider the one sample location problem. This will allow us to explore
some useful concepts such as distribution freeness and robustness in a simple setting. We
will extend many of these concepts to more complicated situations in later chapters. We
need to rst dene a location parameter. For a random variable X we often subscript its
distribution function by X to avoid confusion.
Denition 1.2.1. Let T(H) be a function dened on the set of distribution functions. We
say T(H) is a location functional if
1. If G is stochastically larger than F (ie.(G(x) F(x)) for all x, then T(G) T(F);
2. T(H
aX+b
) = aT(H
X
) + b, a > 0;
3. T(H
X
) = T(H
X
).
Then, we will call = T(H) a location parameter of H.
Note that if X has location parameter it follows from the second item in the above
denition that the random variable e = X has location parameter 0. Suppose X
1
, . . . , X
n
is a random sample having the common distribution function H(x) and = T(H) is a
location parameter of interest. We express this by saying that X
i
follows the statistical
location model,
X
i
= + e
i
, i = 1, . . . , n , (1.2.1)
1.2. LOCATION MODEL 3
where e
1
, . . . , e
n
are independent and identically distributed random variable with distri-
bution function F(x) and density function f(x) and location T(F) = 0. It follows that
H(x) = F(x ) and that T(H) = . We next discuss three examples of location param-
eters that we will use throughout this chapter. Other location parameters are discussed in
Section 1.8. See Bickel and Lehmann (1975) for additional discussion of location functionals.
Example 1.2.1. The Median Location Functional
First dene the inverse of the cdf H(x) by H
1
(u) = infx : H(x) u. Generally we
will suppose that H(x) is strictly increasing on its support and this will eliminate ambiguities
on the selection of the parameter. Now dene
1
= T
1
(H) = H
1
(1/2). This is the median
functional. Note that if G(x) F(x) for all x, then G
1
(u) F
1
(u) for all u; and, in
particular, G
1
(1/2) F
1
(1/2). Hence, T
1
(H) satises the rst condition for a location
functional. Next let H
(x) = P(aX + b x) = H[a

1
(x b)]. Then it follows at once that
H
1
(u) = aH
1
(u) + b and the second condition is satised. The third condition follows
with an argument similar to the the one for the second condition.
Example 1.2.2. The Mean Location Functional
For the mean functional let
2
= T
2
(H) =
_
xdH(x), when the mean exists. Note that
_
xdH(x) =
_
H
1
(u)du. Now if G(x) F(x) for all x, then x G
1
(F(x)). Let x =
F
1
(u) and we have F
1
(u) G
1
(F(F
1
(u)) G
1
(u). Hence, T
2
(G) =
_
G
1
(u)du
_
F
1
(u)du = T
2
(F) and the rst condition is satised. The other two conditions follow
easily from the denition of the integral.
Example 1.2.3. The Pseudo-Median Location Functional
Assume that X
1
and X
2
are independent and identically distributed, (iid), with distri-
bution function H(x). Let Y = (X
1
+ X
2
)/2. Then Y has distribution function H
(y) =
P(Y y) =
_
H(2yx)h(x)dx. Let
3
= T
3
(H) = H
1
(1/2). To show that T
3
is a location
functional, suppose G(x) F(x) for all x. Then
G
(y) =
_
G(2y x)g(x) dx =
_ __
2yx
g(t) dt
_
g(x) dx
_ __
2yx
f(t) dt
_
g(x) dx
=
_ __
2yt
g(x) dt
_
f(t) dx
_ __
2yt
f(x) dt
_
f(t) dx = F
(y) ;
hence, as in Example 1.2.1, it follows that G
1
(u) F
1
(u) and, hence, that T
3
(G)
T
3
(F). For the second property, let W = aX + b where X has distribution function H and
a > 0. Then W has distribution function F
W
(t) = H((t b)/a). Then by the change of
variable z = (x b)/a, we have
F
W
(y) =
_
H
_
2y x b
a
_
1
a
h
_
x b
a
_
dx =
_
H
_
2
y b
a
z
_
h(z) dz .
Thus the dening equation for T
3
(F
W
) is
1
2
=
_
H
_
2
T
3
(F
W
) b
a
z
_
h(z) dz ,
which is satised for T
3
(F
W
) = aT
3
(H) + b. For the third property, let V = X where X
has distribution function H. Then V has distribution function F
V
(t) = 1 H(t). Hence,
by the change in variable z = x,
F
V
(y) =
_
(1 H(2y + x))h(x) dx = 1
_
H(2y z))h(z) dz .
Because the dening equation of T
3
(F
V
) can be written as
1
2
=
_
H(2(T
3
(F
V
)) z)h(z) dz ,
it follows that T
3
(F
V
) = T
3
(H). Therefore, T
3
is a location functional. It has been called
the pseudo-median by Hoyland (1965) and is more appropriate for symmetric distributions.
The next theorem characterizes all the location functionals for a symmetric distribution.
Theorem 1.2.1. Suppose that the pdf h(x) is symmetric about some point a. If T(H) is a
location functional, then T(H) = a.
Proof. Let the random variable X have pdf h(x) symmetric about a. Let Y = X a, then
Y has pdf g(y) = h(y +a) symmetric about 0. Hence Y and Y have the same distribution.
By the third property of location functionals, this means that T(G
Y
) = T(G
Y
) = T(G
Y
);
i.e, T(G
Y
) = 0. But by the second property, 0 = T(G
Y
) = T(H) a; that is , a = T(H).
This theorem means that when we sample from a symmetric distribution we can unam-
biguously dene location as the center of symmetry. Then all location functionals that we
may wish to study will specify the same location parameter.
1.3 Geometry and Inference in the Location Model
Letting X = (X
1
, . . . , X
n
)
and e = (e
1
, . . . , e
n
)
, we then write the statistical location model,

( 1.2.1), as,
X = 1 +e , (1.3.1)
where 1 denotes the vector all of whose components are 1 and T(F
e
) = 0. If
F
denotes the
one-dimensional subspace spanned by 1, then we can express the model more compactly as
X = + e, where
F
. The subscript F on stands for full model in the context of
hypothesis testing as discussed below.
Let x be a realization of X. Note that except for random error, x would lie in
F
. Hence
an intuitive tting criteria is to estimate by a value

such that the vector 1

F
lies
1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL 5
closest to x, where closest is dened in terms of a norm. Furthermore, a norm, as the
following general discussion shows, provides a complete inference for the parameter .
Recall that a norm is a nonnegative function, | |, dened on 1
n
such that |y| 0 for
all y; |y| = 0 if and only if y = 0; |ay| = [a[|y| for all real a; and |y + z| |y| +|z|.
The distance between two vectors is d(z, y) = |z y|.
Given a location model, ( 1.3.1), and a specied a norm, | |, the estimate of induced
by the norm is
= argmin|x 1| , (1.3.2)
i.e., the value which minimizes the distance between x and the space
F
. As discussed in
Exercise 1.12.1, a minimizing value always exists. The dispersion function induced by the
norm is given by,
D() = |x 1| . (1.3.3)
The minimum distance between the vector of observations x and the space
F
is D(
).
As Exercise 1.12.3 shows, D() is a convex, continuous function of which is dierentiable
almost everywhere. Actually the norms discussed in this book are dierentiable at all but
at most a nite number of points. We dene the gradient process by the function
S() =
d
d
D() . (1.3.4)
As Exercise 1.12.3, shows, S() is a nonincreasing function. Its discontinuities are the points
where D() is nondierentiable. Furthermore the minimizing value is a value where S() is
0 or, due to a discontinuity, steps through 0. We express this by saying that

solves the
equation
S(
)
.
= 0 . (1.3.5)
Suppose we can represent the above estimate by

=

(x) =

(H
n
), where H
n
denotes
the empirical distribution function of the sample. The notation

(H
n
) is suggestive of the
functional notation used in the last section. This is as it should be, since it is easy to
show that

satises the sample analogues of properties (2) and (3) of Denition 1.2.1. For
property (2), consider the estimating equation of the translated sample y = ax + 1b, for
a > 0, given by
(y) = argmin|y 1| = a argmin

_
_
_
_
x 1
b
a
_
_
_
_
.
From this we immediaitely have that

(y) = a
(x) + b. For property (3), the dening

equation for the sample y = x is
(y) = argmin|y 1| = argmin|x 1()| .

From which we have

(y) =
(x). Furthermore, for the norms considered in this book it

is easy to check that

(H
n
)

(G
n
) when H
n
and G
n
are empirical cdfs for which H
n
is
stochastically larger than G
n
. Hence, the norms generate location functionals on the set of
empirical cdfs. The L
1
norm provides an easy example. We can think of

(H
n
) = H
1
n
(
1
2
)
as the restriction of (H) = H
1
(
1
2
) to the class of discrete distributions which assign mass
1/n to n points. Generally we can think of

(H
n
) as the restriction of (H) or, conversely,
we can think of (H) as the extension of

(H
n
). We let the norm determine the location.
This is especially simple in the symmetric location model where all location functionals are
equal to the point of symmetry.
Next consider the hypotheses,
H
0
: =
0
versus H
A
: ,=
0
, (1.3.6)
for a specied
0
. Because of the second property of location functionals in Denition 1.2.1,
we can assume without loss of generality that
0
= 0; otherwise we need only subtract
0
from each X
i
. Based on the data, the most acceptable value of is the value at which the
gradient S() is zero. Hence large values of [S(0)[ favor H
A
. Formally the level gradient
test or score test for the hypotheses ( 1.3.6) is given by
Reject H
0
in favor of H
A
if [S(0)[ c , (1.3.7)
where c is such that P
0
[[S(0)[ c] = . Typically, the null distribution of S(0) is symmetric
so there is no loss in generality in considering symmetrical critical regions.
A second formulation of a test statistic is based on the dierence in minimizing dispersions
or the reduction in dispersion. Call Model 1.2.1 the full model. As noted above, the distance
between x and the subspace
F
is D(
). The reduced model is the full model subject

to H
0
. In this case the reduced model space is 0. Hence the distance between x and
the reduced model space is D(0). Under H
0
, x should be close to this space; therefore, the
reduction in dispersion test is given by
Reject H
0
in favor of H
A
if RD = D(0) D(
) m , (1.3.8)
where m is determined by the null distribution of RD. This test will be used in Chapter 3
and subsequent chapters.
A third formulation is based on the standardized estimate:
Reject H
0
in favor of H
A
if
|
b
|
Var
b
, (1.3.9)
where is determined by the null distribution of

. Tests based directly on the estimate are
often referred to as Wald type tests.
The following useful theorem allows us to shift between computing probabilities when
= 0 and for general . Its proof is a straightforward application of a change of variables.
See Theorem A.2.4 of the Appendix for a more general result.
Theorem 1.3.1. Suppose that we can write S() = S(x
1
, . . . , x
n
). Then P
(S(0)
t) = P
0
(S() t).
We now turn to the problem of the construction of a (1 )100% condence interval
for based on S(). Such an interval is easily obtained by inverting the acceptance region
of the level test given by ( 1.3.7). The acceptance region is [ S(0) [< c. Dene
L
= inft : S(t) < c and

U
= supt : S(t) > c. (1.3.10)
Then because S() is nonincreasing,
:[ S() [< c = :

U
. (1.3.11)
Thus from Theorem 1.3.1,
P
U
) = P
([ S() [< c) = P
0
([ S(0) [< c) = 1 . (1.3.12)
Hence, inverting a size test results in the (1 )100% condence interval (
L
,
U
).
Thus a norm not only provides a tting criterion but also a complete inference. As
with all statistical analyses, checks on the appropriateness of the model and the quality of
t are needed. Useful plots here include: stem-leaf plots and q q plots to check shape
and distributional assumptions, boxplots and dotplots to check for outlying observations,
and a plot of X
i
versus i (or other appropriate variables) to check for dependence between
observations. Some of these diagnostic checks are performed in the the next section of
numerical examples.
In the next three examples, we discuss the inference for the norms associated with the
location functionals presented in the last section. We state the results of their associated
inference, which we will derive in later sections.
Example 1.3.1. L
1
-Norm
Recall that the L
1
norm is dened as |x|
1
=
[ x
i
[, hence the associated dispersion
and negative gradient functions are given respectively by D
1
() =
[ X
i
[ and S
1
() =
sgn(X
i
). Letting H
n
denote the empirical cdf, we can write the estimating equation
as
0 = n
1
sgn(x
i
) =
_
sgn(x )dH
n
(x) .
The solution, of course, is

the median of the observations. If we replace the empirical cdf
H
n
by the true underlying cdf H then the estimating equation becomes the dening equation
for the parameter = T(H). In this case, we have
0 =
_
sgn(x T(H))dH(x) =
_
T(H)
dH(x) +
_

T(H)
dH(x) ;
hence, H(T(H)) = 1/2 and solving for T(H) we nd T(H) = H
1
(1/2) as expected.
As we show in Section 1.5,
has an asymptotic N(,

2
S
/n) distribution , (1.3.13)
where
s
= 1/(2h()). Estimation of the standard deviation of

is discussed in Section 1.5.
Turning next to testing the hypotheses ( 1.3.6), the gradient test statistic is S
1
(0) =
sgn(X
i
). But we can write, S
1
(0) = S
+
1
S
1
+ S
0
1
where S
+
1
=
I(X
i
> 0), S
1
=
I(X
i
< 0), and S
0
1
=
I(X
i
= 0) = 0, with probability one since we are sampling
from a continuous distribution, and I() is the indicator function. In practice, we must deal
with ties and this is usually done by setting aside those observations that are equal to the
hypothesized value and carrying out the test with a reduced sample size. Now note that
n = S
+
1
+S
1
so that we can write S
1
= 2S
+
1
n and the test can be based on S
+
1
. The null
distribution of S
+
1
is binomial with parameters n and 1/2. Hence the level sign test of
the hypotheses ( 1.3.6) is
Reject H
0
in favor of H
A
if S
+
1
c
1
or S
+
1
n c
1
, (1.3.14)
and c
1
satises
P[bin(n, 1/2) c
1
] = /2 , (1.3.15)
where bin(n, 1/2) denotes a binomial random variable based on n trials and with proba-
bility of success 1/2. Note that the critical value of the test can be determined without
specifying the shape of F. In this sense, the test based on S
1
is distribution free or
nonparametric. Using the asymptotic null distribution of S
+
1
, c
1
can be approximated as
c
1
.
= n/2 n
1/2
z
/2
/2 .5 where (z
/2
) = /2; (.) is the standard normal cdf, and .5 is
the continuity correction.
For the associated (1 )100% condence interval, we follow the general development
above, ( 1.3.12). Hence, we must nd

L
= inft : S
+
1
(t) < n c
1
, where c
1
is given by
( 1.3.15). Note that S
+
1
(t) < n c
1
if and only if the number of X
i
greater than t is less
than n c
1
. But #i : X
i
> X
(c
1
+1)
= n c
1
1 and #i : X
i
> X
(c
1
+1)
n c
1
for any > 0. Hence,

L
= X
(c
1
+1)
. A similar argument shows that

U
= X
(nc
1
)
. We can
summarize this by saying that the (1 )100% L
1
condence interval is the half open, half
closed interval
[X
(c
1
+1)
, X
(nc
1
)
) where /2 = P(S
+
1
(0) c
1
) determines c
1
. (1.3.16)
The critical value c
1
can be determined from the binomial(n, 1/2) distribution or from the
normal approximation cited above. The interval developed here is a distribution-free con-
dence interval since the condence coecient is determined from the binomial distribution
without making any shape assumption on the underlying model distribution.
Example 1.3.2. L
2
-Norm
Recall that the square of the L
2
-norm is given by |x|
2
2
=
n
i=1
x
2
i
. As shown in Exercise
1.12.4, the estimate determined by this norm is the sample mean X and the functional
parameter is =
_
xh(x) dx, provided it exists. Hence the L
2
norm is consistent for the
mean location problem. The associated test statistic is equivalent to Students t-test. The
approximate distribution of X is N(0,
2
/n), provided the variance
2
= VarX
1
exists.
Hence, the test statistic is not distribution free. In practice, is replaced by its estimate s =
(
(X
i
X)
2
/(n 1))
1/2
and the test is based on the t-ratio, t =
nX/s, which, under the

null hypothesis, is asymptotically N(0, 1). The usual condence interval is Xt
/2,n1
s/
n,
where t
/2,n1
is the (1 /2)-quantile of a t-distribution with n 1 degrees of freedom.
This interval has the approximate condence coecient (1 )100%, unless the errors, e
i
,
follow a normal distribution in which case it has exact condence.
Example 1.3.3. Weighted L
1
Norm
Consider the function
|x|
3
=
n
i=1
R([x
i
[)[x
i
[ , (1.3.17)
where R([x
i
[) denotes the rank of [x
i
[ among [x
1
[, . . . , [x
n
[. As the next theorem shows this
function is a norm on 1
n
. See Section 1.8 for a general weighted L
1
norm.
Theorem 1.3.2. The function |x|
3
=
j[x[
(j)
=
R([x
j
[)[x
j
[ is a norm, where R([x
j
[) is
the rank of [x
j
[ among [x
1
[, . . . , [x
n
[ and [x[
(1)
[x[
(n)
are the ordered absolute values.
Proof. The equality relating |x|
3
to the ranks is clear. To show that we have a norm, we
rst note that |x|
3
0 and that |x|
3
= 0 if and only if x = 0. Also clearly |ax|
3
= [a[|x|
3
for any real a. Hence, to nish the proof, we must verify the triangle inequality. Now
|x+y|
3
=
j[x+y[
(j)
=
R([x
i
+y
j
[)[x
i
+y
j
[
R([x
i
+y
j
[)[x
i
[+
R([x
i
+y
j
[)[y
j
[ .
(1.3.18)
Consider the rst term on the right side. By summing through another index we can write
it as,
R([x
i
+ y
j
[)[x
i
[ =
b
j
[x[
(j)
,
where b
1
, . . . , b
n
is a permutation on the integers 1, . . . , n. Suppose b
j
is not in order, then
there exists a t and a s such that [x[
(t)
[x[
(s)
but b
t
> b
s
. Whence,
[b
s
[x[
(t)
+ b
t
[x[
(s)
] [b
t
[x[
(t)
+ b
s
[x[
(s)
] = (b
t
b
s
)([x[
(s)
[x[
(t)
) 0 .
Hence such an interchange never decreases the sum. This leads to the result,
R([x
i
+ y
j
[)[x
i
[
j[x[
(j)
,
A similar result holds for the second term on the right side of ( 1.3.18). Therefore, |x+y|
3
j[x[
(j)
+
j[y[
(j)
= |x|
3
+ |y|
3
, and, this completes the proof. The above argument is
taken from Hardy, Littlewood, and Polya (1952).
We shall call this norm the weighted L
1
Norm. In the next theorem, we oer an
interesting identity satised by this norm. First, though, we need another representation
of it. For a random sample X
1
, . . . , X
n
, dene the anti-ranks to be the random variables
D
1
, . . . , D
n
such that
Z
1
= [X
D
1
[ . . . Z
n
= [X
Dn
[ . (1.3.19)
For example, if D
1
= 2 then [X
2
[ is the smallest absolute value and Z
1
has rank 1. Note
that the anti-rank function is just the inverse of the rank function. We can then write
|x|
3
=
n
i=j
j[x[
(j)
=
n
j=1
j[x
D
j
[ . (1.3.20)
Theorem 1.3.3. For any vector x,
|x|
3
=
ij
x
i
+ x
j
2
i<j
x
i
x
j
2
. (1.3.21)
Proof: Letting the index run through the anti-ranks, we have
ij
x
i
+ x
j
2
i<j
x
i
x
j
2
=
n
i=1
[x
i
[ +
i<j
_
x
D
i
+ x
D
j
2
x
D
j
x
D
i
2
_
.
(1.3.22)
For i < j, hence [x
D
i
[ [x
D
j
[, consider the expression,
x
D
i
+ x
D
j
2
x
D
j
x
D
i
2
.
There are four cases to consider: where x
D
i
and x
D
j
are both positive; where they are both
negative; and the two cases where they have mixed signs. In all these cases, though, it is
easy to show that
x
D
i
+ x
D
j
2
x
D
j
x
D
i
2
= [x
D
j
[ .
Using this, we have that the right side of expression ( 1.3.22) is equal to:
n
i=1
[x
i
[ +
i<j
[x
D
j
[ =
n
j=1
[x
D
j
[ +
n
j=1
(j 1)[x
D
j
[ =
n
j=1
j[x
D
j
[ = |x|
3
, (1.3.23)
and we are nished.
The associated gradient function is
T() =
n
i=1
R([X
i
[)sgn(X
i
) =
ij
sgn
_
X
i
+ X
j
2

_
. (1.3.24)
The middle term is due to the fact that the ranks only change values at the nite number
of points determined by [X
i
[ = [X
j
[; otherwise R([X
i
[) is constant. The third
term is obtained immediately from the identity ( 1.3.21). The n(n + 1)/2 pairwise averages
(X
i
+ X
j
)/2 : 1 i j n are called the Walsh averages. Hence, the estimate of is
the median of the Walsh averages, which we shall denote as,
3
= med
ij
_
X
i
+ X
j
2
_
, (1.3.25)
rst discussed by Hodges and Lehmann (1963). Often

3
is called the Hodges-Lehmann
estimate of location. In order to obtain the corresponding location functional, note that
R([X
i
[) = #[X
j
[ [X
i
[ = # [X
i
[ X
j
+[X
i
[
= nH
n
( +[X
i
[) nH
n
( [X
i
[) ,
where H
n
is the left limit of H
n
. Hence (1.3.24) becomes
_
H
n
( +[x [) H
n
( [x [)sgn(x ) dH
n
(x) = 0 ,
and in the limit we have,
_
H( +[x [) H( [x [)sgn(x ) dH(x) = 0 ,
that is,
H(2 x) H(x) dH(x) +

_

H(x) H(2 x) dH(x) = 0 .

This simplies to
_

H(2 x) dH(x) =
_

H(2 x) dH(x) =
1
2
, (1.3.26)
Hence, the functional is the pseudo-median dened in Example 1.2.3. If the density h(x) is
symmetric then from ( 1.7.11)
3
has an approximate N(
3
,
2
/n) distribution , (1.3.27)
where = 1/(
12
_
h
2
(x) dx). Estimation of is discussed in Section 3.7.
The most convenient form of the gradient process is
T
+
() =
ij
I
_
X
i
+ X
j
2
>
_
=
n
i=1
R([X
i
[)I(X
i
> ) . (1.3.28)
The corresponding gradient test statistic for the hypotheses ( 1.3.6) is T
+
(0). In Section
1.7, provided that h(x) is symmetric, it is shown that T
+
(0) is distribution free under H
0
with null mean and variance n(n + 1)/4 and n(n + 1)(2n + 1)/24, respectively. This test
is often referred to as the Wilcoxon signed-rank test. Thus the test for the hypotheses
( 1.3.6) is
Reject H
0
in favor of H
A
, if T
+
(0) k or T
+
(0)
n(n+1)
2
k , (1.3.29)
where P(T
+
(0) k) = /2. An approximation for k is given in the next paragraph.
Because of the similarity between the sign and signed-rank processes, the condence
interval based on T
+
() follows immediately from the argument given in Example 1.3.1 for
the sign process. Instead of the order statistics which were used in the condence interval
based on the sign process, in this case we use the ordered Walsh averages, which we denote
as W
(1)
, . . . , W
(n(n+1)/2)
. Hence a (1 )100% condence interval for is given by
[W
(k+1)
, W
((n(n+1)/2)k)
) where k is such that /2 = P(T
+
(0) k) . (1.3.30)
As with the sign process, k can be approximated using the asymptotic normal distribution
of T
+
(0) by
k
.
=
n(n + 1)
4
z
/2
_
n(n + 1)(2n + 1)
24
.5 ,
where z
/2
is the (1 /2)-quantile of the standard normal distribution. Provided that h(x)
is symmetric, this condence interval is distribution free.
1.3.1 Computation
The three procedures discussed in this section are easily computed in R. The R intrin-
sic functions t.test and wilcoxon.test compute the t- and Wilcoxon-signed-rank tests,
respectively. Our collection of R functions, RBR, contain the functions onesampwil and
onesampsgn which compute the asymptotic versions of the Wilcoxon-signed-rank and sign
tests, respectively. These functions also compute the associated estimates, condence inter-
vals and standard errors. Their use is discussed in the examples. Minitab (see ??) also can
be used to compute these tests. At command line, the Minitab commands stest, wtest,
and ttest compute the sign, Wilcoxon-signed-rank, and t-tests, repsectively.
1.4 Examples
In applications by convention, when testing the null hypothesis H
0
: =
0
using the sign
test, any data point equal to
0
is set aside and the sample size is reduced. On the other
hand, these values are not set aside for point estimation or condence intervals. The output
of the RBR functions onesampwil and onesampsgn includes the test statistics T and S,
respectively, and a continuity corrected standardized value z. The p-values are approximated
1.4. EXAMPLES 13
Table 1.4.1: Excess hours of sleep under the inuence of two drugs and the dierence in
excesses.
Row Dextro Laevo Di(L-D)
1 -0.1 -0.1 0.0
2 0.8 1.6 0.8
3 3.4 4.4 1.0
4 0.7 1.9 1.2
5 -0.2 1.1 1.3
6 -1.2 0.1 1.3
7 2.0 3.4 1.4
8 3.7 5.5 1.8
9 -1.6 0.8 2.4
10 0.0 4.6 4.6
by computing normal probabilities on z. Especially for small sample sizes, for the test based
on the signs, S, the approximate and exact p-values can be somewhat dierent. In calculating
the signed-ranks for the test statistic T, we use average ranks. For t-tests, we report the the
p-values and condence intervals using the t-distribution with n 1 degrees of freedom.
Example 1.4.1. Cushney-Peebles Data.
The data given in Table 1.4.1 gives the average excess number of hours of sleep that each
of 10 patients achieved from the use of two drugs. The third column gives the dierence
(Laevo-Dextro) in excesses across the two drugs. This is a famous data set. Gosset, writing
under the pseudonym Student, published his landmark paper on the t-test in 1908 and used
this data set for illustration. The dierences, however, suggests that the L
2
methods may
not be the methods of choice in this case. The normal quantile plot, Panel A of Figure 1.4.1,
shows that the tails may be heavy and that there may be an outlier. A normal quantile
plot has the data (dierences) on the vertical axis and the expected values of the standard
normal order statistics on the horizontal axis. When the data is consistent with a normal
assumption, the plot should be roughly linear. The boxplot, with 95% L
1
condence interval,
Panel B of Figure 1.4.1, further illustrates the presence of an outlier. The box is dened by
the quartiles and the shaded notch represents the condence interval.
For the sake of discussion and comparison of methods, we provide the p-values for the sign
test, the Wilcoxon signed rank test, and the t-test. We used the RBR functions onesampwil,
onesampsgn, and onesampt to compute the results for the Wilcoxon signed rank test, the sign
test, and the t-test, respectively. For each function, the following display shows the necessary
R code (these are preceded with the prompt >) to compute these functions, which is then
followed by the results. The standard errors (SE) for the sign and signed-rank estimates are
given by (1.5.28) and (1.7.12), respectively. in general in Section 1.5.5. These functions also
produce a boxplot of the data. The boxplot produced by the function onesampsgn is shown
in Figure 1.4.1.
Figure 1.4.1: Panel A: Normal qq Plot of Cushney-Peebles Data; Panel B: Boxplot with 95%
notched condence interval; Panel C: Sensitivity Curve for t-test; and Panel D: Sensitivity
Curve for sign test
*
*
*
*
* *
*
*
*
*
1.0 0.5 0.0 0.5 1.0
0
1
2
3
4
Normal quantiles
D
i
f
f
e
r
e
n
c
e
:

L
a
e
v
o

D
e
x
t
r
o
Panel A
0
1
2
3
4
D
i
f
f
e
r
e
n
c
e
:

L
a
e
v
o

D
e
x
t
r
o
Panel B
10 5 0 5 10
0
1
2
3
4
5
6
Value of 10th difference
t
t
e
s
t
Panel C
10 5 0 5 10
2
.
2
2
.
3
2
.
4
2
.
5
2
.
6
2
.
7
2
.
8
Value of 10th difference
S
t
a
n
d
a
r
d
i
z
e
d

s
i
g
n

t
e
s
t
Panel D
Assumes that the differences are in the vector diffs
> onesampwil(diffs)
Results for the Wilcoxon-Signed-Rank procedure
Test of theta = 0 versus theta not equal to 0
Test-Stat. is T 54 Standardized (z) Test-Stat. is 2.70113 p-vlaue 0.00691043
Estimate 1.3 SE is 0.484031
95 % Confidence Interval is ( 0.9 , 2.7 )
Estimate of the scale parameter tau 1.530640
1.4. EXAMPLES 15
> onesampsgn(diffs)
Results for the Sign procedure
Test stat. S is 9 Standardized (z) Test-Stat. 2.666667 p-vlaue 0.007660761
> temp=onesampt(diffs)
Results for the t-test procedure
Test stat. Ave(x) - 0 is 1.58 Standardized (t) Test-Stat. 4.062128 p-vlaue 0.00283289
Estimate of the scale parameter sigma 1.229995
The condence interval corresponding to the sign test is (0.8, 2.4) which is shifted above
0. Hence, there is strong support for the alternative hypothesis that the location of the
dierence distribution is not equal to zero. That is, we reject H
0
: = 0 in favor of
H
A
: ,= 0 at = .05. All three tests support this conclusion. The estimates of location
corresponding to the three tests are the median (1.3), the median of the Walsh averages
(1.3), and the mean of the sample dierences (1.58). Note that the outlier had an eect on
the sample mean.
In order to see how sensitive the test statistics are to outliers, we change the value of
the outlier (dierence in the 10th row of Table 1.4.1 and plot the value of the test statistic
against the value of the dierence in the 10th row of Table 1.4.1; see Panel C of Figure 1.4.1.
Note that as the value of the 10th dierence changes the t-test changes quite rapidly. In
fact, the t-test can be pulled out of the rejection region by making the dierence suciently
small or large. However, the sign test , Panel D of Figure 1.4.1, stays constant until the
dierence crosses zero and then only changes by 2. This illustrates the high sensitivity of the
t-test to outliers and the relative resistance of the sign test. A similar plot can be prepared
for the Wilcoxon signed rank test; see Exercise 1.12.8. In addition, the corresponding p-
values can be plotted to see how sensitive the decision to reject the null hypothesis is to
outliers. Sensitivity plots are similar to inuence functions. We discuss inuence functions
for estimates in Section 1.6.
Example 1.4.2. Shoshoni Rectangles.
Table 1.4.2: Width to Length Ratios of Rectangles
0.553 0.570 0.576 0.601 0.606 0.606 0.609 0.611 0.615 0.628
0.654 0.662 0.668 0.670 0.672 0.690 0.693 0.749 0.844 0.933
The golden rectangle is a rectangle in which the ratio of the width to length is approximately
0.618. It can be characterized in various ways. For example, w/l = l/(w + l) characterizes
the golden rectangle. It is considered to be an aesthetic standard in Western civilization and
appears in art and architecture going back to the ancient Greeks. It now appears in such
items as credit and business cards. In a cultural anthropology study, DuBois (1960) reports
on a study of the Shoshoni beaded baskets. These baskets contain beaded rectangles and the
question was whether the Shoshonis use the same aesthetic standard as the West. A sample
of twenty width to length ratios from Shoshoni baskets is given in Table 1.4.2.
Panel A of Figure 1.4.2 shows the notched boxplot containing the 95% L
1
condence
interval for the median of the population of w/l ratios. It shows two outliers which are also
apparent in the normal quantile plot, Panel B of Figure 1.4.2. We used the sign procedure
to analyze the data, perfoming the computations with the RBR function onesampsgn. For
Figure 1.4.2: Panel A: Boxplot of Width to Length Ratios of Shoshoni Rectangles; Panel B:
Normal qq plot.
0
.
6
0
.
7
0
.
8
0
.
9
W
id
t
h

t
o

le
n
g
t
h

r
a
t
io
s
Panel A
*
*
*
*
****
*
*
*
*
***
**
*
*
*
1.5 1.0 0.5 0.0 0.5 1.0 1.5
0
.
6
0
.
7
0
.
8
0
.
9
Normal quantiles
W
id
t
h

t
o

le
n
g
t
h

r
a
t
io
s
Panel B
this problem, it is of interest to test H
0
: = 0.618 (the golden rectangle). The display
1.5. PROPERTIES OF NORMED-BASED INFERENCE 17
below shows this evaluation for the sign test along with a 90% condence interval for .
> onesampsgn(x,theta0=.618,alpha=.10)
Test of theta = 0.618 versus theta not equal to 0.618
Estimate 0.641 SE is 0.01854268
With a p-value of 0.823, there is no evidence to refute the null hypothesis. Further. we see
that the golden rectangle 0.618 is contained in the condence interval. This suggests that
there is no evidence in this data that the Shoshonis are using a dierent standard.
For comparison, the analysis based on the t-procedure is
> onesampt(x,theta0=.618,alpha=.10)
Results for the t-test procedure
Test of theta = 0.618 versus theta not equal to 0.618
Test stat. Ave(x) - 0.618 is 0.0425 Standardized (t) Test-Stat. 2.054523
p-vlaue 0.05394133
Estimate 0.6605 SE is 0.02068606
Based on the t-test with the p-value of 0.053, one might conclude that there is evidence that
the Shoshonis are using a dierent standard. Further, the 90% t-interval does not contain
the golden rectangle ratio. Based on the t-analysis, a researcher might conclude that there is
evidence that the Shoshonis are using a dierent standard. Hence, the robust and traditional
approaches lead to dierent practical conclusions for this problem. The outliers, of course
impaired the t-analysis. For this data, we have more faith in the simple sign test.
1.5 Properties of Normed-Based Inference
In this section, we establish statistical properties of the inference described in Section 1.3
for the norm-t of a location model. These properties describe the null and alternative
distributions of the test, ( 1.3.7), and the asymptotic distribution of the estimate, (1.3.2).
Furthermore, these properties allow us to derive relative eciencies between competing pro-
cedures. While our discussion is general, we will illustrate the inference based on the L
1
and
L
2
norms as we proceed. The inference based on the signed-rank norm will be considered in
Section 1.7 and that based on norms of general signed-rank scores in Section 1.8.
We assume then that Model ( 1.2.1) holds for a random sample X
1
, . . . , X
n
with common
distribution and density functions H(x) = F(x ) and h(x) = f(x ), respectively. Next
a norm is specied to t the model. We will assume that the induced functional is 0 at F,
i.e., T(F) = 0. Let S() be the gradient function induced by the norm. We establish the
properties of the inference by considering the null and alternative behavior of the gradient
test. For convenience, we consider the one sided hypothesis,
H
0
: = 0 versus H
A
: > 0 . (1.5.1)
Since S() is nonincreasing, a level test of these hypotheses based on S(0) is
Reject H
0
in favor of H
A
if S(0) c , (1.5.2)
0
[S(0) c] = .
The power function of this test is given by,
S
() = P
[S(0) c] = P
0
[S() c] , (1.5.3)
where the last equality follows from Theorem 1.3.1.
The power function forms a convenient summary of the test based on S(0). The prob-
ability of a Type I Error (level of the test) is given by
S
(0). The probability of a Type II
error at the alternative is
S
() = 1
S
(). For a given test of hypotheses ( 1.5.1) we
want the power function to be increasing in with an upper limit of one. In the rst sub-
section below, we establish these properties for the test ( 1.5.2). We can also compare level
-tests of ( 1.5.1) by comparing their powers at alternative hypotheses. These are eciency
considerations and they are covered in later subsections.
1.5.1 Basic Properties of the Power Function
S
()
As a rst step we show that
S
() is nondecreasing:
Theorem 1.5.1. Suppose the test of H
0
: = 0 versus H
A
: > 0 rejects when S(0) c.
Then the power function is nondecreasing in .
Proof. Recall that S() is nonincreasing in since D() is convex. By Theorem 1.3.1,
S
() = P
0
[S() c]. Now, if
1

2
then S(
1
) S(
2
) and , hence, S(
1
) c
implies that S(
2
) c. It then follows that P
0
(S(
1
) c) P
0
(S(
2
) c) and the
power function is monotone in as required.
This theorem shows that the test of H
0
: = 0 versus H
A
: > 0 based on S(0) is
unbiased, that is, P
(S(0) c) for positive , where is the size of the test. At times

it is convenient to consider the more general null hypothesis:
H
0
: 0 versus H
A
: > 0 . (1.5.4)
A test of H
0
versus H
A
with power function
S
is said to have level , if
sup
0
S
() = .
The proof of Theorem 1.5.1 shows that
S
() is nondecreasing in all 1. Since the
gradient test has level for H
0
, it follows immediately that it has level for H
0
also.
We next show that the power function of the gradient test converges to 1 as . We
formally dene this as:
Denition 1.5.1. Consider a level test for the hypotheses ( 1.5.1) which has power func-
tion
S
(). We say the test is resolving, if
S
() 1 as .
Theorem 1.5.2. Suppose the test of H
0
: = 0 versus H
A
: > 0 rejects when S(0) c.
Further, let = sup
S() and suppose that is attained for some nite value of . Then the
test is resolving, that is, P
(S(0) c) 1 as .
Proof. Since S() is nonincreasing, for any unbounded increasing sequence
m
, S(
m
)
S(
m+1
). For xed n and F, there is a real number a such that P
0
([ X
i
[ a, i = 1, . . . , n) >
1 for any specied > 0. Let A
denote the event [ X

i
[ a, i = 1, . . . , n. Now,
P
m
(S(0) c) = P
0
(S(
m
) c)
= 1 P
0
(S(
m
) < c)
= 1 P
0
(S(
m
) < c A
) P
0
(S(
m
) < c A
c
) .
The hypothesis of the theorem implies that, for suciently large m, S(
m
) < c A
is empty. Further, P
0
(S(
m
) < c A
c
) P
0
(A
c
) < c. Hence, for m suciently large,

P
m
(S(0) c) 1 and the proof is complete.
The condition of boundedness imposed on S() in the above theorem holds for almost
all the nonparametric tests discussed in this book; hence, these nonparametric tests will be
resolving. Thus they will be able to discern large alternative hypotheses with high power.
What can be said at a xed alternative? Recall the denition of a consistent test:
Denition 1.5.2. We say that a test is consistent if the power tends to one for each xed
alternative as the sample size n increases. The alternatives consist in specic values of
and a cdf F.
Consistency implies that the test is behaving as expected when the sample size increases
and the alternative hypothesis is true. To obtain consistency of the gradient test, we need
to impose the following two assumptions on S(): rst
S() = S()/n
P
() where (0) = 0 and (0) < () for all > 0, (1.5.5)
for some > 0 and secondly,
E
0
S(0) = 0 and

nS(0)
D
N(0,
2
(0)) under H
0
for all F , (1.5.6)
for some positive constant (0). The rst assumption means that S(0) separates the null
from the alternative hypothesis. Note, it is not crucial that (0) = 0, since this can always be
achieved by recentering. It will be useful to have the following result concerning the asymp-
totic null distribution of S(0). Its proof follows readily from the denition of convergence in
distribution.
Theorem 1.5.3. Assume ( 1.5.6). The test dened by

nS(0) z
(0) where z
is the
upper percentile from the standard normal cdf ie. 1 (z
) = is asymptotically size .
Hence, P
0
(
nS(0)) z
(0)) .
It follows that a gradient test is consistent; i.e.,
Theorem 1.5.4. Assume conditions ( 1.5.5) and ( 1.5.6). Then the gradient test
nS(0)
z
(0) is consistent, ie. the power at xed alternatives tends to one as n increases.
Proof. Fix
> 0 and F. For > 0 and for large n, we have n

1/2
z
(0) < (
) . This
leads to the following string of inequalities:
P
,F
(S(0) n
1/2
z
(0)) P
,F
(S(0) (
) )
P
,F
([ S(0) (
) [ ) 1 ,
which is the desired result.
Example 1.5.1. The L
1
Case
Assume that the model cdf F has the unique median 0. Consider the L
1
norm. The
associated level gradient test of ( 1.5.1) is equivalent to the sign test given by:
Reject H
0
in favor of H
A
if S
+
1
=
I(X
i
> 0) c ,
where c is such that P[bin(n, 1/2) c] = . The test is nonparametric, i.e., it does not
depend on F. From the above discussion its power function is nondecreasing in . Since S
+
1
()
is bounded and attains its bound on a nite interval, the test is resolving. For consistency,
take = 1 in expression ( 1.5.5). Then E[n
1
S
+
1
(0)] = P(X > 0) = 1 F() = ().
An application of the Weak Law of Large numbers shows that the limit in condition ( 1.5.5)
holds. Further, (0) = 1/2 < () for all > 0 and all F. Finally, apply the Central
Limit Theorem to show that ( 1.5.6) holds. Hence, the sign test is consistent for location
alternatives. Further, it is consistent for each pair , F such that P(X > 0) > 1/2.
A discussion of these properties for the gradient test based on the L
2
-norm can be found
in Exercise 1.12.5.
1.5.2 Asymptotic Linearity and Pitman Regularity
In the last section we discussed some of the basic properties of the power function for a
gradient test. Next we establish some general results that will allow us to compare power
functions for dierent level -tests. These results will also lead to the asymptotic distribu-
tions of the location estimators

based on norm ts. We will also make use of them in later
sections and chapters.
Assume the setup found at the beginning of this section; i.e., we are considering the
location model ( 1.3.1) and we have specied a norm with gradient function S(). We rst
dene a Pitman regular process:
Denition 1.5.3. We will say an estimating function S() is Pitman Regular if the
following four conditions hold: rst,
S() is nonincreasing in ; (1.5.7)
second, letting S() = S()/n
, for some > 0.

there exists a function (), such that (0) = 0,
() is continuous at 0,
(0) > 0 and either S(0)

P
() or E
(S(0) = () ; (1.5.8)
third,
sup
|b|B
nS
_
b
n
_
nS(0) +
(0)b
P
0 , (1.5.9)
for any B > 0; and fourth there is a constant (0) such that
n
_
S(0)
(0)
_
D
0
N(0, 1) . (1.5.10)
Further the quantity
c =
(0)/(0) (1.5.11)
is called the ecacy of S().
Condition ( 1.5.9) is called the asymptotic linearity of the process S(). Often we can
compute c when we have the mean under general and the variance under = 0. Thus
(0) =
d
d
E
[S(0) [
=0
and
2
(0) = limnVar
0
(S(0)) . (1.5.12)
Hence, another way expressing the asymptotic linearity of S() is
n
_
S(b/
n)
(0)
_
=
n
_
S(0)
(0)
_
cb + o
p
(1) . (1.5.13)
If we replace b by

n
n
where, of course, [
n
n
[ B for B > 0, then we can write
n
_
S(
n
)
(0)
_
=
n
_
S(0)
(0)
_
c
n
n
+ o
p
(1) . (1.5.14)
We record one more result on limiting distributions whose proof follows from Theorems 1.3.1
and 1.5.6.
Theorem 1.5.5. Suppose S() is Pitman Regular. Then
n
_
S(b/
n)
(0)
_
D
0
Z cb (1.5.15)
and
n
_
S(0)
(0)
_
D
b/
n
Z cb , (1.5.16)
where Z N(0, 1) and, so, Z cb N(cb, 1).
The second part of this theorem says that the limiting distribution of S(0) , when stan-
dardized by (0), and computed along a sequence of alternatives b/n
1/2
is still normal with
the same variance of one but with a new mean, namely cb. This result will be useful in
approximating the power near the null hypothesis.
We will nd asymptotic linearity to be useful in establishing statistical properties. Our
next result provides sucient conditions for linearity.
Theorem 1.5.6. Let S() = (1/n
)S() for some > 0 such that the conditions ( 1.5.7),

( 1.5.8) and ( 1.5.10) of Denition 1.5.3 hold. Suppose for any b 1,
nVar
0
(S(n
1/2
b) S(0)) 0 , as n . (1.5.17)
Then
sup
|b|B
nS
_
b
n
_
nS(0) +
(0)b
P
0 , (1.5.18)
for any B > 0.
Proof. First consider U
n
(b) = [S(n
1/2
b) S(0)]/(b/
n). By ( 1.5.8) we have

E
0
(U
0
(b)) =
n
b

_
b
n
_
=
n
b
_
(
n
)
_
(0) . (1.5.19)
Furthermore,
Var
0
U
n
(b) =
n
b
2
Var
0
_
S
_
b
n
_
S(0)
_
0 . (1.5.20)
As Exercise 1.12.9 shows, ( 1.5.19) and ( 1.5.20) imply that U
n
(b) converges to
(0) in
probability, pointwise in b, i.e., U
n
(b) =
(0) +o
p
(1).
For the second part of the proof, let W
n
(b) =
n[S(b/
n) S(0) +
(0)b/
n]. Further
let > 0 and > 0 and partition [B, B] into B = b
0
< b
1
< . . . < b
m
= B so that
b
i
b
i1
/(2[
(0)[) for all i. There exists N such that n N implies P[max

i
[W
n
(b
i
)[ >
/2] < .
Now suppose that W
n
(b) 0 ( a similar argument can be given for W
n
(b) < 0). Then
[W
n
(b)[ =
n
_
S
_
b
n
_
S(0)
_
+ b
(0)
n
_
S
_
b
n
_
S(0)
_
+b
i1
(0) + (b b
i1
)
(0)
[W
n
(b
i1
)[ + (b b
i1
)[
(0)[ max
i
[W
n
(b
i
)[ + /2 .
Hence,
P
0
_
sup
|b|B
[W
n
(b)[ >
_
P
0
(max
i
[W
n
(b
i
)[ + /2) > ) < ,
and
sup
|b|B
[W
n
(b)[
P
0 .
In the next three subsections we use these tools to handle the issues of power and eciency
for a general norm-based inference, but rst we show that the L
1
gradient function is Pitman
regular.
Example 1.5.2. Pitman Regularity of the L
1
Process
Assume that the model pdf satises f(0) > 0. Recall that the L
1
gradient function is
S
1
() =
n
i=1
sgn(X
i
) .
Take = 1 in Theorem 1.5.6; hence, the average of interest is S
1
() = n
1
S
1
(). This is
nonincreasing so condition ( 1.5.7) is satised. Next it is easy to check that () = E
S
1
(0) =
E
sgnX
i
= E
0
sgn(X
i
+ ) = 1 2F(). Hence,
(0) = 2f(0). Then condition ( 1.5.8) is

satised. We now consider condition ( 1.5.17). Consider the case b > 0, (similarly for b < 0),
S
1
(b/
n) S
1
(0) = n
1
n
1
[sgn(X
i
b/
n) sgn(X
i
)] = (2/n)
n
1
I(0 < X
i
< b/n
1/2
)
Because this is a sum of independent Bernoulli variables, we have
nVar
0
[S
1
(b/n
1/2
) S
1
(0)] 4P(0 < X
1
< b/
n) = 4[F(b/
n) F(0)] 0 .
The convergence to 0 occurs since F is continuous. Thus condition ( 1.5.17) is satised.
Finally, note that (0) = 1 so
nS
1
converges in distribution to Z N(0, 1) by the Central
Limit Theorem. Therefore the L
1
gradient process S() is Pitman regular. It follows that
the ecacy of the L
1
is
c
L
1
= 2f(0) . (1.5.21)
For future reference, we state the asymptotic linearity result for the L
1
process: if [
n
n
[ B
then
nS
1
(
n
) =
nS
1
(0) 2f(0)
n
n
+ o
p
(1) . (1.5.22)
Example 1.5.3. Pitman Regularity of the L
2
Process
In Exercise 1.12.6 it is shown that, provided X
i
has nite variance, the L
2
gradient function
is Pitman Regular and that the ecacy is simply c
L
2
= 1/
f
.
We are now in a position to investigate the eciency and power properties of the statisti-
cal methods based on the L
1
norm relative to the statistical methods based on the L
2
norm.
As we will see in the next three subsections, these properties depend only on the ecacies.
1.5.3 Asymptotic Theory and Eciency Results for

As at the beginning of this section, suppose we have the location model, ( 1.2.1), and that we
have chosen a norm to t the model with gradient function S(). In this part we will develop
the asymptotic distribution of the estimate. The asymptotic variance will provide the basis
for eciency comparisons. We will use the asymptotic linearity that accompanies Pitman
Regularity. To do this, however, we rst need to show that

n
is bounded in probability.
Lemma 1.5.1. If the gradient function S() is Pitman Regular, then

n(
) = O
p
(1).
Proof. Assume without loss of generality that = 0 and take t > 0. By the monotonicity of
S(), if S(t/
n) < 0 then

t/
n. Hence, P
0
(S(t/
n) < 0) P
0
(
t/
n). Theorem
1.5.5 implies that the rst probability can be made as close to (tc) as desired. This, in turn,
can be made as close to 1 as desired. In a similar vein we note that If S(t/
n) > 0, then
t/
n and
t. Again, the probability of this event can be made arbitrarily close

to 1. Hence, P
0
([
[ t) is arbitrarily close to 1 and we have boundedness in probability.

We are now in a position to exploit this boundedness in probability to determine the
asymptotic distribution of the estimate.
Theorem 1.5.7. Suppose S() is Pitman regular with ecacy c. Then

n(
) converges
in distribution to Z n(0, c
2
).
Proof. As usual we assume, with out loss of generality, that = 0. First recall that

is
dened by n
1/2
S(
)
.
= 0. From Lemma 1.5.1, we know that
is bounded in probability
so that we can apply ( 1.5.13) to deduce
nS(
)
(0)
=
nS(0)
(0)
c
+ o
p
(1) .
Solving we have
= c
1
nS(0)/(0) +o
p
(1) ;
hence, the result follows because

nS(0)/(0) is asymptotically N(0, 1).
Denition 1.5.4. If we have two Pitman Regular estimates with ecacies c
1
and c
2
, re-
spectively, then the eciency of

1
with respect to

2
is dened to be the reciprocal ratio of
their asymptotic variances, namely, e(
1
,
2
) = c
2
1
/c
2
2
.
The next example compares the L
1
estimate to the L
2
estimate.
Example 1.5.4. Relative eciency between the L
1
and L
2
estimates
In this example we compare the L
1
and L
2
estimates, namely, the sample median and
mean. We have seen that their respective ecacies are 2f(0) and
1
f
, and their asymptotic
variances are 1/4f
2
(0)n and
2
f
/n, respectively. Hence, the relative eciency of the median
with respect to the mean is
e(

X,

X) = asyvar(
n

X)/asyvar(
n

X) = c
2
X
/c
2
X
= 4f
2
(0)
2
f
(1.5.23)
where

X is the sample median and

X is the sample mean. The eciency computation
depends only on the Pitman ecacies. We illustrate the computation of the eciency using
the contaminated normal distribution. The pdf of the contaminated normal distribution
consists of mixing the standard normal pdf with a normal pdf having mean zero and variance
2
> 1. For between 0 and 1, the pdf can be written:
f
(x) = (1 )(x) +
1
(
1
x) (1.5.24)
with
2
f
= 1 +(
2
1). This distribution has tails heavier than the standard normal distri-
bution and can be used to model data contamination; see Tukey (1960) for more discussion.
We can think of as the fraction of the data that is contaminated. In Table 1.5.1 we
provide values of the eciencies for various values of contamination and with = 3. Note
that when we have 10 percent contamination that the eciency is 1. This indicates that, for
this distribution, the median and mean are equally eective. Finally, this example exhibits
a distribution for which the median is superior to the mean as an estimate of the center. See
Exercise 1.12.12 for other examples.
1.5.4 Asymptotic Power and Eciency Results for the Test Based
on S()
Consider the location model, ( 1.2.1), and assume that we have chosen a norm to t the model
with gradient function S(). Consider the gradient test ( 1.5.2) of the hypotheses ( 1.5.1).
In Section 1.5.1, we showed that the power function of this test is nondecreasing with upper
limit one and that it is typically resolving. Further, we showed that for a xed alternative,
the test is consistent. Thus the power will tend to one as the sample size increases. To
Table 1.5.1: Eciencies of the median relative to the mean for contaminated normal models.
e(

X,

X)
.00 .637
.03 .758
.05 .833
.10 1.000
.15 1.134
oset this eect, we will let the alternative converge to the null value at a rate that will
stabilize the power away from one. This will enable us to compare two tests along the same
alternative sequence. Consider the null hypothesis H
0
: = 0 versus H
An
: =
n
where
n
=
n and
> 0. Recall that the asymptotic size test based on S(0) rejects H
0
if
nS/(0) z
where 1 (z
) = .
The following theorem is called the asymptotic power lemma. Its proof follows im-
mediately from expression ( 1.5.13).
Theorem 1.5.8. Assume that S(0) is Pitman Regular with ecacy c, then the asymptotic
local power along the sequence
n
=
n is
S
(
n
) = P
n
_
nS(0)/(0) z
= P
0
_
nS(
n
)/(0) z
1 (z
c) ,
as n .
Note that larger values of the ecacy imply larger values of the asymptotic local power.
Denition 1.5.5. The Pitman asymptotic relative eciency of one test relative to
another is dened to be e(S
1
, S
2
) = c
2
1
/c
2
2
.
Note that this is the same formula as the eciency of one estimate relative to another
given in Denition 1.5.4. Therefore, the eciency results discussed in Example 1.5.4
between the L
1
and L
2
estimates apply for the sign and t tests also. Hence, we have an
example in which the simple sign test is asymptotically more powerful than the t-test.
We can also develop a sample size interpretation for the asymptotic power. Suppose we
specify a power < 1. Further, let z
be dened by 1(z
) = . Then 1(z
cn
1/2
n
) =
1 (z
) and z
cn
1/2
n
= z
. Solving for n yields

n
.
= (z
)
2
/c
2
2
n
. (1.5.25)
Typically we take
n
= k
n
with k
n
small. Now if S
1
(0) and S
2
(0) are two Pitman Regu-
lar asymptotically size tests then the ratio of sample sizes required to achieve the same
asymptotic power along the same sequence of alternatives is given by the approximation:
n
2
/n
1
.
= c
2
1
/c
2
2
. This provides additional motivation for the above denition of Pitman ef-
ciency of two tests. The initial development of asymptotic eciency was done by Pitman
(1948) in an unpublished manuscript and later published by Noether (1955).
1.5.5 Eciency Results for Condence Intervals Based on S()
In this part we consider the length of the condence interval as a measure of its eciency.
Suppose that we specify = 1 for the condence coecient. Then let z
/2
be dened
by 1 (z
/2
) = /2. Again we suppose throughout the discussion that the estimating
functions are Pitman Regular. Then the endpoints of the 100 percent condence interval
are given asymptotically by

L
and

U
such that
nS(
L
)
(0)
= z
/2
and
nS(
U
)
(0)
= z
/2
; (1.5.26)
see ( 1.3.10) for the exact versions of the endpoints.
The next theorem provides the asymptotic behavior of the length of this interval and,
further, it shows that the standardized length of the condence interval is a consistent
estimate of the asymptotic standard deviation of

n
.
Theorem 1.5.9. Suppose S() is a Pitman Regular estimating function with ecacy c. Let
L be the length of the corresponding condence interval. Then
nL
2z
/2
P
1
c
Proof: Using the same argument as in Lemma 1.5.1, we can show that

L
and

U
are
bounded in probability when multiplied by

n. Hence, the above estimating equations can
be linearized to obtain, for example:
z
/2
=
nS(
L
)/(0) =
nS(0)/(0) c
L
/(0) +o
P
(1) .
This can then be solved to nd:
L
=
nS(0)/c(0) z
/2
/c + o
P
(1)
When this is also done for

U
and the dierence is taken, we have:
n
1/2
(
L
) = 2z
/2
/c + o
P
(1) ,
which concludes the argument.
From Theorem 1.5.7,

has an approximate normal distribution with variance c
2
/n. So
by Theorem 1.5.9, a consistent estimate of the standard error of

is
SE(
) =
nL
2z
/2
1
n
=
L
2z
/2
. (1.5.27)
If the ratio of squared asymptotic lengths is used as a measure of eciency then the
eciency of one condence interval relative to another is again the ratio of the
squares of the ecacies.
The discussion of the properties of estimation, testing, and condence interval construc-
tion shows that, asymptotically at least, the relative merit of a procedure is measured by its
ecacy. This measure is the slope of the linear approximation of the standardized estimat-
ing function that determines these procedures. In the comparison of L
1
and L
2
methods,
we have seen that the eciency e(L
1
, L
2
) = 4
2
f
f
2
(0). There are other types of asymptotic
eciency that have been studied in the literature along with nite sample versions of these
asymptotic eciencies. The conclusions drawn from these other eciencies are consistent
with the picture presented here. Finally, conclusions of simulation studies have also been
consistent with the material presented here. Hence, we will not discuss these other measures;
see Section 2.6 of Hettmansperger (1984a) for further references.
Example 1.5.5. Estimation of the Standard Error of the Sample Median
Recall that the sample median, when properly standardized, has a limiting normal dis-
tribution. Suppose we have a sample of size n from H(x) = F(x) where is the unknown
median. From Theorem 1.5.7, we know that the approximating distribution for

, the sam-
ple median, is normal with mean and variance 1/[4nh
2
()]. We refer to this variance as
the asymptotic variance. This normal distribution can be used to approximate probabilities
concerning the sample median. When the underlying form of the distribution H is unknown,
we must estimate this asymptotic variance. Theorem 1.5.9 provides one key to the estima-
tion of the asymptotic variance. The square root of the asymptotic variance is sometimes
called the asymptotic standard error of the sample median. We will discuss the estimation
of this standard error rather than the asymptotic variance.
As a simple example, in expression (1.5.27) take = .05, z
/2
= 2, and k = n/2
n
1/2
, then we have the following consistent estimate of the asymptotic standard error of the
median:
SE(median) [X
(n/2+n
1/2
)
X
(n/2n
1/2
)
]/4. (1.5.28)
This simple estimate of the asymptotic standard error is based on the length of the 95%
condence interval for the median. Sheather (1987) shows that the estimate can be improved
by using the interpolated condence intervals discussed in Section 1.10. Of course, other
condence intervals with dierent condence coecients can be used also. We recommend
using 90% or 95%; again, see McKean and Schrader (1984) and Sheather (1987). This SE
is computed by our R function onesampsgn for general . The default value of is set at
0.05.
There are other approaches to the estimation of this standard error. For example, we
could estimate the density h(x) directly and then use h
n
(
) where h
n
is the density estimate.
Another possibility is to estimate the nite sample standard error of the sample median
directly. Sheather (1987) surveys these approaches. We will discuss one further possibility
here, namely the bootstrap. The bootstrap has gained wide attention recently because of
its versatility in estimation and testing in nonstandard situations. See Efron and Tibshirani
(1993) for a very readable account of the bootstrap.
If we know the underlying distribution H(x), then we could estimate the standard error
of the median by repeatedly drawing samples with a computer from the distribution H. If we
Table 1.5.2: Generated N(0, 1) variates, (placed in order)
-1.79756 -1.66132 -1.46531 -1.45333 -1.21163 -0.92866 -0.86812
-0.84697 -0.81584 -0.78912 -0.68127 -0.37479 -0.33046 -0.22897
-0.02502 -0.00186 0.09666 0.13316 0.17747 0.31737 0.33125
0.80905 0.88860 0.90606 0.99640 1.26032 1.46174 1.52549
1.60306 1.90116
have B samples from H and have computed and stored the B values of the sample median,
then our estimate of the standard error of the median is simply the sample standard deviation
of these B values. When H is unknown we replace it by H
n
, the empirical distribution
function, and proceed with the simulation. Later in the chapter we will encounter an example
where we want to compute a bootstrap p-value for a test; see Section ??. The bootstrap
approach based on H
n
is called the nonparametric bootstrap since nothing is assumed
about the form of the underlying distribution H. In another version, called the parametric
bootstrap, we suppose that we know the form of the underlying distribution H but there are
some unknown parameters such as the mean and variance. We use the sample to estimate
these unknown parameters, insert the values into H, and use this distribution to draw the
B samples. In this book we will be concerned mainly with the nonparametric bootstrap and
we will use the generic term bootstrap to refer to this approach. In either case, ready access
to high speed computing makes this method appealing. The following example illustrates
the computations.
Example 1.5.6. Generated Data
Using Minitab, the 30 data points in Table 1.5.2 were generated from a normal distribu-
tion with mean 0 and variance 1. Thus, we know that the asymptotic standard error should
be about 1/[30
1/2
2f(0)] = 0.23. We will use this to check what happens if we try to estimate
the standard error from the data.
Using expression (1.3.16), the 95% condence interval for the median is (0.789, 0.331).
Hence, the length of condence interval estimate, given in expression ( 1.5.28), is (0.331 +
0.789)/4 = 0.28. A simple R function was written to bootstrap the sample; see Exercise
1.12.7. Using this function, we obtained 1000 bootstrap samples and the resulting standard
deviation of the 1000 bootstrap medians was 0.27. For this instance, the bootstrap procedure
essentially agrees with the length of condence interval estimate.
Note that, from the data, the sample mean is 0.03575 and the sample standard deviation
is 1.04769. If we assume the underlying distribution H is normal with unknown mean and
variance, we would use the parametric bootstrap. Hence, instead of sampling from the
empirical distribution function, we want to sample from a normal distribution with mean
0.03575 and standard deviation 1.04769. Using R (see Exercise 1.12.7), we obtained 1000
parametric bootstrapped samples. The sample standard deviation of the resulting medians
was 0.23, just the value we would expect. You should not expect to get the precise value
every time you bootstrap, either parametrically or nonparametrically. It is, however, a very
versatile method to use to estimate such quantities as standard errors of estimates and
p-values of tests.
An unusual aspect of this example is that the bootstrap distribution of the sample me-
dian can be found in closed form and does not have to be simulated as described above. The
variance of the sample median computed from the bootstrap distribution can then be found.
The result is another estimate of the variance of the sample median. This was discovered
independently by Maritz and Jarrett (1978) and Efron (1979). We do not pursue this devel-
opment here because in most cases we must simulate the bootstrap distribution and that is
where the real strength of the bootstrap approach lies. For an interesting comparison of the
various estimates of the variance of the sample median see McKean and Schrader (1984).
1.6 Robustness Properties of Norm-Based Inference
We have just considered the statistical properties of the inference procedures. We have looked
at ideas such as eciency and power. We now turn to stability or robustness properties. By
this we mean how the inference procedures are eected by outliers or corruption of portions
of the data. Ideally, we would like procedures (tests and estimates) which do not respond
too quickly to a single outlying value when it is introduced into the sample. Further, we
would not like procedures that can be changed by arbitrary amounts by corrupting a small
amount of the data. Response to outliers is measured by the inuence curve and response
to data corruption is measured by the breakdown value. We will introduce nite sample
versions of these concepts. They are easy to work with and, in the limit, they generally
equal the more abstract versions based on the study of statistical functionals. We consider
rst the robustness properties of the estimates and secondly tests. As in the last section,
the discussion will be general but the L
1
and L
2
procedures will be discussed as we proceed.
The robustness properties of the procedures based on the weighted L
1
norm will be covered
in Sections 1.7 and 1.8. See Section A.5 of the Appendix for a development based on
functionals.
1.6.1 Robustness Properties of

We begin with the denition of breakdown for the estimator

.
Denition 1.6.1. Estimation Breakdown Let x = (x
1
, . . . , x
n
) represent a realization of a
sample and let
x
(m)
= (x
1
, . . . , x
m
, x
m+1
, . . . , x
n
)
represent the corruption of any m of the n observations. We dene the bias of an estimator
to be bias(m;
, x) = sup [
(x
(m)
)
(x)[ where the sup is taken over all possible corrupted

samples x
(m)
. Note that we change only x
1
, . . . , x
m
while x
m+1
, . . . , x
n
are xed at their
original values. If the bias is innite, we say the estimate has broken down and the nite
sample breakdown value is given by
n
= min m/n : bias(m;
, x) = . (1.6.1)
1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 31
This approach to breakdown is called replacement breakdown because observations are
replaced by corrupted values; see Donoho and Huber (1983) for more discussion of this
approach. Often there exists an integer m such that x
(m)

x
(nm+1)
and either

tends
to as x
(m)
tends to or

tends to +as x
(nm+1)
tends to +. If m
is the smallest
such integer then
n
= m
/n. Hodges (1967) was the rst to introduce these ideas.

To remove the eects of sample size, the limit, when it exists, can be computed. In this
case we call the lim
n
=
, the asymptotic breakdown value.

Example 1.6.1. Breakdown Values for the L
1
and L
2
Estimates
The L
1
estimate is the sample median. If the sample size is n = 2k then it is easy to see
that when x
(k)
tends to , the median also tends to . Hence, the breakdown value
of the sample median is k/n which tends to .5. By a similar argument, when the sample
size is n = 2k + 1, the breakdown value is (k + 1)/n and it also tends to .5 as the sample
size increases. Hence, we say that the sample median is a 50% breakdown estimate. The L
2
estimate is the sample mean. A similar analysis shows that the breakdown value is 1/n which
tends to zero. Hence, we say the sample mean is a zero breakdown estimate. This sharply
contrasts the two estimates since we see that the median is the most resistant estimate and
the sample mean is the least resistant estimate. In Exercise 1.12.13, the reader is asked to
show that the pseudo-median induced by the signed-rank norm, ( 1.3.25), has breakdown
.29.
We have just considered the eect of corrupting some of the observations. The estimate
breaks down if we can force the estimate to change by an arbitrary amount by changing
the observations over which we have control. Another important concept of stability entails
measuring the eect of the introduction of a single outlier. An estimate is stable or resistant
if it does not change by a large amount when the outlier is introduced. In particular, we
want the change to be bounded no matter what the value of the outlier.
Suppose we have a sample of observations x
1
, . . . , x
n
from a distribution centered at 0
and an estimate

n
based on these observations. By Pitman Regularity, Denition 1.5.3,
and Theorem 1.5.7, we have
n
1/2
n
= c
1
n
1/2
S(0)/(0) +o
P
(1) , (1.6.2)
provided the true parameter is 0. Further, we often have a representation of S(0) as a sum
of independent random variables. We may have to make a projection of S(0) to achieve this;
see the next chapter for examples of projections. In any case, we then have the following
representation
c
1
n
1/2
S(0)/(0) = n
1/2
n
i=1
(x
i
) + o
P
(1) , (1.6.3)
where () is the function needed in the representation. When we combine the above two
statements we have
n
1/2
n
= n
1/2
n
i=1
(x
i
) + o
P
(1) . (1.6.4)
Recall that the distribution that we are sampling is assumed to be centered at 0. The
dierence (
n
0) is approximated by the average of n independent and identically distributed
random variables. Since (x
i
) represents the eect of the ith observation on

n
it is called
the inuence function.
The inuence function approximates the rate of change of the estimate when an outlier
is introduced. Let x
n+1
= x
represent a new, outlying , observation. Since

n
should be
roughly 0, we have
(n + 1)
n+1
(n + 1)
n
.
= (x
)
and
n+1
n
1/(n + 1)
(x
) , (1.6.5)
and this reveals the dierential character of the inuence function. Hampel (1974) developed
the inuence function from the theory of von Mises dierentiable functions. In Sections A.5
and A.5.2 of the Appendix, we use his formulation to derive several inuence functions for
later situations. Here, though, we will identify inuence functions for the estimates through
the approximations described above. We now illustrate this approach.
Example 1.6.2. Inuence Function for the L
1
and L
2
Estimates
We will briey describe the inuence functions for the sample median and the sample
mean, the L
1
and L
2
estimates. From Example 1.5.2 we have immediately that, for the
sample median,
n
1/2
n
n
i=1
sgn(X
i
)
2f(0)
and
(x) =
sgn(x)
2f(0)
Note that the inuence function is bounded but not continuous. Hence, outlying ob-
servations cannot have an arbitrarily large eect on the estimate. It is this feature along
with the 50% breakdown property that makes the sample median the prototype of resistant
estimates. The sample mean, on the other hand, has an unbounded inuence function. It is
easy to see that (x) = x, linear and unbounded. Hence, a single large outlier is sucient to
carry the sample mean beyond any bound. The unbounded inuence is connected to the 0
breakdown property. Hence, the L
2
estimate is the prototype of an estimate highly ecient
at a specied model, the normal model in this case, but not resistant. This means that quite
1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE 33
close to the model for which the estimate is optimal, the estimate may perform very poorly;
recall Table 1.5.1.
1.6.2 Breakdown Properties of Tests
We now turn to the issue of breakdown in testing hypotheses. The problems are a bit
dierent in this case since we typically want to move, by data corruption, a test statistic
into or out of a critical region. It is not a matter of sending the statistic beyond any nite
bound as it is in estimation breakdown.
Denition 1.6.2. Suppose that V is a statistic for testing H
0
: = 0 versus H
0
: > 0
and we reject the null hypothesis when V k, where P
0
(V k) = determines k. The
rejection breakdown of the test is dened by
n
(reject) = min m/n : inf
x
sup
x
(m)
V k , (1.6.6)
where the sup is taken over all possible corruptions of m data points. Likewise the accep-
tance breakdown is dened to be
n
(accept) = min m/n : sup
x
inf
x
(m)
V < k . (1.6.7)
Rejection breakdown is the smallest portion of the data that can be corrupted to guaran-
tee that the test will reject. Acceptance breakdown is interpreted as the smallest portion of
the data that must be corrupted to guarantee that the test statistic will not be in the critical
region; i.e., the test is guaranteed to fail to reject the null hypothesis. We turn immediately
to a comparison of the L
1
and L
2
tests.
Example 1.6.3. Rejection Breakdown of the L
1
We rst consider the one sided sign test for testing H
0
: = 0 versus H
A
: > 0. The
asymptotically size test rejects the null hypothesis when n
1/2
S
1
(0) z
, the upper
quantile from a standard normal distribution. It is easier to see exactly what happens if we
convert the test to S
+
1
(0) =
I(X
i
> 0) n/2 + (n
1/2
z
)/2. Now each time we make an

observation positive it makes S
+
1
(0) increase by one. Hence, if we wish to guarantee that the
test will reject, we make m observations positive where m
= [n/2 + (n
1/2
z
)/2] + 1, [.] the

greatest integer function. Then the rejection breakdown is
n
(reject) = m
/n
.
=
1
2
+
z
2n
1/2
Likewise,
n
(accept)
.
=
1
2

z
2n
1/2
.
Note that the rejection breakdown converges down to the estimation breakdown and the
acceptance breakdown converges up to it.
We next turn to the one-sided Students t-test. Acceptance breakdown for the t-test is
simple. By making a single observation approach , the t statistic can be made negative
hence we can always guarantee acceptance with control of one observation. The rejection
breakdown is more interesting. If we increase an observation both the sample mean and the
sample standard deviation increase. Hence, it is not at all clear what will happen to the
t-statistic. In fact it is not sucient to increase a single observation in order to force the
t-statistic to move into the critical region. We now show that the rejection breakdown for
the t-statistic is:
n
(reject) =
t
2
n 1 +t
2
0 , as n ,
where t
is the upper quantile from a t-distribution with n 1 degrees of freedom. The

inmum part of the denition suggests that we set all observations at B < 0 and then
change m observations to M > 0. The result is
x =
mM (n m)B
n
and s
2
=
m(n m)(M + B)
2
(n 1)n
.
Putting these two quantities together we have
n
1/2
x
s
= [m(n m)B/M]
_
n 1
m(n m)(1 +B/M)
2
_
1/2
m(n 1)
n m
1/2
,
as M . We now equate the limit to t
and solve for m to get m = nt

2
/(n 1 + t
2
),
(actually we would take the greatest integer and add one). Then the rejection breakdown
is m divided by n as stated. Table 1.6.1 compares rejection breakdown values for the sign
and t-tests. We assume = .05 and the sample sizes are chosen so that the size of the sign
test is quite close to .05. For further discussion, see Ylvisaker (1977).
These denitions of breakdown assume a worst case scenario. They assume that the
test statistic is as far away from the critical region (for rejection breakdown) as possible.
In practice, however, it may be the case that a test statistic is quite near the edge of the
critical region and only one observation is needed to change the decision from fail to reject
to reject. An alternative form of breakdown considers the average number of observations
that must be corrupted, conditional on the test statistic being in the acceptance region, to
force a rejection.
Let M
R
be the number of observations that must be corrupted to force a rejection; then,
M
R
is a random variable. The expected rejection breakdown is dened to be
Exp
n
(reject) = E
H
0
[M
R
[M
R
> 0]/n . (1.6.8)
Note that we condition on M
R
> 0 since M
R
= 0 is equivalent to a rejection. It is left as
Exercise 1.12.14 to show that the expected breakdown can be computed with unconditional
expectation as
Exp
n
(reject) = E
H
0
[M
R
]/(1 ) . (1.6.9)
In the following example we illustrate this computation on the sign test and show how it
compares to the worst case breakdown introduced earlier.
1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM 35
Table 1.6.1: Rejection breakdown values for size = .05 tests.
n Sign t
10 .71 .27
13 .70 .21
18 .67 .15
30 .63 .09
100 .58 .03
.50 0
Table 1.6.2: Comparison of expected breakdown and worst case breakdown for the size
= .05 sign test.
n Exp
n
(reject)
n
(reject)
10 .27 .71
13 .24 .70
18 .20 .67
30 .16 .63
100 .08 .58
0 .50
Example 1.6.4. Expected Rejection Breakdown of the Sign Test
Refer to Example 1.6.3. The one sided sign test rejects when
I(X
i
> 0) n/2 +
n
1/2
z
/2
. Hence, given that we fail to reject the null hypothesis, we will need to change
(corrupt) n/2 + n
1/2
z
/2

I(X
i
> 0) negative observations into positive ones. This is
precisely M
R
and E[M
R
] = n
1/2
z
/2
. It follows that Exp
n
(reject) = z
/2
n
1/2
(1 ) 0 as
n rather than .5 which happens in the worst case breakdown. Table 1.6.2 compares
the two types of rejection breakdown. This simple calculation clearly shows that even highly
resistant tests such as the sign test may breakdown quite easily. This is contrary to what the
worst case breakdown analysis would suggest. For additional reading on test breakdown see
Coakley and Hettmansperger (1992). He, Simpson and Portnoy (1990) discuss asymptotic
test breakdown.
1.7 Inference and the Wilcoxon Signed-Rank Norm
In this section we develop the statistical properties for the procedures based on the Wilcoxon
signed-rank norm, ( 1.3.17), that was dened in Example 1.3.3 of Section 1.3. Recall that
the norm and its associated gradient function are given in expressions ( 1.3.17) and ( 1.3.24),
respectively. Recall for a sample X
1
, . . . , X
n
that the estimate of is the median of the
Walsh averages given by ( 1.3.25). As in Section 1.3, our hypotheses of interest are
H
0
: = 0 versus H
0
: ,= 0 . (1.7.1)
The level test associated with the signed-rank norm is
Reject H
0
in favor of H
A
, if [T(0)[ c , (1.7.2)
0
[[T(0)[ c]. To complete the test we need to determine the null
distribution of T(0), which is given by Theorems 1.7.1 and 1.7.2.
In order to develop the statistical properties, in addition to ( 1.2.1), we assume that
h(x) is symmetrically distributed about . (1.7.3)
We refer to this as the symmetric location model. Under symmetry, by Theorem 1.2.1,
T(H) = , for all location functionals T.
1.7.1 Null Distribution Theory of T(0)
In addition to expression ( 1.3.24), a third representation of T(0) will be helpful in estab-
lishing its null distribution. Recall the denition of the anti-ranks, D
1
, . . . , D
n
, given in
expression ( 1.3.19). Using these anti-ranks, we can write
T(0) =
R([X
i
[)sgn(X
i
) =
jsgn(X
D
j
) =
jW
j
,
where W
j
= sgn(X
D
j
).
Lemma 1.7.1. Under H
0
, [X
1
[, . . . , [X
n
[ are independent of sgn(X
1
), . . . , sgn(X
n
).
Proof: Since X
1
, . . . , X
n
is a random sample from H(x), it suces to show that P[[X
i
[
x, sgn(X
i
) = 1] = P[[X
i
[ x]P[sgn(X
i
) = 1]. But due to H
0
and the symmetry of h(x) this
follows from the following string of equalities:
P[[X
i
[ x, sgn(X
i
) = 1] = P[0 < X
i
x] = H(x)
1
2
= [2H(x) 1]
1
2
= P[[X
i
[ x]P[sgn(X
i
) = 1] .
Based on this lemma, the vector of ranks and, hence, the vector of antiranks (D
1
, . . . , D
n
),
are independent of the vector (sgn(X
1
), . . . , sgn(X
n
)). Based on these facts, we can obtain
the distribution of (W
1
, . . . , W
n
), which we summarize in the following lemma; see Exercise
1.12.15 for its proof.
Lemma 1.7.2. Under H
0
and the symmetry of h(x), W
1
, . . . , W
n
are iid random variables
with P[W
i
= 1] = P[W
i
= 1] = 1/2.
We can now easily derive the null distribution theory of T(0) which we summarize in the
following theorems. Details are given in Exercise 1.12.16.
Theorem 1.7.1. Under H
0
and the symmetry of h(x),
T(0) is distribution free and its distribution is symmetric (1.7.4)
E
0
[T(0)] = 0 (1.7.5)
Var
0
(T(0)) =
n(n + 1)(2n + 1)
6
(1.7.6)
T(0)
_
Var
0
(T(0))
has an asymptotically N(0, 1) distribution . (1.7.7)
The exact distribution of T(0) cannot be found in closed form. We do, however, have
the following recursion formula; see Exercise 1.12.17.
Theorem 1.7.2. Consider the version of the signed-rank test statistics given by T
+
, ( 1.3.28).
Let p
n
(k) = P[T
+
= k] for k = 0, . . . ,
n(n+1)
2
. Then
p
n
(k) =
1
2
[p
n1
(k) + p
n1
(k n)] , (1.7.8)
where
p
0
(0) = 1 ; p
0
(k) = 0 for k ,= 0; and p
0
(k) = 0 for k < 0 .
Using this formula algorithms can be developed which obtain the null distribution of
the signed-rank test statistic. The moment generating function can also be inverted to nd
the null distribution; see Hettmansperger(1984a, Section 2.2). As discussed in Section 1.3.1,
software is now available which computes critical values and p-values of the null distribution.
Theorem 1.7.1 justies the condence interval for given in display ( 1.3.30); i.e, the
(1)100% condence interval given by [W
(k+1)
, W
(((n(n+1))/2)k)
) where W
(i)
denotes the ith
ordered Walsh average and P(T
+
(0) k) = /2. Based on ( 1.7.7), k can be approximated
as k n(n+1)/4.5z
/2
[n(n+1)(2n+1)/24]
1/2
. As noted in Section 1.3.1, the computation
of the estimate and condence interval can be obtain by our R function onesampwil or the
R intrinsic function wilcox.test.
1.7.2 Statistical Properties
From our earlier analysis of the statistical properties of the L
1
and L
2
methods we see that
Pitman Regularity is crucial. In particular, we need to compute the Pitman ecacy which
determines the asymptotic variance of the estimate, the asymptotic local power of the test,
and the asymptotic length of the condence interval. In the following theorem we show that
the weighted L
1
gradient function is Pitman Regular and determine the ecacy. Then we
make some preliminary eciency comparisons with the L
1
and L
2
methods.
Theorem 1.7.3. Suppose that h is symmetric and that
_
h
2
(x)dx < . Let
T() =
2
n(n + 1)
ij
sgn
_
x
i
+ x
j
2

_
.
Then the conditions of Denition 1.5.3 are satised and, thus, T() is Pitman Regular.
Moreover, the Pitman ecacy is given by
c =
12
_

h
2
(x)dx . (1.7.9)
Proof. Since we have the L
1
norm applied to the Walsh averages, the estimating function
is a nonincreasing step function with steps at the Walsh averages. Hence, ( 1.5.7) holds.
Next note that h(x) = h(x) and, hence,
() = E
T(0) =
2
n + 1
E
sgn(X
1
) +
n 1
n + 1
E
_
sgn
_
X
1
+ X
2
2
__
.
Now
E
sgnX
1
=
_
sgn(x + )h(x)dx = 1 2H() ,
and
E
sgn(X
1
+ X
2
)/2 =
_ _
sgn[(x + y)/2 +]h(x)h(y)dxdy =
_
[1 2H(2 y)]h(y)dy .
Dierentiate with respect to and set = 0 to get
(0) =
2h(0)
n + 1
+
4(n 1)
n + 1
_

h
2
(y)dy 4
_
h
2
(y) dy .
The niteness of the integral is sucient to ensure that the derivative can be passed through
the integral; see Hodges and Lehmann (1961) or Olshen (1967). Hence, ( 1.5.8) also holds.
We next establish Condition ( 1.5.9). Since
T() =
2
n(n + 1)
n
i=1
sgn(X
i
) +
2
n(n + 1)
<j
sgn
_
X
i
+ X
j
2

_
,
the rst term is of smaller order and we need only consider the second term. Now, for b > 0,
let
V
=
2
n(n + 1)
i<j
_
sgn
_
X
i
+ X
j
2
n
1/2
b
_
sgn
_
X
i
+ X
j
2
__
=
4
n(n + 1)

i<j
I
_
0 <
X
i
+ X
j
2
< n
1/2
b
_
.
Hence,
nVar(V

) =
16n
n
2
(n + 1)
2
E
i<j
s<t
(I
ij
I
st
EI
ij
EI
st
) ,
where I
ij
= I(0 < (x
i
+ x
j
)/2 < n
1/2
b). This becomes
nVar(V

) =
16n
2
(n 1)
2n
2
(n + 1)
2
Var(I
12
) +
16n
2
(n 1)(n 2)
2n
2
(n + 1)
2
[EI
12
I
13
EI
12
EI
13
]
The rst term tends to zero since it behaves like 1/n. In the second term, consider [EI
12
I
13
EI
12
EI
13
[ EI
12
+ E
2
I
12
= EI
12
(1 +EI
12
). Now, as n ,
EI
12
= P
_
0 <
X
i
+ X
j
2
< n
1/2
b
_
=
_
[H(2n
1/2
b x) H(x)]h(x)dx 0
Hence, by Theorem 1.5.6, Condition ( 1.5.9) is true. Finally, asymptotic normality of the
null distribution is established in Theorem 1.7.1 which also yields nVar
0
T(0) 4/3 =
2
(0).
It follows that the Pitman ecacy is
c =
4
_
h
2
(y)dy
_
4/3
=
12
_
h
2
(y)dy .
For future reference we display the asymptotic linearity result:
T()
_
n(n + 1)(2n + 1)/6
=
T(0)
_
n(n + 1)(2n + 1)/6
12
_

h
2
(x) dx + o
p
(1) , (1.7.10)
for

n[[ B, where B > 0.
An immediate consequence of this theorem and Theorem 1.5.7 is that
n(
)
D
Z N
_
0, 1/12
__
h
2
(t)dt
_
2
_
, (1.7.11)
and we thus have the limiting distribution of the median of the Walsh averages. Exercise
1.12.20 shows that
_
h
2
(t) dt < , when h has nite Fisher information.
From our general discussion, a simple estimate of the standard error of the median of
the Walsh averages is proportional to the length of a distribution free condence interval.
Consider the (1 )100% condence interval given by [W
(k+1)
, W
(((n(n+1))/2)k)
) where W
(i)
denotes the ith ordered Walsh average and P(T
+
(0) k) = /2. Then by expression
(1.5.27), a consistent estimate of the SE of the median of the Walsh averages (medWA) is
SE(medWA) =
W
(((n(n+1))/2)k)
W
(k+1)
2z
/2
. (1.7.12)
Our R function onesampwil computes this standard error for general , (default is set at
0.05). We will have more to say about this particular c in the next chapter where we will
encounter it in the two-sample location model and later in the linear model, where a better
estimator of this SE is presented.
From Example 1.5.3 and Denition 1.5.4, we have the asymptotic relative eciency
between the signed-rank Wilcoxon process and the L
2
process is given by
e(Wilcoxon, L
2
) = 12
2
h
__
h
2
(x) dx
_
2
, (1.7.13)
where h is the underlying density with variance
2
h
.
In the following example, we consider the contaminated normal distribution and then
nd the eciency of the rank methods relative to the L
1
and L
2
methods.
Example 1.7.1. Asymptotic Relative Eciency for Contaminated Normal Distributions
Let f
(x) denote the pdf of the contaminated normal distribution used in Example 1.5.4;
the proportion of contamination is and the variance of the contaminated part is 9. A
straight forward computation shows that
_
f
2
(y)dy =
(1 )
2
2
+

2
6
+
(1 )
,
and we use this in the formula for c given above. The ecacies for the L
1
and L
2
are given
in Example 1.5.4. We rst consider the special case of = 0 corresponding to an underlying
normal distribution. In this case we have for the rank methods c
2
R
= 12/(4) = 3/ = .955,
for the L
1
methods c
2
1
= 2/ = .637, and for the L
2
methods c
2
2
= 1. We have already seen
that the eciency e
normal
(L
1
, L
2
) = c
2
1
/c
2
2
= .637 from the rst line of Table 1.5.1. We
now have
e
normal
(Wilcoxon, L
2
) = 3/
.
= .955 and e
normal
(Wilcoxon, L
1
) = 1.5 . (1.7.14)
The eciency of the rank methods relative to the L
2
methods is extraordinary. It says that
even at the distribution for which the t test is uniformly most powerful, the Wilcoxon signed
rank test is almost as ecient. This means that replacing the values of the observations by
their ranks (retaining only the order information) does not aect the statistical properties of
the test. This was considered highly nonintuitive in the 1950s since nonparametric methods
were thought of as quick and dirty. Now they must be considered highly ecient competitors
of the optimal methods and, in addition, they are more robust than the optimal methods.
This provides powerful motivation for the continued study of rank methods in other statistical
models such as the two-sample location model and the linear model. The early work in the
area of eciency of rank methods is due largely to Lehmann and his students. See Lehmann
and Hodges (1956, 1961) for two important early papers and Lehmann (1975, Appendix) for
more discussion.
We complete this example with a table of eciencies of the rank methods relative to the
L
1
and L
2
methods for the contaminated normal model with = 3. Table 1.7.1 shows these
Table 1.7.1: Eciencies of the Rank, L
1
, and L
2
methods for the Contaminated Normal
Distribution.
e(L
1
, L
2
) e(R, L
1
) e(R, L
2
)
.00 .637 1.500 .955
.01 .678 1.488 1.009
.03 .758 1.462 1.108
.05 .833 1.436 1.196
.10 1.000 1.373 1.373
.15 1.134 1.320 1.497
eciencies and extends Table 1.5.1. As increases the weight in the tails of the distribution
also increases. Note that the eciencies of both the L
1
and rank methods relative to L
2
methods increase with . On the other hand, the eciency of the rank methods relative to
the L
1
methods decreases slightly. The rank methods are still more ecient; however, this
illustrates the fact that the L
1
methods are good for heavy tailed distributions. The overall
implication of this example is that the L
2
methods, such as the sample mean, the t test and
condence interval, are not particularly ecient once the underlying distribution departs
from the normal distribution. Further, the rank methods such as the Wilcoxon signed rank
test, condence interval, and the median of the Walsh averages are surprisingly ecient,
even at the normal distribution. Note that the rank methods are more ecient than L
2
methods even for 1% contamination.
Finally, the following theorem shows that the Wilcoxon signed rank statistic never loses
much eciency relative to the t-statistic. Let T
s
denote the family of distributions which
have symmetric densities and nite Fisher information; see Exercise 1.12.20.
Theorem 1.7.4. Let X
1
, . . . , X
n
be a random sample from H T
S
. Then
inf
Fs
e(Wilcoxon, L
2
) = 0.864 . (1.7.15)
Proof: By ( 1.7.13), e(Wilcoxon, L
2
) = 12
2
h
__
h
2
(x) dx
_
2
. If
2
h
= then e(Wilcoxon, L
2
) >
.864; hence, we can restrict attention to H T
s
such that
2
h
< . As Exercise 1.12.21
indicates e(Wilcoxon, L
2
) is location and scale invariant, so, we can further assume that
h is symmetric about 0 and
2
h
= 1. The problem, then, is to minimize
_
h
2
subject to
_
h =
_
x
2
h = 1 and
_
xh = 0. This is equivalent to minimizing
_
h
2
+ 2b
_
x
2
h 2ba
2
_
h , (1.7.16)
where a and b are positive constants to be determined later. We now write ( 1.7.16) as
_
_
h
2
+ 2b(x
2
a
2
)h
=
_
|x|a
_
h
2
+ 2b(x
2
a
2
)h
+
_
|x|>a
_
h
2
+ 2b(x
2
a
2
)h
. (1.7.17)
First complete the square on the rst term on the right side of ( 1.7.17) to get
_
|x|a
_
h + b(x
2
a
2
)
_
|x|a
b
2
(x
2
a
2
)
2
(1.7.18)
Now ( 1.7.17) is equal to the two terms of ( 1.7.18) plus the second term on the right side of
( 1.7.17). We can now write the density that minimizes ( 1.7.16).
If [x[ > a take h(x) = 0, since x
2
> a
2
, and if [x[ a take h(x) = b(a
2
x
2
), since the
integral in the rst term of ( 1.7.18) is nonnegative. We can now determine the values of a
and b from the side conditions. From
_
h = 1. we have
_
a
a
b(a
2
x
2
) dx = 1 ,
which implies that a
3
b =
3
4
. Further, from
_
x
2
h = 1 we have
_
a
a
x
2
b(a
2
x
2
) dx = 1 ,
from which a
5
b =
15
4
. Hence solving for a and b yields a =
5 and b = 3
5/100. Now
_
h
2
=
_

5
5
_
3
5
100
(5 x
2
)
_
2
dx =
3
5
25
,
which leads to the result,
inf
Fs
e(Wilcoxon, L
2
) = 12
_
3
5
25
_
2
=
108
125
.
= 0.864 .
1.7.3 Robustness Properties
We complete this section with a discussion of the breakdown point of the estimate and test
and a heuristic derivation of the inuence function of the estimate. In Example 1.6.1 we
discussed the breakdown of the sample median and mean. In those cases we saw that the
median is the most resistant while the mean is the least resistant. In Exercise 1.12.13
you are asked to show that the breakdown point of the median of the Walsh averages, the
R-estimate, is roughly .29. Our next result gives the inuence function

.
Theorem 1.7.5. The inuence function of

= med
ij
(x
i
+ x
j
)/2 is given by:
(x) =
H(x) 1/2
_
h
2
(t)dt
We sketch a derivation of this result, here. A rigorous development is oered in Section
A.5 of the Appendix. From Theorems 1.7.3 and 1.5.6 we have
n
1/2
T()/(0) n
1/2
T(0)/(0) cn
1/2
,
and
n
T(0)/c(0) ,
where (0) = (4/3)
1/2
and c = (12)
1/2
_
h
2
(t)dt. Make these substitutions to get
n
.
=
1
n(n + 1)2
_
h
2
(t)dt
ij
sgn
_
X
i
+ X
j
2
_
Now introduce an outlier x
n+1
= x
and take the dierence between

n+1
and

n
. The result
is
2
_
h
2
(t)dt[(n + 2)
n+1
n
n
]
.
=
1
(n + 1)
n+1
i=1
sgn
_
x
i
+ x
2
_
.
We can replace n + 2 and n + 1 by n where convenient without eecting the asymptotics.
Using the symmetry of the density of H, we have
1
n
n
i=1
sgn
_
x
i
+ x
2
_
.
= 1 2H
n
(x
) 1 2H(x
) = 2H(x
) 1 .
It now follows that (n + 1)(
n+1
n
)
.
= (x
), given in the statement of the theorem; see

the discussion of the inuence function in Section 1.6.
Note that we have a bounded inuence function since the cdf H is a bounded function.
Further, it is continuous, unlike the inuence function of the median. Finally, as an additional
check, note that E
2
(X) = 1/12[
_
h
2
(t)dt]
2
= 1/c
2
, the asymptotic variance of n
1/2
.
Let

c
= med
i,j
(X
i
cX
j
)/(1 c) for 1 c < 1 . This extension of the Hodges-
Lehmann estimate, ( 1.3.25), has some very interesting robustness properties for c > 0. The
inuence function of

c
is not only bounded but also redescending, similar to the most robust
M-estimates. In addition,

c
has 50% breakdown. For a complete discussion of this estimate
see Maritz, Wu and Staudte (1977) and Brown and Hettmansperger (1994).
In the next theorem we develop the test breakdown for the Wilcoxon signed rank test.
Theorem 1.7.6. The rejection breakdown, Denition 1.6.2, for the Wilcoxon signed rank
test is
n
.
= 1
_
1
2

z
(3n)
1/2
_
1/2
1
1
2
1/2
.
= .29 .
Table 1.7.2: Rejection breakdown values for size = .05 tests.
Signed-rank
n Sign t Wilcoxon
10 .71 .27 .57
13 .70 .21 .53
18 .67 .15 .48
30 .63 .09 .43
100 .58 .03 .37
.50 0 .29
Proof. Consider the form T
+
(0) =

I[(x
i
+x
j
)/2 > 0], where the double sum is over
all i j. The asymptotically size test rejects H
0
: = 0 in favor of H
A
: > 0 when
T
+
(0) c
.
= n(n +1)/4 +z
[n(n +1)(2n+1)/24]
1/2
. Now we must guarantee that T
+
(0) is
in the critical region. This requires at least c positive Walsh averages. Let x
(1)
. . . x
(n)
be the ordered observations. Then contamination of x
(n)
results in n contaminated Walsh
averages, namely those Walsh averages that include x
(n)
. Contamination of x
(n1)
yields n1
additional contaminated Walsh averages. When we proceed in this way, contamination of the
b ordered values x
(n)
, . . . , x
(nb+1)
yields n+(n1)+...+(nb+1) = [n(n+1)/2][(nb)(n
b+1)/2] contaminated Walsh averages. We now set [n(n+1)/2][(nb)(nb+1)/2]
.
= c and
solve the resulting quadratic for b. We must solve b
2
(2n + 1)b + 2c
.
= 0. The appropriate
root in this case is
b
.
=
2n + 1 [(2n + 1)
2
8c]
1/2
2
.
Substituting the approximate critical value for c, dividing by n, and ignoring higher order
terms, leads to the stated result.
Table 1.7.2 displays the nite rejection breakdowns of the Wilcoxon signed rank test
over the same sample sizes as the rejection breakdowns of the sign test given in Table 1.6.1.
For convenience we have reproduced the results for the sign and t tests, also. The rejection
breakdown for the Wilcoxon test converges from above to the estimation breakdown of .29.
The Wilcoxon test is more resistant than the t-test but not as resistant as the simple sign
test. It is interesting to note that from the discussion of eciency, it is clear that we can now
achieve high eciency and not pay the price in lack of robustness. The rank based methods
seem to be a very attractive alternative to the highly resistant but relatively inecient (at
the normal model) L
1
methods and the highly ecient (at the normal model) but nonrobust
L
2
methods.
1.8 Inference Based on General Signed-Rank Norms
In this section, we develop properties for a generalized sign-rank process. It includes the L
1
and the weighted L
1
as special cases. The development is similar to that of the weighted L
1
1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS 45
so a brief sketch suces. For x 1
n
, consider the function,
|x|
+ =
n
i=1
a
+
(R[x
i
[)[x
i
[ , (1.8.1)
where the scores a
+
(i) are generated as a
+
(i) =
+
(i/(n + 1)) for a positive valued, non-
decreasing, square-integrable function
+
(u) dened on the interval (0, 1). The proof that
| |
+ is a norm on 1
n
follows in the same way as in the weighted L
1
case; see the proof of
Theorem 1.3.2 and Exercise 1.12.22. The gradient function associated with this norm is
T
+() =
n
i=1
a
+
(R[X
i
[)sgn(X
i
) . (1.8.2)
Note that it reduces to the L
1
norm if
+
(u) 1 and the weighted L
1
, Wilcoxon signed-rank,
norm if
+
(u) = u. A family of simple score functions between the weighted L
1
and the L
1
are of the form
+
c
(u) =
_
u 0 < u < c
c 0 u < 1
, (1.8.3)
where the parameter c is between 0 and 1. These scores were proposed by Policello and
Hettmansperger (1976); see, also, Hogg (1974). The frequently used normal scores are
generated by the score function,
(u) =
1
_
u + 1
2
_
, (1.8.4)
where is the standard normal distribution function. Note that
+
(u) is the inverse cdf (or

quantile function) of the absolute value of a standard normal random variable. The normal
scores were originally proposed by Fraser (1957).
For the location model ( 1.2.1), the estimate of based on the norm ( 1.8.1) is the value
of which minimizes the distance |X1|
+ or equivalently solves the equation

T
+()
.
= 0 . (1.8.5)
A simple tracing algorithm suces to compute

. As Exercise 1.12.18 shows T
+() is a
decreasing step function of which steps down only at the Walsh averages. So rst sort
the Walsh averages. Next select a starting value

(0)
, such as median of the Walsh averages
which corresponds to the signed-rank Wilcoxon scores. Then proceed through the sorted
Walsh averages left or right, depending on whether or not T
+(
(0)
) is negative or positive.
The algorithm continues until the sign of T
+() changes. This is the algorithm behind

our RBR function onesampr which solves equation (1.8.5) for general scores functions; see
Exercise 1.12.33. Also, the linear searches discussed in Chapter 3, Section 3.7.3, can be used
to compute

.
To determine the corresponding functional, note that we can write R[X
i
[ = #
j

[X
i
[ X
j
[X
i
[ + . Let H
n
denote the empirical distribution function of the
sample X
1
, . . . , X
n
and let H
n
denote the left limit of H
n
. We can then write the dening
equation of

as,
_

+
(H
n
([x [ + ) H
n
( [x [))sgn(x ) dH
n
(x) = 0 ,
which converges to
() =
_

+
(H([x [ + ) H( [x [))sgn(x ) dH(x) = 0 . (1.8.6)
For convenience, a second representation of () can be obtained if we extend
+
(u) to the
interval (1, 0) as follows:
+
(t) =
+
(t) , for 1 < t < 0 . (1.8.7)
Using this extension, the functional = T(H) is the solution of
() =
_

+
(H(x) H(2 x)) dH(x) = 0. (1.8.8)
Compare expressions ( 1.8.8) and ( 1.3.26).
The level test of the hypotheses ( 1.3.6) based on T
+(0) is
Reject H
0
in favor of H
A
, if [T
+(0)[ c , (1.8.9)
where c solves P
0
[[T
+(0)[ c] = . We briey develop the statistical and robustness

properties of this test and the estimator

+ in the next two subsections.

1.8.1 Null Properties of the Test
For this subsection on null properties and the following subsection on eciency properties of
the test ( 1.8.9), we will assume that the sample X
1
, . . . , X
n
follows the symmetric location
model, ( 1.7.3), with common symmetric density function h(x) = f(x ), where f(x) is
symmetric about 0. Let H(x) denote the distribution function associated with h(x).
As in Section 1.7.1, we can express T
+(0) in terms of the anti-ranks as,

T
+(0) =
a
+
(R([X
i
[))sgn(X
i
) =
a
+
(j)sgn(X
D
j
) =
a
+
(j)W
j
; (1.8.10)
see the corresponding expression ( 1.3.20) for the weighted L
1
norm. Recall that under H
0
and the symmetry of h(x), the variables W
1
, . . . , W
n
are iid with P[W
i
= 1] = P[W
i
= 1] =
1/2, (Lemma 1.7.2). Thus we immediately have that T
+(0) is distribution free under H

0
with mean and variance
E
0
[T
+(0)] = 0 (1.8.11)
Var
0
[T
+(0)] =
n
i=1
a
+2
(i) . (1.8.12)
Tables can be constructed for the null distribution of T
+(0) from which critical values, c,

can be obtained to complete the test described in ( 1.8.9).
For the asymptotic null distribution of T
+(0), the following additional assumption on

the scores will be sucient:
max
j
a
+2
(j)
i=1
a
+2
(i)
0 . (1.8.13)
Because
+
is square integrable, we have
1
n
i=1
a
+2
(i)
2
+ =
_
1
0
(
+
(u))
2
du , 0 <
2
+ < , (1.8.14)
i.e., the left side is a Riemann sum of the integral. Under these assumptions and the sym-
metric location model, Corollary A.1.1 of the Appendix can be used to show that the null
distribution of T
+(0) is asymptotically normal; see, also, Exercise 1.12.16. Hence, an

asymptotic level test is
Reject H
0
in favor of H
A
, if
+
(0)
z
/2
. (1.8.15)
An approximate (1 )100% condence interval for based on the process T
+() is the
interval (
+
,L
,
+
,U
) such that
T
+(
+
,L
) = z
/2
+ and T
+(
+
,U
) = z
/2
+ ; (1.8.16)
see ( 1.5.26). These equations can be solved by the simple tracing algorithm discussed
immediately following expression (1.8.5).
1.8.2 Eciency and Robustness Properties
We derive the eciency properties of the analysis described above by establishing the four
conditions of Denition 1.5.3 to show that the process T
+() is Pitman regular. Assume

that
+
(u) is dierentiable. First dene the quantity
h
as
h
=
_
1
0
+
(u)
+
h
(u) du , (1.8.17)
where
+
h
(u) =
h
_
H
1
_
u+1
2
__
h
_
H
1
_
u+1
2
__ . (1.8.18)
As discussed below,
+
h
(u) is called the optimal score function. We assume that our
scores are such that
h
> 0.
Since it is the negative of a gradient of a norm, T
+() is nondecreasing in ; hence, the

rst condition, ( 1.5.7), holds. Let T
+(0) = T
+(0)/n and consider
+() = E
[T
+(0)] = E
0
[T
+()] .
Note that T
+() converges in probability to () in ( 1.8.8). Hence,
+() = ()
where in ( 1.8.8) H is a distribution function with point of symmetry at 0, without loss of
generality. If we dierentiate () and set = 0, we get
+(0) = 2
_

+
(2H(x) 1)h(x) dH(x)
= 4
_

0
+
(2H(x) 1)h
2
(x) dx =
_
1
0
+
(u)
+
h
(u) du > 0 , (1.8.19)
where the third equality in ( 1.8.19) follows from an integration by parts. Hence the second
Pitman regularity condition holds.
For the third condition, ( 1.5.9), the asymptotic linearity for the process T
+(0) is given
in Theorem A.2.11 of the Appendix. We restate the result here for reference:
P
0
_
sup
n||B
n
T
+()
1
n
T
+(0) +
h

_
0 , (1.8.20)
for all > 0 and all B > 0. Finally the fourth condition, ( 1.5.10), concerns the asymptotic
null distribution which was discussed above. The null variance of T
+(0)/
n is given by
expression ( 1.8.12). Therefore the process T
+() is Pitman regular with ecacy given by

c
+ =
_
1
0

+
(u)
+
h
(u) du
_
_
1
0
(
+
(u))
2
du
=
2
_
+
(2H(x) 1)h
2
(x) dx
_
_
1
0
(
+
(u))
2
du
. (1.8.21)
As our rst result, we obtain the asymptotic power lemma for the process T
+(). This,
of course, follows immediately from Theorem 1.5.8 so we state it as a corollary.
Corollary 1.8.1. Under the symmetric location model,
P
n
_
T
+(0)
+
z
_
1 (z
+) , (1.8.22)
for the sequence of hypotheses
H
0
: = 0 versus H
An
: =
n
=

n
for
> 0 .
Based on Pitman regularity, the asymptotic distribution of the the estimate

+ is
n(
+ )
D
N(0,
2
+) , (1.8.23)
where the scale parameter
+ is dened by the reciprocal of ( 1.8.21),
+ = c
1
+
=

+
_
1
0

+
(u)
+
h
(u) du
. (1.8.24)
Using the general result of Theorem 1.5.9, the length of the condence interval for ,
(1.8.16), can be used to obtain a consistent estimate of
+. This in turn can be used to

obtain a consistent estimate of the standard error of

+; see Exercise ??.

The asymptotic relative eciency between two estimates or two tests based on score
functions
+
1
(u) and
+
2
(u) is the ratio
e(
+
1
,
+
2
) =
c
2
+
1
c
2
+
2
=
+
2
+
1
. (1.8.25)
This can be used to compare dierent tests. For a specic distribution we can determine
the optimum scores. Such a score should make the scale parameter
+ as small as possible.
This scale parameter can be written as,
c
+ =
1
+
=
_
_
_
_
1
0

+
(u)
+
h
(u) du
+
_
_
1
0

2
h
(u) du
_
_
_
_
1
0
2
h
(u) du . (1.8.26)
The quantity in brackets is a correlation coecient; hence, to minimize the scale parameter
+, we need to maximize the correlation coecient which can be accomplished by selecting

the optimal score function given by
+
(u) =
+
h
(u) ,
where
+
h
(u) is given by expression ( 1.8.18). The quantity
_
_
1
0
(
+
h
(u))
2
du is the square
root of Fisher information; see Exercise 1.12.23. Therefore for this choice of scores the
estimate

+
h
is asymptotically ecient. This is the reason for calling the score function
+
h
the optimal score function.
It is shown in Exercise 1.12.24 that the optimal scores are the normal scores if h(x) is
a normal density, the Wilcoxon weighted L
1
scores if h(x) is a logistic density, and the L
1
scores if h(x) is a double exponential density. It is further shown that the scores generated
by ( 1.8.3) are optimal for symmetric densities with a logistic center and exponential tails.
From Exercise 1.12.24, the eciency of the normal scores methods relative to the least
squares methods is
e(NS, LS) =
__

f
2
(x)
(
1
(F(x)))
dx
_
2
, (1.8.27)
where F T
S
, the family of symmetric distributions with positive, nite Fisher information
and =
is the N(0, 1) pdf.

We now prove a result similar to Theorem 1.7.4. We prove that the normal scores
methods always have eciency at least equal to one relative to the LS methods. Further, it
is only equal to 1 at the normal distribution. The result was rst proved by Cherno and
Savage (1958); however, the proof presented below is due to Gastwirth and Wol (1968).
Theorem 1.8.1. Let X
1
, . . . , X
n
be a random sample from F T
s
. Then
inf
Fs
e(NS, LS) = 1 , (1.8.28)
and is only equal to 1 at the normal distribution.
Proof: If
2
f
= then e(NS, LS) > 1; hence, we suppose that
2
f
= 1. Let e = e(NS, LS).
Then from ( 1.8.27) we can write
e = E
_
f(X)
(
1
(F(X)))
_
= E
_
1
(
1
(F(X))) /f(X)
_
.
Applying Jensens inequality to the convex function h(x) = 1/x, we have
e
1
E [ (
1
(F(X))) /f(X)]
.
Hence,
1
e
E
_
(
1
(F(X)))
f(X)
_
=
_

_
1
(F(x))
_
dx .
We now integrate by parts, using u = (
1
(F(x))), du =
(
1
(F(x))) f(x) dx/(
1
(F(x))) =
1
(F(x))f(x) dx since
(x)/(x) = x. Hence, with dv = dx, we have

_

1
(f(x))
_
dx = x
_
1
(F(x))
_
+
_

x
1
(F(x))f(x) dx . (1.8.29)
Now transform x (
1
(F(x))) into F
1
((w))(w) by rst letting t = F(x) and then
w =
1
(t). The integral
_
F
1
((w))(w) dw =
_
xf(x) dx < , hence the limit of the
integrand must be 0 as x . This implies that the rst term on the right side of
( 1.8.29) is 0. Hence applying the Cauchy-Schwarz inequality,
1
e

_

x
1
(F(x))f(x) dx
=
_

x
_
f(x)
1
(F(x))
_
f(x) dx
__

x
2
f(x) dx
_

1
(F(x))
_
2
f(x) dx
_
1/2
= 1 ,
since
_
x
2
f(x) dx = 1 and
_
x
2
(x) dx = 1. Hence e
1/2
1 and e 1, which completes
the proof. It should be noted that the inequality is strict except at the normal distribution.
Hence the normal scores are strictly more ecient than the LS procedures except at the
normal model where the asymptotic relative eciency is 1.
The inuence function for

+ is derived in Section A.5 of the Appendix. It is given

by
(t,
+) =

+
(2H(t) 1)
4
_
0

+
(2H(x) 1)h
2
(x) dx
. (1.8.30)
Note, also, that E[
2
(X,
+)] =
2
+
as a check on the asymptotic distribution of

+. Note
that the inuence function is bounded provided the score function is bounded. Thus the
estimates based on the scores discussed in the last paragraph are all robust except for the
normal scores. In the case of the normal scores, when H(t) = (t), the inuence function is
(t) =
1
(t); see Exercise 1.12.25.
The asymptotic breakdown of the estimate

+ is
given by
_
1
+
(u) du =
1
2
_
1
0
+
(u) du . (1.8.31)
We provide a heuristic argument for ( 1.8.31); for a rigorous development see Huber (1981).
Recall Denition 1.6.1. The idea is to corrupt enough data so that the estimating equation,
( 1.8.5), no longer has a solution. Suppose that [n] observations are corrupted, where []
denotes the greatest integer function. Push the corrupted observations out towards + so
that
n
i=[(1)n]+1
a
+
(R([X
i
[))sgn(X
i
) =
n
i=[(1)n]+1
a
+
(i) .
This restrains the estimating function from crossing the horizontal axis provided
[(1)n]
i=1
a
+
(i) +
n
i=[(1)n]+1
a
+
(i) > 0 .
Replacing the sums by integrals in the limit yields
_
1
0
+
(u) du >
_
1
1
+
(u) du .
Now use the fact that
_
1
0
+
(u) du +
_
1
1
+
(u) du =
_
1
0
+
(u) du
and that we want the smallest possible to get ( 1.8.31).
Example 1.8.1. Breakdowns of Estimates Based on Wilcoxon and Normal Scores
Table 1.8.1: Empirical AREs Based on n = 30 and 10,000 simulations.
Estimators Normal Contaminated Normal
NS, LS 0.983 1.035
Wil, LS 0.948 1.007
NS, WIL 1.037 1.028
For

= med(X
i
+X
j
)/2,
+
(u) = u and it follows at once that
= 1 (1/
2)
.
= .293.
For the estimate based on the normal scores where
+
(u) is given by ( 1.8.4), expression
( 1.8.31) becomes
exp
_
1
2
_
1
_
1

2
__
2
_
=
1
2
and
= 2(1 (
log 4))
.
= .239. Hence we have the unusual situation that the estimate
based on the normal scores has positive breakdown but an unbounded inuence curve.
Example 1.8.2. Small Sample Empirical AREs of Estimator Based on Normal Scores
As discussed above, the ARE between the normal scores estimator and the sample mean
is 1 at the normal distribution. This is an asymptotic result. To answer the question about
this eciency at small samples, we conducted a small simulation study. We set the sample
size at n = 30 and ran 10,000 simulations from a normal distribution. We also selected the
contaminated normal distribution with = 0.01 and
c
= 3, which is a very mild contami-
nated distribution. We consider the three estimators: rank-based estimator based on normal
scores (NS), rank-based estimator based on Wilcoxon scores (WIL), and the sample mean
(LS). We used the RBR command onesampr(x,score=phinscp,grad=spnsc,maktable=F)
to compute the normal scores estimator; see Exercise 1.12.29. As our empirical ARE we used
the ratios of empirical mean square errors of the three estimators. Table 1.8.1 summarizes
the results. The empirical AREs for the NS and WIL estimators, at the normal, are close to
their asymptotic counterparts. Note that the NS estimator results in only a loss of less than
2% eciency over LS. For this small amount of contamination the NS estimator dominates
the LS estimator. It also dominates the Wilcoxon estimator. In Exercise 1.12.29, the reader
is asked to extend this study to other situations.
Example 1.8.3. Shoshoni Rectangles, Continued.
The next display shows the normal scores analysis of the Shoshoni Rectangles Data; see
Example 1.4.2. We conducted the same analysis as we did for the sign test and tratditional
t-test discussed in Example 1.4.2. Note that the call to the RBR function onnesampr with
the values score=phinscp,grad=spnsc computes the normal scores analysis.
> onesampr(x,theta0=.618,alpha=.10,score=phinscp,grad=spnsc)
Test of Theta = 0.618 Alternative selected is 0
Test Stat. Tphi+ is 7.809417 Standardized (z) Test-Stat. 1.870514 and p-vlaue 0.06141252
1.9. RANKED SET SAMPLING 53
Estimate 0.6485 SE is 0.02502799
While not as sensitive to the outliers as the traditional analysis, the outliers still had some
inuence on the normal scores analysis. The normal scores test rejects the null hypothesis
at level 0.06 while the 90% condence interval just misses the value 0.618.
1.9 Ranked Set Sampling
In this section we discuss an alternative to simple random sampling (SRS) called ranked set
sampling (RSS). This method of data collection is useful when measurements are destructive
or expensive while ranking of the data is relatively easy. Johnson, Nussbaum, Patil and
Ross (1996) give an interesting application to environmental sampling. As a simple example
consider the problem of estimating the mean volume of trees in a forest. To measure the
volume, we must destroy the tree. On the other hand, an expert may well be able to rank the
trees by volume in a small sample. The idea is to take a sample of size k of trees and ask the
expert to pick the one with smallest volume. This tree is cut down and the volume measured
and the other k 1 trees are returned to the population for possible future selection. Then
a new sample of size k is taken and the expert identies the second smallest which is then
cut down and measured. This is repeated until we have k measurements, having looked at
k
2
trees. This ends cycle 1. The measurements are represented as x
(1)1
. . . x
(k)1
where
the number in parentheses indicates an order statistic and the second number indicates the
cycle. We repeat the process for n cycles to get nk measurements:
x
(1)1
, . . . , x
(1)n
iid h
(1)
(t)
x
(2)1
, . . . , x
(2)n
iid h
(2)
(t)
.
.
.
.
.
.
.
.
.
x
(k)1
, . . . , x
(k)n
iid h
(k)
(t)
It is important to note that all nk measurements are independent but are identically
distributed only within each row. The density function h
(j)
(t) represents the pdf of the jth
order statistic from a sample of size k and is given by:
h
(j)
(t) =
k!
(j 1)!(k j)!
H
j1
(t)[1 H(t)]
kj
h(t)
We suppose the measurements are distributed as H(x) = F(x ) and we wish to make
a statistical inference concerning , such as an estimate, test, or condence interval. We will
illustrate the ideas on the L
1
methods since they are simple to work with. We also wish to
compute the eciency of the RSSL
1
methods relative to the SRSL
1
methods. We will see
that there is a substantial increase in eciency when using the RSS design. In particular,
we will compare the RRS methods to SRS methods based on a sample of size nk. The
RSS method was rst applied by McIntyre (1952) in measuring mean pasture yields. See
Hettmansperger (1995) for a development of the RSSL
1
methods. The most convenient
form of the RSS sign statistic is the number of positive measurements given by
S
+
RSS
=
k
j=1
n
i=1
I(X
(j)i
> 0) . (1.9.1)
Now note that S
+
RSS
can be written as S
+
RSS
=
S
+
(j)
where S
+
(j)
=
i
I(X
(j)i
> 0) has
a binomial distribution with parameters n and 1 H
(j)
(0). Further, S
+
(j)
, j = 1, . . . , k are
stochastically independent. It follows at once that
ES
+
RSS
= n
k
j=1
(1 H
(j)
(0)) (1.9.2)
VarS
+
RSS
= n
k
j=1
(1 H
(j)
(0))H
(j)
(0) .
With k xed and n , it follows from the independence of S
+
(j)
, j = 1, . . . , k that
(nk)
1/2
S
+
RSS
n
k
j=1
(1 H
(j)
(0)
D
Z n(0,
2
) , (1.9.3)
and the asymptotic variance is given by
2
= k
1
k
j=1
[1 H
(j)
(0)]H
(j)
(0) =
1
4
k
1
k
j=1
(H
(j)
(0)
1
2
)
2
. (1.9.4)
It is convenient to introduce a parameter
2
= 1 (4/k)
(H
(j)
(0) 1/2)
2
, then
2
=
2
/4.
The reader is asked to prove the second equality above in Exercise 1.12.26. Using the
formulas for the pdfs of the order statistics it is straightforward to verify that
h(t) = k
1
k
j=1
h
(j)
(t) and H(t) = k
1
k
j=1
H
(j)
(t) .
We now consider testing H
0
: = 0 versus H
A
: ,= 0. The following theorem provides the
mean and variance of the RSS sign statistic under the null hypothesis.
Theorem 1.9.1. Under the assumption that H
0
: = 0 is true, F(0) = 1/2,
F
(j)
(0) =
k!
(j 1)!(k j)!
_
1/2
0
u
j1
(1 u)
kj
du
1.9. RANKED SET SAMPLING 55
Table 1.9.1: Values of F
(j)
(0), j = 1, . . . , k and
2
= 1 (4/k)
(F
(j)
(0) 1/2)
2
.
k: 2 3 4 5 6 7 8 9 10
1 .750 .875 .938 .969 .984 .992 .996 .998 .999
2 .250 .500 .688 .813 .891 .938 .965 .981 .989
3 .125 .313 .500 .656 .773 .856 .910 .945
4 .063 .188 .344 .500 .637 .746 .828
5 .031 .109 .227 .363 .500 .623
6 .016 .063 .145 .254 .377
7 .008 .035 .090 .172
8 .004 .020 .055
9 .002 .011
10 .001
2
.750 .625 .547 .490 .451 .416 .393 .371 .352
and
ES
+
RSS
= nk/2, and VarS
+
RSS
= 1/4 k
1
(F
(j)
(0) 1/2)
2
.
Proof. Use the fact that k
1
F
(j)
(0) = F(0) = 1/2, and the expectation formula
follows at once. Note that
F
(j)
(0) =
k!
(j 1)!(k j)!
_
0
F(t)
j1
(1 F(t))
kj
f(t)dt ,
and then make the change of variable u = F(t).
The variance of S
+
RSS
does not depend on H, as expected; however, its computation
requires the evaluation of the incomplete beta integral. Table 1.9.1 provides the values
of F
(j)
(0), under H
0
: = 0. The bottom line of the table provides the values of
2
=
1 (4/k)
(F
(j)
(0) 1/2)
2
, an important parameter in assessing the gain of RSS over SRS.
We will compare the SRS sign statistic S
+
SRS
based on a sample of nk to the RSS sign
statistic S
+
RSS
. Note that the variance of S
+
SRS
is nk/4. Then the ratio of variances is
VarS
+
RSS
/VarS
+
SRS
=
2
= 1 (4/k)
(F
(j)
(0) 1/2)
2
. The reduction in variance is given in
the last row of Table 1.9.1 and can be quite large.
We next show that the parameter is an integral part of the ecacy of the RSS L
1
methods. It is straight forward using the methods of Section 1.5 and Example 1.5.2 to
show that the RSS L
1
estimating function is Pitman regular. To compute the ecacy we
rst note that
S
RSS
= (nk)
1
k
j=1
n
i=1
sgn(X
(j)i
) = (nk)
1
[2S
+
RSS
nk] .
We then have at once that
(nk)
1/2

S
RSS
D
0
Z n(0,
2
) , (1.9.5)
and
(0) = 2f(0); see Exercise 1.12.27. See Babu and Koti (1996) for a development of the
exact distribution. Hence, the ecacy of the RSS L
1
methods is given by
c
RSS
=
2f(0)
=
2f(0)
1 (4/k)
k
j=1
(F
(j)
(0) 1/2)
2
1/2
.
We now summarize the inference methods and their eciency in the following:
1. The test. Reject H
0
: = 0 in favor of H
A
: > 0 at signicance level if
S
+
SRS
> (nk/2) z
(nk/4)
1/2
where, as usual, 1 (z
) = .
2. The estimate. (nk)
1/2
medX
(j)i
D
Z n(0,
2
/4f
2
(0)).
3. The condence interval. Let X
(1)
, . . . , X
(nk)
be the ordered values of X
(j)i
, j =
1, . . . , k and i = 1, . . . , n. Then [X
(m+1)
, X
(nkm)
] is a (1 )100% condence in-
terval for where P(S
+
SRS
m) = /2. Using the normal approximation we have
m
.
= (nk/2) z
/2
(nk/4)
1/2
.
4. Eciency. The eciency of the RSS methods with respect to the SRS methods is
given by e(RSS, SRS) = c
2
RSS
/c
2
SRS
=
2
. Hence, the reciprocal of the last line of
Table 1.9.1 provides the eciency values and they can be quite substantial. Recall
from the discussion following Denition 1.5.5 that eciency can be interpreted as
the ratio of sample sizes needed to achieve the same approximate variances, the same
approximate local power, and the same condence interval length. Hence, we write
(nk)
RSS
.
=
2
(nk)
SRS
. This is really the point of the RSS design. Returning to the
example of estimating the volume of wood in a forest, if we let k = 5, then from Table
1.9.1, we would need to destroy and measure only about one half as many trees using
the RSS method rather than the SRS method.
As a nal note, we mention the problem of assessing the eect of imperfect ranking. Suppose
that the expert makes a mistake when asked to identify the jth ordered value in a set of k
observations. As expected, there is less gain from using the RSS method. The interesting
point is that if the expert simply identies the supposed jth ordered value by random guess
then
2
= 1 and the two sign tests have the same information; see Hettmansperger (1995)
for more detail.
1.10 Interpolated Condence Intervals for the L
1
In-
ference
When we construct L
1
condence intervals, we are limited in our choice of condence coe-
cients because of the discreteness of the binomial distribution. The eect does not wear o
1.10. INTERPOLATED CONFIDENCE INTERVALS FOR THE L
1
INFERENCE 57
very quickly as the sample size increases. For example with a sample of size 50, we can have
either a 93.5% or a 96.7% condence interval, and that is as close as we can come to 95%.
In the following discussion we provide a method to interpolate between condence intervals.
The method is nonlinear and seems to be essentially distribution-free. We will begin by
presenting and illustrating the method and then derive its properties.
Suppose is the desired condence coecient. Further, suppose the following intervals
are available from the binomial table: interval (x
(k)
, x
(nk+1)
) with condence coecient
k
and interval (x
(k+1)
, x
(nk)
) with condence coecient
k+1
where
k+1

k
. Then the
interpolated interval is [
L
,
U
],
L
= (1 )x
(k)
+ x
(k+1)
and

U
= (1 )x
(nk+1)
+ x
(nk)
(1.10.1)
where
=
(n k)I
k + (n 2k)I
and I =

k
k+1
. (1.10.2)
We call I the interpolation factor and note that if we were using linear interpolation then
= I. Hence, we see that the interpolation is distinctly nonlinear.
As a simple example we take n = 10 and ask for a 95% condence interval. For k = 2
we nd
k
= .9786 and
k+1
= .8907. Then I = .325 and = .685. Hence,

L
= .342x
(2)
+
.658x
(3)
and

U
= .342x
(9)
+.658x
(8)
. Note that linear interpolation is almost the reverse of
the recommended mixtures, namely = I = .325 and this can make a substantial dierence
in small samples.
The method is based on the following theorem. This theorem highlights the nonlinear
relationship between the interpolation factor and . After proving the theorem we will need
to develop an approximate solution and then show that it works in practice.
Theorem 1.10.1. The interpolation factor I is given by
I =

k
k+1
= 1 (n k)2
n
_

0
F
k
_

1
y
_
(1 F(y))
nk1
f(y)dy
Proof. Without loss of generality we will assume that is 0. Then we can write:
k
= P
0
(x
k
0 x
nk+1
) = P
0
(k 1 < S
+
1
(0) < n k 1)
and
k+1
= P
0
(x
k+1
0 x
nk
) = P
0
(k < S
+
1
(0) < n k) .
Taking the dierence, we have, using
_
n
k
_
to denote the binomial coecient,
k+1
= P
0
(S
+
1
(0) = k) + P
0
(S
+
1
(0) = n k) =
_
n
k
_
(1/2)
n1
. (1.10.3)
We now consider the lower tail probability associated with the condence interval. First
consider
P
0
(X
k+1
> 0) =
1
k+1
2
=
_

0
n!
k!(n k 1)!
F
k
(t)(1 F(t))
nk1
dF(t)(1.10.4)
= P
0
(S
+
1
(0) n k) = P
0
(S
+
1
(0) k) .
We next consider the lower end of the interpolated interval
1
2
= P
0
((1 )X
k
+ X
k+1
> 0)
=
_

0
_
y
1
y
n!
(k 1)!(n k 1)!
F
k1
(x)(1 F(y))
nk1
f(x)f(y)dxdy
=
_

0
n!
(k 1)!(n k 1)!
1
k
_
F
k
(y) F
k
_
y
1
__
(1 F(y))
nk1
f(y)dy
=
1
k+1
2

_

0
n!
k!(n k 1)!
F
k
_
y
1
_
(1 F(y))
nk1
f(y)dy (1.10.5)
Use ( 1.10.4) in the last line above. Now with ( 1.10.3), substitute into the formula for the
interpolation factor and the result follows.
Clearly, not only is the relationship between I and nonlinear but it also depends on the
underlying distribution F. Hence, the interpolated interval is not distribution free. There
is one interesting case in which we have a distribution free interval given in the following
corollary.
Corollary 1.10.1. Suppose F is the cdf of a symmetric distribution. Then I(1/2) = k/n,
where we write I() to denote the dependence of the interpolation factor on .
This shows that when we sample from a symmetric distribution, the interval that lies
half between the available intervals does not depend on the underlying distribution. Other
interpolated intervals are not distribution free. Our next theorem shows how to approximate
the solution and the solution is essentially distribution free. We show by example that the
approximate solution works in many cases.
Theorem 1.10.2.
I()
.
= k/((2k n) + n k)
Proof. We consider the integral
_

0
F
k
_

1
y
_
(1 F(y))
nk1
f(y)dy
The integrand decreases rapidly for moderate powers; hence, we expand the integrand around
y = 0. First take logarithms then
k log F
_

1
y
_
= k log F(0)

1
k
f(0)
F(0)
y + o(y)
1.10. INTERPOLATED CONFIDENCE INTERVALS FOR THE L
1
INFERENCE 59
Table 1.10.1: Condence Coecients for Interpolated Condence Intervals in Example
1.10.1. DE(Approx)=Double Exponential and the Approximation in Theorem 1.10.2,
U=Uniform, N=Normal, C=Cauchy, Linear=Linear Interpolation
DE(Approx) U N C Linear
0.1 0.976 0.977 0.976 0.976 0.970
0.2 0.973 0.974 0.974 0.974 0.961
0.3 0.970 0.971 0.971 0.970 0.952
0.4 0.966 0.967 0.966 0.966 0.943
0.5 0.961 0.961 0.961 0.961 0.935
0.6 0.955 0.954 0.954 0.954 0.926
0.7 0.946 0.944 0.944 0.946 0.917
0.8 0.935 0.930 0.931 0.934 0.908
0.9 0.918 0.912 0.914 0.918 0.899
and
(n k 1) log(1 F(y)) = (n k 1) log(1 F(0)) (n k 1)
f(0)
1 F(0)
y + o(y) .
Substitute r = k/(1 ) and F(0) = 1 F(0) = 1/2 into the above equations, and add
the two equations together. Add and subtract r log(1/2), and group terms so the right side
of the second equation appears on the right side along with k log(1/2) r log(1/2). Hence,
we have
k log F
_

1
y
_
+ (n k 1) log(1 F(y)) = k log(1/2) r log(1/2)
+(n r k 1) log(1 F(y)) +o(y) ,
and, hence,
_

0
F
k
_

1
y
_
(1 F(y))
nk1
f(y)dy
.
=
_

0
2
(kr)
(1 F(y))
n+rk1
f(y)dy
=
1
2
n
(n + r k)
. (1.10.6)
Substitute this approximation into the formula for I(), use r = k/(1 ) and the result
follows.
Note that the approximation agrees with Corollary 1.10.1. In addition Exercise 1.12.28
shows that the approximation formula is exact for the double exponential (Laplace) dis-
tribution. In Table 1.10.1 we show how well the approximation works for several other
distributions. The exact results were obtained by numerical integration of the integral in
Theorem 1.10.1. Similar close results were found for asymmetric examples. For further
reading see Hettmansperger and Sheather (1986) and Nyblom (1992).
Example 1.10.1. Cushney-Peebles Example 1.4.1, continued.
We now return to this example using it to illustrate the sign test and the L
1
interpolated
condence interval. We use the RBR function interpci for the computations. We take as
our location model: X
1
, . . . , X
10
iid from H(x) = F(x ), F and both unknown, along
with the L
1
norm. We have already seen that the estimate of is the sample median equal
to 1.3. Besides obtaining an interpolated 95% condence interval, we test H
0
: = 0 versus
H
A
: ,= 0. Assuming that the sample is in the vector x, the output for a test and a 95%
interpolated condence interval is:
> tm=interpci(.05,x)
Estimation of Median
Sample Median is 1.3
Confidence Interval ( 1 , 1.8 ) 89.0625 %
Confidence Interval ( 0.9315 , 2.0054 ) 95 % Interpolted
Confidence Interval ( 0.8 , 2.4 ) 97.8516 %
Results for the Sign Test
Test stat. S is 9 p-vlaue 0.00390625
Note the p-value of the test is .0039 and we would easily reject the null hypothesis at any
reasonable level of signicance. The interpolated 95% condence interval for shows the
reasonable set of values of to be between .9315 and 2.0054, given the level of condence.
1.11 Two Sample Analysis
We now propose a simple way to extend our one sample methods to the comparison of two
samples. Suppose X
1
, . . . , X
m
are iid F(x
x
) and Y
1
, . . . , Y
n
are iid F(y
y
) and the
two samples are independent. Let =
y

x
and we wish to test the null hypothesis
H
0
: = 0 versus the alternative hypothesis H
a
: ,= 0. Without loss of generality we can
consider
x
= 0 so that the X sample is from a distribution with cdf F(x) and the Y sample
is from a distribution with cdf F(y ).
The hypothesis testing rule that we propose is:
1. Construct L
1
condence intervals [X
L
, X
U
] and [Y
L
, Y
U
].
2. Reject H
0
if the intervals are disjoint.
1.11. TWO SAMPLE ANALYSIS 61
If we consider the condence interval as a set of reasonable values for the parameter, given
the condence coecient, then we reject the null hypothesis when the respective reasonable
values are disjoint. We must determine the signicance level for the test. In particular, for
given
x
and
y
, what is the value of
c
, the signicance level for the comparison? Perhaps
more pertinent: Given
c
, what values should we choose for
x
and
y
? Below we show that
for a broad range of sample sizes,
Comparing two 84% CIs yields a 5% test of H
0
: = 0 versus H
A
: ,= 0, (1.11.1)
where CI denotes condence interval. In the following theorem we provide the relationship
between
c
and the pair
x
,
y
. Dene z
x
by
x
= 2(z
x
) 1 and likewise z
y
by
y
=
2(z
y
) 1.
Theorem 1.11.1. Suppose m, n so that m/N , 0 < < 1, N = m + n. Then
under the null hypothesis H
0
: = 0,
c
= P(X
L
> Y
U
) + P(Y
L
> X
U
) 2[(1 )
1/2
z
x
1/2
z
y
]
Proof. We will consider
c
/2 = P(X
L
> Y
U
). From ( 1.5.22) we have
X
L
.
=
S
x
(0)
m2f(0)

z
x
m
1/2
2f(0)
and Y
U
.
=
S
y
(0)
m2f(0)
+
z
y
n
1/2
2f(0)
.
Since m/N
N
1/2
X
L
D
1/2
Z
1
, Z
1
n(z
x
/2f(0), 1/4f
2
(0)) ,
and
N
1/2
Y
U
D
(1 )
1/2
Z
2
, Z
2
n(z
y
/2f(0), 1/4f
2
(0)) .
Now
c
/2 = P(X
L
> Y
U
) = P(N
1/2
(Y
U
X
L
) < 0) and X
L
, Y
U
are independent, hence
N
1/2
(Y
U
X
L
)
D
1/2
Z
1
(1 )
1/2
Z ,
and
1/2
Z
1
(1 )
1/2
Z
2
n
_
1
2f(0)
_
z
x
(1 )
1/2
+
z
y
1/2
_
,
1
4f
2
(0)
_
1
+
1
1
__
.
It then follows that
P(N
1/2
(Y
U
X
L
) < 0)
_
_
z
x
(1 )
1/2
+
z
y
1/2
_
/
_
1
(1 )
_
1/2
_
.
Which, when simplied, yields the result in the statement of the theorem.
Table 1.11.1: Condence Coecients for 5% Comparison.
= m/N .500 .550 .600 .650 .750
m/n 1.00 1.22 1.50 1.86 3.00
z
x
= z
y
1.39 1.39 1.39 1.40 1.43
x
=
y
.84 .84 .84 .85 .86
To illustrate, we take equal sample sizes so that = 1/2 and we take z
x
= z
y
= 2. Then
we have two 95% condence intervals and we will reject the null hypothesis H
0
: = 0 if the
two intervals are disjoint. The above theorem says that the signicance level is approximately
equal to
c
= 2(2.83) = .0046. This is a very small level and it will be dicult to reject
the null hypothesis. We might prefer a signicance level of say
c
= .05. We then must nd
z
x
and z
y
so that .05 = 2((.5)
1/2
(z
x
+z
y
)). Note that now we have an innite number of
solutions. If we impose the reasonable condition that the two condence coecients are the
same then we require that z
x
= z
y
= z. Then we have the equation .025 = ((2)
1/2
z) and
hence 2 = (2)
1/2
z. So z = 2
1/2
= 1.39 and the condence coecient for the two intervals
is =
x
=
y
= 2(1.41) 1 = .84. Hence, if we have equal sample sizes and we use two
84% condence intervals then we have a 5% two sided comparison of the two samples.
If we set
c
= .10, this would correspond to a 5% one sided test. This means that we
compare the two condence intervals in the direction specied by the alternative hypothesis.
For example, if we specify =
y

x
> 0, then we would reject the null hypothesis if
the X-interval is completely below the Y -interval. To determine which condence intervals
we again assume that the two intervals will have the same condence coecient. Then we
must nd z such that .05 = ((2)
1/2
z) and this leads to 1.645 = (2)
1/2
z and z = 1.16.
Hence, the condence coecient for the two intervals is =
x
=
y
= 2(1.16) 1 = .75.
Hence, for a one-sided 5% test or a 10% two-sided test, when you have equal sample sizes,
use two 75% condence intervals.
We must now consider what to do if the sample sizes are not equal. Let z
c
be determined
by
c
/2 = (z
c
), then, again if we use the same condence coecient for the two intervals,
z = z
x
= z
y
= z
c
/(
1/2
+ (1 )
1/2
). When m = n so that = 1 = .5 we had
z = z
c
/2
1/2
= .707z
c
and so z = 1.39 when
c
= .05. We now show by example that when
c
= .05, z is not sensitive to the value of . Table 1.11.1 gives the relevant information.
Hence, if we use 84% condence intervals, then the signicance level will be roughly 5% for
the comparison for a broad range of ratios of sample sizes. Likewise, we would use 75%
intervals for a 10% comparison. See Hettmansperger (1984b) for additional discussion.
Next suppose that we want a condence interval for =
y
x
. In the following simple
theorem we show that the proposed test based on comparing two condence intervals is
equivalent to checking to see if zero is contained in a dierent condence interval. This new
interval will be a condence interval for .
Theorem 1.11.2. [X
L
, X
U
] and [Y
L
, Y
U
] are disjoint if and only if 0 is not contained in
1.11. TWO SAMPLE ANALYSIS 63
[Y
L
X
U
, Y
U
X
L
].
If we specify our signicance level to be
c
then we have immediately that
1
c
= P
(Y
L
X
U
Y
U
X
L
)
and [Y
L
X
U
, Y
U
X
L
] is a
c
= 1
c
condence interval for .
This theorem simply points out that the hypothesis test can be equivalently based on
a single condence interval. Hence, two 84% intervals produce a roughly 95% condence
interval for . The condence interval is easy to construct since we need only nd the least
and greatest dierences of the end points between the respective Y and X intervals.
Recall that one way to measure the eciency of a condence interval is to nd its asymp-
totic length. This is directly related to the Pitman ecacy of the procedure; see Section
1.5.5. This would seem to be the most natural way to study the eciency of the test based
on condence intervals. In the following theorem we determine the asymptotic length of the
interval for .
Theorem 1.11.3. Suppose m, n in such a way that m/N , 0 < < 1, N = m+n.
Further suppose that
c
= 2(z
c
) 1. Let be the length of [Y
L
X
U
, Y
U
X
L
]. Then
N
1/2
2z
c
1
[(1 )]
1/2
]2f(0)
Proof. First note that =
x
+
y
, the sum of the two lengths of the X and Y intervals,
respectively. Further,
N
1/2
=
N
1/2
n
1/2
n
1/2
y
+ =
N
1/2
m
1/2
m
1/2
x
.
But by Theorem 1.5.9 this converges in probability to z
x
/
1/2
+ z
y
/(1 )
1/2
. Now note
that (1 )
1/2
z
x
+
1/2
z
y
= z
c
and the result follows.
The interesting point about this theorem is that the eciency of the interval does not
depend on how z
x
and z
y
are chosen so long as they satisfy (1 )
1/2
z
x
+
1/2
z
y
= z
c
. In
addition, this interval has inherited the ecacy of the L
1
interval in the one sample location
model. We will discuss the two-sample location model in detail in the next chapter. In
Hettmansperger (1984b) other choices for z
x
and z
y
are discussed; for example, we could
choose z
x
and z
y
so that the asymptotic standardized lengths are equal. The corresponding
condence coecients for this choice are more sensitive to unequal sample sizes than the
method proposed here.
Example 1.11.1. Hendy and Charles Coin Data
Hendy and Charles (1970) study the change in silver content in Byzantine coins. During
the reign of Manuel I (1143-1180) there were several mintings. We consider the research
hypothesis that the silver content changed from the rst to the fourth coinage. The data
consists in 9 coins identied from the rst coinage and 7 coins from the fourth. We suppose
Table 1.11.2: Silver Percentage in Two Mintings
First 5.9 6.8 6.4 7.0 6.6 7.7 7.2 6.9 6.2
Fourth 5.3 5.6 5.5 5.1 6.2 5.8 5.8
that they are realizations of random samples of coins from the two populations. The percent-
age of silver in each coin is given in Table 1.11. Let =
1
4
where the 1 and 4 indicate
the coinage. To test the null hypothesis H
0
: = 0 versus H
A
: ,= 0 at = .05, we
construct two 84% L
1
condence intervals and reject the null hypothesis if they are disjoint.
The condence intervals can be computed by using the RBR function onesampsgn with the
value alph=.16. Results pertinent to the condence intervals are:
> onesampsgn(First,alpha=.16)
84 % Confidence Interval is ( 6.4 , 7 )
> onesampsgn(Fourth,alpha=.16)
Clearly, the 84% condence intervals are disjoint, hence, we reject the null hypothesis
at a 5% signicance level and claim that the emperor apparently held back a little on the
fourth coinage. A 95% condence interval for =
1
4
is found by taking the dierences
in the ends of the condence intervals: (6.4 5.8, 7.0 5.3) = (0.6, 1.7). Hence, this analysis
suggests that the dierence in median percentages is someplace between .6% and 1.7%, with
a point estimate of 6.8 5.6 = 1.2%.
Figure 1.11.1 provides a comparison boxplot of the data for the rst and fourth coinages.
Marking the 84% condence intervals on the plot, we can see the relatively large gap between
the condence intervals, i.e., the sharp reduction in silver content from the rst to fourth
coinage. In addition, the box for the fourth coinage is a bit more narrow than the box for
the rst coinage indicating that there may be less variation (as measured by the interquartile
range) in the fourth coinage. There are no apparent outliers as indicated by the whiskers on
the boxplot. Larson and Stroup (1976) analyze this example with a two sample t-test.
1.12. EXERCISES 65
Figure 1.11.1: Comparison Boxplots of the Hendy and Charles Coin Data
First Fourth
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0
7
.
5
P
e
r
c
e
n
t
a
g
e

o
f

s
i
l
v
e
r
1.12 Exercises
1.12.1. Show that if | | is a norm, then there always exists a value of which minimizes
|x 1| for any x
1
, . . . , x
n
.
1.12.2. Figure 1.12.1 displays the graph of Z() versus for n = 20 data points (count the
steps) where
Z() =
1
n
20
i=1
sign(X
i
),
i.e., the standardized sign (median) process.
(a) From the plot, what are the minimum and maximum values of the sample?
(b) From the plot, what is the associated point estimate of ?
(c) From the plot, determine a 95% condence interval for , (approximate, but show on
the graph).
(d) From the plot, determine the value of the test statistic and the associated p-value for
testing H
0
: = 0 versus H
A
: > 0.
1 0 1 2 3
5
0
5
theta
Z
(
t
h
e
t
a
)
Plot of Z(theta) versus theta
Figure 1.12.1: The Graph of Z() versus
1.12.3. Show D(), ( 1.3.3), is convex and continuous as a function of . Further, argue that
D() is dierentiable almost everywhere. Let S() be a function such that S() = D
()
where the derivative exists. Then show that S() is a nonincreasing function.
1.12.4. Consider the L
2
norm. Show that

= x and that S
2
(0) =
nt/
n 1 +t
2
where
t =

n x/s, and s is the sample standard deviation. Further, show S
2
(0) is an increasing
function of t so the test based on t is equivalent to S
2
(0).
1.12.5. Discuss the consistency of the t-test. Is the t-test resolving?
1.12.6. Discuss the Pitman regularity in the L
2
case.
1.12.7. The following R function computes a bootstrap distribution of the sample median.
bootmed = function(x,nb){
# Sample is in x and nb is the number of bootstraps
n = length(x)
bootmed = rep(0,nb)
for(i in 1:nb){
1.12. EXERCISES 67
y = sample(x,size=n,replace=T)
bootmed[i] = median(y)
}
bootmed
}
(a). Use this code to obtain 1000 bootstraped medians for the Shoshoni data of Example
1.4.2. Determine the standard error of this bootstrap sample of medians and compare
it with the estimate based on the length of the condence interval for the Shoshoni
data.
(b). Now nd the mean and variance of the Shoshoni data. Use these estimates to perform
a parametric bootstrap of the sample median, as discussed in Example ??. Determine
the standard error of this parametric bootstrap sample of medians and compare it with
estimates in Part (a).
1.12.8. Using languages such as Minitab or R, obtain a plot of the test sensitivity curves
based on the signed-rank Wilcoxon statistic for the Cushney-Peebles Data, Example 1.4.1,
similar to the sensitivity curves based on the t test and the sign test as shown in Figure
1.4.1.
1.12.9. In the proof of Theorem 1.5.6, show that ( 1.5.19) and ( 1.5.20) imply that U
n
(b)
converges to
(0) in probability, pointwise in b, i.e., U

n
(b) =
(0) +o
p
(1).
1.12.10. Suppose we are sampling fron the distribution with pdf
f(x) =
3
4
1
(2/3)
exp[x[
3/2
, < x <
and we are considering whether to use the Wilcoxon or sign test. Using the ecacies of these
tests, determine which test to use.
1.12.11. For which of the following distributions is the signed-rank Wilcoxon more powerful?
Why?
f
1
(x) =
_
3
2
x
2
1 < x < 1
0 elsewhere.
f
2
(x) =
_
3
2
(1 x
2
) 1 < x < 1
0 elsewhere.
1.12.12. Show that ( 1.5.23) is scale invariant. Hence, the eciency does not change if X
is multiplied by a positive constant. Let
f(x, ) = exp([x[
)/2(
1
), < x < , 1 2.
When = 2, f is a normal distribution and when = 1, f is a Laplace distribution. Compute
and plot as a function of the eciency ( 1.5.23).
1.12.13. Show that the nite sample breakdown of the Hodges-Lehmann estimate ( 1.3.25) is
n
= m/n, where m is the solution to the quadratic inequality 2m
2
(4n+2)m
+n
2
+n 0.
Table
n
as a function of n and show that
n
converges to 1
1
2
.
= .29.
1.12.14. Derive ( 1.6.9).
1.12.15. Prove Lemma 1.7.2.
1.12.16. Prove Theorem 1.7.1. In particular, check the conditions of the Lindeberg Central
Limit Theorem to verify ( 1.7.7).
1.12.17. Prove Theorem 1.7.2.
1.12.18. For the general signed-rank norm given by (1.8.1), show that the function T
+(),
(1.8.2) is a decreasing step function which steps down only at the Walsh averages. Hint:
First show that the ranks of [X
i
[ and [X
j
[ switch for
1
<
2
if and only if
1
<
X
i
+ X
j
2
<
2
,
(replace ranks by signs if i = j).
1.12.19. Use the results of the last exercise to write in some detail the tracing algorithm,
described after expression (1.8.5), for obtaining the location estimator

+ and its associated

standard error.
1.12.20. Suppose h(x) has nite Fisher information:
I(h) =
_
(h
(x))
2
h(x)
dx < .
Prove that h(x) is bounded and that
_
h
2
(x)dx < .
Hint: Write
h(x) =
_
x
(t)dt
_
x
[h
(t)[dt .
1.12.21. Repeat Exercise 1.12.12 for ( 1.7.13).
1.12.22. Show that ( 1.8.1) is a norm.
1.12.23. Show that
_

+
2
h
(u)du,
+
2
h
(u) given by ( 1.8.18), is equal to Fisher information,
_
(h
(x))
2
h(x)
dx .
1.12.24. Find ( 1.8.18) when h is normal, logistic, Laplace (double exponential) density,
respectively.
1.12. EXERCISES 69
1.12.25. Verify that the inuence function of the normal score estimate is unbounded when
the underlying distribution is normal.
1.12.26. Verify ( 1.9.4).
1.12.27. Derive the limit distribution in expression ( 1.9.5).
1.12.28. Show that approximation ( 1.10.6) is exact for the double exponential (Laplace)
distribution.
1.12.29. Extend the simulation study of Example 1.8.2 to the other contaminated normal
situations found in Table 1.7.1. Comment on the results. Compare the empirical results for
the Wilcoxon withe asymptotic results found in the table.
The following R code performs the contaminated normal simulation discussed in Example
1.8.2. (Semicolons are end of line indicators. As indicated in the call to onesampr, the normal
scores estimator is computed by using the gradient R function spnsc and score function
phinscp.)
nsims = 10000; n = 30; itype = 1; eps = .01; sigc = 3;
collls = rep(0,nsims): collwil = rep(0,nsims); collnsc = rep(0,nsims);
for(i in 1:nsims){
if(itype == 0){x = rnorm(n)}
if(itype == 1){x = rcn(n,eps,sigc)}
collls[i] = mean(x)
collnsc[i] = onesampr(x,score=phinscp,grad=spnsc,maktable=F)$est
collwil[i] = onesampwil(x,maktable=F)$est
}
msels = mean(collls^2); msensc = mean(collnsc^2): msewil = mean(collwil^2)
arensc = msels/msensc; arewil = msels/msewil; arenscwil = msewil/msensc
1.12.30. Consider the one sample location problem. Let T() be a nonincreasing process.
Consider the hypotheses:
H
0
: = 0 versus H
A
: > 0.
Assume that T() is standardized so that the decision rule of the (asymptotic) level test
is given by
Reject H
0
: = 0 in favor of H
A
: > 0, if T(0) > z
.
Further assume that for all [[ < B, B > 0,
T(/
n) = T(0) 1.2 + o
p
(1).
(a) For
0
> 0, determine the asymptotic power (
0
), i.e., determine
(
0
) = P
0
[T(0) > z
].
(b) Evaluate (
0
) for n = 36 and
0
= 0.5.
1.12.31. Suppose X
1
, . . . , X
2n
are independent observations such that X
i
has cdf F(x
i
)0.
For testing H
0
:
1
= . . . =
2n
versus H
A
:
1
. . .
2n
with at least one strict inequality,
consider the test statistic,
S =
n
i=1
I(X
n+i
> X
i
) .
(a.) Discuss the small sample and asymptotic distribution of S under H
0
.
(b.) Determine the alternative distribution of S under the alternative
n+i
i
= , > 0,
for all i = 1, . . . , n. Show that the test is consistent for this alternative. This test is
called Manns (1945) test for trend.
1.12.32. The data in Table 1.12.1 constitutes a sample of size 59 of information on profes-
sional baseball players. The data were recorded from the back of a deck of baseball cards,
(complements of Carrie McKean).
(a). Obtain dotplots of the weights and heights of the baseball players.
(b). Assume the weight of a typical adult male is 175 pounds. Use the Wilcoxon test statistic
to test the hypotheses
H
0
:
W
= 175 versus H
A
:
W
,= 175 ,
where
W
is the median weight of a professional baseball player. Compute the p-value.
Next obtain a 95% condence interval for
W
using the condence interval procedure
based on the Wilcoxon. Use the dotplot in Part (a) to comment on the assumption of
symmetry.
(c). Let
H
be the median height of a baseball player. Repeat the analysis of Part (b) for
the hypotheses
H
0
:
H
= 70 versus H
A
:
H
,= 70 .
1.12.33. The signed-rank Wilcoxon scores are optimal for the logistic distribution while the
sign scores are optimal for the Laplace distribution. A family of score functions which are
optimal for distributions with logistic middles and Laplace tails are the bent scores.
These are continuous score functions
+
(u) with a linear (positive slope and intercept 0)
piece for 0 < u < b and a constant piece for b < u < 1, for a specied value of b; see Policello
and Hettmansperger (1976). These are called signed-rank Winsorized Wilcoxon scores.
(a) Obtain the standardized scores such that
_
[
+
(u)]
2
du = 1.
(b) For these scores with b = 0.75, obtain the corresponding estimate of location and an
estimate of its standard error for the following data set:
7.94 8.13 8.11 7.96 7.83 7.04 7.91 7.82
7.42 8.06 8.51 7.88 8.96 7.58 8.14 8.06
1.12. EXERCISES 71
Table 1.12.1: Data for professional baseball players, Exercise 1.12.32. The variables are:
(H) Height in inches; (W) Weight in pounds; (B) Side of plate from which the player bats,
(1-Right handed, 2-Left handed, 3-Switch-hitter); (A) Throwing arm (0-Right, 1-Left); (P)
Pitch-hit indicator, (0-Pitcher, 1-Hitter); and (Ave) Average (ERA if pitcher, Batting aver-
age if hitter).
H W B A P Ave H W B A P Ave
74 218 1 1 0 3.330 79 232 2 1 0 3.100
75 185 1 0 1 0.286 72 190 1 0 1 0.238
77 219 2 1 0 3.040 75 200 2 0 0 3.180
73 185 1 0 1 0.271 70 175 2 0 1 0.279
69 160 3 0 1 0.242 75 200 1 0 1 0.274
73 222 1 0 0 3.920 78 220 1 0 0 3.880
78 225 1 0 0 3.460 73 195 1 0 0 4.570
76 205 1 0 0 3.420 75 205 2 1 1 0.284
77 230 2 0 1 0.303 74 185 1 0 1 0.286
78 225 1 0 0 3.460 71 185 3 0 1 0.218
76 190 1 0 0 3.750 73 210 1 0 1 0.282
72 180 3 0 1 0.236 76 210 2 1 0 3.280
73 185 1 0 1 0.245 73 195 1 0 1 0.243
73 200 2 1 0 4.800 75 205 1 0 0 3.700
74 195 1 0 1 0.276 73 175 1 1 0 4.650
75 195 1 0 0 3.660 73 190 2 1 1 0.238
72 185 2 1 1 0.300 74 185 3 1 0 4.070
75 190 1 0 1 0.239 72 190 3 0 1 0.254
76 200 1 0 0 3.380 73 210 1 0 0 3.290
76 180 2 1 0 3.290 71 195 1 0 1 0.244
72 175 2 1 1 0.290 71 166 1 0 1 0.274
76 195 2 1 0 4.990 71 185 1 1 0 3.730
68 175 2 0 1 0.283 73 160 1 0 0 4.760
73 185 1 0 1 0.271 74 170 2 1 1 0.271
69 160 1 0 1 0.225 76 185 1 0 0 2.840
76 211 3 0 1 0.282 71 155 3 0 1 0.251
77 190 3 0 1 0.212 76 190 1 0 0 3.280
74 195 1 0 1 0.262 71 160 3 0 1 0.270
75 200 1 0 0 3.940 70 155 3 0 1 0.261
73 207 3 0 1 0.251
The software RBR computes this estimate with the call
onesampr(x,score=phipb,grad=sphipb,param=c(.75)).
Chapter 2
Two Sample Problems
2.1 Introduction
Let X
1
, . . . , X
n
1
be a random sample with common distribution function F(x) and density
function f(x). Let Y
1
, . . . , Y
n
2
be another random sample, independent of the rst, with
common distribution function G(x) and density g(x). We will call this the general model
throughout this chapter. A natural null hypothesis is H
0
: F(x) = G(x). In this chapter
we will consider rank and sign tests of this hypothesis. A general alternative to H
0
is
H
A
: F(x) ,= G(x) for some x. Except for the Section 2.10 on the scale model we will
be generally concerned with the alternative models where one distribution is stochastically
larger than the other; for example, the alternative that G is stochastically larger than F
which can be expressed as H
A
: G(x) F(x) with a strict inequality for some x. This family
of alternatives includes the location model, described next, and the Lehmann alternative
models discussed in Section 2.7, which are used in survival analysis.
As in Chapter 1, the location models will be of primary interest. For these models
G(x) = F(x ) for some parameter . Thus the parameter represents a shift in
location between the two distributions. It can be expressed as =
Y

X
where
Y
and
X
are the medians of the distributions of G and F or equivalently as =
Y

X
where,
provided they exist,
Y
and
X
are the means of G and F. In the location problem the null
hypothesis becomes H
0
: = 0. In addition to tests of this hypothesis we will develop
estimates and condence intervals for . We will call this the location model throughout
this chapter and we will show that this is a generalization of the location problem dened in
Chapter 1.
As in Chapter 1 with the one-sample problems, for the two-sample problems, we oer
the reader computational R functions which do the computation for the rank-based analyses
dicussed in this chapter.
73
74 CHAPTER 2. TWO SAMPLE PROBLEMS
2.2 Geometric Motivation
In this section, we work with the location model described above. As in Chapter 1, we will
derive sign and rank-based tests and estimates from a geometric point of view. As we shall
show, their development is analogous to that of least squares procedures in that other norms
are used in place of the least squares Euclidean norm. In order to do this we place the
problem into the context of a linear model. This will facilitate our geometric development
and will also serve as an introduction to Chapter 3, linear models.
Let Z
= (X
1
, . . . , X
n
1
, Y
1
, . . . , Y
n
2
) denote the vector of all observations; let n = n
1
+n
2
denote the total sample size; and let
c
i
=
_
0 if 1 i n
1
1 if n
1
+ 1 i n
. (2.2.1)
Then we can write the location model as
Z
i
= c
i
+ e
i
, 1 i n , (2.2.2)
where e
1
, . . . , e
n
are iid with distribution function F(x). Let C = [c
i
] denote the n1 design
matrix and let
FULL
denote the column space of C. We can express the location model as
Z = C +e , (2.2.3)
where e
= (e
1
, . . . , e
n
) is the n 1 vector of errors. Note that except for random error, the
observations Z would lie in
FULL
. Thus given a norm, we estimate so that C
minimizes
the distance between Z and the subspace
FULL
; i.e., C
is the vector in
FULL
closest to
Z.
Before turning our attention to , however, we write the problem in terms of the geometry
discussed in Chapter 1. Consider any location functional T of the distribution of e. Let
= T(F). Dene the random variable e
= e . Then the distribution function of e
is
F
(x) = F(x+) and its functional is T(F
) = 0. Thus the model, (2.2.3), can be expressed

as
Z = 1 +C +e
. (2.2.4)
Note that this is a generalization of the location problem discussed in Chapter 1. From
the last paragraph, the distribution function of X
i
can be expressed as F(x) = F
(x );
hence, T(F) = is a location functional of X
i
. Further, the distribution function of Y
j
can be written as G(x) = F
(x ( + )). Thus T(G) = + is a location functional

of Y
j
. Therefore, is precisely the dierence in location functionals between X
i
and Y
j
.
Furthermore does not depend on which location functional is used and will be called the
shift parameter.
Let b = (, )
. Given a norm, we want to choose as our estimate of b a value

b such
that [1 C]
b minimizes the distance between the vector of observations Z and the column
space V of the matrix [1 C]. Thus we can use the norms dened in Chapter 1 to estimate b.
2.2. GEOMETRIC MOTIVATION 75
If, as an example, we select the L
1
norm, then our estimate of b minimizes
D(b) =
n
i=1
[Z
i
c
i
[ . (2.2.5)
Dierentiating D with respect to and , respectively, and setting the resulting equations
to 0 we obtain the equations,
n
1
i=1
sgn (X
i
) +
n
2
j=1
sgn (Y
j
)
.
= 0 (2.2.6)
n
2
j=1
sgn (Y
j
)
.
= 0 . (2.2.7)
Subtracting the second equation from the rst we get
n
1
i=1
sgn (X
i
)
.
= 0; hence,
= med X
i
. Substituting this into the second equation, we get

= med Y
j

=
med Y
j
med X
i
; hence,

b = (med X
i
, med Y
j

med X
i
). We will obtain
inference based on the L
1
norm in Sections 2.6.1 and 2.6.2.
If we select the L
2
norm then, as shown in Exercise 2.13.1, the LS-estimate

b = (X, Y
X)
. Another norm discussed in Chapter 1 was the weighted L

1
norm. In this case b is
estimated by minimizing
D(b) =
n
i=1
R([Z
i
c
i
[)[Z
i
c
i
[ . (2.2.8)
This estimate cannot be obtained in closed form; however, fast minimization algorithms for
such problems are discussed later in Chapter 3.
In the initial statement of the problem, though, is a nuisance parameter and we are
really interested in , the shift in location between the populations. Hence, we want to dene
distance in terms of norms which are invariant to . The type of norm that is invariant to
is a pseudo-norm which we dene next.
Denition 2.2.1. An operator | |
is called a pseudo-norm if it satises the following

four conditions:
|u +v|
|u|
+|v|
for all u, v R
n
|u|
= [[|u|
for all R, u R
n
|u|
0 for all u R
n
|u|
= 0 if and only if u
1
= = u
n
Note that a regular norm satises the rst three properties but in lieu of the fourth
property, the norm of a vector is 0 if and only if the vector is 0. The following inequalities
establish the invariance of pseudo-norms to the parameter :
|Z 1 C|
|Z C|
+|1|
= |Z C|
= |Z 1 C + 1|
|Z 1 C|
.
Hence, |Z 1 C|
= |Z C|
.
Given a pseudo-norm, denote the associated dispersion function by D
() = |ZC|
.
It follows from the above properties of a pseudo-norm that D
() is a non-negative, contin-
uous, and convex function of .
We next develop an inference which includes estimation of and tests of hypotheses
concerning for a general pseudo-norm. As an estimate of the shift parameter , we
choose a value

which solves
= ArgminD
() = Argmin|Z C|
; (2.2.9)
i.e., C
minimizes the distance between Z and

FULL
. Another way of dening

is as the
stationary point of the gradient of the pseudo-norm. Dene the function S
by
S
() = |Z C|
(2.2.10)
where denotes the gradient of |Z C|
with respect to . Because D
() is convex,
it follows immediately that
S
() is nonincreasing in . (2.2.11)
Hence

is such that
S
)
.
= 0 . (2.2.12)
Given a location functional = T(F), i.e. Model (2.2.4), once has been estimated we
can base an estimate of on the residuals Z
i

c
i
. For example, if we chose the median
as our location functional then we could use the median of the residuals to estimate it. We
will discuss this in more detail for general linear models in Chapter 3.
Next consider the hypotheses
H
0
: = 0 versus H
A
: ,= 0 . (2.2.13)
The closer S
(0) is to 0 the more plausible is the hypothesis H

0
. More formally, we dene
the gradient test of H
0
versus H
A
by the rejection rule,
Reject H
0
in favor of H
A
if S
(0) k or S
(0) l ,
where the critical values k and l depend on the null distribution of S
(0). Typically, the null

distribution of S
(0) is symmetric about 0 and k = l. The reduction in dispersion test

is given by
Reject H
0
in favor of H
A
if D
(0) D
) m ,
where the critical value m is determined by the null distribution of the test statistic. In
this chapter, as in Chapter 1, we will be concerned with the gradient test while in Chapter
3 we will use the reduction in dispersion test. A condence interval for of condence
(1 )100% is the interval : k < S
() < l and
1 = P
[k < S
() < l] . (2.2.14)
Since D
() is convex, S
() is nonincreasing and we have
L
= inf : S
() < l and

U
= sup : S
() > k ; (2.2.15)
compare (1.3.10). Often we will be able to invert k < S
() < l to nd an explicit formula

for the upper and lower end points.
We will discuss a large class of general pseudo norms in Section 2.5, but now we present
the pseudo norms that yield the pooled t-test and the Mann-Whitney-Wilcoxon test.
2.2.1 Least Squares (LS) Analysis
The traditional analysis is based on the squared pseudo-norm given by
|u|
2
LS
=
n
i=1
n
j=1
(u
i
u
j
)
2
, u R
n
. (2.2.16)
It follows, (see Exercise 2.13.1) that
|Z C|
2
LS
= 4n
1
n
2
(Y X ) ;
hence the classical estimate is

LS
= Y X. Eliminating the constant factor 4n
1
n
2
the
classical test is based on the statistic
S
LS
(0) = Y X .
As shown in Exercise 2.13.1, standardizing S
LS
results in the two-sample pooled t-statistic.
An approximate condence interval for is given by
Y X t
(/2,n
1
+n
2
2)

_
1
n
1
+
1
n
2
,
where is the usual pooled estimate of the common standard deviation. This condence
interval is exact if e
i
has a normal distribution. Asymptotically, we replace t
(/2,n
1
+n
2
2)
by
z
/2
. The test is asymptotically distribution free.
2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis
The rank based analysis is based on the pseudo-norm dened by
|u|
R
=
n
i=1
n
j=1
[u
i
u
j
[ , u R
n
. (2.2.17)
Note that this pseudo-norm is the L
1
-norm based on the dierences between the components
and that it is the second term of expression (1.3.20), which denes the norm of the signed
rank analysis of Chapter 1. Note further, that this pseudo-norm diers from the least squares
pseudo-norm in that the square root is taken inside the double summation. In Exercise 2.13.2
the reader is asked to show that this indeed is a pseudo-norm and that further it can be
written in terms of ranks as
|u|
R
= 4
n
i=1
_
R(u
i
)
n + 1
2
_
u
i
.
From (2.2.17), it follows that the MWW gradient is
|Z C|
R
= 2
n
1
i=1
n
2
j=1
sgn (Y
j
X
i
) .
Our estimate of is a value which makes the gradient zero; that is, makes half of the
dierences positive and the other half negative. Thus the rank based estimate of is
R
= med Y
j
X
i
. (2.2.18)
This pseudo-norm estimate is often called the Hodges-Lehmann estimate of shift for the
two sample problem, (Hodges and Lehmann, 1963). As we show in Section 2.4.4,

R
has an
approximate normal distribution with mean and standard deviation
_
(1/n
1
) + (1/n
2
),
where the scale parameter is given in display (2.4.22).
From the gradient we dene
S
R
() =
n
1
i=1
n
2
j=1
sgn (Y
j
X
i
) . (2.2.19)
Next dene
S
+
R
() = #(Y
j
X
i
> ) . (2.2.20)
Note that we have (with probability one) that S
R
() = 2S
+
R
() n
1
n
2
. The statistic
S
+
R
= S
+
R
(0), originally proposed by Mann and Whitney (1947), will be more convenient to
use. The gradient test for the hypotheses (2.2.13) is
Reject H
0
in favor of H
A
if S
+
R
k or S
+
R
n
1
n
2
k ,
where k is chosen by P
0
(S
+
R
k) = /2. We show in Section 2.4 that the test statistic
is distribution free under H
0
and, that further, it has an asymptotic normal distribution
with mean n
1
n
2
/2 and standard deviation
_
n
1
n
2
(n
1
+ n
2
+ 1)/12 under H
0
. Hence, an
asymptotic level test rejects H
0
in favor of H
A
, if
[z[ > z
/2
where z =
S
+
R
(n
1
n
2
/2)
n
1
n
2
(n
1
+n
2
+1)/12
. (2.2.21)
As shown in Section 2.4.2, the (1 )100% MWW condence interval for is given
by
[D
(k+1)
, D
(n
1
n
2
k)
) , (2.2.22)
where k is such that P
0
[S
+
R
k] = /2 and D
(1)
D
(n
1
n
2
)
denote the ordered n
1
n
2
dierences Y
j
X
i
. It follows from the asymptotic null distribution of S
+
R
that k can be
approximated as
n
1
n
2
2

1
2
z
/2
_
n
1
n
2
(n+1)
12
.
A rank formulation of the MWW test statistic S
+
R
() will also prove useful. Letting
R(u
i
) denote the rank of u
i
among u
1
, . . . , u
n
we can write
n
2
j=1
R(Y
j
) =
n
2
j=1
#
i
(X
i
< Y
j
) + #
i
(Y
i
Y
j
)
= #(Y
j
X
i
> ) +
n
2
(n
2
+ 1)
2
.
Dening,
W() =
n
2
i=1
R(Y
i
) , (2.2.23)
we thus have the relationship that
S
+
R
() = W()
n
2
(n
2
+ 1)
2
. (2.2.24)
The test statistic W(0) was proposed by Wilcoxon (1945). Since it is a linear function of
the Mann-Whitney test statistic it has identical statistical properties. We will refer to the
statistic, S
+
R
, as the Mann-Whitney-Wilcoxon statistic and will label it as MWW.
As a nal note on the geometry of the rank based analysis, reconsider the model with
the location functional in it, i.e. (2.2.4). Suppose we obtain the R-estimate of , (2.2.18).
Let e
R
= Z C
R
denote the residuals. Next suppose we want to estimate the location
parameter by using the weighted L
1
norm which was discussed for estimation of location
in Section 1.7 of Chapter 1. Let |u|
SR
=
n
j=1
j[u[
(j)
denote this norm. For the residual
vector e
R
, expression (1.3.10) of Chapter 1 is given by
|e 1|
SR
=
ij
e
i
+e
j
2

+ (1/4)|e
R
|
R
. (2.2.25)
Hence the estimate of determined by this geometry is the Hodges-Lehmann estimate based
on the residuals; i.e.,
R
= med
ij
_
e
i
+e
j
2
_
. (2.2.26)
Asymptotic theory for the joint distribution of the random vector (
R
,
R
)
will be discussed
in Chapter 3.
2.2.3 Computation
The Mann-Whitney-Wilcoxon analysis which we described above is easily computed using the
RBR function twosampwil. This function returns the value of the Mann-Whitney-Wilcoxon
test statistic S
+
R
= S
+
R
(0), (2.2.20), the estimate

, (2.2.18), the associated condence in-
terval (2.2.22), and comparison boxplots of the samples. Also, the R intrinsic function
wilcoxon.test and minitab command MANN compute this Mann-Whitney-Wilcoxon analy-
sis.
2.3 Examples
In this section we present two examples which illustrate the methods discussed in the last
section. The calculations were performed by the RBR functions twosampwil and twosampt
which compute the Mann-Whitney-Wilcoxon and LS analyses, respectively. By convention,
for each dierence Y
j
X
i
= 0, we add the value 1/2 to the test statistic S
+
R
. Further, the
returned p-value is calculated with the usual continuity correction. The estimate of and
its standard error (SE) displayed in the results are given by expression (2.4.27), where a full
discussion is given. The LS analysis, computed by twosampt, is based on the traditional
pooled two-sample t-test.
Example 2.3.1. Quail Data
The data for this problem are drawn from a high volume drug screen designed to nd
compounds which reduce low density lipoproteins, LDL, cholesterol in quail; see McKean,
Vidmar and Sievers (1989) for a discussion of this screen. For the purposes of the present
example, we have taken the plasma LDL levels of one group of quail who were fed over a
specied period of time a special diet mixed with a drug compound and the LDL levels of
a second group of quail who were fed the same special diet but without the drug compound
over the same length of time. A completely randomized design was employed. We will refer
to the rst group as the treatment group and the second group as the control group. The
data are displayed in Table 2.3.1. Let
C
and
T
denote the true median levels of LDL for the
control and treatment populations, respectively. The parameter of interest is =
C

T
.
We are interested in the alternative hypothesis that the treatment has been eective; hence
the hypotheses are:
H
0
: = 0 versus H
A
: > 0 .
2.3. EXAMPLES 81
Table 2.3.1: Data for Quail Example
Control 64 49 54 64 97 66 76 44 71 89
70 72 71 55 60 62 46 77 86 71
Treated 40 31 50 48 152 44 74 38 81 64
The comparison boxplots returned by the RBR function twosampwil are found in Figure
2.3.1. Note that there is one outlier, the fth observation of the treated group, which has
the value 152. Outliers such as this were typical with most of the data in this study; see
McKean et al. (1989). For the data at hand, the treated group appears to have lower LDL
levels.
Figure 2.3.1: Comparison Boxplots of Treatment and Control Quail LDL Levels
Control Treated
4
0
6
0
8
0
1
0
0
1
2
0
1
4
0
L
D
L

c
h
o
l
e
s
t
e
r
o
l
Comparison Boxplots of Treated and Control
The analyses returned by the functions twosampwil and twosampt are given below. The
Mann-Whitney-Wilcoxon test statistic has the value 134.5 with p-value 0.067, while the t-test
statistic has value 0.557 with p-value 0.291. The MWW indicates with marginal signicance
that the treatment performed better than the placebo. The two sample t analysis was
impaired by the outlier.
The Hodges-Lehmann estimate of , (2.2.18), is 14 and the 90% condence interval is
(2.0, 24.0). In contrast, the least squares estimate of shift is 5 and the corresponding 90%
condence interval is (10.25, 20.25).
> twosampwil(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",
nameresp="LDL cholesterol")
Test of Delta = 0 Alternative selected is 1
Test Stat. S+ is 134.5 Standardized (z) Test-Stat. 1.495801 and
p-vlaue 0.06735282
MWW estimate of the shift in location is 14 SE is 8.180836
90 % Confidence Interval is ( -2 , 24 )
> twosampt(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",
nameresp="LDL cholesterol")
Test Stat. ybar-xbar- 0 is 5 Standardized (t) Test-Stat. 0.5577585
and p-vlaue 0.2907209
Mean of y minus the mean of x is 5 SE is 8.964454
90 % Confidence Interval is ( -10.24971 , 20.24971 )
As noted above, this data was drawn from data from a high-speed drug screen to discover
drug compounds which have the potential to reduce LDL cholesterol. In this screen, if a
compound was at least marginally signicant the investigation of it would continue, else it
would be eliminated from further scrutiny. Hence, for this drug compound, the robust and
LS analyses would result in dierent practical outcomes.
Example 2.3.2. Hendy-Charles Coin Data, continuation of Example 1.11.1
Recall that the 84% L
1
condence intervals for the data are disjoint. Thus we reject the
null hypothesis that the silver content is the same for the two mintings at the 5% level. We
now apply the MWW test and condence interval to this data and nd the Hodges-Lehmann
estimate of shift. If the tailweights of the underlying distributions are moderate, the MWW
methods are more ecient.
The output from the RBR function twosampwil is:
> twosampwil(Fourth,First)
Test Stat. S+ is 61.5 Standardized (z) Test-Stat. 3.122611
and p-vlaue 0.001792544
MWW estimate of the shift in location is 1.1 SE is 0.2999926
2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON 83
Note that there is strong statistical evidence that the mintings are dierent. The Hodges-
Lehmann estimate (2.2.18) is 1.1 which suggests that there is roughly a 1.1% decrease in the
silver content from the rst to the fourth mintings. The 95% condence interval, (2.2.22), is
(0.6, 1.7). Half the length of the condence is .45 and this could be reported as the margin
of error in estimating , the change in median silver contents from the rst to the fourth
mintings. Hence we could report 1.1%.45%.
2.4 Inference Based on the Mann-Whitney-Wilcoxon
We next develop the theory for inference based on the Mann-Whitney-Wilcoxon statistic,
including the test, the estimate and the condence interval. Although much of the devel-
opment is for the location model the general model will also be considered. We begin with
testing.
2.4.1 Testing
Although the geometric motivation of the test statistic S
+
R
was derived under the location
model, the test can be used for more general models. Recall that the general model is
comprised of a random sample X
1
, . . . , X
n
1
with cdf F(x) and a random sample Y
1
, . . . , Y
n
2
with cdf G(x). For the discussion we select the hypotheses,
H
0
: F(x) = G(x), for all x versus H
A
: F(x) G(x), with strict inequality for some x.
(2.4.1)
Under this stochastically ordered alternative Y tends to dominate X,; i.e., P(Y > X) >
1/2. Our rank-based decision rule is to reject H
0
in favor of H
A
if S
+
R
is too large, where
S
+
R
= #(Y
j
X
i
> 0). Our immediate goal is to make this precise. What we discuss will of
course hold for the other one-sided alternative F(x) G(x) and the two-sided alternative
F(x) G(x) or F(x) G(x) as well. Furthermore, since the location model is a submodel
of the general model, what holds for the general model will hold for it also. It will always
be clear which set of hypotheses is being considered.
Under H
0
, we rst show that S
+
R
is distribution free and then show it is symmetrically
distributed about (n
1
n
2
)/2.
Theorem 2.4.1. Under the general null hypothesis in (2.4.1), S
+
R
is distribution free.
Proof: Under the null hypothesis, the combined samples X
1
, . . . , X
n
1
, Y
1
, . . . , Y
n
2
consti-
tute a random sample of size n from the distribution function F(x). Hence any assignment
of n
2
ranks from the set of integers 1, . . . , n to Y
1
, . . . , Y
n
2
is equilikely; i. e., has probability
_
n
n
2
_
1
independent of F.
0
in (2.4.1), the distribution of S
+
R
is symmetric about (n
1
n
2
)/2.
Proof: Under H
0
in (2.4.1) L(Y
j
X
i
) = L(X
i
Y
j
) for all i, j; see Exercise 2.13.3. Thus
if S
R
= #(X
i
Y
j
> 0) then, under H
0
, L(S
+
R
) = L(S
R
). Since S
R
= n
1
n
2
S
+
R
we have
the following string of equalities which proves the result:
P[S
+
R

n
1
n
2
2
+ x] = P[n
1
n
2
S
R

n
1
n
2
2
+ x]
= P[S
R

n
1
n
2
2
x] = P[S
+
R

n
1
n
2
2
x]
Hence for the hypotheses (2.4.1), a level test based on S
+
R
would reject H
0
if S
+
R

c
,n
1
,n
2
where P
H
0
[S
+
R
c
,n
1
,n
2
] = . From the symmetry, note that the lower critical
point is given by n
1
n
2
c
,n
1
,n
2
.
Although S
+
R
is distribution free under the null hypothesis its distribution cannot be
obtained in closed form. The next theorem gives a recursive formula for its distribution.
The proof can be found in Exercise 2.13.4; see, also, Hettmansperger (l984, p. 136-137).
Theorem 2.4.3. Under the general null hypothesis in (2.4.1), let P
n
1
,n
2
(k) = P
H
0
[S
+
R
= k].
Then
P
n
1
,n
2
(k) =
n
2
n
1
+ n
2
P
n
1
,n
2
1
(k n
1
) +
n
1
n
1
+ n
2
P
n
1
1,n
2
(k) ,
where P
n
1
,n
2
(k) satises the boundary conditions P
i,j
(k) = 0 if k < 0, P
i,0
(k) and P
0,j
(k) are
1 or 0 as k = 0 or k ,= 0.
Based on these recursion formulas, tables of the null distribution can be obtained readily,
which then can be used to obtain the critical values for the rank based test. Alternatively,
the asymptotic null distribution of S
+
R
can be used to determine approximate critical values.
This asymptotic test will be discussed later; see Theorem 2.4.9.
We next derive the mean and variance of S
+
R
under the three models:
(a) the general model where X has distribution function F(x) and Y has distribution func-
tion G(x);
(b) the location model where G(x) = F(x );
(c) and the null model in which F(x) = G(x).
Of course, from Theorem 2.4.2, the null mean of S
+
R
is (n
1
n
2
)/2. In our derivation we
repeatedly make use of the fact that if H is the distribution function of a random variable
Z then the random variable H(Z) has a uniform distribution over the interval (0, 1); see
Exercise 2.13.5.
Theorem 2.4.4. Assuming that X
1
, . . . , X
n
1
are iid F(x) and Y
1
, . . . , Y
n
2
are iid G(x) and
that these two samples are independent of one another, the means of S
+
R
under the three
models (a)-(c) are:
(a) E
_
S
+
R
= n
1
n
2
[1 E [G(X)]] = n
1
n
2
E [F(Y )]
(b) E
_
S
+
R
= n
1
n
2
[1 E [F(X )]] = n
1
n
2
E [F(X + )]
(c) E
_
S
+
R
=
n
1
n
2
2
.
Proof: We shall prove only (a), since results (b) and (c) follow directly from it. We can
write S
+
R
in terms of indicator functions as
S
+
R
=
n
1
i=1
n
2
j=1
I(Y
j
X
i
> 0) , (2.4.2)
where I(t > 0) is 1 or 0 for t > 0 or t 0, respectively. Let Y have distribution function G,
let X have distribution function F, and let X and Y be independent. Then
E [I (Y X > 0)] = E [P [Y > X[X]]
= E [1 G(X)] = E [F(Y )] ,
where the second equality follows from the independence of X and Y . The results then
follow.
Theorem 2.4.5. The variances of S
+
R
under the models (a) - (c) are:
(a) V ar
_
S
+
R
= n
1
n
2
_
E [G(X)] E
2
[G(X)]
_
+ n
1
n
2
(n
1
1)V ar [F(Y )] + n
1
n
2
(n
2
1)V ar [G(X)]
(b) V ar
_
S
+
R
= n
1
n
2
_
E [F(X )] E
2
[F(X )]
_
+ n
1
n
2
(n
1
1)V ar [F(Y )] + n
1
n
2
(n
2
1)V ar [F(X )]
(c) V ar
_
S
+
R
=
n
1
n
2
(n + 1)
12
.
Proof: Again only the result (a) will be obtained. Using the indicator formulation of S
+
R
,
(2.4.2), we have
V ar
_
S
+
R
=
n
1
i=1
n
2
j=1
V ar [I(Y
j
X
i
> 0)]+
n
1
i=1
n
2
j=1
n
1
l=1
n
2
k=1
Cov [I(Y
j
X
i
> 0), I(Y
k
X
l
> 0)] ,
where the sums for the covariance terms are over all possible combinations except (i, j) =
(l, k). For the rst term, note that the variance of I(Y X > 0) is
V ar [I(Y > X)] = E [I(Y > X)] E
2
[I(Y > X)]
= E [1 G(X)] E
2
[1 G(X)]
= E [G(X)] E
2
[G(X)] .
This yields the rst term in (a). For the covariance terms, note that a covariance is 0 unless
either j = k or i = l. This leads to the following two cases:
Case(i) For the covariance terms with j = k and i ,= l, we need E [I(Y > X
1
)I(Y > X
2
)]
which is,
E [I(Y > X
1
)I(Y > X
2
)] = P [Y > X
1
, Y > X
2
]
= E [P [Y > X
1
, Y > X
2
[Y ]]
= E [P [Y > X
1
[Y ] P [Y > X
2
[Y ]]
= E
_
F(Y )
2
There are n
2
ways to get a j and n
1
(n
1
1) ways to get i ,= l; hence there are n
1
n
2
(n
1
1)
covariances of this form. This leads to the second term of (a).
Case(ii) ) The terms for the covariances where i = l and j ,= k follow similarly to Case (i).
This leads to the third and nal term of (a).
The last two theorems suggest that the random variable Z =
S
+
R
n
1
n
2
2
q
n
1
n
2
(n+1)
12
has an approx-
imate N(0, 1) distribution under H
0
. This follows from the next results which yield the
asymptotic distribution of S
+
R
under general alternatives as well as under the null hypoth-
esis. We will obtain these results by projecting our statistic S
+
R
down onto a set of linear
combinations of independent random variables. Then we can use central limit theory on the
projection. See Hajek and

Sidak (1967) for a discussion of this technique.
Let T = T(Z
1
, . . . , Z
n
) be a random variable based on a sample Z
1
, . . . , Z
n
such that
E [T] = 0. Let
p
k
(x) = E [T [ Z
k
= x] , k = 1, . . . , n .
Next dene the random variable T
p
to be
T
p
=
n
k=1
p
k
(Z
k
) . (2.4.3)
In the next theorem we show that T
p
is the projection of T onto the space of linear functions of
Z
1
, . . . , Z
n
. Note that unlike T, T
p
is a linear combination of independent random variables;
hence, its asymptotic distribution is often easier to obtain than that of T. As the following
projection theorem shows it is in a sense the closest linear function of the form
p
i
(Z
i
)
to T.
Theorem 2.4.6. If W =
n
i=1
p
i
(Z
i
) then E [(T W)
2
] is minimized by taking p
i
(x) =
p
i
(x). Furthermore E [(T T
p
)
2
] = V ar [T] V ar [T
p
].
Proof: First note that E [p
k
(Z
k
)] = 0. We have,
E
_
(T W)
2
= E
_
[(T T
p
) (W T
p
)]
2
(2.4.4)
= E
_
(T T
p
)
2
+ E
_
(W T
p
)
2
2E [(T T
p
)(W T
p
)] .
We can write one-half the cross product term as
n
i=1
E [(T T
p
)(p
i
(Z
i
) p
i
(Z
i
))] =
n
i=1
E [E [(T T
p
)(p
i
(Z
i
) p
i
(Z
i
)) [ Z
i
]]
=
n
i=1
E
_
(p
i
(Z
i
) p
i
(Z
i
))E
_
T
n
j=1
p
j
(Z
j
) [ Z
i
__
.
The conditional expectation can be written as,
(E [T [ Z
i
] p
i
(Z
i
))
j=i
E
_
p
j
(Z
j
)
= 0 0 = 0 .
Hence the cross-product term is zero, and, therefore the left-hand-side of expression (2.4.4)
is minimized with respect to W by taking W = T
p
. Also since this holds, in particular, for
W = 0 we get
E
_
T
2
= E
_
(T T
p
)
2
+ E
_
T
2
p
.
Since both T and T
p
have zero means the second result of the theorem also follows.
From these results a strategy for obtaining the asymptotic distribution of T is apparent.
Namely, nd the asymptotic distribution of its projection, T
p
and then show V ar [T]
V ar [T
p
] 0 as n . This implies that T and T
p
have the same asymptotic distribution;
see Exercise 2.13.7. We shall apply this strategy to get the asymptotic distribution of the
rank based methods. As a rst step we obtain the projection of S
+
R
E
_
S
+
R
under the
general model.
Theorem 2.4.7. Under the general model the projection of the random variable S
+
R
E
_
S
+
R
is
T
p
= n
1
n
2
j=1
(F(Y
j
) E [F(Y
j
)]) n
2
n
1
i=1
(G(X
i
) E [G(X
i
)]) . (2.4.5)
Proof: Dene the n random variables Z
1
, . . . , Z
n
by
Z
i
=
_
X
i
if 1 i n
1
Y
in
1
if n
1
+ 1 i n
.
We have,
p
k
(x) = E
_
S
+
R
[ Z
k
= x
E
_
S
+
R
=
n
1
i=1
n
2
j=1
E [I(Y
j
> X
i
) [ Z
k
= x] E
_
S
+
R
. (2.4.6)
There are two cases depending on whether 1 k n
1
or n
1
+ 1 k n
1
+ n
2
= n.
Case (1) Suppose 1 k n
1
then the conditional expectation in the above expression
(2.4.6), depending on the value of i, becomes
(a): i ,= k, E [I(Y
j
> X
i
) [ X
k
= x] = E [I(Y
j
> X
i
)]
= P [Y > X]
(b): i = k, E [I(Y
j
> X
i
) [ X
i
= x]
= P [Y > X [ X = x]
= 1 G(x)
Hence, in this case,
p
k
(x) = n
2
(n
1
1)P [Y > X] + n
2
(1 G(x)) E
_
S
+
R
.
Case (2) Next suppose that n
1
+ 1 k n then the conditional expectation in the above
expression (2.4.6), depending on the value of j, becomes
(a): j ,= k, E [I(Y
j
> X
i
) [ Y
k
= x] = P [Y > X]
(b): j = k, E [I(Y
j
> X
i
) [ Y
j
= x] = F(x)
Hence, in this case,
p
k
(x) = n
1
(n
2
1)P [Y > X] + n
1
F(x) E
_
S
+
R
.
Combining these results we get
T
p
=
n
1
i=1
p
i
(X
i
) +
n
2
j=1
p
j
(Y
j
)
= n
1
n
2
(n
1
1)P [Y > X] + n
2
n
1
i=1
(1 G(X
i
))
+ n
1
n
2
(n
2
1)P [Y > X] + n
1
n
2
j=1
F(Y
j
) nE
_
S
+
R
.
This can be simplied by noting that
P(Y > X) = E [P(Y > X [ X)] = E [1 G(X)]
or similarly
P(Y > X) = E [F(Y )] .
From (a) of Theorem 2.4.4,
E
_
S
+
R
= n
1
n
2
(1 E [G(X)]) = n
1
n
2
P(Y > X) .
Substituting these three results into (2.4.6) we get the desired result.
An immediate outcome is
Corollary 2.4.1. Under the general model, if T
p
is given by (2.4.5) then
Var (T
p
) = n
2
1
n
2
Var (F(Y )) +n
1
n
2
2
Var (G(X)) .
From this it follows that T
p
should be standardized as
T
p
=
1
nn
1
n
2
T
p
.
In order to obtain the asymptotic distribution of T
p
and subsequently S
+
R
we need the
following assumption on the design (sample sizes),
(D.1) :
n
i
n

i
, 0 <
i
< 1 . (2.4.7)
This says that the sample sizes go to at the same rate. Note that
1
+
2
= 1. The
asymptotic variance of T
p
is thus
Var (T
p
)
1
Var (F(Y )) +
2
Var (G(X)) .
We rst want to obtain the asymptotic distribution under general alternatives. In order
to do this we need an assumption concerning the ranges of X and Y . The support of a
continuous random variable with distribution function H and density h is dened to be the
set x : h(x) > 0 which is denoted by o(H).
Our second assumption states that the intersection of the supports of F and G has a
nonempty interior; that is
(E.3) : There is an open interval I such that I o(F) o(G) . (2.4.8)
Note that the asymptotic variance of T
p
is not zero under (E.3).
We are now in the position to nd the asymptotic distribution of T
p
.
Theorem 2.4.8. Under the general model and the assumptions (D.1) and (E.3), T
p
has an
asymptotic N (0,
1
Var (F(Y )) +
2
Var (G(X))) distribution.
Proof: By (2.4.5) we can write
T
p
=
_
n
1
nn
2
n
2
j=1
(F(Y
j
) E [F(Y
j
)])
_
n
2
nn
1
n
1
i=1
(G(X
i
) E [G(X
i
)]) . (2.4.9)
Note that both sums on the right side of expression (2.4.9) are composed of independent and
identically distributed random variables and that the sums are independent of one another.
The result then follows immediately by applying the simple central limit theorem to each
sum.
This is the key result we need in order to obtain the asymptotic distribution of our test
statistic S
+
R
. We rst obtain the result under the general model and then under the null
hypothesis. As we will see, both results are immediate.
Theorem 2.4.9. Under the general model and the conditions (E.3) and (D.1) the random
variable
S
+
R
E[S
+
R
]
Var (S
+
R
)
has a limiting N(0, 1) distribution.
Proof: By the last theorem and Theorem 2.4.6, we need only show that the dierence in
the variances of S
+
R
/
nn
1
n
2
and T
p
goes to 0 as n . Note that,
Var
_
1
nn
1
n
2
S
+
R
_
=
n
1
n
2
nn
1
n
2
_
E [G(X)] E [G(X)]
2
_
+
n
1
n
2
(n
1
1)
nn
1
n
2
Var (F(Y )) +
n
1
n
2
(n
2
1)
nn
1
n
2
Var (G(X)) ;
hence, Var (T
p
) Var (S
+
R
/
nn
1
n
2
) 0 and the result follows from Exercise (2.13.7).
The asymptotic distribution of the test statistic under the null hypothesis follows imme-
diately from this theorem. We record it in the next corollary.
Corollary 2.4.2. Under H
0
: F(x) = G(x) and (D.1) only, the test statistic S
+
R
is approx-
imately N
_
n
1
n
2
2
,
n
1
n
2
(n+1)
12
_
.
Therefore an asymptotic size test for H
0
: F(x) = G(x) versus H
A
: F(x) ,= G(x) is
to reject H
0
if [z[ z
/2
where
z =
S
+
R

n
1
n
2
2
_
n
1
n
2
(n+1)
12
and
1 (z
/2
) = /2 .
Since we approximate a discrete random variable with continuous one, we think it is advisable
in cases of small samples to use a continuity correction. Fix and Hodges (l955) give an
Edgeworth approximation to the distribution of S
+
R
and Bickel (l974) discusses the error of
this approximation.
Since the standard normal distribution function, , is continuous on the entire real line,
we can strengthen the convergence in Theorem 2.4.9 to uniform convergence; that is, the
distribution function of the standardized MWW converges uniformly to . Using this, it
is not hard to show that the standardized critical values of the MWW converge to their
counterparts at the standard normal. Thus if c
,n
is the MWW critical value dened by
= P
H
0
[S
+
R
c
,n
] then
c
,n
n
1
n
2
2
_
n
1
n
2
(n+1)
12
z
, (2.4.10)
where 1 = (z
); see Exercise 2.13.8 for details. This result will prove useful in the next
section.
We now consider when the test based on S
+
R
is consistent. Consider the general set up;
i. e., X
1
, . . . , X
n
1
is a random sample with distribution function F(x) and Y
1
, . . . , Y
n
2
is a
random sample with distribution function G(x). Consider the hypotheses
H
0
: F = G versus H
A1
: F(x) G(x) with F(x
0
) > G(x
0
) for some x
0
Int(o(F) o(G)).
(2.4.11)
Such an alternative is called a stochastically ordered alternative. The next theorem shows
that the MWW test statistic is consistent for this alternative. Likewise it is consistent for
the other one sided stochastically ordered alternative with F and G interchanged, H
A2
, and,
also, for the two sided alternative which consists of the union of H
A1
and H
A2
. These results
imply that the MWW test is consistent for location alternatives, provided F and G have
overlapping support. As Exercise 2.13.9 shows, it will also be consistent when one support
is shifted to the right of the other support.
Theorem 2.4.10. Suppose that the assumptions (D.1), (2.4.7), and (E.3), (2.4.8), hold.
Under the stochastic ordering alternatives given above, S
+
R
is a consistent test.
Proof: Assume the stochastic ordering alternative H
A1
, (2.4.11). For an arbitrary level ,
select the critical level c
such that the test that rejects H

0
if S
+
R
c
has asymptotic level

. We want to show that the power of the test goes to 1 as n . Since F(x
0
) > G(x
0
) for
some point x
0
in the interior of o(F)o(G), there exists an interval N such that F(x) > G(x)
on N. Hence
E
H
A
[G(X)] =
_
N
G(y)f(y)dy +
_
N
c
G(y)f(y)dy
<
_
N
F(y)f(y)dy +
_
N
c
F(y)f(y)dy =
1
2
(2.4.12)
The power of the test is given by
P
H
A
_
S
+
R
c
= P
H
A
_
S
+
R
E
H
A
(S
+
R
)
_
V
H
A
(S
+
R
)
(n
1
n
2
/2)
_
V
H
A
(S
+
R
)
+
(n
1
n
2
/2) E
H
A
(S
+
R
)
_
V
H
A
(S
+
R
)
_
.
Note by (2.4.10) that
c
(n
1
n
2
/2)
_
V
H
A
(S
+
R
)
=
c
(n
1
n
2
/2)
_
V
H
0
(S
+
R
)
_
V
H
0
(S
+
R
)
_
V
H
A
(S
+
R
)
z
,
where is a real number (since the variances are of the same order). But by (2.4.12)
(n
1
n
2
/2) E
H
A
(S
+
R
)
_
V
H
A
(S
+
R
)
=
(n
1
n
2
/2) n
1
n
2
[1 E
H
A
(G(X))
_
V
H
A
(S
+
R
)
]
=
n
1
n
2
_
1
2
+ E
H
A
(G(X))
_
V
H
A
(S
+
R
)
.
By Theorem (2.4.9), under H
A
the random variable
S
+
R
E
H
A
(S
+
R
)
V
H
A
(S
+
R
)
converges in distribution to
a standard normal variate. Since the convergence is uniform, it follows from the above limits
that the power converges to 1. Hence the MWW test is consistent.
2.4.2 Condence Intervals
Consider the location model (2.2.4). We next obtain a distribution free condence interval
for by inverting the MWW test. As a rst step we have the following result on the function
S
+
R
(), (2.2.20):
Lemma 2.4.1. S
+
R
() is a decreasing step function of which steps down by 1 at each
dierence Y
j
X
i
. Its maximum is n
1
n
2
and its minimum is 0.
Proof: Let D
(1)
D
(n
1
n
2
)
denote the ordered n
1
n
2
dierences Y
j
X
i
. The results
follow immediately by writing S
+
R
() = #(D
(i)
> ).
Let be given and choose c
/2
to be the lower /2 critical point of the MWW distribution;
i.e., P
_
S
+
R
() c
/2,
= /2. By the above lemma we have

1 = P
_
c
/2
< S
+
R
() < n
1
n
2
c
/2
= P
_
D
(c
/2
+1)
< D
(n
1
n
2
c
/2
)
_
.
Thus [D
(c
/2
+1)
, D
(n
1
n
2
c
/2
)
) is a (1 )100% condence interval for ; compare (1.3.30).
From the asymptotic null distribution theory for S
+
R
, Corollary (2.4.2), we can approximate
c
/2
as
c
/2
.
=
n
1
n
2
2
z
/2
_
n
1
n
2
(n + 1)
12
.5 . (2.4.13)
2.4.3 Statistical Properties of the Inference Based on the MWW
In this section we derive the eciency properties of the MWW test statistic and properties
of its power function under the location model (2.2.4).
We begin with an investigation of the power function of the MWW test. For deniteness
we will consider the one sided alternative,
H
0
: = 0 versus H
A
: > 0 . (2.4.14)
Results similar to those given below can be obtained for the power function of the other one
sided and the two sided alternatives. Given a level , let c
,n
1
,n
2
denote the upper critical
value for the MWW test of this hypothesis; hence, the test rejects H
0
if S
+
R
c
,n
1
,n
2
. The
power function of this test is given by
() = P
[S
+
R
c
,n
1
,n
2
] , (2.4.15)
where the subscript on P denotes that the probability is determined when the true pa-
rameter is . Recall that S
+
R
() = #Y
j
X
i
> .
The following theorem will prove useful, its proof is similar to that of Theorem 1.3.1 of
Chapter 1 and the more general result Theorem A.2.4 of the Appendix.
Theorem 2.4.11. For all t, P
[S
+
R
(0) t] = P
0
[S
+
R
() t].
From Lemma 2.4.1 and Theorem 2.4.11 we have our rst important result on the power
function of the MWW test; namely, that it is monotone.
Theorem 2.4.12. For the above hypotheses (2.4.14), the function () in monotonically
increasing in .
Proof: Let
1
<
2
. Then
2
<
1
and, hence, from Lemma 2.4.1, we have
S
+
R
(
2
) S
+
R
(
1
). By applying Theorem 2.4.11, the desired result, (
2
) (
1
),
follows from the following:
1 (
2
) = P
2
[S
+
R
(0) < c
,n
1
,n
2
]
= P
0
[S
+
R
(
2
) < c
,n
1
,n
2
]
P
0
[S
+
R
(
1
) < c
,n
1
,n
2
]
= P
1
[S
+
R
(0) < c
,n
1
,n
2
]
= 1 (
1
) .
From this we immediately have that the MWW test is unbiased; that is, its power
function evaluated at an alternative is always at least as large as its level of signicance. We
state it as a corollary.
Corollary 2.4.3. For the above hypotheses (2.4.14), () for all > 0.
A more general null hypothesis is given by
H
0
: 0 versus H
A
: > 0 .
If T is any test for these hypotheses with critical region C then we say T is a size test
provided
sup
0
P
[T C] = .
For selected , it follows from the monotonicity of the MWW power function that the MWW
test has size for this more general null hypothesis.
From the above theorems, we have that the MWW power function is monotonically
increasing in . Since S
+
R
() achieves its maximum for nite, we have by Theorem 1.5.2
of Chapter 1 that the MWW test is resolving; hence, its power function approaches one as
. Even for the location model, though, we cannot get the power function of the MWW
test in closed form. For local alternatives, however, we can obtain an asymptotic expression
for the power function. Applications of this result include sample size determination for
the MWW test and eciency comparisons of the MWW with other tests, both of which we
consider.
We will need the assumption that the density f(x) had nite Fisher Information, i.e.,
(E.1) f is absolutely continuous, 0 < I(f) =
_
1
0

2
f
(u) du < , (2.4.16)
where
f
(u) =
f
(F
1
(u))
f(F
1
(u))
. (2.4.17)
As discussed in Section 3.4, assumption (E.1) implies that f is uniformly bounded.
Once again we will consider the one sided alternative, (2.4.14), (similar results hold for
the other one sided and two sided alternatives). Consider a sequence of local alternatives of
the form
H
An
:
n
=

n
, (2.4.18)
where > 0 is arbitrary but xed.
As a rst step, we need to show that S
+
R
() is Pitman regular as discussed in Chapter
1. Let S
+
R
() = S
+
R
()/(n
1
n
2
). We need to verify the four conditions of Denition 1.5.3 of
Chapter 1. The rst condition is true by Lemma 2.4.1 and the fourth condition follows from
Corollary 2.4.2. By (b) of Theorem 2.4.4, we have
() = E
[S
+
R
(0)] = 1 E[F(X )] . (2.4.19)
By assumption (E.1), (2.4.16),
_
f
2
(x) dx sup f
_
f(x) dx < . Hence dierentiating
(2.4.19) we obtain
(0) =
_
f
2
(x)dx > 0 and, thus, the second condition is true. Hence we
need only show that the third condition, asymptotic linearity of S
+
R
() is true. This will
follow provided we can show the variance condition (1.5.17) of Theorem 1.5.6 is true. Note
that
S
+
R
(/
n) S
+
R
(0) = (n
1
n
2
)
1
#(0 < Y
j
X
i
/
n) .
This is similar to the MWW statistic itself. Using essentially the same argument as that for
the variance of the MWW statistic, Theorem 2.4.5 we get
nVar
0
[S
+
R
(/
n) S
+
R
(0)] =
n
n
1
n
2
(a
n
a
2
n
) +
n(n
1
1)
n
1
n
2
(b
n
c
n
)
+
n(n
2
1)
n
1
n
2
(d
n
a
2
n
) ,
where a
n
= E
0
[F(X +/
n) F(X)], b
n
= E
0
[(F(Y ) F(Y /
n))
2
], c
n
= E
0
[(F(Y )
F(Y /
n))], and d
n
= E
0
[(F(X + /
n) F(X))
2
]. Using the Lebesgue Dominated
Convergence Theorem, it is easy to see that a
n
, b
n
, c
n
, and d
n
all converge to 0. Therefore
Condition (1.5.17) of Theorem 1.5.6 holds and we have thus established the asymptotic
linearity result given by:
sup
||B
_
n
1/2
S
+
R
(/
n) n
1/2
S
+
R
(0) +
_
f
2
(x) dx
_
P
0 , (2.4.20)
for any B > 0. Therefore, it follows that S
+
R
() is Pitman regular.
In order to get the ecacy of the MWW test, we need the quantity
2
(0) dened by
2
(0) = lim
n0
nVar
0
(S
R
(0))
= lim
n0
nn
1
n
2
(n + 1)
n
2
1
n
2
2
12
= (12
1
2
)
1
;
see expression (1.5.12). Therefore by (1.5.11) the ecacy of the MWW test is
c
MWW
=
(0)/(0) =
_
12
_
f
2
(x) dx =
_
1
, (2.4.21)
where is the scale parameter given by
= (
12
_
f
2
(x)dx)
1
. (2.4.22)
In Exercise 2.13.10 it is shown that the ecacy of the two sample pooled t-test is
1
where
2
is the common variance of X and Y . Hence the eciency of the the
MWW test to the two sample t test is the ratio
2
/
2
. This of course is the same eciency
as that of the signed rank Wilcoxon test to the one sample t test; see (1.7.13). In particular
if the distribution of X is normal then the eciency of the MWW test to the two sample
t test is .955. For heavier tailed distributions, this eciency is usually larger than 1; see
Example 1.7.1.
As in Chapter 1 it is convenient to summarize the asymptotic linearity result as follows:
n
_
S
+
R
(/
n) (0)
(0)
_
=
n
_
S
+
R
(0) (0)
(0)
_
c
MWW
+ o
p
(1) , (2.4.23)
uniformly for [[ B and any B > 0.
The next theorem is the asymptotic power lemma for the MWW test. As in Chapter 1,
(see Theorem 1.5.8), its proof follows from the Pitman regularity of the MWW test.
Theorem 2.4.13. Under the sequence of local alternatives, (2.4.18),
lim
n
(
n
) = P
0
[Z z
c
MWW
] = 1
_
z
_
12
1
2
_
f
2
_
; ,
where Z is N(0, 1).
In Exercise 2.13.10, it is shown that if
LS
() denotes the power function of the usual
two sample t-test then
lim
n
LS
(
n
) = 1
_
z
_
, (2.4.24)
where
2
is the common variance of X and Y . By comparing these two power functions, it is
seen that the Wilcoxon is asymptotically more powerful if < , i.e., if e = c
2
MWW
/c
2
t
> 1.
As an application of the asymptotic power lemma, we consider sample size determi-
nation. Consider the MWW test for the one sided hypothesis (2.4.14). Suppose the level,
, and the power, , for a particular alternative
A
are specied. For convenience, assume
equal sample sizes, i. e. n
1
= n
2
= n
where n
denotes the common sample size; hence,
1
=
2
= 2
1
. Express
A
as
2n
A
/
2n
. Then by Theorem 2.4.13 we have
.
= 1
_
z
_
1
4
2n
_
,
But this implies
z
= z
A
/
2
and (2.4.25)
n
=
_
z
A
_
2
2
2
.
The above value of n
is the approximate sample size. Note that it does depend on which,

in applications, would have to be guessed or estimated in a pilot study; see the discussion
in Section 2.4.5, (estimates of are discussed in Sections 2.4.5 and 3.7.1). For a specied
distribution it can be evaluated; for instance, if the underlying density is assumed to be
normal with standard deviation then =
_
/3.
Using (2.4.24) a similar derivation can be obtained for the usual two sample t-test, re-
sulting in an approximate sample size of
n
LS
=
_
z
A
_
2
2
2
.
The ratio of the sample size needed by the MWW test to that of the two sample t test is
2
/
2
. This provides additional motivation for the denition of eciency.
2.4.4 Estimation of
Recall from the geometry earlier in this chapter that the estimate of based on the rank-
pseudo norm is

R
= med
i,j
Y
j
X
i
, (2.2.18). We now obtain several properties of
this estimate including its asymptotic distribution. This will lead again to the eciency
properties of the rank based methods discussed in the last section.
For convenience, we note some equivariances of

R
=

(Y, X), which are established in
Exercise 2.13.11. First,

R
is translation equivariant; i.e.,
R
(Y+ +, X+ ) =

R
(Y, X) + ,
for any and . Second,

R
is scale equivariant; i.e.,
R
(aY, aX) = a
R
(Y, X) ,
for any a. Based on these we next show that

R
is an unbiased estimate of under certain
conditions.
Theorem 2.4.14. If the errors, e
i
, in the location model (2.2.4) are symmetrically dis-
tributed about 0, then

R
is symmetrically distributed about .
Proof: Due to translation equivariance there is no loss of generality in assuming that
and are 0. Then Y and X are symmetrically distributed about 0; hence, L(Y ) = L(Y )
and L(X) = L(X). Thus from the above equivariance properties we have,
L(
(Y, X)) = L(
(Y, X) = L(
(Y, X)) .
Therefore

R
is symmetrically distributed about 0, and, in general it is symmetrically dis-
tributed about .
Theorem 2.4.15. Under Model (2.2.4), if n
1
= n
2
then

R
is symmetrically distributed
about .
The reader is asked to prove this in Exercise 2.13.12. In general,

R
may be biased if the
error distribution is not symmetrically distributed but as the following result shows

R
is
always asymptotically unbiased. Since the MWW process S
+
R
() was shown to be Pitman
regular the asymptotic distribution of

n(
) is N(0, c
2
MWW
). In practice, we say
R
has an approximate N(,
2
(n
1
1
+ n
1
2
)) distribution,
where was dened in (2.4.22).
Recall from Denition 1.5.4 of Chapter 1, that the asymptotic relative eciency of two
Pitman regular estimators is the reciprocal of the ratio of their aymptotic variances. As
Exercise 2.13.10 shows, the least squares estimate

LS
= Y X of is approximately
N
_
,
2
_
1
n
1
+
1
n
2
__
; hence,
e(
R
,

LS
) =

2
2
= 12
2
f
__
f
2
(x) dx
_
2
.
This agrees with the asymptotic relative eciency results for the MWW test relative to the
t test and (1.7.13).
2.4.5 Eciency Results Based on Condence Intervals
Let L
1
be the length of the (1 )100% distribution free condence interval based on the
MWW statistic discussed in Section 2.4.2. Since this interval is based on the Pitman regular
process S
+
R
(), it follows from Theorem 1.5.9 of Chapter 1 that
_
n
1
n
2
n
L
1
2z
/2
P
; (2.4.26)
that is, the standardized length of a distribution-free condence interval is a consistent
estimate of the scale parameter . It further follows from (2.4.26) that, as in Chapter 1, if
eciency is based on the relative squared asymptotic lengths of condence intervals then we
obtain the same eciency results as quoted above for tests and estimates.
In the RBR computational function twosampwil a simple degree of freedom adjustment
is made in the estimation of . In the traditional LS analysis based on the pooled t, this
adjustment is equivalent to dividing the pooled estimate of variance by n
1
+ n
2
2 instead
of n
1
+ n
2
. Hence, as our estimate of , the function twosampwil uses
=
_
n
1
+ n
2
n
1
+ n
2
2
_
n
1
n
2
n
L
1
2z
/2
. (2.4.27)
Thus the standard error (SE) of the estimator

R
is given by
_
(1/n
1
) + (1/n
2
).
The distribution free condence interval is not symmetric about

R
. Often in practice
symmetric intervals are desired. Based on the asymptotic distribution of

R
we can formulate
the approximate interval
R
z
/2

_
1
n
1
+
1
n
2
, (2.4.28)
where is a consistent estimate of . If we use (2.4.26) as our estimate of with the level
, then the condence interval simplies to
L
1
2
. (2.4.29)
Besides the estimate given in (2.4.26), a consistent estimate of was proposed by by
Koul, Sievers and McKean (1987) and will be discussed in Section 3.7. Using this estimate
small sample studies indicate that z
/2
should be replaced by the t critical value t
(/2,n1)
;
see McKean and Sheather (1991) for a review of small sample studies on R-estimates. In
this case, the symmetric condence interval based on

R
is directly analogous to the usual
t interval based on least squares in that the only dierence is that is replaced by .
Example 2.4.1. Hendy and Charles Coin Data, continued from Examples 1.11.1 and 2.3.2
Recall from Chapter 1 that this example concerned the silver content in two coinages (the
rst and the fourth) minted during the reign of Manuel I. The data are given in Chapter
1. The Hodges-Lehmann estimate of the dierence between the rst and the fourth coinage
is 1.10 percent of silver and a 95% condence interval for the dierence is (.60, 1.70). The
length of this condence interval is 1.10; hence, the estimate of given in expression (2.4.27)
is 0.595. The symmetrized condence interval (2.4.28) based on the t upper .025 critical
value is (0.46, 1.74). Both of these intervals are in agreement with the condence interval
obtained in Example 1.11.1 based on the two L
1
condence intervals.
Another estimate of can be obtained from a similar consideration of the distribution free
condence intervals based on the signed-rank statistic discussed in Chapter 1; see Exercise
2.13.13. Note in this case for consistency, though, we would have to assume that f is
symmetric.
2.5. GENERAL RANK SCORES 99
2.5 General Rank Scores
In this section we will be concerned with the location model; i.e., X
1
, . . . , X
n
1
are iid F(x),
Y
1
, . . . , Y
n
2
are iid G(x) = F(x), and the samples are independent of one another. We will
present an analysis for this problem based on general rank scores. In this terminology, the
Mann-Whitney-Wilcoxon procedures are based on a linear score function. We will present
the results for the hypotheses
H
0
: = 0 versus H
0
: > 0 . (2.5.1)
The results for the other one sided and two sided alternatives are similar. We will also be
concerned with estimation and condence intervals for . As in the preceeding sections we
will rst present the geometry.
Recall that the pseudo norm which generated the MWW analysis could be written as a
linear combination of ranks times residuals. This is easily generalized. Consider the function
|u|
=
n
i=1
a(R(u
i
))u
i
, (2.5.2)
where a(i) are scores such that a(1) a(n) and
a(i) = 0. For the next theorem,

we will also assume that a(i) = a(n +1 i); although, this is only used to show the scalar
multiplicative property.
Theorem 2.5.1. Suppose that a(1) a(n),
a(i) = 0, and a(i) = a(n + 1 i).

Then the function | |
is a pseudo-norm.
Proof: By the connection between ranks and order statistics we can write
|u|
=
n
i=1
a(i)u
(i)
.
Next suppose that u
(j)
is the last order statistic with a negative score. Since the scores sum
to 0, we can write
|u|
=
n
i=1
a(i)(u
(i)
u
(j)
)
=
ij
a(i)(u
(i)
u
(j)
) +
ij
a(i)(u
(i)
u
(j)
) . (2.5.3)
Both terms on the right side are nonnegative; hence, |u|
0. Since all the terms in (2.5.3)

are nonnegative, |u|
= 0 implies that all the terms are zero. But since the scores are
not all 0, yet sum to zero, we must have a(1) < 0 and a(n) > 0. Hence we must have
u
(1)
= u
(j)
= u
(n)
; i.e., u
(1)
= = u
(n)
. Conversely if u
(1)
= = u
(n)
then |u|
= 0. By
the condition a(i) = a(n + 1 i) it follows that |u|
= [[|u|
; see Exercise 2.13.16.

In order to complete the proof we need to show the triangle inequality holds. This is
established by the following string of inequalities:
|u +v|
=
n
i=1
a(R(u
i
+ v
i
))(u
i
+ v
i
)
=
n
i=1
a(R(u
i
+ v
i
))u
i
+
n
i=1
a(R(u
i
+ v
i
))v
i
i=1
a(i)u
(i)
+
n
i=1
a(i)v
(i)
= |u|
+|v|
.
The proof of the above inequality is similar to that of Theorem 1.3.2 of Chapter 1.
Based on a set of scores satisfying the above assumptions, we can establish a rank infer-
ence for the two sample problem similar to the MWW analysis. We shall do so for general
rank scores of the form
a
(i) = (i/(n + 1)) , (2.5.4)

where (u) satises the following assumptions
_
(u) is a nondecreasing function dened on the interval (0, 1)
_
1
0
(u) du = 0 and
_
1
0

2
(u) du = 1
; (2.5.5)
see (S.1), (3.4.10) in Chapter 3, also. The last assumptions concerning standardization of
the scores are for convenience. The Wilcoxon scores are generated in this way by the linear
function
R
(u) =
12(u (1/2)) and the sign scores are generated by

S
(u) = sgn(2u 1).
We will denote the corresponding pseudo norm for scores generated by (u) as
|u|
=
n
i=1
a
(R(u
i
))u
i
. (2.5.6)
These two sample sign and Wilcoxon scores are generalizations of the sign and Wilcoxon
scores discussed in Chapter 1 for the one sample problem. In Section 1.8 of Chapter 1 we
presented one sample analyses based on general score functions. Similar to the sign and
Wilcoxon cases, we can generate a two sample score function from any one sample score
function. For reference we establish this in the following theorem:
Theorem 2.5.2. As discussed at the beginning of Section 1.8, let
+
(u) be a score function
for the one sample problem. For u (1, 0), let
+
(u) =
+
(u). Dene,
(u) =
+
(2u 1) , for u (0, 1) . (2.5.7)
and
|x|
=
n
i=1
(R(x
i
)/(n + 1))x
i
. (2.5.8)
Then | |
is a pseudo-norm on 1
n
. Furthermore
(u) = (1 u) , (2.5.9)
and
_
1
0
2
(u) du =
_
1
0
(
+
(u))
2
du . (2.5.10)
Proof: As discussed in the beginning of Section 1.8, (see expression (1.8.1)),
+
(u) is a
positive valued and nondecreasing function dened on the interval (0, 1). Based on these
properties, it follows that (u) is nondecreasing and that
_
1
o
(u) du = 0. Hence | |
is a
pseudo-norm on 1
n
. Properties (2.5.9) and (2.5.10) follow readily; see Exercise 2.13.17 for
details.
The two sample sign and Wilcoxon scores, cited above, are easily seen to be generated
this way from their one sample counterparts
+
(u) = 1 and
+
(u) =
3u, respectively. As
discussed further in Section 2.5.3, properties such as eciencies of the analysis based on the
one sample scores are the same for a two sample analysis based on their corresponding two
sample scores.
In the notation of (2.2.3), the estimate of is
= Argmin |Z C|
.
Denote the negative of the gradient of |Z C|
by S
(). Then based on (2.5.6),

S
() =
n
2
j=1
a
(R(Y
j
)) . (2.5.11)
Hence

equivalently solves the equation,

S
)
.
= 0 . (2.5.12)
As with pseudo norms in general, the function |Z C|
is a convex function of . The

negative of its derivative, S
(), is a decreasing step function of which steps down at the

dierences Y
j
X
i
; see Exercise 2.13.18. Unlike the MWW function S
R
(), the step sizes
of S
() are not necessarily the same size. Based on MWW starting values, a simple trace
algorithm through the dierences can be used to obtain the estimator

. The R function
twosampr2 computes the rank-based analysis for general scores.
The gradient rank test statistic for the hypotheses (2.5.1) is
S
=
n
2
j=1
a
(R(Y
j
)) . (2.5.13)
Since the test statistic only depends on the ranks of the combined sample it is distribution
free under the null hypothesis. As shown in Exercise 2.13.18,
E
0
[S
] = 0 (2.5.14)
= V
0
[S
] =
n
1
n
2
n(n 1)
n
i=1
a
2
(i) . (2.5.15)
Note that we can write the variance as
=
n
1
n
2
n 1
_
n
i=1
a
2
(i)
1
n
_
.
=
n
1
n
2
n 1
, (2.5.16)
where the approximation is due to the fact that the term in braces is a Riemann sum of
_

2
(u)du = 1 and, hence, converges to 1.
It will be convenient from time to time to use rank statistics based on unstandardized
scores; i.e., a rank statistic of the form
S
a
=
n
2
j=1
a(R(Y
j
)) , (2.5.17)
where a(i) = (i/(n +1)), i = 1, . . . , n is a set of scores. As Exercise 2.13.18 shows the null
mean
S
and null variance
2
S
of S
a
are given by
S
= n
2
a and
2
S
=
n
1
n
2
n(n 1)
(a(i) a)
2
. (2.5.18)
2.5.1 Statistical Methods
The asymptotic null distribution of the statistic S
, (2.5.13), easily follows from Theorem

A.2.1 of the Appendix. To see this note that we can use the notation (2.2.1) and (2.2.2) to
write S
as a linear rank statistic; i.e.,

S
=
n
i=1
c
i
a(R(Z
i
)) =
n
i=1
(c
i
c)a
_
n
n + 1
F
n
(Z
i
)
_
, (2.5.19)
where F
n
is the empirical distribution function of Z
1
, . . . , Z
n
. Our score function is mono-
tone and square integrable; hence, the conditions on scores in Section A.2 are satised. Also
F is continuous so the distributional assumption is satised. Finally, we need only show that
the constants c
i
satisfy conditions, D.2, (3.4.7), and D.3, (3.4.8). It is a simple exercise to
show that
n
i=1
(c
i
c)
2
=
n
1
n
2
n
max
1in
(c
i
c)
2
= max
_
n
2
2
n
2
,
n
2
1
n
2
_
.
Under condition (D.1), (2.4.7), 0 <
i
< 1 where lim(n
i
/n) =
i
for i = 1, 2. Using this
along with the last two expressions, it is immediate that Noethers condition, (3.4.9), holds
for the c
i
s. Thus the assumptions of Section A.2 hold for the statistic S
.
As in expression (A.2.7) of Section A.2, dene the random variable T
as
T
=
n
i=1
(c
i
c)(F(Z
i
)) . (2.5.20)
By comparing expressions (2.5.19) and (2.5.20), it seems that the variable T
is an approx-
imation of S
. This follows from Section A.2. Briey, under H

0
the distribution of T
is approximately normal and Var((T
)/
) 0; hence, S
is asymptotically normal
with mean and variance given by expressions (2.5.14) and (2.5.15), respectively. Hence, an
asymptotic level test of the hypotheses (2.5.1) is
Reject H
0
in favor of H
A
, if S
,
where
is dened by (2.5.15).
As discussed above, the estimate

of solves the equation (2.5.12). The interval

(
L
,

U
) is a (1 )100% condence interval for (based on the asymptotic distribution)
provided

L
and

U
solve the equations
S
U
)
.
= z
/2
_
n
1
n
2
n
and S
L
)
.
= z
/2
_
n
1
n
2
n
, (2.5.21)
where 1 (z
/2
) = /2. As with the estimate of , these equations can be easily solved
with an iterative algorithm; see Exercise 2.13.18.
2.5.2 Eciency Results
In order to obtain the eciency results for these statistics, we rst show that the process
S
() is Pitman regular. For general scores we need to further assume that the density
has nite Fisher information; i.e., satises condition (E.1), (2.4.16). Recall that Fisher
information is given by I(f) =
_
1
0

2
F
(u) du, where
f
(u) =
f
(F
1
(u))
f(F
1
(u))
. (2.5.22)
Below we will show that the score function
f
is optimal. Dene the parameter
as,
=
_
(u)
f
(u)du . (2.5.23)
Estimation of
is dicussed in Section 3.7.

To show that the process S
() is Pitman regular, we show that the four conditions of

Denition 1.5.3 are true. As noted after expression (2.5.12), S
() is nonincreasing; hence,
the rst condition holds. For the second condition, note that we can write
S
() =
n
2
i=1
a(R(Y
i
)) =
n
2
i=1
_
n
1
n + 1
F
n
1
(Y
i
) +
n
2
n + 1
F
n
2
(Y
i
)
_
, (2.5.24)
where F
n
1
and F
n
2
are the empirical cdfs of the samples X
1
, . . . , X
n
1
and Y
1
, . . . , Y
n
2
, respec-
tively. Hence, passing to the limit we have,
E
0
_
1
n
S
()
_

2
_

[
1
F(x) +
2
F(x )]f(x ) dx
=
2
_

[
1
F(x + ) +
2
F(x)]f(x) dx =
() ; (2.5.25)
see Cherno and Savage (1958) for a rigorous proof of the limit. Dierentiating
() and
evaluating the derivative at 0 we obtain
(0) =
1
2
_

[F(t)]f
2
(t) dt
=
1
2
_

[F(t)]
_
(t)
f(t)
_
f(t) dt
=
1
2
_
1
0
(u)
f
(u) du =
1
> 0 . (2.5.26)
Hence, the second condition is satised.
The null asymptotic distribution of S
(0) was established in the Section 2.5.1; hence the

fourth condition is true. Hence, we need only establish asymptotic linearity. This result
follows from the results for general rank regression statistics which are developed in Section
A.2.2 of the Appendix. By Theorem A.2.8 of the Appendix, the asymptotic linearity result
for S
() is given by
1
n
S
(/
n) =
1
n
S
(0)
1
2
+ o
p
(1) , (2.5.27)
uniformly for [[ B, where B > 0 and
is dened in (2.5.23).
Therefore, following Denition 1.5.3 of Chapter 1, the estimating function is Pitman
regular.
By the discussion following (2.5.20), we have that n
1/2
S
(0)/
2
is asymptotically
N(0, 1). The ecacy of the test based on S
is thus given by
c
2
=
1
2
. (2.5.28)
As with the MWW analysis, several important items follow immediately from Pitman
regularity. Consider rst the behavior of S
under local alternatives. Specically consider

a level test based on S
for the hypothesis (2.5.1) and the sequence of local alternatives

H
n
:
n
= /
n. As in Chapter 1, it is easy to show that the asymptotic power of the

test based on S
is given by
lim
n
P
/
n
[S
] = 1 (z
) . (2.5.29)
Based on this result, sample size determination for the test based on S
can be conducted
similar to that based on the MWW test statistic; see (2.4.25).
Next consider the asymptotic distribution of the estimator

. Recall that the

estimate

solves the equation S
)
.
= 0. Based on Pitman regularity and Theorem
1.5.7 of Chapter 1 the asymptotic distribution

is given by
n(
)
D
N(0,
2
(
1
2
)
1
) ; (2.5.30)
By using (2.5.27) and T
(0) to approximate S
(0), we have the following useful result:
2
1
n
T
(0) +o
p
(1) . (2.5.31)
We want to select scores such that the ecacy c
, (2.5.28), is as large as possible, or

equivalently such that the asymptotic variance of

is as small as possible. How large can

the ecacy be? Similar to (1.8.26), note that we can write
=
_
(u)
f
(u)du
=
_

2
f
(u)du
_
(u)
f
(u)du
_
_

2
f
(u)du
_
_

2
(u)du
=
_

2
f
(u)du . (2.5.32)
The second equation is true since the scores were standardized as above. In the third equation
is a correlation coecient and
_

2
f
(u)du is Fisher location information, (2.4.16), which
we denoted by I(f). By the Rao-Cramer lower bound, the smallest asymptotic variance
obtainable by an asymptotically unbiased estimate is (
1
2
I(f))
1
. Such an estimate is called
asymptotically ecient. Choosing a score function to maximize (2.5.32) is equivalent to
choosing a score function to make = 1. This can be achieved by taking the score function
to be (u) =
f
(u), (2.5.22). The resulting estimate,

, is asymptotically ecient. Of
course this can be accomplished only provided that the form of f is known; see Exercise
2.13.19. Evidently, the closer the chosen score is to
f
, the more powerful the rank analysis
will be.
In Exercise 2.13.19, the reader is ask to show that the MWW analysis is asymptotically
ecient if the errors have a logistic distribution. For normal errors, it follows in a few
steps from expression (2.4.17) that the optimal scores are generated by the normal scores
function,
N
(u) =
1
(u) , (2.5.33)
where (u) is the distribution function of a standard normal random variable. Exercise
2.13.19 shows that this score function is standardized. These scores yield an asymptotically
ecient analysis if the the errors truly have a normal distribution and, further, e(
N
, L
2
) 1;
see Theorem 1.8.1. Also, unlike the Mann-Whitney-Wilcoxon analysis, the estimate of the
shift based on the normal scores cannot be obtained in closed form. But as mentioned
above for general scores, provided the score function is nondecreasing, simple iterative al-
gorithms can be used to obtain the estimate and the corresponding condence interval for
. In the next sections we will discuss analyses that are asymptotically ecient for other
distributions.
Example 2.5.1. Quail Data, continued from Example 2.3.1
In the larger study, McKean et al. (1989), from which these data were drawn, the re-
sponses were positively skewed with long right tails; although, outliers frequently occurred
in the left tail also. McKean et al. conducted an investigation of estimates of the score func-
tions for over 20 of these experiments. Classes of simple scores which seemed appropriate
for such data were piecewise linear with one piece which is linear on the rst part on the
interval (0, b) and with a second piece which is constant on the second part (b, 1); i.e., scores
of the form
b
(u) =
_
2
b(2b)
u 1 if 0 < u < b
b
2b
if b u < 1
. (2.5.34)
These scores are optimal for densities with left logistic and right exponential tails; see Ex-
ercise 2.13.19. A value of b which seemed appropriate for this type of data was 3/4. Let
S
3/4
=
a
3/4
(R(Y
j
)) denote the test statistic based on these scores. The RBR function
phibentr with the argument param = 0.75 computes these scores. Using the RBR func-
tion twosampr2 with the argument score = phibentr, computes the rank-based analysis
for the score function (2.5.34). Assuming that the treated and control observations are in x
and y, respectively, the call and the resulting analysis for a one sided test as computed by
R is:
> tempb = twosampr2(x,y,test=T,alt=1,delta0=0,score=phibentr,grad=sphir,
param=.75,alpha=.05,maktable=T)
Standardized (z) Test-Statistic 1.787738 and p-vlaue 0.03690915
95 % Confidence Interval is ( -2 , 28 )
Comparing p-values, the analysis based on the score function (2.5.34) is a little more precise
than the MWW analysis given in Example 2.3.1. Recall that the data are right skewed, so
this result is not surprising.
For another class of scores similar to (2.5.34), see the discussion around expression (3.10.6)
in Chapter 3.
2.5.3 Connection between One and Two Sample Scores
In Theorem 2.5.2 we discussed how to obtain a corresponding two sample score function
given a one sample score function. Here we reverse the problem, showing how to obtain
a one sample score function from a two sample score function. This will provide a natural
estimate of in (2.2.4). We also show the eciencies and asymptotic properties are the same
for such corresponding scores functions.
Consider the location model but further assume that X has a symmetric distribution.
Then Y also has a symmetric distribution. For associated one sample problems, we could
then use the signed rank methods developed in Chapter 1. What one sample scores should
we select?
First consider what two sample scores would be suitable under symmetry. Assume with-
out loss of generality that X is symmetrically distributed about 0. Recall that the optimal
scores are given by the expression (2.5.22). Using the fact that F(x) = 1 F(x), it is easy
to see (Exercise 2.13.20) that the optimal scores satisfy,
f
(u) =
f
(1 u) , for 0 < u < 1 ,
that is, the optimal score function is odd about
1
2
. Hence for symmetric distributions, it
makes sense to consider two sample scores which are odd about
1
2
.
For this sub-section then assume that the two sample score generating function satises
the property
(S.3) (1 u) = (u) . (2.5.35)
Note that such scores satisfy: (1/2) = 0 and (u) 0 for u 1/2. Dene a one sample
score generating function as
+
(u) =
_
u + 1
2
_
(2.5.36)
and the one sample scores as
a
+
(i) =
+
_
i
n + 1
_
. (2.5.37)
It follows that these one sample scores are nonnegative and nonincreasing.
For example, if we use Wilcoxon two sample scores, that is, scores generated by the
function, (u) =
12
_
u
1
2
_
then the associated one sample score generating function is
+
(u) =
3u and, hence, the one sample scores are the Wilcoxon signed-rank scores. If
instead we use the two sample sign scores, (u) = sgn(2u 1) then the one sample score
function is
+
(u) = 1. This results in the one sample sign scores.
Suppose we use two sample scores which satisfy (2.5.35) and use the associated one
sample scores. Then the corresponding one and two sample ecacies satisfy
c
=
_
2
c
+ , (2.5.38)
where the ecacies are given by expressions (2.5.28) and (1.8.21). Hence the eciency and
asymptotic properties of the one and two sample analyses are the same. As a nal remark,
if we write the model as in expression (2.2.4), then we can use the rank statistic based on
the two sample to estimate . We next form the residuals Z
i

c
i
. Then using the one
sample scores statistic of Chapter 1, we can estimate based on these residuals, as discussed
in Chapter 1. In terms of a regression problem we are estimating the intercept parameter
based on the residuals after tting the regression coecient . This is discussed in some
detail in Section 3.5.
2.6 L
1
Analyses
In this section, we present analyses based on the L
1
norm and pseudo norm. We discuss the
pseudo norm rst, showing that the corresponding test is the familiar Moods (1950) test.
The test which corresponds to the norm is Mathisens (1943) test.
1
Pseudo Norm
Consider the sign scores. These are the scores generated by the function (u) = sgn(u1/2).
The corresponding pseudo norm is given by,
|u|
=
n
i=1
sgn
_
R(u
i
)
n + 1
2
_
u
i
. (2.6.1)
This pseudo norm is optimal for double exponential errors; see Exercise 2.13.19.
We have the following relationship between the L
1
pseudo norm and the L
1
norm. Note
that we can write
|u|
=
n
i=1
sgn
_
i
n + 1
2
_
u
(i)
.
Next consider,
n
i=1
[u
(i)
u
(ni+1)
[ =
n
i=1
sgn(u
(i)
u
(ni+1)
)(u
(i)
u
(ni+1)
)
= 2
n
i=1
sgn(u
(i)
u
(ni+1)
)u
(i)
.
2.6. L
1
ANALYSES 109
Finally note that
sgn(u
(i)
u
(ni+1)
) = sgn(i (n i + 1)) = sgn
_
i
n + 1
2
_
.
Putting these results together we have the relationship,
n
i=1
[u
(i)
u
(ni+1)
[ = 2
n
i=1
sgn
_
i
n + 1
2
_
u
(i)
= 2|u|
. (2.6.2)
Recall that the pseudo norm based Wilcoxon scores can be expressed as the sum of all
absolute dierences between the components; see (2.2.17). In contrast the pseudo norm
based on the sign scores only involves the n symmetric absolute dierences [u
(i)
u
(ni+1)
[.
In the two sample location model the corresponding R-estimate based on the pseudo
norm (2.6.1) is a value of which solves the equation
S
() =
n
2
j=1
sgn
_
R(Y
j
)
n + 1
2
_
.
= 0 . (2.6.3)
Note that we are ranking the set X
1
, . . . , X
n
1
, Y
1
, . . . , Y
n
2
which is equivalent to
ranking the set X
1
med X
i
, . . . , X
n
1
med X
i
, Y
1
med X
i
, . . . , Y
n
2
med X
i
.
We must choose so that half of the ranks of the Y part of this set are above (n+1)/2 and
half are below. Note that in the X part of the second set, half of the X part is below 0 and
half is above 0. Thus we need to choose so that half of the Y part of this set is below 0
and half is above 0. This is achieved by taking
= med Y
j
med X
i
. (2.6.4)
This is the same estimate as produced by the L
1
norm, see the discussion following (2.2.5).
We shall refer to the above pseudo norm (2.6.1) as the L
1
pseudo norm. Actually, as
pointed out in Section 2.2, this equivalence between estimates based on the L
1
norm and
the L
1
pseudo norm is true for general regression problems in which the model includes an
intercept, as it does here.
The corresponding test statistic for H
0
: = 0 is
n
2
j=1
sgn(R(Y
j
)
n+1
2
). Note that
the sgn function here is only counting the number of Y
j
s which are above the combined
sample median

M = med X
1
, . . . , X
n
1
, Y
1
, . . . , Y
n
2
minus the number below

M. Hence a
more convenient but equivalent test statistic is
M
+
0
= #(Y
j
>

M) , (2.6.5)
which is called Moods median test statistic; see Mood (1950).
Testing
Since this L
1
-analysis is based on a rank-based pseudo-norm we could use the general theory
discussed in Section 2.5 to handle the theory for estimation and testing. As we will point
out, though, there are some interesting results pertaining to this analysis.
For the null distribution of M
+
0
, rst assume that n is even. Without loss of generality,
assume that n = 2r and n
1
n
2
. Consider the combined sample as a population of n items,
where n
2
of the items are Y s and n
1
items are Xs. Think of the n/2 items which exceed
M. Under H
0
these items are as likely to be an X as a Y . Hence, M
+
0
, the number of Y s
in the top half of the sample follows the hypergeometric distribution, i.e.,
P(M
+
0
= k) =
_
n
2
k
__
n
1
rk
_
_
n
r
_ k = 0, . . . , n
2
,
where r = n/2. If n is odd the same result holds except in this case r = (n 1)/2. Thus as
a level decision rule, we would reject H
0
: = 0 in favor of H
A
: > 0, if M
+
0
c
,
where c
could be determined from the hypergeometric distribution or approximated by the

binomial distribution. From the properties of the hypergeometic distribution, E
0
[M
+
0
] =
r(n
2
/n) and V
0
[M
+
0
] = (rn
1
n
2
(n r))/(n
2
(n 1)). Under the assumption D.1, (2.4.7), it
follows that the limiting distribution of M
+
0
is normal.
Condence Intervals
Exercise 2.13.21 shows that, for n = 2r,
M
+
0
() = #(Y
j
>

M) =
n
2
i=1
I(Y
(i)
X
(ri+1)
> 0) , (2.6.6)
and furthermore that the n = 2r dierences,
Y
(1)
X
(r)
< Y
(2)
X
(r1)
< < Y
(n
2
)
X
(rn
2
+1)
,
can be ordered only knowing the order statistics from the individual samples. It is further
shown that if k is such that P(M
+
0
k) = /2 then a (1 )100% condence interval for
is given by
(Y
(k+1)
X
(rk)
, Y
(n
2
k)
X
(rn
2
+k+1)
) .
The above condence interval simplies when n
1
= n
2
= m, say. In this case the interval
becomes
(Y
(k+1)
X
(mk)
, Y
(mk)
X
(k+1)
) ,
which is the dierence in endpoints of the two simple L
1
condence intervals (X
(k+1)
, X
(mk)
)
and (Y
(k+1)
, Y
(mk)
) which were discussed in Section 1.11. Using the normal approximation
2.6. L
1
ANALYSES 111
to the hypergeometric we have k = m/2 Z
/2
_
m
2
/(4(2m1)) .5. Hence, the above
two intervals have condence coecient
.
= 1 2
_
k m/2
_
m/4
_
= 1 2
_
z
/2
_
m/(2m1)
_
.
= 1 2
_
z
/2
2
1/2
_
.
For example, for the equal sample size case, a 5% two sided Moods test is equivalent to
rejecting the null hypothesis if the 84% one sample L
1
condence intervals are disjoint.
While this also could be done for the unequal sample sizes case, we recommend the direct
approach of Section 1.11.
Eciency Results
We will obtain the eciency results from the asymptotic distribution of the estimate,

=
med Y
j
med X
i
of . Equivalently, we could obtain the results by asymptotic linearity
that was derived for arbitrary scores in (2.5.27); see Exercise 2.13.22.
Theorem 2.6.1. Under the conditions cited in Example 1.5.2, (L
1
Pitman regularity con-
ditions), and (2.4.7), we have
n(
)
D
N(0, (
1
2
4f
2
(0))
1
) . (2.6.7)
Proof: Without loss of generality assume that and are 0. We can write,
=
_
n
n
2
n
2
med Y
j
_
n
n
1
n
1
med X
i
.
From Example 1.5.2, we have
n
2
med Y
j
=
1
2f(0)
1
n
2
n
2
j=1
sgnY
j
+ o
p
(1)
hence,
n
2
med Y
j
D
Z
2
where Z
2
is N(0, (4f
2
(0))
1
). Likewise
n
1
med X
i
D
Z
1
where Z
1
is N(0, (4f
2
(0))
1
). Since Z
1
and Z
2
are independent, we have that

n
D
(
2
)
1/2
Z
2
(
1
)
1/2
Z
1
which yields the result.
The ecacy of Moods test is thus
2
2f(0). The asymptotic relative eciency of
Moods test to the two-sample t test is 4
2
f
2
(0), while its asymptotic relative eciency with
the MWW test is f
2
(0)/(3(
_
f
2
)
2
). These are the same as the eciency results of the sign
test to the t test and to the Wilcoxon signed-rank test, respectively, that were obtained in
Chapter 1; see Section 1.7.
Example 2.6.1. Quail Data, continued, Example 2.3.1
For the quail data the median of the combined samples is

M = 64. For the subsequent
test based on Moods test we eliminated the three data points which had this value. Thus
n = 27, n
1
= 9 and n
2
= 18. The value of Moods test statistic is M
+
0
= #(P
j
> 64) = 11.
Since E
H
0
(M
+
0
) = 8.67 and V
H
0
(M
+
0
) = 1.55, the standardized value (using the continuity
correction) is 1.47 with a p-value of .071. Using all the data, the point estimate corresponding
to Moods test is 19 while a 90% condence interval, using the normal approximation, is
(10, 31).
1
Norm
Another sign type procedure is based on the L
1
norm. Reconsider expression (2.2.7) which is
the partial derivative of the L
1
dispersion function with respect to . We take the parameter
as a nuisance parameter and we estimate it by med X
i
. An aligned sign test procedure
for is then obtained by aligning the Y
j
s with respect to this estimate of . The process
of interest, then, is
S() =
n
2
j=1
sgn(Y
j
med X
i
) .
A test of H
0
: = 0 is based on the statistic
M
+
a
= #(Y
j
> med X
i
) . (2.6.8)
This statistic was proposed by Mathisen (1943) and is also referred to as the control median
test; see Gastwirth (1968). The estimate of obtained by solving S()
.
= 0 is, of course,
the L
1
estimate

= med Y
j
med X
i
.
Testing
Mathisens test statistic, similar to Moods, has a hypergeometric distribution under H
0
.
Theorem 2.6.2. Suppose n
1
is odd and is written as n
1
= 2n
1
+1. Then under H
0
: = 0,
P(M
+
a
= t) =
_
n
1
+t
n
1
__
n
2
t+n
1
n
1
_
_
n
n
1
_ , t = 0, 1, . . . , n
2
.
Proof: The proof will be based on a conditional argument. Given X
(n
1
+1)
= x, M
+
a
is
binomial with n
2
trials and 1 F(x) as the probability of success. The density of X
(n
1
+1)
is
f
(x) =
n
1
!
(n
1
!)
2
(1 F(x))
n
1
F(x)
n
1
f(x) .
2.6. L
1
ANALYSES 113
Using this and the fact that the samples are independent we get,
P(M
+
a
= t) =
_ _
n
2
t
_
(1 F(x))
t
F(x)
n
2
t
f(x)dx
=
_
n
2
t
_
n
1
!
(n
1
!)
2
_
(1 F(x))
t+n
1
F(x)
n
1
+n
2
t
f(x)dx
=
_
n
2
t
_
n
1
!
(n
1
!)
2
_
1
0
(1 u)
t+n
1
u
n
1
+n
2
t
du .
By properties of the function this reduces to the result.
Once again using the conditional argument, we obtain the moments of M
+
a
as
E
0
[M
+
a
] =
n
2
2
(2.6.9)
V
0
[M
+
a
] =
n
2
(n + 1)
4(n
1
+ 2)
; (2.6.10)
see Exercise 2.13.23.
The result when n
1
is even is found in Exercise 2.13.23. For the asymptotic null distribu-
tion of M
+
a
we shall make use of the linearity result for the sign process derived in Chapter
1; see Example 1.5.2.
0
and D.1, (2.4.7), M
+
a
n
2
2
,
n
2
(n+1)
4(n
1
+2)
) distri-
bution.
Proof: Assume without loss of generality that the true median of X and Y is 0. Let
= med X
i
. Note that
M
+
a
= (
n
2
j=1
sgn(Y
j
) + n
2
)/2 . (2.6.11)
Clearly under (D.1),

n
2
is bounded in probability. Hence by the asymptotic linearity

result for the L
1
analysis, obtained in Example 1.5.2, we have
n
1/2
2
n
2
j=1
sgn(Y
j
) = n
1/2
2
n
2
j=1
sgn(Y
j
) 2f(0)
n
2
+ o
p
(1) .
But we also have
n
1
= (2f(0)
n
1
)
1
n
1
i=1
sgn(X
i
) + o
p
(1) .
Therefore
n
1/2
2
n
2
j=1
sgn(Y
j
) = n
1/2
2
n
2
j=1
sgn(Y
j
)
_
n
2
/n
1
n
1/2
1
n
1
i=1
sgn(X
i
) + o
p
(1) .
Note that
n
1/2
2
n
2
j=1
sgn(Y
j
)
D
N(0,
1
1
) .
and
_
n
2
/n
1
n
1/2
1
n
1
i=1
sgn(X
i
)
D
N(0,
2
/
1
) .
The result follows from these asymptotic distributions, the independence of the samples,
expression (2.6.11), and the fact that asymptotically the variance of M
+
a
satises
n
2
(n + 1)
4(n
1
+ 2)
.
= n
2
(4
1
)
1
.
Condence Intervals
Note that M
+
a
() = #(Y
j
>

) = #(Y
j

> ); hence, if k is such that P

0
(M
+
a

k) = /2 then (Y
(k+1)
, Y
(n
2
k)
) is a (1)100% condence interval for . For testing

the two sided hypothesis H
0
: = 0 versus H
A
: ,= 0 we would reject H
0
if 0 is not in
the condence interval. This is equivalent, however, to rejecting if

is not in the interval
(Y
(k+1)
, Y
(n
2
k)
).
Suppose we determine k by the normal approximation. Then
k
.
=
n
2
2
z
/2
n
2
(n + 1)
4(n
1
+ 2)
.5
.
=
n
2
2
z
/2
_
n
2
4
1
.5 .
The condence interval (Y
(k+1)
, Y
(n
2
k)
), is a 100%, ( = 1 2(z
/2
(
1
)
1/2
), condence
interval based on the sign procedure for the sample Y
1
, . . . , Y
n
2
. Suppose we take = .05
and have the equal sample sizes case so that
1
= .5. Then = 1 2(2
2). Hence, the

two sided 5% test rejects H
0
: = 0 if

is not in the condence interval.
Remarks on Eciency
Since the estimator of based on the Mathisen procedure is the same as that of Moods
procedure, the asymptotic relative eciency results for Mathisens procedure are the same as
that of Moods. Using another type of eciency due to Bahadur (1967), Killeen, Hettman-
sperger and Sievers (1972) show it is generally better to compute the median of the smaller
sample.
Curtailed sampling on the Y s is one situation where Mathisens test would be used
instead of Moods test since with Mathisens test an early decision could be made; see
Gastwirth (1968).
Example 2.6.2. Quail Data, continued, Examples 2.3.1 and 2.6.1
2.7. ROBUSTNESS PROPERTIES 115
For this data, med T
i
= 49. Since one of the placebo values was also 49, we eliminated
it in the subsequent computation of Mathisens test. The test statistic has the value M
+
a
=
#(C
j
> 49) = 17. Using n
2
= 19 and n
1
= 10 the null mean and variance are 9.5 and
11.875, respectively. This leads to a standardized test statistic of 2.03 (using the continuity
correction) with a p-value of .021. Utilizing all the data, the corresponding point estimate
and condence interval are 19 and (6, 27). This diers from MWW and Mood analyses; see
Examples 2.3.1 and 2.6.1, respectively.
2.7 Robustness Properties
In this section we obtain the breakdown points and the inuence functions of the L
1
and
MWW estimates. We rst consider the breakdown properties.
2.7.1 Breakdown Properties
We begin with the denition of an equivariant estimator of . For convenience let the
vectors X and Y denote the samples X
1
, . . . , X
n
1
and Y
1
, . . . , Y
n
2
, respectively. Also let
X+ a1 = (X
1
+ a, . . . , X
n
1
+ a)
.
Denition 2.7.1. An estimator

(X, Y) of is said to be an equivariant estimator of
if

(X+ a1, Y) =

(X, Y) a and

(X, Y+ a1) =

(X, Y) + a.
Note that the L
1
estimator and the Hodges-Lehmann estimator are both equivariant
estimators of . Indeed as Exercise 2.13.24 shows any estimator based on the rank pseudo
norms discussed in Section 2.5 are equivariant estimators of . As the following theorem
shows the breakdown point of an equivariant estimator is bounded above by .25.
Theorem 2.7.1. Suppose n
1
n
2
. Then the breakdown point of an equivariant estimator
satises
[(n
1
+ 1)/2] + 1/n, where [] denotes the greatest integer function.
Proof: Let m = [(n
1
+ 1)/2] + 1. Suppose

is an equivariant estimator such that
> m/n. Then the estimator remains bounded if m points are corrupted. Let X
=
(X
1
+ a, . . . , X
m
+ a, X
m+1
, . . . , X
n
1
)
. Since we have corrupted m points there exists a

B > 0 such that
[
(X
, Y)

(X, Y)[ B . (2.7.1)
Next let X
= (X
1
, . . . , X
m
, X
m+1
a, . . . , X
n
1
a)
. Then X
contains n
1
m = [n
1
/2] m
altered points. Therefore,
[
(X
, Y)

(X, Y)[ B . (2.7.2)
Equivariance implies that

(X
, Y) =

(X
, Y) + a. By (2.7.1) we have
(X, Y) B

(X
, Y)

(X, Y) + B (2.7.3)
while from (2.7.2) we have
(X, Y) B + a

(X
, Y)

(X, Y) + B + a . (2.7.4)
Taking a = 3B leads to a contradiction between (2.7.2) and (2.7.4).
By this theorem the maximum breakdown point of any equivariant estimator is roughly
half of the smaller sample proportion. If the sample sizes are equal then the best possible
breakdown is 1/4.
Example 2.7.1. Breakdown of L
1
and MWW estimates
The L
1
estimator of ,

= med Y
j
med X
i
, achieves the maximal breakdown since
med Y
j
achieves the maximal breakdown in the one sample problem.
The Hodges-Lehmann estimate

R
= med Y
j
X
i
also achieves maximal breakdown.
To see this, suppose we corrupt an X
i
. Then n
2
dierences Y
j
X
i
are corrupted. Hence
between samples we maximize the corruption by corrupting the items in the smaller sample,
so without loss of generality we can assume that n
1
n
2
. Suppose we corrupt m X
i
s. In
order to corrupt med Y
j
X
i
we must corrupt (n
1
n
2
)/2 dierences. Therefore mn
2

(n
1
n
2
)/2; i.e., m n
1
/2. Hence med Y
j
X
i
has maximal breakdown. Based on Exercise
1.12.13 of Chapter 1, the one sample estimate based on the Wilcoxon signed rank statistic
does not achieve the maximal breakdown value of 1/2 in the one sample problem.
2.7.2 Inuence Functions
Recall from Section 1.6.1 that the inuence function of a Pitman regular estimator based
on a single sample X
1
, . . . , X
n
is the function (z) when the estimator has the represen-
tation n
1/2
(X
i
) + o
p
(1). The estimators we are concerned with in this section are
Pitman regular; hence, to determine their inuence functions we need only obtain similar
representations for them.
For the L
1
estimate we have from the proof of Theorem 2.6.1 that
= med Y
j
med X
i
=
1
2f(0)
1
n
_
n
2
j=1
sgn (Y
j
)
n
1
i=1
sgn (X
i
)
1
_
+ o
p
(1) .
Hence the inuence function of the L
1
estimate is
(z) =
_
(
1
2f(0))
1
sgn z if z is an x
(
2
2f(0))
1
sgn z if z is an y
,
which is a bounded discontinuous function.
For the Hodges-Lehmann estimate, (2.2.18), note that we can write the linearity result
(2.4.23) as
n(S
+
(/
n) 1/2) =
n(S
+
(0) 1/2)
_
f
2
+ o
p
(1) ,
2.7. ROBUSTNESS PROPERTIES 117
which upon substituting

n
R
for leads to
R
=
__
f
2
_
1
n(S
+
(0) 1/2) +o
p
(1) .
Recall the projection of the statistic S
R
(0)1/2 given in Theorem 2.4.7. Since the dierence
between it and this statistic goes to zero in probability we can, after some algebra, obtain
the following representation for the Hodges-Lehmann estimator,
R
=
__
f
2
_
1
1
n
_
n
2
j=1
F(Y
j
) 1/2
n
2
i=1
F(X
i
) 1/2
1
_
+ o
p
(1) .
Therefore the inuence function for the Hodges-Lehmann estimate is
(z) =
_

_
1
_
f
2
_
1
(F(z) 1/2) if z is an x
_
2
_
f
2
_
1
(F(z) 1/2) if z is an y
,
which is easily seen to be bounded and continuous.
For least squares, since the estimate is Y X the inuence function is
(Z) =
_
(
1
)
1
z if z is an x
(
2
)
1
z if z is an y
,
which is unbounded and continuous. The Hodges-Lehmann and L
1
estimates attain the
maximal breakdown point and have bounded inuence functions; hence they are robust.
On the other hand, the least squares estimate has 0 percent breakdown and an unbounded
inuence function. One bad point can destroy a least squares analysis.
For a general score function (u), by (2.5.31) we have the asymptotic representation
=
1
n
_
n
1
i=1
_
1
_
(F(X
i
)) +
n
2
i=1
_
2
_
(F(Y
i
))
_
.
Hence, the inuence function of the R-estimate based on the score function is given by
(z) =
_

1
(F(z)) if z is an x
2
(F(z)) if z is an y
,
where
is dened by expression (2.5.23). In particular, the inuence function is bounded

provided the score generating function is bounded. Note that the inuence function for the
R-estimate based on normal scores is unbounded; hence, this estimate is not robust. Recall
Example 1.8.1 in which the one sample normal scores estimate has an unbounded inuence
function (non robust) but has positive breakdown point (resistant). A rigorous derivation of
these inuence functions can be based on the inuence function derived in Section A.5.2 of
the Appendix.
2.8 Lehmann Alternatives and Proportional Hazards
Consider a two sample problem where the responses are lifetimes of subjects. We shall
continue to denote the independent samples by X
1
, . . . , X
n
1
and Y
1
, . . . , Y
n
2
. Let X
i
and Y
j
have distribution functions F(x) and G(x) respectively. Since we are dealing with lifetimes
both X
i
and Y
j
are positive valued random variables. The hazard function for X
i
is dened
by
h
X
(t) =
f(t)
1 F(t)
and represents the likelihood that a subject will die at time t given that he has survived
until that time; see Exercise 2.13.25.
In this section, we will consider the class of lifetime models that are called Lehmann
alternative models for which the distribution function G satises
1 G(x) = (1 F(x))
, (2.8.1)
where the parameter > 0. See Section 4.4 of Maritz (1981) for an overview of nonparamet-
ric methods for these models. The Lehmann model generalizes the exponential scale model
F(x) = 1 exp(x) and G(x) = 1 (1 F(x))
= 1 exp(x). As shown in Exercise

2.13.25, the hazard function of Y
j
is given by h
Y
(t) = h
X
(t); i.e., the hazard function of Y
j
is proportional to the hazard function of X
i
; hence, these models are also referred to as pro-
portional hazards models; see, also, Section 3.10. The null hypothesis can be expressed
as H
L0
: = 1. The alternative we will consider is H
LA
: < 1; that is, Y is less hazardous
than X; i.e., Y has more chance of long survival than X and is stochastically larger than X.
Note that,
P
(Y > X) = E
[P(Y > X [ X)]

= E
[1 G(X)]
= E
[(1 F(X))
] = ( + 1)
1
(2.8.2)
The last equality holds, since 1 F(X) has a uniform (0, 1) distribution. Under H
LA
, then,
P
(Y > X) > 1/2; i.e., Y tends to dominate X.

The MWW test statistic S
+
R
= #(Y
j
> X
i
) is a consistent test statistic for H
L0
versus
H
LA
, by Theorem 2.4.10. We reject H
L0
in favor of H
LA
for large values of S
+
R
. Furthermore
by Theorem 2.4.4 and (2.8.2), we have that
E
[S
+
R
] = n
1
n
2
E
[1 G(X)] =
n
1
n
2
1 +
.
This suggests as an estimate of , the statistic,
= ((n
1
n
2
)/S
+
R
) 1 . (2.8.3)
By Theorem 2.4.5 it can be shown that
V
(S
+
R
) =
n
1
n
2
( + 1)
2
+
n
1
n
2
(n
1
1)
( + 2)( + 1)
2
+
n
1
n
2
(n
2
1)
2
(2 + 1)( + 1)
2
; (2.8.4)
2.8. LEHMANN ALTERNATIVES AND PROPORTIONAL HAZARDS 119
see Exercise 2.13.27. Using this result and the asymptotic distribution of S
+
R
under general
alternatives, Theorem 2.4.9, we can obtain, by the delta method, the asymptotic variance of
given by
Var
.
=
(1 +)
2
n
1
n
2
_
1 +
n
1
1
+ 2
+
(n
2
1)
2 + 1
_
. (2.8.5)
This can be used to obtain an asymptotic condence interval for ; see Exercise 2.13.27 for
details. As in the example below the bootstrap could also be used to estimate the Var( ).
2.8.1 The Log Exponential and the Savage Statistic
Another rank test which is frequently used in this situation is the log rank test proposed
by Savage (1956). In order to obtain this test, rst consider the special case where X has
the exponential distribution function, F(x) = 1 e
x/
, for > 0. In this case the hazard
function of X is a constant function. Consider the random variable = log X log . In a
few steps we can obtain its distribution function as,
P[ t] = P[log X log t]
= 1 exp (e
t
) ;
i.e., has an extreme value distribution. The density of is f
(t) = exp (t e
t
). Hence, we
can model log X as the location model:
log X = log + . (2.8.6)
Next consider the distribution of the log Y . Using expression (2.8.1) and a few steps of
algebra we get
P[log Y t] = 1 exp (
e
t
) .
But from this it is easy to see that we can model Y as
log Y = log + log
1
+ , (2.8.7)
where the error random variable has the above extreme value distribution. From (2.8.6) and
(2.8.7) we see that the log-transformation problem is simply a two sample location problem
with shift parameter = log . Here, H
L0
is equivalent to H
0
: = 0 and H
LA
is
equivalent to H
A
: > 0. We shall refer to this model as the log exponential model for
the remainder of this section. Thus any of the rank-based analyses that we have discussed
in this chapter can be used to analyze this model.
Lets consider the analysis based on the optimal score function for the model. Based on
Section 2.5 and Exercise 2.13.19, the optimal scores for the extreme value distribution are
generated by the function
f
(u) = (1 + log(1 u)) . (2.8.8)
Hence the optimal rank test in the log exponential model is given by
S
L
=
n
2
j=1
f
_
R(Y
j
)
n + 1
_
=
n
2
j=1
_
1 + log
_
1
R(log Y
j
)
n + 1
__
=
n
2
j=1
_
1 + log
_
1
R(Y
j
)
n + 1
__
. (2.8.9)
We reject H
L0
in favor of H
LA
for large values of S
L
. By (2.5.14) the null mean of S
L
is 0
while from (2.5.18) its null variance is given by
f
=
n
1
n
2
n(n 1)
n
i=1
_
1 + log
_
1
i
n + 1
__
2
. (2.8.10)
Then an asymptotic level test rejects H
L0
in favor of H
LA
if S
L
z
f
.
Certainly the statistic S
L
can be used in the general Lehmann alternative model described
above, although, it is not optimal if X does not have an exponential distribution. We shall
discuss the eciency of this test below.
For estimation, let

be the estimate of based on the optimal score function
f
; that
is,

solves the equation
n
2
j=1
_
1 + log
_
1
R[log(Y
j
) ]
n + 1
__
.
= 0 . (2.8.11)
Besides estimation, the condence intervals discussed in Section 2.5 for general scores, can
be obtained for the score function
f
; see Example 2.8.1 for an illustration.
Thus another estimate of would be = exp
_
_
. As discussed in Exercise 2.13.27,
an asymptotic condence interval for can be formulated from this relationship. Keep in
mind, though, that we are assuming that X is exponentially distributed.
As a further note, since
f
(u) is an unbounded function it follows from Section 2.7.2
that the inuence function of

is unbounded. Thus the estimate is not robust.
A frequently used, equivalent test statistic to S
L
was proposed by Savage. To derive it,
denote R(Y
j
) by R
j
. Then we can write
log
_
1
R
j
n + 1
_
=
_
1R
j
/(n+1)
1
1
t
dt =
_
0
R
j
/(n+1)
1
1 t
dt .
We can approximate this last integral by the following Riemann sum:
1
1 R
j
/(n + 1)
1
n + 1
+
1
1 (R
j
1)/(n + 1)
1
n + 1
+ +
1
1 (R
j
(R
j
1))/(n + 1)
1
n + 1
.
This simplies to
1
n + 1 1
+
1
n + 1 2
+ +
1
n + 1 R
j
=
n
i=n+1R
j
1
i
.
2.8. LEHMANN ALTERNATIVES AND PROPORTIONAL HAZARDS 121
This suggests the rank statistic
S
L
= n
2
+
n
2
j=1
n
i=nR
j
+1
1
i
. (2.8.12)
This statistic was proposed by Savage (1956). Note that it is a rank statistic with scores
dened by
a
j
= 1 +
n
i=nj+1
1
i
. (2.8.13)
Exercise 2.13.28 shows that its null mean and variance are given by
E
H
0
[
S
L
] = 0

2
=
n
1
n
2
n 1
_
1
1
n
n
j=1
1
j
_
. (2.8.14)
Hence an asymptotic level test is to reject H
L0
in favor of H
LA
if

S
L
z
.
Based on the above Riemann sum it would seem that

S
L
and S
L
are close statistics.
Indeed they are asymptotically equivalent and, hence, both are optimal when X is exponen-
tially distributed; see Hajek and

Sidak (1967) or Kalbeish and Prentice (1980) for details.
2.8.2 Eciency Properties
We next derive the asymptotic relative eciences for the log exponential model with f
(t) =
exp (t e
t
). The MWW statistic, S
+
R
, is a consistent test for the log exponential model. By
(2.4.21), the ecacy of the Wilcoxon test is
c
MWW
=
12
_
f
2
2
=
_
3
4
_
2
; ,
Since the Savage test is asymptotically optimal its ecacy is the square root of Fisher infor-
mation, i.e., I
1/2
(f
) discussed in Section 2.5. This ecacy is
2
. Hence the asymptotic
relative eciency of the Mann-Whitney-Wilcoxon test to the Savage test at the log expo-
nential model, is 3/4; see Exercise 2.13.29.
Recall that the ecacy of the L
1
procedures, both Moods and Mathisens, is 2f
2
,
where
denotes the median of the extreme value distribution. This turns out to be
= log(log 2)). Hence f
) = (log 2)/2, which leads to the ecacy
2
log 2 for the L
1
methods. Thus the asymptotic relative eciency of the L
1
procedures with respect to the
procedure based on Savage scores is (log 2)
2
= .480. The asymptotic relative eciency of
the L
1
methods to the MWW at this model is .6406. Therefore there is a substantial loss of
eciency if L
1
methods are used for the log exponential model. This makes sense since the
extreme value distribution has very light tails.
The variance of a random variable with density f
is
2
/6; hence the asymptotic relative
eciency of the t test to the Savage test at the log exponential model is 6/
2
= .608. Hence,
for the procedures analyzed in this chapter on the log exponential model the Savage test is
optimal followed, in order, by the MWW, t, and L
1
tests.
Example 2.8.1. Lifetimes of an Insulation Fluid.
The data below are drawn from an example on page 3 of Lawless (1982); see, also, Nelson
(1982, p. 227). They consist of the breakdown times (in minutes) of an electrical insulating
uid when subject to two dierent levels of voltage stress, 30 and 32 kV. Suppose we are
interested in testing to see if the lower level is less hazardous than the higher level.
Voltage Level Times to Breakdown (Minutes)
30 kV 17.05 22.66 21.02 175.88 139.07 144.12 20.46 43.40
Y 194.90 47.30 7.74
32 kV 0.40 82.85 9.88 89.29 215.10 2.75 0.79 15.93
X 3.91 0.27 0.69 100.58 27.80 13.95 53.24
Let Y and X denote the log of the breakdown times of the insulating uid at the voltage
stesses of 30 kV and 32 kVs, respectively. Let =
Y

X
denote the shift in locations.
We are interested in testing H
0
: = 0 versus H
A
: > 0. The comparison boxplots for
the log-transformed data are displayed in the left panel of Figure 2.8.1. It appears that the
lower level (30 kV) is less hazardous.
The RBR function twosampwiltwosampr2+ with the score argument set at philogr
obtains the analysis based on the log-rank scores. . Briey, the results are:
Standardized (z) Test-Statistic 1.302 and p-vlaue 0.096
95 % Confidence Interval is (-0.261, 2.662)
The corresponding Mann-Whitney-Wilcoxon analysis is
Test Stat. S+ is 118 Standardized (z) Test-Stat. 1.816 and p-vlaue 0.034
MWW estimate of the shift in location is 1.297 SE is 0.944
95 % Confidence Interval is (-0.201, 3.355)
2.9. TWO SAMPLE RANK SET SAMPLING (RSS) 123
Figure 2.8.1: Comparison Boxplots of Treatment and Control Quail LDL Levels
0.0 0.5 1.0 1.5 2.0 2.5
0
5
0
1
0
0
1
5
0
2
0
0
Exponential Quantiles
V
o
l
t
a
g
e

l
e
v
e
l
Exponential qq Plot
log 30 kv log 32 kv
1
0
1
2
3
4
5
B
r
e
a
k
d
o
w
n
t
i
m
e
Comparison Boxplots of log 32 kv and log 30 kv
While the log-rank is insignicant, the MWW analysis is signicant at level 0.034. This
dierence is not surprising upon considering the qq plot of the original data at the 32 kV
level found in the right panel of Figure 2.8.1. The population quantiles are drawn from an
exponential distribution. The plot indicates heavier tails than that of an exponential distri-
bution. In turn, the error distribution for the location model would have heavier tails than
the light-tailed extreme ,valued distribution. Thus the MWW analysis is more appropriate.
The two sample t-test has value 1.34 with the p-value also of .096. It was impaired by the
heavy tails too.
Although, the exponential model on the original data seems unlikely, for illustration we
consider it. The sum of the ranks of the 30 kV (Y ) sample is 184. The estimate of based
on the MWW statistic is .40. A 90% condence interval for based on the approximate (via
the delta-method) variance, (2.8.5), is (.06, .74); while, a 90% bootstrap condence interval
based 1000 bootstrap samples is (.15, .88). Hence the MWW test, the corresponding estimate
of and the two condence intervals indicate that the lower voltage level is less hazardous
than the higher level.
2.9 Two Sample Rank Set Sampling (RSS)
The basic background for rank set sampling was discussed in Section 1.9. In this section we
extend these ideas to the two sample location problem. Suppose we have the two samples
in which X
1
, . . . , X
n
1
are iid F(x) and Y
1
, . . . , Y
n
2
are iid F(x ) and the two samples
are independent of one another. In the corresponding RSS design, we take n
1
cycles of k
samples for X and n
2
cycles of q samples for Y . Proceeding as in Section 1.9, we display the
measured data as:
X
(1)1
, . . . , X
(1)n
1
iid f
(1)
(t) Y
(1)1
, . . . , Y
(1)n
2
iid f
(1)
(t )

X
(k)1
, . . . , X
(k)n
1
iid f
(k)
(t) Y
(q)1
, . . . , Y
(q)n
2
iid f
(q)
(t )
.
To test H
0
: = 0 versus H
A
: > 0 we compute the Mann-Whitney-Wilcoxon
statistic with these rank set samples. Letting U
si
=
n
2
t=1
n
1
j=1
I(Y
(s)t
> X
(i)j
), the test
statistic is
U
RSS
=
q
s=1
k
i=1
U
si
.
Note that U
si
is the Mann-Whitney-Wilcoxon statistic computed on the sample of the sth Y
order statistics and the ith X order statistics. Even under the null hypothesis H
0
: = 0,
U
si
is not based on identically distributed samples unless s = i. This complicates the null
distribution of U
RSS
.
Bohn and Wolfe (1992) present a thorough treatment of the distribution theory for U
RSS
.
We note that under H
0
: = 0, U
RSS
is distribution free and further, using the same ideas
as in Theorem 1.9.1, E
H
0
(U
RSS
) = qkn
1
n
2
/2. For xed k and q, provided assumption D.1,
(2.4.7), holds, Theorem 2.4.2 can be applied to show that (U
RSS
qkn
1
n
2
/2)/
_
V
H
0
(U
RSS
)
has a limiting N(0, 1) distribution. The diculty is in the calculation of the V
H
0
(U
RSS
); recall
Theorem 1.9.1 for a similar calculation for the sign statistic. Bohn and Wolfe (1992) present
a complex formula for the variance. Bohn and Wolfe provide a table of the approximate null
distribution of U
RSS
for q = k = 2, n
1
= 1, . . . , 5, n
2
= 1, . . . , 5 and likewise for q = k = 3.
Another way to approximate the null distribution of U
RSS
is to bootstrap it. Consider,
for simplicity, the case k = q = 3 and n
1
= n
2
= m. Hence the expert must rank three
observations and each of the m cycles consists of three samples of size three for each of the
X and Y measurements. In order to bootstrap the null distribution of U
RSS
, rst align the
Y -RSSs with

, the Hodges-Lehmann estimate of shift computed across the two RSSs.
Our bootstrap sampling is on the data with the indicated sampling distributions:
X
(1)1
, . . . , X
(1)m
sample

F
(1)
(x) Y
(1)1
, . . . , Y
(1)m
sample

F
(1)
(y

)
X
(2)1
, . . . , X
(2)m
sample

F
(2)
(x) Y
(2)1
, . . . , Y
(2)m
sample

F
(2)
(y

)
X
(3)1
, . . . , X
(3)m
sample

F
(3)
(x) Y
(3)1
, . . . , Y
(3)m
sample

F
(3)
(y

)
.
In the bootstrap process, for each row i = 1, 2, 3, we take random samples X
(i)1
, . . . , X
(i)m
from

F
(i)
(x) and Y
(i)1
, . . . , Y
(i)m
from

F
(2)
(y

). We then compute U
RSS
on these samples.
Repeating this B times, we obtain the sample of test statistics U
RSS,1
, . . . , U
RSS,B
. Then
the bootstrap p-value for our test is #(U
RSS,j
U
RSS
)/B, where U
RSS
is the value of the
statistic based on the original data. Generally we take B = 1000 for a p-value. It is clear
how to modify the above argument to allow for k ,= q and n
1
,= n
2
.
2.10. TWO SAMPLE SCALE PROBLEM 125
2.10 Two Sample Scale Problem
Frequently it is of interest to investigate whether or not one random variable is more dispersed
than another. The general case is when the random variables dier in both location and scale.
Suppose the distribution functions of X and Y are given by F(x) and G(y) = F((y )/),
respectively; hence L(Y ) = L(X +). For discussion, we consider one-sided hypotheses of
the form
H
0
: = 1 versus H
A
: > 1. (2.10.1)
The other one-sided or two-sided hypotheses can be handled similarly. Let X
1
, . . . , X
n
1
and
Y
1
, . . . , Y
n
2
be samples drawn on the random variables X and Y , respectively.
The traditional test of H
0
is the F-test which is based on the ratio of sample variances.
As we discuss in Section 2.10.2, though, this test is generally not asymptotically correct, (one
of the exceptions is when F(t) is a normal cdf). Indeed, as many simulation studies have
shown, this test is extremely liberal in many non-normal situations; see Conover, Johnson
and Johnson (1981).
Tests of H
0
should be invariant to the locations. One way of ensuring this is to rst
center the observations. For the F-test, the centering is by sample means; instead, we prefer
to use the sample medians. Let

X
and

Y
denote the sample medians of the X and Y
samples, respectively. Then the samples of interest are the folded aligned samples given by
[X
1
[, . . . , [X
n
1
[ and [Y
1
[, . . . , [Y
n
2
[, where X
i
= X
i
X
and Y
i
= Y
i
Y
.
2.10.1 Optimal Rank-Based Tests
To obtain appropriate score functions for the scale problem, rst consider the case when
the location parameters of X and Y are known. Without loss of generality, we can then
assume that they are 0 and, hence, that L(Y ) = L(X). Further because > 0, we have
L([Y [) = L([X[). Let Z
= (log [X
1
[, . . . , log [X
n
1
[, log [Y
1
[, . . . , log [Y
n
2
[) and c
i
, (2.2.1), be
the dummy indicator variable, i.e., c
i
= 0 or 1, depending on whether Z
i
is an X or Y ,
repectively. Then an equivalent formulation of this problem is
Z
i
= c
i
+ e
i
, 1 i n , (2.10.2)
where = log , e
1
, . . . , e
n
are iid with distribution function F
(x) which is the cdf of log [X[.

The hypotheses, (2.10.1), are equivalent to
H
0
: = 0 versus H
A
: > 1. (2.10.3)
Of course, this is the two sample location problem based on the logs of the absolute values
of the observations. Hence, the optimal score function for Model 2.10.2 is given by
f
(u) =
f
(F
1
(u)))
f
(F
1
(u)))
. (2.10.4)
After some simplication, see Exercise 2.13.30, we have
(x)
f
(x)
=
e
x
[f
(e
x
) f
(e
x
)]
f (e
x
) + f (e
x
)
+ 1 . (2.10.5)
If we further assume that f is symmetric, then expression (2.10.5) for the optimal scores
function simplies to
f
(u) = F
1
_
u + 1
2
_
f
_
F
1
_
u+1
2
__
f
_
F
1
_
u+1
2
__ 1. (2.10.6)
This expression is convenient to work with because it depends on F(t) and f(t), the
cdf and pdf of X, in the original formulation of this scale problem. The following two
examples obtain the optimal score function for the normal and double exponential situations,
respectively.
Example 2.10.1. L(X) Is Normal
Without loss of generality, assume that f(x) is the standard normal density. In this case
expression (2.10.6) simplies to
FK
(u) =
_
1
_
u + 1
2
__
2
1 , (2.10.7)
where is the standard normal distribution function; see Exercise 2.13.33. Hence, if we are
sampling from a normal distribution this suggests the rank test statistic
S
FK
=
n
2
j=1
_
1
_
R[Y
j
[
2(n + 1)
+
1
2
__
2
, (2.10.8)
where the FK subscript is due to Fligner and Killeen (1976), who discussed this score
function in their work on the two-sample scale problem.
Example 2.10.2. L(X) Is Double Exponential
Suppose that the density of X is the double exponential, f(x) = 2
1
exp [x[, <
x < . Then as Exercise 2.13.33 shows the optimal rank score function is given by
(u) = (log (1 u) + 1) . (2.10.9)
These scores are not surprising, because the distribution of [X[ is exponential. Hence, this
is precisely the log linear problem with exponentially distributed lifetime that was discussed
in Section 2.8; see the discussion around expression (2.8.8).
Example 2.10.3. L([X[) Is a Member of the Generalized F-family: MWW Statistic
In Section 3.10 a discussion is devoted to a large family of commonly used distributions
called the generalized F family for survival type data. In particular, as shown there, if
[X[ follows an F(2, 2)-distribution, then it follows, (Exercise 2.13.31), that the log [X[ has
a logistic distribution. Thus the MWW statistic is the optimal rank score statistic in this
case.
Notice the relationship between tail-weight of the distribution and the optimal score
function for the scale problem over these last three examples. If the underlying distribution
is normal then the optimal score function (2.10.8) is for very light-tailed distributions. Even
at the double-exponential, the score function (2.10.9) is still for light-tailed errors. Finally,
for the heavy-tailed (variance is ) F(2, 2) distribution the score function is the bounded
MWW score function. The reason for the dierence in location and scale scores is that
the optimal score function for the scale case is based on the distribution of the logs of the
original variables.
Once a scale score function is selected, following Section 2.5 the general scores process
for this problem is given by
S
() =
n
2
j=1
a
(R(log [Y
j
[ )) , (2.10.10)
where the scores a(i) are generated by a(i) = (i/(n + 1)).
A rank test statistic for the hypotheses, (2.10.3), is given by
S
= S
(0) =
n
2
j=1
a
(R(log [Y
j
[) =
n
2
j=1
a
(R([Y
j
[) , (2.10.11)
where the last equality holds because the log function is strictly increasing. This is not
necessarily a standardized score function, but it follows from the discussion on general scores
found in Section 2.5 and (2.5.18) that the null mean
and null variance

2
of the statistic
are given by
= n
2
a and
2
=
n
1
n
2
n(n 1)
(a(i) a)
2
. (2.10.12)
The asymptotic version of this test statistic rejects H
0
at approximate level if z z
where
z =
S
. (2.10.13)
The ecacy of the test based on S
is given by expression (2.5.28); i.e.,

c
=
1
2
, (2.10.14)
where
is given by
=
_
1
0
(u)
f
(u) du (2.10.15)
and the optimal scores function
f
(u) is given in expression (2.10.4). Note that this formula
for the eciacy is under the assumption that the score function (u) is standardized.
Recall the original (realistic) problem, where the distribution functions of X and Y are
given by F(x) and G(y) = F((y )/), respectively and the dierence in locations, , is
unknown. In this case, L(Y ) = L(X + ). As noted above, the samples of interest are
the folded aligned samples given by [X
1
[, . . . , [X
n
1
[ and [Y
1
[, . . . , [Y
n
2
[, where X
i
= X
i
X
and Y
i
= Y
i
Y
, where

X
and

Y
denote the sample medians of the X and Y samples,
respectively.
Given a score function (u), we consider the linear rank statistic, (2.10.11), where the
ranking is performed on the folded-aligned observations; i.e.,
S
=
n
2
j=1
a(R([Y
j
[)). (2.10.16)
The statistic S
is no longer distribution free for nite samples. However, if we further assume

that the distributions of X and Y are symmetric, then the test statistic S
is asymptotically
distribution free and has the same eciency properties as S
; see Puri (1968) and Fligner

and Hettmansperger (1979). The requirement that f is symmetric is discussed in detail by
Fligner and Hettmansperger (1979). In particular, we standardize the statistic using the
mean and variance given in expression (2.10.12).
Estimation and condence intervals for the parameter are based on the process
S
() =
n
2
j=1
a
(R(log [Y
j
[ )) , (2.10.17)
An estimate of is a value

which solves the equation (2.5.12); i.e.,
S
)
.
= 0 . (2.10.18)
An estimate of , the ratio of scale parameters, is then
= e
b
. (2.10.19)
The interval (
L
,
U
) where

L
and

U
solve the equations (2.5.21), forms (asymptotically)
a (1 )100% condence interval for . The corresponding condence interval for is
(exp
L
, exp
U
).
As a simple rank-based analysis, consider the test and estimation given above based on
the optimal scores (2.10.7) for the normal situation. The folded aligned samples version of
the test statistic (2.10.8) is the statistic
S
FK
=
n
2
j=1
_
1
_
R[Y
j
[
2(n + 1)
+
1
2
__
2
. (2.10.20)
The standardized test statistic is z
FK
= (S
FK

FK
)/
FK
, where
FK
abd
FK
are the
vaules of (2.10.12) for the scores (2.10.7). This statistic for non-aligned samples is given on
page 74 of Hajek and

Sidak (1967). A version of it was also discussed by Fligner and Killeen
(1976). We refer to this test and the associated estimator and condence interval as the
Fligner-Killeen analysis. The RBR function twoscale with the score function phiscalefk
computes the Fligner-Killeen analysis. We next obtain the ecacy of this analysis.
Example 2.10.4. Ecacy for the Score Function
FK
(u).
To use expression (2.5.28) for the ecacy, we must rst standardize the score function
FK
(u) =
1
[(u +1)/2]
2
1, (2.10.7). Using the substitution (u +1)/2 = (t), we have
_
1
0
FK
(u) du =
_

t
2
(t) dt 1 = 1 1 = 0.
Hence, the mean is 0. In the same way,
_
1
0
[
FK
(u)]
2
du =
_

t
4
(t) dt 2
_

t
2
(t) dt + 1 = 2.
Thus the standardized score function is
FK
(u) =
1
[(u + 1)/2]
2
1]/
2. (2.10.21)
Hence, the ecacy of the Fligner-Killeen analysis is
c
FK
=
_
2
_
1
0
1
1
[(u + 1)/2]
2
1]
f
(u) du, (2.10.22)
where the optimal score function
f
(u) is given in expression (2.10.4). In particular, the
ecacy at the normal distribution is given by
c
FK
(normal) =
_
2
_
1
0
1
1
[(u + 1)/2]
2
1]
2
du, =
2
_
2
. (2.10.23)
We illustrate the Fligner-Killeen analysis with the following example.
Example 2.10.5. Doksum and Sievers Data.
Doksum and Sievers (1976) describe an experiment involving the eect of ozone on weight
gain of rats. The experimental group consisted of n
2
= 22 rats which were placed in an ozone
environment for seven days, while the control group contained n
1
= 21 rats which were placed
in an ozone free environment for the same amount of time. The response was the weight
gain in a rat over the time period. Figure 2.10.1 displays the comparison boxplots for the
data. There appears to be a dierence in scale. Using the RBR software discussed above,
Figure 2.10.1: Comparison Boxplots of Treated and Control Weight Gains in rats.
Control Ozone
1
0
0
1
0
2
0
3
0
4
0
5
0
W
e
i
g
h
t

G
a
i
n
Comparison Boxplots of Control and Ozone
the Fligner-Killeen test statistic S
FK
= 28.711 and its standardized value is z
FK
= 2.095.
The corresponding p-value for a two sided test is 0.036, conrming the impression from the
plot. The associated estimate of the ratio (ozone to control) of scales is = 2.36 with a 95%
condence interval of (1.09, 5.10).
Conover, Johnson and Johnson (1981) performed a large Monte Carlo study of tests of
dispersion, including these folded-aligned rank tests, over a wide variety of situations for
the c-sample scale problem. The traditional F-test (Bartletts test) did poorly, (as would
be expected from our comments below about the lack of robustness of the classical F-test).
In certain null situations its empirical levels exceeded .80 when the nominal level was
.05. One rank test that performed very well was the aligned rank version of a test statistic
similar to S
FK
, (2.10.20), but with the exponent of one instead of two in the denition of the
score function. This performed well overall in terms of validity and power except for highly
asymmetric distributions, where it has a tendency to be liberal. However, in the following
simulation study the Fligner-Killeen test (??) (exponent of two) is empirically valid over the
asymmetric situations covered.
Example 2.10.6. Simulation Study for Validity of Tests S
Table 2.10.1 displays the results of a small simulation study of the validity of the rank-
based tests of scale for various score functions over mostly skewed error distributions. The
scores in the study are: (fk
2
), the optimal score function for the normal distribution; (fk),
similar to last except the exponent is one; (Wilcoxon), the linear Wilcoxon score function;
(Quad), the score function (u) = u
2
; and (Logistic) the optimal score function if the dis-
tribution of X is logistic (see Exercise 2.13.32). The error distributions include the normal
and the
2
(1) distributions and several members of the skewed contaminated normal dis-
tribution. In the later case, the random variable X is written as X = X
1
(1 I
) + I
X
2
,
where X
1
and X
2
have N(0, 1) and N(
c
,
2
c
) distributions, respectively, I
has a Bernoulli
distribution with probability of success , and X
1
, X
2
and I
are mutually independent. For

the study was set at 0.3 and
c
and
c
varied. The pdfs of the three SCN distributions
in Table 2.10.1 are shown in Figure 2.10.2. The pdf in the bottom right cornor panel of
the gure is that of
2
(1)-distribution. For all but the last situation in Table 2.10.1, the
sample sizes are n
1
= 20 and n
2
= 25. The last situation is for n
1
= n
2
= 10, The number
of simulations for each situation was set at 1000. For each run, the two sided alternative,
H
A
: ,= 1, was tested and the estimator of and an associated condence interval for
was obtained. Computations were performed by RBR functions.
The table shows the empirical levels at the nominal 0.10, 0.05, and 0.01 levels; the
emirical condence coecient for a nominal 95% condence interval, the mean of the esti-
mates of , and the MSE for . Of the ve analyses, overall the Fligner-Killeen analysis (fk
2
)
performed the best. This analysis was valid (nominal levels and empirical coverage) in all
the situations, except for the
2
(1) distribution at the 10% level and the larger sample sizes.
Even here, its empirical level is 0.128. The other tests were liberal in the skewed situations,
some as the Wilcoxon quite liberal. Also, the fk analysis (exponent 1 in its score function)
was liberal for the
2
(1) situations. Notice that the Fligner-Killeen analysis achieved the
lowest MSE in all the situations.
Hall and Padmanabhan (1997) developed a percentile bootstrap for these rank-based
tests which in their accompanying study performed quite well for skewed error distributions
as well as the symmetric error distributions.
As a nal remark, another class of linear rank statistics for the two sample scale problem
consists of simple linear rank statistics of the form
S =
n
2
j=1
a(R(Y
j
)) , (2.10.24)
where the scores are generated as a(i) = (i/(n + 1)). The folded rank statistics discussed
above suggest that be a convex (or concave) function. One popular score function is the
quadratic function (u) = (u 1/2)
2
. The resulting statistic,
S
M
=
n
2
j=1
_
R(Y
j
)
n + 1

1
2
_
2
, (2.10.25)
was proposed by Mood (1954) as a test statistic for (??). For the realistic problem with
unknown location, though, the observations have to be rst aligned. Asymptotic theory
holds, provided the underlying distribution is symmetric. This class of aligned rank tests,
though, did not perform nearly as well as the folded rank statistics, (2.10.16), in the large
Monte Carlo study of Conover et al. (1981). Hence, we recommend the folded rank-based
analyses discussed above.
Table 2.10.1: Empirical Levels, Condences, and MSEs for the Monte carlo Study of Ex-
ample ??.
Normal Errors, n
1
= 20, n
2
= 25

.10

.05

.01

Cnf
.95
MSE( )
Logistic 0.083 0.041 0.006 0.961 1.037 0.060
Quad. 0.080 0.030 0.008 0.970 1.043 0.076
Wilcoxon 0.073 0.033 0.004 0.967 1.042 0.097
fk
2
0.087 0.039 0.004 0.960 1.036 0.057
fk 0.077 0.033 0.005 0.969 1.037 0.067
SKCN(
c
= 2,
c
=
2,
c
= 0.3), n
1
= 20, n
2
= 25
Logistic 0.106 0.036 0.006 0.965 1.035 0.076
Quad. 0.106 0.046 0.008 0.953 1.040 0.095
Wilcoxon 0.103 0.049 0.007 0.952 1.043 0.117
fk
2
0.100 0.034 0.006 0.966 1.033 0.073
fk 0.099 0.047 0.006 0.953 1.034 0.085
SKCN(
c
= 6,
c
=
2,
c
= 0.3), n
1
= 20, n
2
= 25
Logistic 0.081 0.033 0.006 0.966 1.067 0.166
Quad. 0.122 0.068 0.020 0.933 1.105 0.305
Wilcoxon 0.163 0.103 0.036 0.897 1.125 0.420
fk
2
0.072 0.026 0.005 0.974 1.057 0.126
fk 0.111 0.057 0.015 0.942 1.075 0.229
SKCN(
c
= 12,
c
=
2,
c
= 0.3), n
1
= 20, n
2
= 25
Logistic 0.084 0.046 0.007 0.954 1.091 0.298
Quad. 0.138 0.085 0.018 0.916 1.183 0.706
Wilcoxon 0.171 0.116 0.038 0.886 1.188 0.782
fk
2
0.074 0.042 0.007 0.958 1.070 0.201
fk 0.115 0.069 0.015 0.932 1.109 0.400
2
(1), n
1
= 20, n
2
= 25
Logistic 0.154 0.086 0.023 0.913 1.128056 0.353
Quad. 0.249 0.149 0.047 0.851 1.170 0.482
Wilcoxon 0.304 0.197 0.067 0.804 1.196 0.611
fk
2
0.128 0.066 0.018 0.936 1.120 0.336
fk 0.220 0.131 0.039 0.870 1.154 0.432
2
(1), n
1
= 10, n
2
= 10
Logistic 0.132 0.062 0.018 0.934 1.360 1.495
Quad. 0.192 0.099 0.035 0.900 1.457 2.108
Wilcoxon 0.276 0.166 0.042 0.833 1.560 3.311
fk
2
0.111 0.057 0.013 0.941 1.335 1.349
fk 0.199 0.103 0.033 0.893 1.450 2.086
Figure 2.10.2: Pdfs of Skewed Distributions in the Simulation Study of Example 2.10.6.

2 0 2 4 6 8
0
.
0
0
0
.
1
0
0
.
2
0
0
.
3
0
x
f
(
x
)
SCN:
c
= 2,
c
= 1.41, = .3

2 0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
f
(
x
)
SCN:
c
= 6,
c
= 1.41, = .3

0 5 10 15
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
f
(
x
)
SCN:
c
= 12,
c
= 1.41, = .3

0 1 2 3 4
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
x
f
(
x
)
2
, One Defree of Freedom
2.10.2 Ecacy of the Traditional F-Test
We next obtain the ecacy of the traditional F-test for the ratio of scale parameters.
Actually for our development we need not assume that X and Y have the same locations.
Let
2
2
and
2
1
denote the variances of Y and X respectively. Then in the notation in the
rst paragraph of this section,
2
=
2
2
/
2
1
. The classical F-test of the hypotheses (??) is to
reject H
0
if F
F(, n
2
1, n
1
1) where
F
=
2
2
/
2
1
,
and
2
2
and
2
1
are the sample variances of the samples Y
1
, . . . , Y
n
2
and X
1
, . . . , X
n
1
, respec-
tively. The F-test is exact size if f is a normal pdf. Also the test is invariant to dierences
in location.
We rst need the asymptotic distribution of F
under the null hypothesis. Instead of

working with F
it is more convenient mathematically to work with the equivalent test

statistic

nlog F
. We will assume that X has a nite fourth central moment; i.e.,

X,4
=
E[(XE(X))
4
] < . Let = (
X,4
/
4
1
) 3 denote the kurtosis of X. It easily follows that
Y has the same kurtosis under the null and alternative hypotheses. A key result, established
in Exercise 2.13.38, is that under these conditions
n
i
(
2
i

2
i
)
D
N(0,
4
i
( + 2)) , for i = 1, 2 . (2.10.26)
It follows immediately by the delta method that
n
i
(log
2
i
log
2
i
)
D
N(0, + 2) , for i = 1, 2 . (2.10.27)
Under H
0
,
i
= , say, and the last result,
nlog F
=
_
n
n
2
n
2
(log
2
2
log
2
)
_
n
n
1
n
1
(log
2
1
log
2
)
D
N(0, ( + 2))/(
1
2
)) . (2.10.28)
The approximate test rejects H
0
if
nlog F
_
( + 2))/(
1
2
)
z
. (2.10.29)
Note that = 0 if X is normal. In practice the test which is used assumes = 0; that
is, F
is not corrected by an estimate of . This is one reason that the usual F-test for
ratio in variances does not possess robustness of validity; that is, the signicance level is
not asymptotically distribution free. Unlike the t-test, the F-test for variances is not even
asymptotically distribution free under H
0
.
In order to obtain the ecacy of the F-test, consider the sequence of contiguous alter-
natives (??). Assume without loss of generality that the locations of X and Y are the same.
Under this sequence of alternatives we have Y
j
= e
n
U
j
where U
j
is a random variable with
cdf F(x) while Y
j
has cdf F(e
n
x). We also get
2
2
= exp 2
n

2
U
where
2
U
denotes the
sample variance of U
1
, . . . , U
n
2
. Let
F
() denote the power function of the F-test. The
asymptotic power lemma for the F test is
Theorem 2.10.1. Assuming that X has a nite fourth moment, with = (
X,4
/
4
1
) 3,
lim
n
F
(
n
) = P(Z z
c
F
) ,
where Z has a standard normal distribution and ecacy
c
F
= 2
_
2
/
_
+ 2 . (2.10.30)
Proof: The conclusion follows directly upon observing,
nlog F
n(log
2
2
log
2
1
)
=
n(log
2
U
+ 2(/
n) log
2
1
)
= 2 +
_
n
n
2
n
2
(log
2
U
log
2
)
_
n
n
1
n
1
(log
2
1
log
2
)
and that the last quantity converges in distribution to a N(2, ( + 2))/(
1
2
)) variate.
Let (u) denote a general score function for an foled-aligned rank-based analysis as
discussed above. It then follows that the asymptotic relative eciency of this analysis to the
F-test is the ratio of the squares of their ecacies, i.e., e(S, F) = c
2
/c
2
F
, where c
is given
in expression (2.5.28).
Suppose we use the Fligner-Killeen analysis. Then its ecacy is c
FK
which is given in
expression (2.10.22). The ARE between the Fligner-Killeen analysis and the traditional F-
test analysis is the ratio c
2
FK
/c
2
F
. In particular, if we assume that the underlying distribution
is normal, then by (2.10.23), this ratio is one.
2.11. BEHRENS-FISHER PROBLEM 135
2.11 Behrens-Fisher Problem
Consider the general model in Section 2.1 of this chapter, where X
1
, . . . , X
n
1
is a random
sample on the random variable X which has distribution function F(x) and density function
f(x) and Y
1
, . . . , Y
n
2
is a second random sample, independent of the rst, on the random
variable Y which has common distribution function G(x) and density g(x). Let
X
and
Y
denote the medians of X and Y , respectively, and let =
Y

X
. In Section 2.4 we
showed that the MWW test was consistent for the stochastically ordered alternative. In the
location model where the distributions of X and Y dier by at most a shift in location, the
hypothesis F = G is equivalent to the the null hypothesis that = 0. In this section we
drop the location model assumption, that is, we will assume that X and Y have distribution
functions F and G respectively, but we still consider the null hypothesis that = 0. In
order to avoid confusion with Section 2.4, we explicitly state the hypotheses of this section
as
H
0
: = 0 versus H
A
: > 0 , where =
Y

X
, and L(X) = F, and L(Y ) = G .
(2.11.1)
As in the previous sections we have selected a specic alternative for the discussion.
The above hypothesis is our most general hypothesis of this section and the modied
Mathisens test dened below is consistent for it. We will also consider the case where the
forms of F and G are the same; that is, G(x) = F(x/), for some parameter . Note in
this case that L(Y ) = L(X); hence, = T(Y )/T(X) where T(X) is any scale functional,
(T(X) > 0 and T(aX) = aT(X) for a 0). If T(X) =
X
, the standard deviation of
X, then this is a Behrens-Fisher problem with F unknown. If we further assume that the
distributions of X and Y are symmetric then the modied MWW, dened below, can be
used to test that = 0. The most restrictive case, is when both F and G are assumed to be
normal distribution functions. This is, of course, the classical Behrens-Fisher problem and
the classical solution to it is the Welch type t-test, discussed below. For motivation we rst
show the behavior of usual the MWW statistic. We then consider general rank procedures
and nally specialize to analogues of the L
1
and MWW analyses.
2.11.1 Behavior of the Usual MWW Test
In order to motivate the problem, consider the null behavior of the usual MWW test under
(2.11.1) with the further restriction that the distributions of X and Y are symmetric. Under
H
0
, since we are examining null behavior there is no loss of generality if we assume that
X
=
Y
= 0. The asymptotic form of the MWW test rejects H
0
in favor of H
A
if
S
+
R
=
n
1
i=1
n
2
j=1
I(Y
j
X
i
> 0)
n
1
n
2
2
+ z
_
n
1
n
2
(n + 1)
12
.
This test would have asymptotic level if F = G. As Exercise 2.13.41 shows, we still have
E
H
0
(S
+
R
) = n
1
n
2
/2 when the densities of X and Y are symmetric. From Theorem 2.4.5, Part
(a), the variance of the MWW statistic under H
0
satises the limit,
Var
H
0
(S
+
R
)
n
1
n
2
(n + 1)

1
Var(F(Y )) +
2
Var(G(X)) .
Recall that we obtained the asymptotic distribution of S
+
R
, Theorem 2.4.9, under general
conditions which cover the current assumptions; hence, the true signicance level of the
MWW test has the following limiting behavior:
S
+
R
= P
H
0
_
S
+
R

n
1
n
2
2
+ z
_
n
1
n
2
(n + 1)
12
_
= P
H
0
_
S
+
R

n
1
n
2
2
_
Var
H
0
(S
+
R
)
z
n
1
n
2
(n + 1)
12Var
H
0
(S
+
R
)
_
1
_
z
(12)
1/2
(
1
Var(F(Y )) +
2
Var(G(X)))
1/2
. (2.11.2)
Under the assumptions that the sample sizes are the same and that L(X) and the L(Y )
have the same form we can simplify expression (2.11.2) further. We express the result in the
following theorem.
Theorem 2.11.1. Suppose that the null hypothesis in (2.11.1) is true. Assume that the
distributions of Y and X are symmetric, n
1
= n
2
, and G(x) = F(x/) where is an
unknown parameter. Then the maximum observed signicance level is 1 (.816z
) which
is approached as 0 or .
Proof: Under the assumptions of the theorem, note that Var(F(Y )) =
_
F
2
(t)dF(t)
1
4
and Var(G(X)) =
_
F
2
(x/)dF(x)
1
4
. Dierentiating (2.11.2) with respect to we get
_
z
(12)
1/2
((1/2)Var(F(Y )) + (1/2)Var(G(X)))
1/2
(12)
1/2
__
F(t)tf(t)f(t)dt +
_
F(t/)f(t/)(t/
2
)f(t)dt
_
3/2
. (2.11.3)
Making the substitution u = t in the rst integral, the quantity in braces reduces to
2
_
(F(u) F(u/))uf(u)f(u/)du. Note that the other factors in (2.11.3) are strictly
positive. Thus to determine the graphical behavior of (2.11.2) with respect to , we need
only consider the factor in braces. First note that it has a critical point at = 1. Next
consider the case > 1. In this case F(u) F(u/) < 0 on the interval (, 0) and is
positive on the interval (0, ); hence the factor in braces is positive for > 1. Using a
similar argument this factor is negative for 0 < < 1. Therefore the limit of the function
S
+
R
() is decreasing on the interval (0, 1), has a minimum at = 1 and is increasing on the
interval (1, ).
Thus the minimum level of signicance occurs at = 1, (the location model), where it is
. By the graphical behavior of the function, maximum levels would occur at the extremes
of 0 and . But it follows that
Var(F(Y )) =
_
F
2
(t)dF(t)
1
4

_
0 if 0
1
4
if
and
Var(G(X)) =
_
F
2
(x/)dF(x)
1
4

_
1
4
if 0
0 if
.
From these two results and (2.11.2), the true signicance level of the MWW test satises
S
+
R
_
1 (z
(3/2)
1/2
) if 0
1 (z
(3/2)
1/2
) if
.
Hence,
S
+
R
1 (z
(3/2)
1/2
) = 1 (.816z
) ,
whether 0 or . Thus the maximum observed signicance level is 1 (.816z
) which
is approached as 0 or .
For example if = .05 then .816z
= 1.34 and
S
+
R
1 (1.34) = .09. Thus in the
equal sample size case when F and G dier only in scale parameter and are symmetric, the
nominal 5% level of the MWW test will not be worse than .09. In order to guarantee that
.05 choose z
so that 1 (.816z
) = .05. This leads to z
= 2.02 which is the critical

value for an = .02. Hence another way of saying this is: by performing a 2% MWW test
we are guaranteed that the true (asymptotic) level is at most 5%.
2.11.2 General Rank Tests
Assuming the most general hypothesis, (2.11.1), we will follow the development of Fligner
and Policello (1981) to construct general tests. Suppose T represents a rank test statistic,
used in the case F = G, and that the test rejects H
0
: = 0 in favor of H
A
: > 0 for
large values of T. Suppose further that n
1/2
(T
F,G
)/
F,G
converges in distribution to a
standard normal. Let
0
denote the null mean of T and assume that it is independent of F.
Next suppose that is a consistent estimate of
F,G
which is a function only of the ranks
of the combined sample. This will ensure distribution freeness under H
0
; otherwise, the test
statistic will only be asymptotically distribution free. The modied test statistic is
T =
n
1/2
(T
0
)

. (2.11.4)
Such a test can be used for the general hypothesis (2.11.1). Fligner and Policello (1981)
applied this approach to Moods statistic; see Hettmansperger and Malin (1975), also. In
the next section, we consider Mathisens test.
2.11.3 Modied Mathisens Test
We next present a modied version of Mathisens test for the most general hypothesis
(2.11.1). Let

X
= med
i
X
i
and dene the sign-process
S
2
() =
n
2
j=1
sgn(Y
j
) . (2.11.5)
Recall from expression (2.6.8), Section 2.6.2 that Mathisens test statistic (centered version)
is given by S
2
(
X
). This will be our test statistic. The modication lies in its asymptotic
distribution which is given in the next theorem.
Theorem 2.11.2. Assume the null hypothesis in expression (2.11.1) is true. Then under the
assumption (D.1), (2.4.7),
1
n
2
S
2
(
X
) is asymptotically normal with mean 0 and asymptotic
variance 1 +K
2
12
where K
2
12
is dened by
K
2
12
=

2
1
g
2
(
Y
)
f
2
(
X
)
. (2.11.6)
Proof: Assume without loss of generality that
X
=
Y
= 0. From the asymptotic linearity
results discussed in Example 1.5.2 of Chapter 1, we have that
1
n
2
S
2
(
n
)
.
=
1
n
2
S
2
(0) 2g(0)
n
2
n
,
for

n[
n
[ c, c > 0. Since

n
2
X
is bounded in probability, upon substitution in the last
expression we get
1
n
2
S
2
(
X
)
.
=
1
n
2
S
2
(0) 2g(0)
n
2
X
. (2.11.7)
In Example 1.5.2, we also have the approximation
X
.
=
1
n
1
2f(0)
S
1
(0) , (2.11.8)
where S
1
(0) =
n
1
i=1
sgn(X
i
). Combining (2.11.7) and (2.11.8), we get
1
n
2
S
2
(
X
)
.
=
1
n
2
S
2
(0)
g(0)
f(0)
_
n
2
n
1
1
n
1
S
1
(0) . (2.11.9)
The results follows because of independent samples and because S
i
(0)/
n
i
D
N(0, 1), for
i = 1, 2.
In order to use this test we need an estimate of K
12
. As in Chapter 1, selected order
statistics from the sample X
1
, . . . , X
n
1
will provide a condence interval for the median
of X. Hence given a level , the interval (L, U), where L
1
= X
(k+1)
, U
1
= X
(nk)
, and
k = n/2 z
/2
(
n/2) is an approximate (1 )100% condence interval for the median of

X. Let D
X
denote the length of this condence interval. By Theorem 1.5.9 of Chapter 1,
n
1
D
X
2z
/2
P
2f(0) . (2.11.10)
In the same way let D
Y
denote the length of the corresponding (1 )100% condence
interval for the median of Y . Dene
K
12
=
D
Y
D
X
. (2.11.11)
From (2.11.10) and the corresponding result for D
Y
, the estimate

K
12
is a consistent estimate
of K
12
, under both H
0
and H
A
.
Thus the modied Mathisens test for the general hypotheses (2.11.1), is to reject H
0
at
approximately level if
Z
M
=
S
2
(
X
)
_
n
2
(1 +

K
2
12
)
z
. (2.11.12)
To derive the ecacy of this statistic we will use the development of Section 1.5.2. The
average to consider is n
1
S
2
(
X
). Let denote the shift in medians and without loss of
generality let
X
= 0. Then the mean function we need is
lim
n
E
(n
1
S
2
(
X
)) = () .
Note that we can reexpress the expansion (2.11.9) as
1
n
S
2
(
X
) =
n
2
n
1
n
2
S
2
(
X
)
.
=
n
2
n
1
_
1
n
2
S
2
(0)
g(0)
f(0)
_
n
2
n
1
_
n
1
n
2
1
n
1
S
1
(0)
_
P

2
_
E
[sgn(Y )]
g(0)
f(0)
E
[sgn(X)]
_
=
2
E
[sgn(Y )] = () , (2.11.13)
where the next to last equality holds since
X
= 0. Using E
(sgn(Y )) = 1 2G(), we
obtain the derivative
(0) = 2
2
g(0) . (2.11.14)
By Theorem 2.11.2 we have the asymptotic null variance of the test statistic S
2
(
X
)/
n.
From the above discussion then the statistic S
2
(
X
) is Pitman regular with ecacy
c
MM
=
2
2
g(0)
_
2
(1 +K
2
12
)
=
2
2g(0)
_
1
+
2
(g
2
(0)/f
2
(0))
. (2.11.15)
Using Theorem 1.5.4 of Chapter 1, consistency of the modied Mathisens test for the
hypotheses (2.11.1) is obtained provided () > (0). But this follows immediately from
the inequality G() > G(0).
2.11.4 Modied MWW Test
Recall by Theorem 2.4.9 that the mean of the MWW test statistic S
+
R
is n
1
n
2
P(Y > X) =
1
_
G(x)f(x)dx. For general F and G, though, this mean may not be 1/2 under H
0
. Since
this section is concerned with methods for testing the specic hypothesis that = 0, we add
the further restriction that the distributions of X and Y are symmetric. Recall from
Section 2.11.1 that under this assumption and = 0 that E(S
+
R
) = n
1
n
2
/2; see Exercise
2.13.41.
Using the general development of rank tests, Section 2.11.2, our modied rank test is
given by: reject H
0
: = 0 in favor of H
A
: > 0 if Z > z
where
Z =
S
+
R
(n
1
n
2
)/2
_
Var(S
+
R
)
, (2.11.16)
where

Var(S
+
R
) is a consistent estimate of Var(S
+
R
), under H
0
. From the asymptotic distri-
bution theory obtained for S
+
R
under general conditions, Theorem 2.4.9, it follows that this
test has approximate level . By Theorem 2.4.5, we can express the variance as
Var(S
+
R
) = n
1
n
2
_
_
GdF
__
GdF
_
2
_
+ n
1
n
2
(n
1
1)
_
_
F
2
dG
__
FdG
_
2
_
+ n
1
n
2
(n
2
1)
_
_
(1 G)
2
dF
__
(1 G)dF
_
2
_
. (2.11.17)
Following the suggestion of Fligner and Policello (1981), we estimate Var(S
+
R
) by replacing
F and G by the empirical cdfs F
n
1
and G
n
2
respectively. As Exercise 2.13.42 demonstrates,
this estimate is consistent and, further, it is a function of the ranks of the combined sample.
Thus the test is distribution free when F(x) = G(x) and is asymptotically distribution free
when F and G have symmetric densities.
The ecacy for the modied MWW follows using an argument similar to that for the
MWW in Section 2.4. As there, the function S
+
R
() is a decreasing function of . Its mean
function is given by
E
(S
+
R
) = E
0
(S
+
R
()) = n
1
n
2
_
(1 G(x ))f(x)dx .
The average to consider here is S
R
= (n
1
n
2
)
1
S
+
R
. Letting () denote the mean of S
R
under
, we have
(0) =
_
g(x)f(x)dx > 0. The variance we need is
2
(0) = lim
n
nVar
0
(S
R
),
which using the above result on variance simplies to
2
(0) =
1
2
_
_
F
2
dG
__
FdG
_
2
_
+
1
1
_
_
(1 G)
2
dF
__
(1 G)dF
_
2
_
.
The process S
+
R
() is Pitman regular and, in particular, its ecacy is given by,
c
MMWW
=
2
_
g(x)f(x)
_
1
_
_
F
2
dG
__
FdG
_
2
_
+
2
_
_
(1 G)
2
dF
__
(1 G)dF
_
2
_
.
(2.11.18)
As with the modied Mathisens test, we show consistency of the modied MWW test
by using Theorem 1.5.4. Again we need only show that (0) < (). But this follows
immediately provided the supports of F and G overlap in a neighborhood of 0. Note that
this shows that the modied MWW is consistent for the hypotheses (2.11.1) under the further
restriction that the densities of X and Y are symmetric.
2.11.5 Eciencies and Discussion
Before obtaining the asymptotic relative eciencies of the above procedures, we shall briey
discuss traditional methods. Suppose we restrict F and G to have symmetric densities of the
same form with nite variance; that is, F(x) = F
0
((x
X
)/
X
) and G(x) = F
0
((x
Y
)/
Y
)
where F
0
is some distribution function with symmetric density f
0
and
X
and
Y
are the
standard deviations of X and Y respectively.
Under these assumptions, it follows that

n(Y X ) converges in distribution to
N(0, (
2
X
/
1
) + (
2
Y
/
2
)); see Exercise 2.13.43. The test is to reject H
0
: = 0 in favor of
H
A
: > 0 if t
W
> z
where
t
W
=
Y X
_
s
2
X
n
1
+
s
2
Y
n
2
,
where s
2
X
and s
2
Y
are the sample variances of X
i
and Y
j
, respectively. Under these as-
sumptions, it follows that these sample variances are consistent estimates of
2
X
and
2
Y
,
respectively; hence, the test has approximate level . If F
0
is also normal then, under H
0
,
t
W
has an approximate t distribution with a degrees of freedom correction proposed by Welch
(1949). This test is frequently used in practice and we shall subsequently call it the Welch
t-test.
In contrast, the pooled t-test can behave poorly in this situation, since we have,
t
p
=
Y X
_
(n
1
1)s
2
X
+(n
2
1)s
2
Y
n
1
+n
2
2
_
1
n
1
+
1
n
2
_
.
=
Y X
_
s
2
X
n
2
+
s
2
Y
n
1
;
that is, the sample variances are divided by the wrong sample sizes. Hence unless the sample
sizes are fairly close the pooled t is not asymptotically distribution free. Exercise 2.13.44
obtains the true asymptotic level of t
p
.
In order to get the ecacy of the Welch t, consider the statistic Y X. The mean fuction
at is () = ; hence,
(0) = 1. It follows from the asymptotic distribution discussed

above that
n
_

2
(Y X)
_
(
2
X
/
1
) + (
2
Y
)/
2
)
_
D
N(0, 1) ;
hence, (0) =
_
(
2
X
/
1
) + (
2
Y
)/
2
)/
2
. Thus the ecacy of t
W
is given by
c
t
W
=

(0)
(0)
=
2
_
(
2
X
/
1
) + (
2
Y
)/
2
)
. (2.11.19)
We obtain the AREs of the above procedures for the case where G(x) = F(x/) and
F(x) has density f(x) symmetric about 0 with variance 1. Thus is the ratio of standard
deviations
Y
/
X
. For this case the ecacies (2.11.15), (2.11.18), and (2.11.19) reduce to
c
MM
=
2
2
f(0)
_
2
+
1
2
c
MMWW
=
2
_
gf
_
1
[
_
F
2
dG(
_
FdG)
2
] +
2
[
_
(1 G)
2
dF (
_
(1 G)dF)
2
]
c
t
W
=
2
_
2
+
1
2
.
Thus the ARE between the modied Mathisens procedure and the Welch procedure is the
ratio c
2
MM
/c
2
t
W
= 4
2
X
f
2
(0) = 4f
2
0
(0). This is the same ARE as in the location problem. In
particular the ARE does not depend on =
Y
/
X
. Thus the modied Mathisens test in
comparison to t
W
would have poor eciency at the normal distribution, .63, but in general it
would be much more ecient than t
W
for heavy tailed distributions. Similar to the modied
Mathisens test, the Mood test can also be modied for these problems; see Exercise 2.13.45.
Its ecacy is the same as that of the Mathisens test.
Asymptotic relative eciencies involving the modied Wilcoxon do depend on the ratio
of scale parameters . Fligner and Rust (1982) show that if the variances of X and Y are
quite dierent then the modied Mathisens test may be as ecient as the modied MWW
irrespective of the shape of the underlying distribution.
Fligner and Policello (1981) conducted a simulation study of the pooled t, Welchs t,
MWW and the modied MWW over situations where F and G dier in scale only. The
unmodied tests did not maintain their level. Welchs t performed well when F and G were
normal whereas the modied MWW performed well over all situations, including unequal
sample sizes and normal and contaminated normal distributions. In the simulation study
performed by Fligner and Rust (1982), they found that the modied Mood test maintains
its level over the situations that were considered by Fligner and Policello (1981).
As a nal note, Welchs t requires distributions with the same shape and the modied
MWW requires symmetric densities. The modied Mathisens test and the modied Mood
test, though, are consistent tests for the general problem stated in expression (2.11.1).
2.12. PAIRED DESIGNS 143
2.12 Paired Designs
Consider the situation where we have two treatments of interest, say, A and B, which can be
applied to subjects from a population of interest. Suppose we are interested in a particular
response after these treatments have been applied. Let X denote the response of a subject
after treatment A has been applied and let Y be the corresponding measurement for a
subject after treatment B has been applied. The natural null hypothesis, H
0
, is that there
is no dierence in treatment eects. A one sided alternative, would be that the response
of a subject under treatment B is in general larger than of a subject under treatment A.
Reversing the roles of A and B would yield the other one sided alternative while the union
of the these two alternatives would result in the two sided alternative. Again for deniteness
we choose as our alternative, H
A
, the rst one sided alternative.
The completely randomized design and the paired design are two experimental designs
which are often employed in this situation. In the completely randomized design, n subjects
are selected at random from the population of interest and n
1
of them are randomly assigned
to treatment A while the remaining n
2
= n n
1
are assigned to treatment B. At the end
of the treatment period, we then have two samples, one on X while the other is on Y . The
two sample procedures discussed in the previous sections can be used to analyze the data.
Proper randomization along with carefully controlled experimental conditions give credence
to the assumptions that the samples are random and are independent of one another. The
design that produced the data of Example 2.3.1 was a a completely randomized design.
While the completely randomized design is often used in practice, the underlying vari-
ability may impair the power of any procedure, robust or classical, to detect alternative
hypotheses. The design discussed next usually results in a more powerful analysis but it
does require a pairing device; i.e., a block of length two.
Suppose we have a pairing device. Some examples include identical twins for a study on
human subjects, litter mates for a study on animal subjects, or the same exterior wall of a
house for a study on the durability of exterior house paints. In the paired design, n pairs
of subjects are randomly selected from the population of interest. Within each pair, one
member is randomly assigned to treatment A while the other receives treatment B. Again
let X and Y denote the responses of subjects after treatments A and B respectively have
been applied. This experimental design results in a sample of pairs (X
1
, Y
1
), . . . , (X
n
, Y
n
).
The sample dierences D
1
= X
1
Y
1
, . . . D
n
= X
n
Y
n
, however, become the single sample
of interest. Note that the random pairing in this design induces under the null hypothesis a
symmetrical distribution for the dierences.
Theorem 2.12.1. In a randomized paired design, under the null hypothesis of no treatment
eect, the dierences D
i
are symmetrically distributed about 0.
Proof: Let F(x, y) denote the joint distribution of (X, Y ). Under the null hypothesis
of no treatment eect and randomized pairing, it follows that X and Y are exchangable
random variables; that is, P(X x, Y y) = P(X y, Y x). Hence for a dierence
D = Y X we have,
P[D t] = P[Y X t] = P[X Y t] = P[D t] .
Thus D and D have the same distribution; hence D is symmetrically distributed about 0.
Let be a location functional for the distribution of D
i
. We shall further assume that D
i
is symmetrically distributed under alternative models also. Then we can express the above
hypotheses by H
0
: = 0 versus H
A
: > 0.
Note that the one sample analyses based on signs and signed-ranks discussed in Chapter
1 are appropriate for the randomly paired design. The appropriate sign test statistic is
S =
sgn(D
i
) while the signed-rank statistic is T =
sgn(D
i
)R([D
i
[).
From Chapter 1 we shall summarize the analysis based on the signed-rank statistic. A
level test would reject H
0
in favor of H
A
, if T c
where c
is determined from the null

distribution of the Wilcoxon signed-rank test or from the asymptotic approximation to the
distribution. The test is consistent for > 0 and it has the eciency results discussed in
Chapter 1. In particular, for normal errors the eciency of T with respect to the usual paired
t-test is .955. The associated point estimate of is the Hodges-Lehmann estimate given by
= med
ij
(D
i
+D
j
)/2. A distribution free condence interval for is constructed based
on the Walsh averages (D
i
+ D
j
)/2, i j as discussed in Chapter 1. Instead of using
Wilcoxon scores, general signed-rank scores as discussed in Chapter 1, can also be used.
A similar summary holds for the analysis based on the sign statistic. In fact for the sign
scores we need not assume that D
1
, . . . , D
n
are identically distributed; that is, there can be
a block eect. This is discussed further in Chapter 4.
We should mention that if the pairing is not done randomly then D
i
may or may not
be symmetrically distributed. If the symmetry assumption is realistic, then both sign and
signed-rank analyses can be used. If, however, it is not realistic then the sign analysis would
still be valid but caution would be necessary in interpretating the results of the signed-rank
analysis.
Example 2.12.1. Darwin Data:
The data, Table 2.12.1, are some measurements recorded by Charles Darwin in 1878.
They consist of 15 pairs of heights in inches of cross-fertilized plants and self-fertilized plants,
(Zea mays), each pair grown in the same pot.
RBR Results for Darwin Data
Results for the Wilcoxon-Signed-Rank procedure
Test-Stat. is T 72 Standardized (z) Test-Stat. is 2.016 p-vlaue 0.043
Table 2.12.1: Plant Growth
Pot 1 2 3 4 5 6 7 8
Cross- 23.500 12.000 21.000 22.000 19.125 21.500 22.125 20.375
Self- 17.375 20.375 20.000 20.000 18.375 18.625 18.625 15.250
Pot 9 10 11 12 13 14 15
Cross- 18.250 21.625 23.250 21.000 22.125 23.000 12.000
Self- 16.500 18.000 16.250 18.000 12.750 15.500 18.000
Estimate 3 SE is 1.307422
95 % Confidence Interval is ( 1 , 6.125 )
Let D
i
denote the dierence between the heights of the cross-fertilized and self-fertilized
plants of the ith pot and let denote the median of the distribution of D
i
. Suppose we are
interested in testing for an eect; that is, the hypotheses are H
0
: = 0 versus H
A
: ,= 0.
The boxplot of the dierences is displayed in Panel A of Figure 2.12.1, while Panel B gives
the normal qq plot of the dierences. As the plots indicate, the dierences for Pot 2 and,
perhaps, Pot 15 are possible outliers. The results from the RBR functions onesampwil and
onesampsgn are shown below. The value of the signed-rank Wilcoxon statistic for this data
is T = 72 with the approximate p-value of .044. The corresponding estimate of is 3.14
inches and the 95% condence interval is (.50, 5.21).
There are 13 positive dierences, so the standardized value of the sign test statistic is
2.58, with the p-value of 0.01. The corresponding estimate of is 3 inches and the 95%
interpolated condence is (1.00, 6.13). The paired t-test statistic has the value of 2.15 with
p-value 0.050. The dierence in sample means is 2.62 inches and the corresponding 95%
condence interval is (0, 5.23). Note that the outliers impaired the t-test and to a lesser
degree the Wilcoxon signed-rank test; see Exercise 2.13.46 for further analyses.
2.12.1 Behavior under Alternatives
In this section we will compare sample size determination for the paired design with sample
size determination for the completely randomized design. For the paired design, let
+
()
denote the power function of Wilcoxon signed-rank test statistic for the alternative . Then
the asymptotic power lemma, Theorem 1.5.8 with c =
1
=
12
_
f
2
(t) dt, for the signed-
rank Wilcoxon from Chapter 1 states that at signicance level and under the sequence of
Figure 2.12.1: Boxplot of Darwin Data.
5
0
5
1
0
Darwin Data
P
a
i
r
e
d

d
i
f
f
e
r
n
c
e
s
contiguous alternatives,
n
= /
n,
lim
n
+
(
n
) = P
n
_
Z z
_
.
We will only consider the case where the random vector (Y, X) is jointly normal with
variance-covariance matrix
V =
2
_
1
1
_
.
Then =
_
/3
_
2(1 ).
Now suppose we select the sample size n
so that the Wilcoxon signed-rank test has

power
+
(
0
) to detect the one-sided alternative
0
> 0 for a level test. Then writing
0
=
we have by the asymptotic power lemma and (1.5.25) that
+
(
0
)
.
= 1 (z
0
/) ,
and
n
.
=
(z
+
(o)
)
2
2
0
2
.
Substituting the value of into this nal equation, we have that the necessary sample size
for the paired design to have the desired local power is
n
.
=
(z
+
(o)
)
2
2
0
(/3)
2
2(1 ) . (2.12.1)
Next consider a two-sample design with equal sample sizes n
i
= n
. Assume that X and

Y are iid normal with variance
2
. Then
2
= (/3)
2
. Hence by (2.4.25), the necessary
sample size for the completely randomized design to achieve power
+
(
0
) at the onesided
alternative
0
> 0 for a level test is given by,
n =
_
z
+
(
0
)
0
_
2
2(/3)
2
. (2.12.2)
Based on expressions (2.12.1) and (2.12.2), the sample size needed for the paired design is
(1 ) times the sample size needed for the completely randomized design. If the pairing
device is such that X and Y are strongly, positively correlated then it pays to use the paired
design. The paired design is a disaster, of course, if the variables are negatively correlated.
2.13 Exercises
2.13.1. (a). Derive the L
2
estimates of intercept and shift based on the L
2
norm on Model
(2.2.4).
(b). Next apply the pseudo norm, (2.2.16), to (2.2.4) and derive the estimating function.
Show that the natural test statistic is the pooled t-statistic.
2.13.2. Show that (2.2.17) is a pseudo norm. Show, also, that it can be written in terms of
ranks; see the formula following (2.2.17).
2.13.3. In the proof of Theorem 2.4.2, verify that L(Y
j
X
i
) = L(X
i
Y
j
).
2.13.5. Prove that if a continuous random variable Z has cdf H(z), then the random variable
H(Z) has a uniform distribution on (0, 1).
2.13.6. In Theorem 2.4.4, show that E(F(Y )) =
_
F(y)dG(y) =
_
(1 G(x))dF(x) =
E(1 G(X)).
2.13.7. Prove that if Z
n
converges in distribution to Z and if Var(Z
n
W
n
) and EZ
n
EW
n
converge to 0, then W
n
also converges in distribution to Z.
2.13.8. Verify (2.4.10).
2.13.9. Explain what happens to the MWW statistic when one support is shifted completely
to the right of the other support. What does this imply about the consistency of the MWW
in this case?
2.13.10. Show that the L
2
estimating function is Pitman regular and derive the ecacy of
the pooled t-test. Also, establish the asymptotic power lemma, Theorem 2.4.13, for the L
2
case. Finally, establish the asymptotic distribution of

n(
Y

X).
2.13.11. Prove that the Hodges-Lehmann estimate of shift, (2.2.18), is translation and scale
equivariant. (See the discussion in Section 2.4.4).
2.13.13. In Example 2.4.1, form the residuals Z
i
c
i
, i = 1, . . . , n. Then, similar to Section
1.5.5, use these residuals to estimate based on (1.3.30).
2.13.14. Simulate independent random samples from N(20, 5
2
) and N(22, 5
2
) distributions
of sizes 10 and 15 respectively. Let denote the shift in the locations of the distributions.
(a.) Obtain comparison boxplots for your samples.
(b.) Use the Wilcoxon procedure to test H
0
: = 0 versus H
A
: ,= 0 at level .05.
2.13. EXERCISES 149
(c.) Use the Wilcoxon procedure to estimate and obtain a 95% condence interval for it.
(d.) Obtain the true value of . Use your condence interval in the last item to obtain
an estimate of . Obtain a symmetric 95% condence interval for based on your
estimate.
(e.) Form a pooled estimate of based on the Wilcoxon signed rank process for each sample.
Obtain a symmetric 95% condence interval for based on your estimate. Compare
it with the estimate from the last item and the true value.
2.13.15. Write minitab macros to bootstrap the distribution of

. Obtain the bootstrap
distribution for 500 bootstraps of data of Problem 2.13.14. What is your bootstrap estimate
of ? Compare with the true value and the other estimates.
2.13.16. Verify the scalar multiple condition for the pseudo norm in the proof of Theorem
2.5.1.
2.13.17. Verify (2.5.9) and (2.5.10).
2.13.18. Consider the process S
(), (2.5.11):
(a). Show that S
() is a decreasing step function, with steps occurring at Y

j
X
i
.
(b). Using Part (a) and the MWW estimator as a starting values, write with some details
an algorithm which obtains the estimator

.
(c). Verify expressions (2.5.14), (2.5.15), and (2.5.16).
2.13.19. Consider the the optimal score function (2.5.22):
(a). Show it is location invariant and scale equivariant. Hence, show if g(x) =
1
f(
x
),
then
g
=
1
f
.
(b). Use (2.5.22) to show that the MWW is asymptotically ecient when the underlying
distribution is logistic. (F(x) = (1 + exp(x))
1
, < x < .)
(c). Show that (2.6.1) is optimal for a Laplace or double exponential distribution. ( f(x) =
1
2
exp([x[), < x < .)
(d). Show that the optimal score function for the extreme value distribution, (f(x) =
expx e
x
, < x < ), is given by (2.8.8).
(e). Show that the optimal score function for the normal distribution is given by (2.5.33).
Show that it is standardized.
(f). Show that (2.5.34) is the optimal score function for an underlying distribution that has
a left logistic tail and a right exponential tail.
2.13.20. Show that when the underlying density f is symmetric then
f
(1 u) =
f
(u).
2.13.21. Show that expression (2.6.6) is true and that the n = 2r dierences,
Y
(1)
X
(r)
< Y
(2)
X
(r1)
< < Y
(n
2
)
X
(rn
2
+1)
,
can be ordered only knowing the order statistics from the individual samples.
2.13.22. Develop the asymptotic linearity formula for Moods estimating function given in
(2.6.3). Then give an alternative proof of Theorem 2.6.1 based on this result.
2.13.23. Verify the moment formulas (2.6.9) and (2.6.10).
2.13.24. Show that any estimator based on the pseudo norm (2.5.2) is equivariant. Hence, if
we multiply the combined sample observations by a constant, then the estimator is multiplied
by that same constant.
2.13.25. Suppose X is a continuous random variable representing the time until failure
of some process. The hazard function for a continuous random variable X with cdf F is
dened to be the instantaneous rate of failure at X = t, conditional on survival to time t.
It is formally given by:
h
X
(t) = lim
t0
+
P(t X < t + t[X t)
t
.
(a). Show that
h
X
(t) =
f(t)
1 F(t)
.
(b). Suppose that Y has cdf given by (2.8.1). Show the hazard function is given by h
Y
(t) =
h
X
(t).
2.13.26. Verify (2.8.4).
2.13.27. Apply the delta method of nding the asymptotic distribution of a function to
(2.8.3) to nd the asymptotic distribution of . Then verify (2.8.5). Explain how this can
be used to nd an approximate (1 )100% condence interval for .
2.13.28. Verify (2.8.14).
2.13.29. Show that the asymptotic relative eciency of the Mann-Whitney-Wilcoxon test
to the Savage test at the log exponential model, is 3/4.
2.13.30. Verify (2.10.5).
2.13.31. Show that if [X[ has an F(2, 2) distribution then log [X[ has a logistic distribution.
2.13.32. Suppose f(t) is the logistic pdf. Show that the optimal scores function, (2.10.6) is
given by (u) = ulog[(u + 1)/(1 u)].
2.13. EXERCISES 151
2.13.33. (a). Verify (2.10.6).
(b). Apply (2.10.6) to the normal distribution.
(c). Apply (2.10.6) to the Laplace or double exponential distribution.
2.13.34. We consider the Siegel-Tukey (1960) test for the equality of variances when the
underlying centers are equal but possibly unknown. The test statistic is the sum of ranks of
the Y sample in the combined sample (MWW statistic). However, the ranks are assigned in
a dierent way: In the ordered combined sample assign rank 1 to the smallest value, rank
2 to the largest value, rank 3 to the second largest value, rank 4 to the second smallest
value, and so on, alternatively assigning ranks to end values. To test H
0
: varX = varY
vs H
A
: varX > varY , reject H
0
when the sum of ranks of the Y sample is large. Find
the mean, variance and the limiting distribution of the test statistic. Show how to nd an
approximate size test.
2.13.35. Develop a sample size formula for the scale problem similar to the sample size
formula in the location problem, (2.4.25).
2.13.36. Verify (??).
2.13.37. Compute the ecacy of Moods scale test, Ansari-Bradley scale test, and Klotzs
scale test discussed in Section ??.
2.13.38. Verify the asymptotic properties given in (2.10.26), (2.10.27) and ( 2.10.28).
2.13.39. Compute the eciency of Moods scale test and the Ansari-Bradley scale test
relative to the classical F test for equality of variances.
2.13.40. Show that the Ansari-Bradley scale test is optimal for f(x) =
1
2
(1 +[x[)
2
, <
x < .
2.13.41. Show that when F and G have densities symmetric at 0 (or any common point),
the expected value of S
R
+ = n
1
n
2
/2.
2.13.42. Show that the estimate of (2.11.17) based on the empirical cdfs is consistent and
that it is a function only of the combined sample ranks.
2.13.43. Under the general model in Section 2.11.5, derive the limiting distribution of
n(Y X).
2.13.44. Find the true asymptotic level of the pooled t-test under the null hypothesis in
(2.11.1).
2.13.45. Develop a modied Moods test similar to the modied Mathisens test discussed
in Section 2.11.5.
2.13.46. Construct and discuss a normal quantile plot of the dierences from Table 2.12.1.
Carry out the Boos test for asymmetry (??). Why do these results suggest that the L1
analysis may be the best analysis in this example?
2.13.47. Consider the data set of information on professional baseball players given in Ex-
ercise 1.12.32. Let denote the shift parameter of the dierence between the height of a
pitcher and the height of a hitter.
(a.) Obtain comparison dotplots between the heights of the pitchers and hitters. Does a
shift model seem appropriate?
(b.) Use the MWW test statistic to test the hypotheses H
0
: = 0 versus H
A
: > 0.
Compute the p-value.
(c.) Determine a point estimate for and a 95% condence interval for based on MWW
procedure.
(d.) Obtain an estimate of the standard deviation of

. Use it to obtain an approximate
95% condence interval for .
2.13.48. Repeat Exercise 2.13.47 when is the shift parameter for the dierence in pitchers
and hitters weights.
2.13.49. Repeat Exercise 2.13.47 when is the shift parameter for the dierence in left
handed (A-1) and right handed (A-0) pitchers ERAs and the hypotheses are H
0
: = 0
versus H
A
: ,= 0.
Chapter 3
Linear Models
3.1 Introduction
In this chapter we discuss the theory for a rank-based analysis of a general linear model.
Applications of this analysis to experimental design models will be discussed in Chapter 4.
The rank-based analysis is complete, consisting of estimation, testing, and diagnostic tools
for checking the adequacy of t of the model, outlier detection, and detection of inuential
cases. As in the earlier chapters, we present the analysis in terms of its geometry.
The analysis could be based on either rank scores or signed-rank scores. We have chosen
to use the general rank scores of Chapter 2. This allows the error distribution to be either
asymmetric or symmetric. An analysis based on signed-rank scores would parallel the one
based on rank scores except that the theory would require a symmetric error distribution; see
Hettmansperger and McKean (1983) for discussion. Although the results are established for
general score functions, we illustrate the methods with Wilcoxon and sign scores throughout.
We will commonly use the subscripts R and S for results based on Wilcoxon and sign scores,
respectively.
3.2 Geometry of Estimation and Tests
For i = 1, . . . , n. let Y
i
denote the ith observation and let x
i
denote a p 1 vector of
explanatory variables. Consider the linear model
Y
i
= x
i
+ e
i
, (3.2.1)
where is a p 1 vector of unknown parameters. In this chapter, the components of
are the parameters of interest. We are interested in estimating and testing linear
hypotheses concerning it. However, it will be convenient to also have a location parameter.
So accordingly let = T(e
i
) be a location functional. One that we will frequently use is the
median. Let e
i
= e
i
then T(e
i
) = 0 and the model can be written as,
Y
i
= +x
i
+ e
i
. (3.2.2)
153
154 CHAPTER 3. LINEAR MODELS
The parameter is called an intercept parameter. An argument similar to the one concerning
the shift parameter of Chapter 2 shows that does not depend on the location functional
used.
Let Y = (Y
1
, . . . , Y
n
)
denote the n1 vector of observations and let X denote the np

matrix whose ith row is x
i
. We can then express the model as
Y = 1 +X +e , (3.2.3)
where 1 is an n1 vector of ones, and e
= (e
1
, . . . , e
n
). Since the model includes an intercept
parameter, , there is no loss in generality in assuming that X is centered; i.e., the columns
of X sum to 0. Further, in this chapter, we will assume that X has full column rank p. Let
F
denote the column space spanned by the columns of X. Note that we can then write the
model as
Y = 1 + +e , where
F
. (3.2.4)
This model is often called the coordinate free model.
Besides estimation of the regression coecients, we are interested in tests of general linear
hypotheses of the form
H
0
: M = 0 versus H
A
: M ,= 0 , (3.2.5)
where M is a q p matrix of full row rank. In this section, we discuss the geometry of
estimation and testing with rank-based procedures for the linear model.
3.2.1 Estimation
With respect to model ( 3.2.4), we will estimate by minimizing the distance between Y
and the subspace
F
. In this chapter we will dene distance in terms of the norms or
pseudo-norms presented in Chapter 2. Consider, rst, the general R pseudo-norm discussed
in Chapter 2 which is given by expression ( 2.5.2) and which we write for convenience,
|v|
=
n
i=1
a(R(v
i
))v
i
, (3.2.6)
where a(1) a(2) a(n) is a set of scores generated as a(i) = (i/(n + 1)) for some
nondecreasing score function (u) dened on the interval (0, 1) and standardized such that
_
(u)du = 0 and
_

2
(u)du = 1. This was shown to be a pseudo-norm in Chapter 2. Recall
that the Wilcoxon pseudo-norm is generated by the linear score function (u) =
12(u
1/2). We will also discuss the sign pseudo-norm which is generated by (u) = sgn(u 1/2)
and show that it is equivalent to using the L
1
norm. In Section 3.10 we will also discuss a
class of score functions appropriate for survival type analyses.
For the general R pseudo-norm given above by ( 3.2.6), an R-estimate of is a vector
such that
D
(Y,
F
) = |Y

Y
= min
F
|Y|
. (3.2.7)
3.2. GEOMETRY OF ESTIMATION AND TESTS 155
Figure 3.2.1: The R-estimate of is a vector

Y
which minimizes the normed dierences,

( 3.2.6), between Y and
F
. The distance between Y and the space
F
is D
(Y,
F
).
about here
These quantities are represented geometrically in Figure 3.2.1.
Once has been estimated, can be estimated by solving the equation X =

Y
; that is,
the R-estimate of is

= (X
X)
1
X
. As discussed later in Section 3.7, the intercept

can be estimated by a location estimate based on the residuals e = Y

Y
. One that we
will frequently use is the median of the residuals which we denote as
S
= med Y
i
x
.
Theorem 3.5.7 shows, under regularity conditions, that
_

S
_
has an approximate N
p+1
__

_
,
_
n
1
2
S
0
0
2
(X
X)
1
__
distribution ,
(3.2.8)
where
and
S
are the scale parameters dened in displays ( 3.4.4) and ( 3.4.6), respectively.
From this result, an asymptotic condence interval for the linear function h
is given by
h
t
(/2,np1)

_
h(X
X)
1
h , (3.2.9)
where the estimate
is discussed in Section 3.7.1. The use of t-critical values instead of

z-critical values is documented in the small sample studies cited in Section 3.7. Note the
close analogy between this condence interval and those based on LS estimates. The only
dierence is that has been replaced by
.
We will make use of the coordinate free model, especially in Chapter 4; however, in this
chapter we are primarily concerned with the properties of the estimator

and it will be
more convenient to use the coordinate model ( 3.2.3). Dene the dispersion function by
D
() = |YX|
. (3.2.10)
Then D
) = D
(Y,
F
) = |Y

Y
is the R-distance between Y and the subspace
F
. It is also the residual dispersion.
Because D
is expressed in terms of a norm it is a continuous and convex function of ;

see Exercise 1.12.3. Exercise 3.16.2 shows that the ranks of the residuals can ony change at
the boundaries of the regions dened by the
_
n
2
_
equations y
i
x
i
= y
j
x
j
. Note that in
the simple linear regression case, these equations dene the sample slopes
Y
j
Y
i
x
j
x
i
. Hence, in
the interior of these regions the ranks are constant. Therefore, D
() is a piecewise linear,
continuous, convex function of with gradient (dened almost everywhere) given by,
D
() = S
(Y X) , (3.2.11)
where
S
(Y X) = X
a(R(Y X))) (3.2.12)

and a(R(YX)))
= (a(R(Y
1
x
1
)), . . . , a(R(Y
n
x
n
))). Thus

solves the equations

S
(Y X) = X
a(R(Y X)))
.
= 0 , (3.2.13)
which are called the R normal equations. A quadratic form in S
(YX
0
) serves as the
gradient R-test statistic for testing H
0
: =
0
versus H
A
: ,=
0
.
In terms of the simple regression problem S
() is a decreasing step function of , which

steps down at each sample slope. There may be an interval of solutions of S
() = 0
or S
() may step across the horizontal axis. Let

denote any point in the interval in

the former case and the crossing point in the latter case. The gradient test statistic is
S
(
0
) =
x
i
a(R(y
i
x
i
0
)). If the xs are distinct and equally spaced then for Wilcoxon
scores this test statistic is equivalent to the test for correlation based on Spearmans r
S
; see
Exercise 3.16.4.
For the asymptotic distribution theory of estimation and testing, we note that the es-
timate is location and scale equivariant. Let

(Y) denote the R-estimate for the lin-

ear model ( 3.2.3). Then, as shown in Exercise 3.16.6,

(Y + X) =

(Y) + and
(kY) = k
(Y). In particular these results imply, without loss of generality, that the
theory developed in the following sections can be accomplished under the assumption that
the true is 0.
As a nal note, we outline the least squares estimates. The LS estimates of in model
( 3.2.4) is given by
Y
LS
= Argmin |Y|
2
LS
,
| |
LS
denotes the least squares pseudo-norm given by ( 2.2.16) of Chapter 2. The value of
which minimizes this pseudo-norm is

LS
= HY , (3.2.14)
where H is the projection matrix onto the space
F
i.e.; H = X(X
X)
1
X
. Denote the
sum of squared residuals by SSE = min
F
|Y|
2
LS
= |(I H)Y|
2
LS
. In order to have
similar notation we shall denote this minimum by D
2
LS
(Y,
F
). Also, it is easy to show that
the least squares estimate of is

LS
= (X
X)
1
X
Y.
3.2.2 The Geometry of Testing
We next discuss the geometry behind rank-based tests of the general linear hypotheses given
by ( 3.2.5). As above, consider the model ( 3.2.4),
Y = 1 + +e , where
F
, (3.2.15)
and
F
is the column space of the full model design matrix X. Let

Y
,
F
denote the R-
tted value in the full model. Note that D
(Y,
F
) is the amount of residual dispersion not
accounted for in tting the Model ( 3.2.4). These are shown geometrically in Figure 3.2.2.
3.2. GEOMETRY OF ESTIMATION AND TESTS 157
Next let denote the subspace of
F
subject to H
0
. In symbols =
F
: =
X, for some such that M = 0. In Exercise 3.16.7 the reader is asked to show that
is a subspace of
F
of dimension p q. Let

Y
,
denote the R-estimate of when the
reduced model is t and let D
(Y, ) = |Y

Y
,
|
R
denote the distance between Y and
the subspace . These are illustrated by Figure 3.2.2. The nonnegative quantity
RD
= D
(Y, ) D
(Y,
F
) , (3.2.16)
denotes the reduction in residual dispersion when we pass from the reduced model to
the full model. Large values of RD
indicate H
A
while small values support H
0
.
Figure 3.2.2: The reduction in dispersion RD
is the dierence in normed distances between

Y and the subspaces
F
and .
about here
This drop in residual dispersion, RD
, is analogous to the drop in residual sums of squares

for the LS analysis. In fact to obtain this reduction in sums of squares, we need only replace
the R-norm with the square of the Euclidean norm in the above development. Thus the drop
in sums of squared errors is
SS = D
2
LS
(Y, ) D
2
LS
(Y,
F
) ,
where D
2
LS
(Y,
F
) is dened above. Hence the reduction in sums of squared residuals can
be written as
SS = |(I H
)Y|
2
LS
|(I H
F
)Y|
2
LS
.
The traditional least squares F-test is given by
F
LS
=
SS/q

2
, (3.2.17)
where
2
= D
2
LS
(Y,
F
)/(n p). Other than replacing one norm with another, Figures
3.2.1 and 3.2.2 remain the same for the two analyses, LS and R.
In order to be useful as a test statistic, similar to least squares, the reduction in dispersion
RD must be standardized. The asymptotic distribution theory that follows suggests the
standardization
F
=
RD/q

/2
, (3.2.18)
where
is the estimate of
discussed in Section 3.7. Small sample studies cited in Section

3.7 indicate that F
should be compared with F-critical values with q and n(p+1) degrees

of freedom analogous to the LS classical F-test statistic. Similar to the LS Ftest, the test
based on F
can be summarized in the ANOVA table, Table 3.2.1. Note that the reduction
in dispersion replaces the reduction in sums of squares in the classical table. These robust
ANOVA tables were rst discussed by Schrader and McKean (1976).
Table 3.2.1: Robust ANOVA Table for H
0
: M = 0
Source Reduction Mean Reduction
in Dispersion in Dispersion df in Dispersion F
Regression RD
=
_
D
(Y, ) D
(Y,
F
)
_
q RD/q F
Error n (p + 1)
/2
Table 3.2.2: Robust ANOVA Table for H
0
: = 0
Regression RD =
_
D
(0) D
(Y,
F
)
_
p RD/p F
Error n p 1
/2
Tests that all Regression Coecients are 0
As discussed more fully in Section 3.6, there are three R-test statistics for the hypotheses
( 3.2.5). These are the R-analogues of the classical tests: the likelihood ratio test, the scores
test, and the Wald test. We shall introduce them here for the special null hypothesis that
all the regression parameters are 0; i.e.,
H
0
: = 0 versus H
0
: = 0 . (3.2.19)
Their asymptotic theory and small sample properties are discussed in more detail in later
sections.
In this case, the reduced model dispersion is just the dispersion of the response vector
Y, i.e., D
(0). Hence, the R-test based on the reduction in dispersion is

F
=
_
D
(0) D
(Y,
F
)
_
/p

/2
. (3.2.20)
As discussed above, F
should be compared with F(, p, np1)-critical values. Similar to

the general hypothesis, the test based on F
can be expressed in the robust ANOVA table

given in Table 3.2.2. This is the robust analogue of the traditional ANOVA table that is
printed out for a regression analysis by most least squares regression packages.
The R-scores test is the test based on the gradient. Theorem 3.5.2, below, gives the
asymptotic distribution of the gradient S
(0) under the null hypothesis. This leads to the

asymptotic level test, reject H
0
if
S
(0)(X
X)
1
S
(0)
2
(p) . (3.2.21)
Note that this test avoids the estimation of
.
The R-Wald test is a quadratic form in the full model estimates. Based on the asymp-
totic distribution of the full model estimate

given in Corollary 3.5.1, an asymptotic level

3.3. EXAMPLES 159
Table 3.3.1: Data for Example 3.3.1. The number of calls is in tens of millions and the
years are from 1950-1973.
Year 50 51 52 53 54 55 56 57 58 59 60 61
No. Calls 0.44 0.47 0.47 0.59 0.66 0.73 0.81 0.88 1.06 1.20 1.35 1.49
Year 62 63 64 65 66 67 68 69 70 71 72 73
No. Calls 1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30 2.40 2.70 2.90
test, rejects H
0
if
(X
X)
/p

2
F(, p, n p 1) . (3.2.22)
3.3 Examples
We oer several examples to illustrate the rank-based estimates and test procedures discussed
in the last section. For all the examples, we use Wilcoxon scores, (u) =
12(u (1/2)),
for the rank-based estimates of the regression coecients. We estimate the intercept by the
median of the residuals and we estimate the scale parameter
as discussed in Section 3.7.

We begin with a simple regression data set and proceed to multiple regression problems.
Example 3.3.1. Telephone Data
The response for this data set is the number of telephone calls (tens of millions) made
in Belgium for the years 1950 through 1973. Time, the years, serves as our only predictor
variable. The data is discussed in Rousseeuw and Leroy (1987) and, for convenience, is
displayed in Table 3.3.1.
The Wilcoxon estimates of the intercept and slope are 7.13 and .145, respectively, while
the LS estimates are 26 and .504. The reason for this disparity in ts is easily seen in Panel
A of Figure 3.3.1 which is a scatterplot of the data overlaid with the LS and Wilcoxon ts.
Note that the years 1964 through 1969 had a profound eect on the LS t while the Wilcoxon
t was much less sensitive to these years. As discussed in Rousseeuw and Leroy the recording
system for the years 1964 through 1969 diered from the other years. Panels B and C of
Figure 3.3.1 are the studentized residual plots of the ts; see ( 3.9.31) of Section 3.9. As
with internal LS-studentized residuals, values of the internal R-studentized residuals which
exceed 2 in absolute value are potential outliers. Note that the internal Wilcoxon studentized
residuals clearly show that the years 1964-1969 are outliers while the internal LS studentized
residuals only detect 1969. The Wilcoxon studentized residuals also mildly detect the year
1970. Based on the scatterplot, this point does not follow the trend of the early (before
1964) years either. The scatterplot and Wilcoxon residual plot indicate that there may be a
quadratic trend over the years before the outliers occur. The last few years, though, do not
seem to follow this trend. Hence, a linear model for this data is questionable. On the basis
of these plots, we will not discuss any formal inference for this data set.
Figure 3.3.1: Panel A: Scatterplot of the Telephone Data, overlaid with the LS and Wilcoxon
ts; Panel B: Internal LS studentized residual plot; Panel C: Internal Wilcoxon studentized
residual plot; and Panel D: Wilcoxon dispersion function.

Year
N
u
m
b
e
r

o
f

c
a
l
l
s
50 55 60 65 70
0
5
1
0
1
5
2
0
Panel A
LS-Fit
Wilcoxon-Fit
LS-Fit
L
S
-
S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
0 2 4 6 8 10
-
1
0
1
2
Panel B

Wilcoxon-Fit
W
i
l
c
o
x
o
n
-
S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
0 1 2 3
0
1
0
2
0
3
0
4
0
5
0
Panel C
Beta
W
i
l
c
o
x
o
n

d
i
s
p
e
r
s
i
o
n
-0.2 0.0 0.2 0.4 0.6
1
1
0
1
2
0
1
3
0
1
4
0
1
5
0
Panel D
Panel D of Figure 3.3.1 depicts the Wilcoxon dispersion function over the interval
(.2, .6). Note that Wilcoxon estimate

R
= .145 is the minimizing value. Next consider
the hypotheses H
0
: = 0 versus H
A
: ,= 0. The basis for the test statistic F
can be
read from this plot. The reduction in dispersion is given by RD = D(0) D(.145). Also,
the gradient test of these hypotheses would be the negative of the slope of the dispersion
function at 0; i.e., D
(0).
Example 3.3.2. Baseball Salaries
As a large data set, we consider data on the salaries of professional baseball pitchers for
the 1987 baseball season. This data set was taken from the data set on baseball salaries
which was used in the 1988 ASA Graphics Section Poster Session. It can be obtained at
the web site: http://lib.stat.cmu.edu/datasets. Our analysis concerns a subdata set of
176 pitchers, which can be obtained from the authors upon request. Our response variable
is the 1987 beginning salary (in log dollars) of these pitchers. As predictors, we took the
career summary statistics through the end of the 1986 season. The names of these variables
are listed in Table 3.3.2. Panels A - G of Figure 3.9.2 show the scatter plots of the log of
salary versus each of the predictors. Certainly the strongest predictor on the basis of these
plots is log years; although, linearity in this plot is questionable.
3.3. EXAMPLES 161
Figure 3.3.2: Panels A - G: Plots of log-salary versus each of the predictors for the baseball
data of Example 3.3.2; Panel H: Internal Wilcoxon studentized residual plot.
Log Years
L
o
g
s
a
la
r
y
0.0 0.5 1.0 1.5 2.0 2.5 3.0
4
5
6
7
Panel A
Ave. wins
L
o
g
s
a
la
r
y
0 5 10 15 20
4
5
6
7
Panel B
Ave. loses
L
o
g
s
a
la
r
y
2 4 6 8 10 12
4
5
6
7
Panel C
ERA
L
o
g
s
a
la
r
y
2.5 3.0 3.5 4.0 4.5 5.0 5.5
4
5
6
7
Panel D
Ave. games
L
o
g
s
a
la
r
y
0 20 40 60 80
4
5
6
7
Panel E
Ave. innings
L
o
g
s
a
la
r
y
0 50 100 150 200 250
4
5
6
7
Panel F
Ave. saves
L
o
g
s
a
la
r
y
0 5 10 15 20 25
4
5
6
7
Panel G

Wilcoxon fit
S
tu
d
e
n
tiz
e
d
r
e
s
id
.
4 5 6 7 8
-
8
-
6
-
4
-
2
0
2
4
6
Panel H
OO
O
The internal Wilcoxon studentized residuals, ( 3.9.31), versus tted values are displayed
in the Panel H of Figure 3.9.2. Based on Panels A and H, the pattern in the residual
plot follows from the fact that log years is not a linear predictor. Better tting models are
pursued in Exercise 3.16.1. Note that there are several large outliers. The three identied
outliers, circled points in Panel H, are interesting. These correspond to the pitchers Steve
Carlton, Phil Niekro and Rick Sutcli. These were very good pitchers, but in 1987 they
were at the end of their careers, (21, 23, and 21 years of pitching, respectively); hence,
they missed the rapid rise in baseball salaries. A diagnostic analysis, (see Section 3.9 and
Exercise 3.16.1), indicates a few mildly inuential points, also. For illustration, though, we
will consider the model that we t. Table 3.3.2 also displays the estimated coecients and
their standard errors. The outliers impaired the LS t, somewhat. The LS estimate of is
.515 in comparison to the estimate of which is .388.
Table 3.3.3 displays the robust ANOVA table for testing that all the coecients, except
Table 3.3.2: Predictors for Baseball Salaries of Pitchers and Their Estimated (Wilcoxon Fit)
Coecients
Predictor Estimate Stand. Error t-ratio
log Years in professional baseball .839 .044 19.15
Average wins per year .045 .028 1.63
Average losses per year -.024 .026 -.921
Earned run average -.146 .070 -2.11
Average games per year -.006 .004 1.60
Average innings per year .004 .003 1.62
Average saves per year .012 .011 1.07
Intercept 4.22 .324
Scale () .388
Table 3.3.3: Wilcoxon ANOVA Table for H
0
: = 0
Regression 78.287 7 11.18 57.65

Error 168 .194
the intercept, are 0. Based on the large value of F
, ( 3.2.20), the predictors are helpful

in explaining the response. In particular, based on Table 3.3.2, the predictors years in
professional baseball, earned run average, average innings per year, and average number of
saves per year seem more important than the variables wins, losses, and games. These last
three variables form a similar group of variables; hence, as an illustration of the rank-based
statistic F
, the hypothesis that the coecients for these three predictors are 0 was tested.
The reduction in dispersion for this hypothesis is RD = 1.24 which leads to F
= 2.12
which is signicant at the 10% level. This conrms the above observations on the regression
coecients.
Example 3.3.3. Potency Data
This example is part of an n = 34 multivariate data set discussed in Chapter 6; see Table
6.6.2 for the data.. The experiment concerned the potency of drug compounds which were
manufactured under dierent levels of 4 factors. Here we shall consider only one of the
response variables POT2, which is the potency of a drug compound at the end of two weeks.
The factors are: SAI, the amount of intragranular steric acid, which was set at the three
levels 1, 0 and 1; SAE, the amount of extragranular steric acid, which was set at the three
levels 1, 0 and 1; ADS, the amount of cross carmellose sodium, which was set at the three
levels 1, 0 and 1; and TYPE of steric acid which was set at two levels 1 and 1. The initial
potency of the compound, POT0, served as a covariate.
In Example 3.9.2 of Section 3.9 a residual analysis of this data set is performed. This
analysis indicates that the model which includes the covariate, the linear terms of the factors,
3.3. EXAMPLES 163
Table 3.3.4: Wilcoxon and LS Estimates for the Potency Data
Wilcoxon Estimates LS Estimates
Terms Parameter Est. SE Est. SE
Intercept 7.184 2.96 5.998 4.50
1
0.072 0.05 0.000 0.08
Linear
2
0.023 0.05 -0.018 0.07
3
0.166 0.05 0.135 0.07
4
0.020 0.04 -0.011 0.05
5
0.042 0.05 0.086 0.08
6
-0.040 0.05 0.035 0.08
Two-way
7
0.040 0.05 0.102 0.07
Inter.
8
-0.085 0.06 -0.030 0.09
9
0.024 0.05 0.070 0.07
10
-0.049 0.05 -0.011 0.07
11
-0.002 0.10 0.117 0.15
Quad.
12
-0.222 0.09 -0.240 0.13
13
0.022 0.09 -0.007 0.14
Covariate
14
0.092 0.31 0.217 0.47
Scale or .204 .310
the simple two-way interaction terms of the factors, and the quadratic terms of the three
factors SAE, SAI and ADS is adequate. Let x
j
for j = 1, . . . , 4 denote the level of the factors
SAI, SAE, ADS, and TYPE, respectively, and let c
i
denote the value of the covariate. Then
the model is expressed as,
y
i
= +
1
x
1,i
+
2
x
2,i
+
3
x
3,i
+
4
x
4,i
+
5
x
1,i
x
2,i
+
6
x
1,i
x
3,i
+
7
x
1,i
x
4,i
+
8
x
2,i
x
3,i
+
9
x
2,i
x
4,i
+
10
x
3,i
x
4,i
+
11
x
2
1,i
+
12
x
2
2,i
+
13
x
2
3,i
+
14
c
i
+ e
i
. (3.3.1)
The Wilcoxon and LS estimates of the regression coecients and their standard errors are
given in Table 3.3.4. The Wilcoxon estimates are more precise. As the diagnostic analysis
of Example 3.9.2 shows, this is due the outliers in this data set.
Note that the Wilcoxon estimate of the parameter
13
, the quadratic term of the factor
ADS is signicant. Again referring to the residual analysis given in Example 3.9.2, there is
some graphical evidence to retain the three quadratic coecients in the model. In order to
statistically conrm this evidence, we will test the hypotheses
H
0
:
12
=
13
=
14
= 0 versus H
A
:
i
,= 0 for some i = 12, 13, 14 .
The Wilcoxon test is summarized in Table 3.3.5 and it is based on the test statistic ( 3.2.18).
The value of the test statistic is signicant at the .05 level. The LS F-test statistic, though,
has the value 1.19. As with its estimates of the regression coecients, the LS F-test statistic
has been impaired by the outliers.
Table 3.3.5: Wilcoxon ANOVA Table for H
0
:
12
=
13
=
14
= 0
of Dispersion in Dispersion df in Dispersion F
Quadratic Terms .977 3 .326 3.20

Error 19 .102
3.4 Assumptions for Asymptotic Theory
For the asymptotic theory developed in this chapter certain assumptions on the distribution
of the errors, the design matrix, and the scores are needed. The required assumptions for
each section may dier, but for easy reference, we have placed them in this section.
The major assumption on the error density function f for much of the rank-based anal-
yses, is:
(E.1) f is absolutely continuous, 0 < I(f) < . (3.4.1)
where I(f) denotes Fisher information, ( 2.4.16). Since f is absolutely continuous, we can
write
f(s) f(t) =
_
s
t
f
(x)dx
for some function f
. An application of the Cauchy-Schwartz inequality yields

[f(s) f(t)[ I(f)
1/2
_
[F(s) F(t)[ ; (3.4.2)
see Exercise 1.12.20. It follows from ( 3.4.2), that assumption (E.1) implies that f is
uniformly bounded and is uniformly continuous.
An assumption that will be used for analyses based on the L
1
norm is:
(E.2) f(
e
) > 0 , (3.4.3)
where
e
denotes the median of the error distribution, i.e.,
e
= F
1
(1/2).
For easy reference, we list again the scale parameter
, ( 2.5.23),
=
_
(u)
f
(u)du , (3.4.4)
where
f
(u) =
f
(F
1
(u))
f(F
1
(u))
. (3.4.5)
Under (E.1) the scale parameter
is well dened. Another scale parameter that will be

needed is
S
dened as:
S
= (2f(
e
))
1
; (3.4.6)
see ( 1.5.21). Note that it is well dened under Assumption (E.2).
3.4. ASSUMPTIONS FOR ASYMPTOTIC THEORY 165
As above let H = X(X
X)
1
X
denote the projection matrix onto , the column space

of X. Our asymptotic theory assumes that the design matrix X is imbedded in a sequence
of design matices which satisfy the next two properties. We should subscript quantities such
as X and the projection matrix with n to show this, but as a matter of convenience we have
not done so. We will subscript the leverage values h
iin
which are the diagonal entries of the
projection matrix H. We will often impose the next two conditions on the design matrix:
(D.2) lim
n
max
1in
h
iin
= 0 (3.4.7)
(D.3) lim
n
n
1
X
X = , (3.4.8)
where is a pp positive denite matrix. The rst condition has become known as Hubers
condition. Huber (1981) showed that (D.2) is a necessary and sucient design condition for
the least squares estimates to have an asymptotic normal distribution provided the errors,
e
i
, are iid with nite variance. Conditions (D.3) reduces to assumption (D.1), ( 2.4.7), of
Chapter 2 for the two sample problem.
Another design condition is Noethers condition which is given by
(N.1) max
1in
x
2
ik
n
j=1
x
2
jk
0 for all k = 1, . . . p . (3.4.9)
Although this condition will be convenient, as the next lemma shows it is implied by Hubers
condition.
Lemma 3.4.1. (D.2) implies (N.1).
Proof: By the generalized Cauchy-Schwarz inequality, (see Graybill, (1976), page 224),
for all i = 1, . . . , n we have the following equalities:
sup
=1
x
i
x
X
= x
i
(X
X)
1
x
i
= h
nii
.
Next for k = 1, . . . , p take to be
k
, the p 1 vector of zeroes except for 1 in the kth
component. Then the above equalities imply that
x
2
ik
n
j=1
x
2
jk
h
nii
, i = 1, . . . , n, k = 1, . . . , p .
Hence
max
1kp
max
1in
x
2
ik
n
j=1
x
2
jk
max
1in
h
nii
.
Therefore Hubers condition implies Noethers condition.
As in Chapter 2, we will often assume that the score generating function (u) satises
assumption ( 2.5.5). We will in addition assume that it is bounded. For reference, we will
assume that (u) is a function dened on (0, 1) such that
(S.1)
_
(u) is a nondecreasing, square-integrable, and bounded function
_
1
0
(u) du = 0 and
_
1
0

2
(u) du = 1
. (3.4.10)
Occasionally we will need further assumptions on the score function. In Section 3.7, we
will need to assume that
(S.2) is dierentiable . (3.4.11)
When estimating the intercept parameter based on signed-rank scores, we need to assume
that the score function is odd about
1
2
, i.e.,
(S.3) (1 u) = (u) ; (3.4.12)
see, also, ( 2.5.5).
3.5 Theory of Rank-Based Estimates
Consider the linear model given by ( 3.2.3). To avoid confusion, we will denote the true
vector of parameters by (
0
,
0
)
; that is, the true model is Y = 1

0
+ X
0
+ e. In this
section we will derive the asymptotic theory for the R-analysis, estimation and testing, under
the assumptions (E.1), (D.2), (D.3), and (S.1). We will occassionally supress the subscripts
and R from the notation. For example, we will denote the R-estimate by simply

.
3.5.1 R-Estimators of the Regression Coecients
A key result for both estimation and testing concerns the gradient S(Y X), ( 3.2.12).
We rst derive its mean and covariance matrix and then obtain its asymptotic distribution.
Theorem 3.5.1. Under Model ( 3.2.3),
E [S(Y X
0
)] = 0
V [S(Y X
0
)] =
2
a
X
X ,
2
a
= (n 1)
1
n
i=1
a
2
(i)
.
= 1.
Proof: Note that S(Y X
0
) = X
a(R(e)). Under Model ( 3.2.3), e

1
, . . . , e
n
are iid;
hence, the ith component a(R(e)) has mean
E [a(R(e
i
))] =
n
j=1
a(j)n
1
= 0 ,
from which the result for the expectation follows.
For the result on the variance-covariance matrix, note that, V [S(Y X
0
)] = X
V [a(R(e)] X.
The digaonal entries for the covariance matrix on the RHS are:
V [a(R(e
i
))] = E
_
a
2
(R(e
i
))
=
n
j=1
a(j)
2
n
1
=
n 1
n

2
a
.
3.5. THEORY OF RANK-BASED ESTIMATES 167
The o-diagonal entries are the covariances given by
cov(a(R(e
i
)), a(R(e
l
))) = E [a(R(e
i
)a(R(e
l
)]
=
n
j=1
n
k=1
j=k
a(j)a(k)(n(n 1))
1
= (n(n 1))
1
n
j=1
a
2
(j)
=
2
a
/n , (3.5.1)
where the third step in the derivation follows from 0 =
_
n
j=1
a(j)
_
2
. The result, ( 3.5.1),
is obtained directly from these variances and covariances.
Under (D.3), we have that
V
_
n
1/2
S(YX
0
)
. (3.5.2)
This anticpates our next result,
Theorem 3.5.2. Under the Model ( 3.2.3), (E.1), (D.2), (D.3), and (S.1) in Section 3.4,
n
1/2
S(Y X
0
)
D
N
p
(0, ) . (3.5.3)
Proof: Let S(0) = S(Y X
0
) and let T(0) = X
(F(Y X
0
)). Under the above
assumptions, the discussion around Theorem A.3.1 of the appendix shows that (T(0)
S(0))/
n converges to 0 in probability. Hence we need only show that T(0)/
n converges
to the intended distribution. Letting W
= n
1/2
t
T(e) where t ,= 0 is an arbitrary p 1

vector, it suces to show that W
converges in distribution to a N(0, t
t) distribution.
Note that we can write W
as,
W
= n
1/2
n
k=1
t
x
k
(F(e
k
)) . (3.5.4)
Since F is the distribution function of e
k
, it follows from
_
du = 0 that E [W
] = 0, from
_

2
du = 1, and (D.3) that
V [W
] = n
1
n
k=1
(t
x
k
)
2
= t
n
1
X
Xt t
t > 0 . (3.5.5)
Since W
is a sum of independent random variables which are not identically distributed

we establish the limit distribution by the Lindeberg-Feller Central Limit Theorem; see The-
orem A.1.1 of the Appendix. In the notation of this theorem let B
2
n
= V [W
]. By ( 3.5.5),
B
2
n
converges to a positive real number. We need to show,
limB
2
n
n
k=1
E
_
1
n
(x
k
t)
2
2
(F(e
k
))I
_
n
(x
k
t)(F(e
k
))
> B
n
__
= 0 . (3.5.6)
The key is the factor n
1/2
(x
k
t) in the indicator function. By the Cauchy-Schwarz inequality
and (D.2) we have the string of inequalities:
n
1/2
[(x
k
t)[ n
1/2
|x
k
||t|
=
_
n
1
p
j=1
x
2
kj
_
1/2
|t|
_
p max
j
n
1
x
2
kj
_
1/2
|t| . (3.5.7)
By assumptions (D.2) and (D.3), it follows that the quantity in brackets in equation ( 3.5.7),
and, hence, n
1/2
[(x
k
t)[ converges to zero as n . Call the term on the right side of
equation ( 3.5.7) M
n
. Note that it does not depend on k and M
n
0. From this string of
inequalities, the limit on the left side of (3.5.6) is less than or equal to
limB
2
n
limE
_
2
(F(e
1
))I
_
[(F(e
1
))[ >
B
n
M
n
__
limn
1
n
k=1
(x
k
t)
2
.
The rst and third limits are positive reals. For the second limit, note that the random
variable inside the expectation is bounded; hence, by Lebesgue Dominated Convergence
Theorem we can interchange the limit and expectation. Since
Bn
Mn
the expectation
goes to 0 and our desired result is obtained.
Similar to Chapter 2, Exercise 3.16.9 obtains the proof of the above theorem for the
special case of the Wilcoxon scores by rst getting the projection of the statistic W.
Note from this theorem we have the gradient test that all the regression coecients are
0; that is, H
0
: = 0 versus H
A
: ,= 0. Consider the test statistic
T =
2
a
S(Y)
(X
X)
1
S(Y) . (3.5.8)
From the last theorem an approximate level test for H
0
versus H
A
is:
Reject H
0
in favor of H
A
if T
2
(, p) , (3.5.9)
where
2
(, p) denotes the upper level critical value of
2
-distribution with p degrees of
freedom.
Theorem A.3.8 of the Appendix gives the following linearity result for the process S(
n
):
1
n
S(
n
) =
1
n
S(
0
)
1
n(
n
0
) + o
p
(1) , (3.5.10)
for

n(
n

0
) = O(1), where the scale parameter
is given by ( 3.4.4). Recall that we

have made use of this result in Section 2.5 when we showed that the two sample location
process under general scores functions is Pitman regular. If we integrate the RHS of this
result we obtain a locally smooth approximation of the dispersion function D(
n
) which is
given by the following quadratic function:
Q(YX) = (2
)
1
(
0
)
X(
0
)(
0
)
S(YX
0
)+D(YX
0
) . (3.5.11)
Note that Q depends on
and
0
so it cannot be used to estimate . As we will show, the
function Q is quite useful for establishing asymptotic properties of the R-estimates and test
statistics. As discussed in Section 3.7.3, it also leads to a Gauss-Newton type algorithm for
obtaining R-estimates.
The following theorem shows that Q provides a local approximation to D. This is an
asymptotic quadraticity result which was proved by Jaeckel (1972). It in turn is based on
an asymptotic linearity result derived by Jureckova (1971) and displayed above, ( 3.5.10).
It is proved in the Appendix; see Theorem A.3.8.
Theorem 3.5.3. Under the Model ( 3.2.3) and the assumptions (E.1), (D.1), (D.2), and
(S.1) of Section 3.4, for any > 0 and c > 0,
P
_
max
0
<c/
n
[D(YX) Q(Y X)[
_
0 , (3.5.12)
as n .
We will use this result to obtain the asymptotic distribution of the R-estimate. With-
out loss of generality assume that the true
0
= 0. Then we can write Q(Y X) =
(2
)
1
S(Y) +D(Y). Because Q is a quadratic function it follows from dier-

entiation that it is minimized by
(X
X)
1
S(Y) . (3.5.13)
Hence,

is a linear function of S(Y). Thus we immediately have from Theroem 3.5.2:
Theorem 3.5.4. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,
n(

0
)
D
N
p
(0,
2
1
) . (3.5.14)
Since Q is a local approximation to D, it would seem that their minimizing values are
close also. As the next result shows this indeed is the case. The proof rst appeared in
Jaeckel (1972) and is sketched in the Appendix; see Theorem A.3.9.
Theorem 3.5.5. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,
n(
)
P
0 .
Combining this last result with Theorem 3.5.4, we get the next corollary which gives the
asymptotic distribution of the R-estimate.
Corollary 3.5.1. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1),
n(
0
)
D
N
p
(0,
2
1
) . (3.5.15)
Under the further restriction that the errors have nite variance
2
, Exercise 3.16.10
shows that the least squares estimate

LS
of satises

n(
LS
)
D
N
p
(0,
2
1
).
Hence as in the location problems of Chapters 1 and 2, the asymptotic relative eciency
between the R-estimates and least squares is the ratio
2
/
2
, where
is the scale parameter

( 3.4.4). Thus the R-estimates of regression coecients have the same high eciency relative
to LS estimates as do the rank-based estimates in the location problem. In particular, the
eciency of the Wilcoxon estimates relative to the LS estimates at the normal distribution
is .955. For longer tailed errors distributions this realtive ecency is much higher; see the
eciency discussion for contaminated normal distributions in Example 1.7.1.
From the above corollary, R-estimates are asymptotically unbiased. It follows from the
invariance properties, if we additionally asume that the errors have a symmetric distribution,
that R-estimates are unbiased for all sample sizes; see Exercise 3.16.11 for details.
The random vector

, ( 3.5.13), is an asymptotic representation of the R-estimate

.
The following representation will be useful later:
Corollary 3.5.2. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,
n
1/2
(
0
) =
(n
1
X
X)
1
n
1/2
X
(F(Y X
0
)) +o
p
(1) , (3.5.16)
where the notation (F(Y)) means the n 1 vector whose ith component is (F(Y
i
)).
Proof: This follows immediately from ( A.3.9), ( A.3.10), the proof of Theorem 3.5.2,
and equation ( 3.5.13).
Based on this last corollary, we have that the inuence function of the R-estimate is
given by
(x
0
, y
0
;
) =
1
(F(y
0
))x
0
. (3.5.17)
A more rigorous derivation of this result, based on Frechet derivatives, is given in the Ap-
pendix; see Section A.5.2. Note that the inuence function is bounded in the Y -space but
it is unbounded in the x-space. Hence an outlier in the x-space can seriously impair an
R-estimate. Although as noted above the R-estimates are highly ecient relative to the
LS estimates, it follows from its inuence function that the breakdown of the R-estimate is
0. In Section 3.12, we present the HBR estimates whose inuence function is bounded in
both spaces and which can attain 50% breakdown; although, it is less ecient than the R
estimate.
3.5.2 R-Estimates of the Intercept
As discussed in Section 3.2, the intercept parameter requires the specication of a location
functional, T(e
i
). In this section we shall take T(e
i
) = med(e
i
). Since we assume, without
loss of generality, that T(e
i
) = 0, = T(Y
i
x
i
). This leads immediately to estimating
by the median of the R-residuals. Note that this is analogous to LS, since the LS estimate
of the intercept is the arithmetic average of the LS residuals. Further, this estimate is
associated with the sign test statistic and the L
1
norm. More generally we could also consider
estimates associated with signed-rank test statistics. For example, if we consider the signed-
rank Wilcoxon scores of Chapter 1 then the corresponding estimate is the median of the
Walsh averages of the residuals. The theory of such estimates based on signed-rank tests,
though, requires symmetrically distributed errors. Thus, while we briey discuss these later,
we now concentrate on the median of the residuals which does not require this symmetry
assumption. We will make use of Assumption (E.2), ( 3.4.3), i.e, f(0) > 0.
The process we consider is the sign process based on residuals given by
S
1
(Y 1 X
) =
n
i=1
sgn(Y
i
x
i
) . (3.5.18)
As with the sign process in Chapter 1, this process is a nondecreasing step function of
which steps down at the residuals. The solution to the equation
S
1
(Y 1 X
)
.
= 0 (3.5.19)
is the median of the residuals which we shall denote by
S
= medY
i
x
i
. Our goal is
to obtain the asymptotic joint distribution of the estimate

b
= (
S
,
.
Similar to the R-estimate of the estimate of the intercept is location and scale equivari-
ant; hence, without loss of generality we will assume that the true intercept and regression
parameters are 0. We begin with a lemma.
Lemma 3.5.1. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4. For
any > 0 and for any a 1,
lim
n
P[[S
1
(Y an
1/2
1 X
) S
1
(Yan
1/2
1)[
n] = 0 .
The proof of this lemma was rst given by Jureckova (1971) for general signed-rank scores
and it is briey sketched in the Appendix for the sign scores; see Lemma A.3.2. This lemma
leads to the asymptotic linearity result for the process ( 3.5.18).
We need the following linearity result:
Theorem 3.5.6. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4.
For any > 0 and c > 0,
lim
n
P[sup
|a|c
[n
1/2
S
1
(Yan
1/2
1 X
) n
1/2
S
1
(Y X
) + a
1
S
[ ] = 0 ,
where
s
is the scale parameter dened in expression ( 3.4.6).
Proof: For any xed a write
[n
1/2
S
1
(Y an
1/2
1 X
) n
1/2
S
1
(YX
) + a
1
S
[
[n
1/2
S
1
(Y an
1/2
1 X
) n
1/2
S
1
(Yan
1/2
1)[
+[n
1/2
S
1
(Yan
1/2
1) n
1/2
S
1
(Y) + a
1
S
[ +[n
1/2
S
1
(Y) n
1/2
S
1
(Y X
)[ .
We can apply Lemma 3.5.1 to the rst and third terms on the right side of the above
inequality. For the middle term we can use the asymptotic linearity result in Chapter 1 for
the sign process, ( 1.5.22). This yields the result for any a and the sup will follow from the
monotonicity of the process, similar to the proof of Theorem 1.5.6 of Chapter 1.
Letting a = 0 in Lemma 3.5.1, we have that the dierence n
1/2
S
1
(Y X
)
n
1/2
S
1
(Y) goes to zero in probability. Thus the asymptotic distribution of n
1/2
S
1
(Y
X
) is the same as that of n

1/2
S
1
(Y), namely, N(0, 1). We have two applications of these
results. The rst is found in the next lemma.
Lemma 3.5.2. Assume conditions (E.1), (E.2), (D.1), (D.2), and (S.1) of Section 3.4.
The random variable, n
1/2

S
Proof: Let > 0 be given. Since n
1/2
S
1
(YX
) is asymptotically N(0, 1) there exists

a c < 0 such that
P[n
1/2
S
1
(Y X
) < c] <

2
. (3.5.20)
Take c
=
1
S
(c ). By the processs monotonicity and the denition of , we have the
implication n
1/2

S
< c
n
1/2
S
1
(Y c
n
1/2
1 X
) 0. Adding in and subtracting

out the above linearity result, leads to
P[n
1/2

S
< c
P[n
1/2
S
1
(Y n
1/2
c
1 X
) 0]
P[[n
1/2
S
1
(Y c
n
1/2
1 X
) (n
1/2
S
1
(YX
) c
1
S
[ ]
+ P[n
1/2
S
1
(Y X
) c
1
S
< ]] (3.5.21)
The rst term on the right side can be made less that /2 for suciently large n whereas the
second term is ( 3.5.20). From this it follows that n
1/2

S
is bounded below in probability.
To nish the proof a similar argument shows that n
1/2

S
is bounded above in probability.
As a second application we can write the linearity result of the last theorem as
n
1/2
S
1
(Yan
1/2
1 X
) = n
1/2
S
1
(Y) a
1
S
+o
p
(1) (3.5.22)
uniformly for all [a[ c and for c > 0.
Because
S
is a solution to equation ( 3.5.19) and n
1/2

S
is bounded in probability, the
second linearity result, ( 3.5.22), yields, after some simplication, the following asymptotic
representation of our result for the estimate of the intercept for the true intercept
0
,
n
1/2
(
S
0
) =
S
n
1/2
n
i=1
sgn(Y
i
0
) + o
p
(1) , (3.5.23)
where
S
is given in ( 3.4.6). From this we have that n
1/2
(
S
0
)
D
N(0,
2
S
). Our interest,
though, is in the joint distribution of
S
and

.
By Corollary 3.5.2 the corresponding asymptotic representation of

for the true vector

of regression coecients
0
is
n
1/2
(
0
) =
(n
1
X
X)
1
n
1/2
X
(F(Y)) +o
p
(1) , (3.5.24)
where
is given by ( 3.4.4). The joint asymptotic distribution is given in the following

theorem.
Theorem 3.5.7. Under (D.1), (D.2), (S.1), (E.1) and (E.2) in Section 3.4,
=
_

S
_
p+1
__

0
0
_
,
_
n
1
2
S
0
0
2
(X
X)
1
__
distribution .
Proof: As above assume without loss of generality that the true parameters are 0. It
is easier to work with the random vector T
n
= (
1
s
n
S
,
n(
1
(n
1
X
X)
. Let
t = (t
1
, t
2
)
be an arbitrary, nonzero, vector in 1

p+1
. We need only show that Z
n
=
t
T
n
has an asymptotically univariate normal distribution. Based on the above asymptotic
representations of
S
, ( 3.5.23), and

, ( 3.5.24), we have
Z
n
= n
1/2
n
k=1
(t
1
sgn(Y
k
) + (t
2
x
k
)(F(Y
k
)) +o
p
(1) , (3.5.25)
Denote the sum on the right side of ( 3.5.25) as Z
n
. We need only show that Z
n
converges
in distribution to a univariate normal distribution. Denote the kth summand as Z
nk
. We
shall use the Lindeberg-Feller Central Limit Theorem. Our application of this theorem is
similar to its use in the proof of Theorem 3.5.2. First note that since the score function
is standardized (
_
= 0) that E(Z
n
) = 0. Let B
2
n
= Var(Z
n
). Because the individual
summands are independent, Y
k
are identically distributed, is standardized (
_

2
= 1), and
the design is centered, B
2
N
simplies to
B
2
n
= n
1
(
n
k=1
t
2
1
+
n
k=1
(t
2
x
k
)
2
+ 2t
1
cov(sgn(Y
1
), (F(Y
1
))t
2
n
k=1
x
k
= t
2
1
+t
2
(n
1
X
X)t
2
+ 0 .
Hence by (D.2),
lim
n
B
2
n
= t
2
1
+t
2
t
2
, (3.5.26)
which is a positive number. To satisfy the Lindeberg-Feller condition, we need to show that
for any > 0
lim
n
B
2
n
n
k=1
E[Z
2
nk
I([Z
nk
[ > B
n
)] = 0 . (3.5.27)
Since B
2
n
converges to a positive constant we need only show that the sum converges to 0.
By the triangle inequality we can show that the indicator function satises
I(n
1/2
[t
1
[ + n
1/2
[t
2
x
k
[[(F(Y
k
))[ > B
n
) I([Z
nk
[ > B
n
) . (3.5.28)
Following the discussion after expression ( 3.5.7), we have that n
1/2
[(x
k
t)[ M
n
where M
n
is independent of k and, furthermore, M
n
0. Hence, we have
I([(F(Y
k
))[ >
B
n
n
1/2
t
1
M
n
) I(n
1/2
[t
1
[ + n
1/2
[t
2
x
k
[[(F(Y
k
))[ > B
n
) . (3.5.29)
Thus the sum in expression ( 3.5.27) is less than or equal to
n
k=1
E
_
Z
2
nk
I
_
[(F(Y
k
))[ >
B
n
n
1/2
t
1
M
n
__
= t
1
E
_
I
_
[(F(Y
1
))[ >
B
n
n
1/2
t
1
M
n
__
+ (2/n)E
_
sgn(Y
1
)(F(Y
1
))I
_
[(F(Y
1
))[ >
B
n
n
1/2
t
1
M
n
__
t
2
n
k=1
x
k
+ E
_
2
(F(Y
1
))I
_
[(F(Y
1
))[ >
B
n
n
1/2
t
1
M
n
__
(1/n)
n
k=1
(t
2
x
k
)
2
.
Because the design is centered the middle term on the right side is 0. As remarked above, the
term (1/n)
n
k=1
(t
2
x
k
)
2
= (1/n)t
2
X
Xt
2
converges to a positive constant. In the expression
Bnn
1/2
t
1
Mn
, the numerator converges to a positive constant as the denominator converges to
0; hence, the expression goes to . Therefore since is bounded, the indicator function
converges to 0. Again using the boundedness of , we can interchange limit and expectation
by the Lebesgue Dominated Convergence Theorem. Thus condition ( 3.5.27) is true and,
hence, Z
n
converges in distribution to a univariate normal distribution. Therefore, T
n
con-
verges to a multivariate normal distribution. Note by ( 3.5.26) it follows that the asymptotic
covariance of

b
is the result displayed in the theorem.

In the above development, we considered the centered design. In practice, though, we are
often concerned with an uncentered design. Let
denote the intercept for the uncentered

model. Then
= x
where x denoted the vector of column averages of the uncentered

design matrix. An estimate of
based on R-estimates is given by
S
=
S
x
. Based
on the last theorem, it follows, (Exercise 3.16.14), that
_

_
is approximately N
p+1
__

0
0
_
,
_

n

2
(X
X)
1
(X
X)
1
x
2
(X
X)
1
__
, (3.5.30)
where
n
= n
1
2
S
+
2
(X
X)
1
x and
S
and and
are given respectively by ( 3.4.6) and

( 3.4.4).
Intercept Estimate Based on Signed-Rank Scores
Suppose we additionally assume that the errors have a symmetric distribution; i.e., f(x) =
f(x). In this case, all location functionals are the same. Let
f
(u) = f
(F
1
(u))/f(F
1
(u))
denote the optimal scores for the density f(x). Then as Exercise 3.16.12 shows,
f
(1u) =
f
(u); that is, the scores are odd about 1/2. Hence, in this subsection we will additionally
assume that the scores satisfy property (S.3), ( 3.4.12).
For scores satisfying (S.3), the corresponding signed-rank scores are generated as a
+
(i) =
+
(i/(n+1)) where
+
(u) = ((u+1)/2); see the discussion in Section 2.5.3. For example
if Wilcoxon scores are used, (u) =
12(u 1/2), then the signed-rank score function is
+
(u) =
3u. Recall from Chapter 1, that these signed-rank scores can be used to dene a
norm and a subsequent R-analysis. Here we only want to apply the associated one sample
signed-rank procedure to the residuals in order to obtain an estimate of the intercept. So
consider the process
T
+
(e
R
1) =
n
i=1
sgn(e
Ri
1)a
+
(R[e
Ri
[) , (3.5.31)
where e
Ri
= y
i
x
; see ( 1.8.2). Note that this is the process discussed in Section 1.8,
except now the iid observations are replaced by residuals. The process is still a nonincreasing
function of which steps down at the Walsh averages of the residuals; see Exercise 1.12.28.
The estimate of the intercept is a value
+
which solves the equation

T
+
(e
R
)
.
= 0. (3.5.32)
If Wilcoxon scores are used then the estimate is the median of the Walsh averages, ( 1.3.25)
while if sign scores are used the estimate is the median of the residuals.
Let

b
+
= (
+
. We next briey sketch the development of the asymptotic distri-

bution of

b
+
. Assume without loss of generality that the true parameter vector (

0
,
0
)
is
0. Suppose instead of the residuals we had the true errors in ( 3.5.31). Theorem A.2.11
of the Appendix then yields an asymptotic linearity result for the process. McKean and
Hettmansperger (1976) show that this result holds for the residuals also; that is,
1
n
S
+
(e
R
1) = S
+
(e)
1
+ o
p
(1) , (3.5.33)
for all [[ c, where c > 0. Using arguments similar to those in McKean and Hettmansperger
(1976), we can show that
n
+
is bounded in probability; hence, by ( 3.5.33) we have that
n
+
n
S
+
(e) + o
p
(1) . (3.5.34)
But by ( A.2.43) and ( A.2.45) of the Appendix, we have the second representation given by,
n
+
n
n
i=1
+
(F
+
[e
i
[)sgn(e
i
) + o
p
(1)
=
n
n
i=1
+
(2F(e
i
) 1) +o
p
(1) , (3.5.35)
where F
+
is the distribution function of the absolute errors [e
i
[. Due to symmetry, F
+
(t) =
2F(t)1. Then using the relationship between the rank and the signed-rank scores,
+
(u) =
((u + 1)/2), we obtain nally
n
+
n
n
i=1
(F(Y
i
)) . (3.5.36)
Therefore using expression ( 3.5.2), we have the asypmtotic representation of the estimates:
n
_

+
_
=

n
_
1
(F(Y))
(X
X)
1
X
(F(Y))
_
. (3.5.37)
This and an application of the Lindeberg Central Limit Theorem, similar to the proof of
Theorem 3.5.7, leads to the theorem,
Theorem 3.5.8. Under assumptions (D.1), (D.2), (E.1), (E.2), (S.1) and (S.3) of Section
3.4
_

+
_
p+1
__

0
0
_
,
2
(X
1
X
1
)
1
_
distribution , (3.5.38)
where X
1
= [1 X].
3.6 Theory of Rank-Based Tests
Consider the general linear hypotheses discussed in Section 3.2,
H
0
: M = 0 versus H
A
: M ,= 0 , (3.6.1)
where M is a q p matrix of full row rank. The geometry of R testing, Section 3.2.2,
indicated the statistic based on the reduction of dispersion between the reduced and full
models, F
= (RD/q)/(
/2), see ( 3.2.18), as a test statistic. In this section we develop the

asymptotic theory for this test statistic under null and alternative hypotheses. This theory
will be sucient for two other rank-based tests which we will discuss later. See Table 3.2.2
and the discussion relating to that table for the special case when M = I.
3.6. THEORY OF RANK-BASED TESTS 177
3.6.1 Null Theory of Rank Based Tests
We proceed with two lemmas about the dispersion function D() and its quadratic approx-
imation Q() given by expression ( 3.5.11).
Lemma 3.6.1. Let

denote the R-estimate of in the full model ( 3.2.3), then under
(E.1), (S.1), (D.1) and (D.2) of Section 3.4,
D(
) Q(
)
P
0 . (3.6.2)
Proof: Assume without loss of generality that the true is 0. Let > 0 be given. Choose
c
0
such that P
_
n|
| > c
0
_
< /2, for n suciently large. Using asymptotic quadraticity,
Theorem A.3.8, we have for n suciently large
P
_
[D(
) Q(
)[ <
_
P
__
max
<c
0
/
n
[D() Q()[ <
_
n|
| < c
0
_
_
> 1 . (3.6.3)
From this we obtain the result.
The last result shows that D and Q are close at the R-estimate of . Our next result
shows that Q(
) is close to the minimum of Q.

Lemma 3.6.2. Let

denote the minimizing value of the quadratic function Q then under
(E.1), (S.1), (D.1) and (D.2) of Section 3.4,
Q(
) Q(
)
P
0 . (3.6.4)
Proof: By simple algebra we have,
Q(
) Q(
) = (2
)
1
(
X(
) (
S(Y)
= (2
)
1
n(
_
n
1
X
n((
) n
1/2
S(Y)
_
.
It is shown in Exercise 3.16.15 that the term in brackets in the last equation is bounded
in probability. Since the left factor converges to zero in probability by Theorem 3.5.5 the
desired result follows.
It is easier to work with the the equivalent formulation of the linear hypotheses given by
Lemma 3.6.3. An equivalent formulation of the model and the hypotheses is:
Y = 1 +X
1
+X
2
+e , (3.6.5)
with the hypotheses H
0
:
2
= 0 versus H
A
:
2
,= 0, where X
i
and
i
, i = 1, 2, are dened
in display ( 3.6.7).
Proof: Consider the QR-decomposition of M given by
M
= [Q
2
Q
1
] =
_
R
O
_
= Q
2
R , (3.6.6)
where the columns of Q
1
form an orthonormal basis for the kernel of the matrix M, the
columns of Q
2
form an orthonormal basis for the column space of M
, O is a (p q) q
matrix of 0s, and R is a q q upper triangular, nonsingular matrix. Dene
X
i
= XQ
i
and
i
= Q
i
for i = 1, 2 . (3.6.7)
It follows that,
Y = 1 +X +e
= 1 +X
1
+X
2
+e .
Further, M = 0 if and only if
2
= 0, which yields the desired result.
Without loss of generality, by the last lemma, for the remainder of the section, we will
consider a model of the form
Y = 1 +X
1
1
+X
2
2
+e , (3.6.8)
with the hypotheses
H
0
:
2
= 0 versus H
A
:
2
,= 0 . (3.6.9)
With these lemmas we are now ready to obtain the asymptotic distribution of F
. Let
r
= (
1
, 0
denote the reduced model vector of parameters, let

r,1
denote the reduced
model R-estimate of
1
, and let

r
= (
r,1
, 0
. We shall use similar notation with the min-

imizing value of the approximating quadratic Q. With this notation the drop in dispersion
becomes RD
= D(
r
) D(
). McKean and Hettmansperger (1976) proved the following:

Theorem 3.6.1. Suppose the assumptions (E.1), (D.1), (D.2), and (S.1) of Section 3.4
hold. Then under H
0
,
RD
/2
D
2
(q) ,
where RD
is formally dened in expression ( 3.2.16).

Proof: Assume that the true vector of parameters is 0 and suppress the subscript on
RD. Write RD as the sum of ve dierences:
RD = D(
r
) D(
)
=
_
D(
r
) Q(
r
)
_
+
_
Q(
r
) Q(
r
)
_
+
_
Q(
r
)
Q(
)
_
+
_
Q(
) Q(
)
_
+
_
Q(
) D(
)
_
.
By Lemma 3.6.1 the rst and fth dierences go to zero in probability and by Lemma 3.6.2
the second and fourth dierences go to zero in probability. Hence we need only show that
the middle dierence converges in distribution to the intended distribution. As in Lemma
3.6.2, algebra leads to
Q(
) = 2
1
S(Y)
(X
X)
1
S(Y) + D(Y) ,
while
Q(
r
) = 2
1
S(Y)
_
(X
1
X
1
)
1
0
0 0
_
S(Y) + D(Y) .
Combining these last two results the middle dierence becomes
Q(
r
) Q(
) = 2
1
S(Y)
_
(X
X)
1
_
(X
1
X
1
)
1
0
0 0
__
S(Y) .
Using a well known matrix identity, (see page 27 of Searle, 1971),
(X
X)
1
=
_
(X
1
X
1
)
1
0
0 0
_
+
_
A
1
1
B
I
_
W
_
B
A
1
1
I
,
where
X
X =
_
A
1
B
B
A
2
_
W =
_
A
2
B
A
1
1
B
_
1
. (3.6.10)
Hence after some simplication we have
RD
/2
= S(Y)
__
A
1
1
B
I
_
W
_
B
A
1
1
I
_
S(Y) + o
p
(1)
=
__
B
A
1
1
I
S(Y)
_
W
__
B
A
1
1
I
S(Y)
_
+ o
p
(1)
=
__
B
A
1
1
I
n
1/2
S(Y)
_
nW
__
B
A
1
1
I
n
1/2
S(Y)
_
+ o
p
(1) . (3.6.11)
Using n
1
X
X and the asymptotic distribution of n

1/2
S(Y), Theorem 3.5.2, it follows
that the right side of ( 3.6.11) converges in distribution to a
2
random variable with q degrees
of freedom, which completes the proof of the theorem.
A consistent estimate of
is discussed in Section 3.7. We shall denote this estimate by

. The test statistic we shall subsequently use is given by

F
=
RD
/q

/2
. (3.6.12)
Although the test statistic qF
has an asymptotic
2
distribution, small sample studies, (see
below), have indicated that it is best to compare the test statistic with F-critical values
having q and n p 1 degrees of freedom; that is, the test at nominal level is:
Reject H
0
: M = 0 in favor of H
A
: M ,= 0 if F
F(, q, n p 1) . (3.6.13)
McKean and Sheather (1991) review numerous small sample studies concerning the valid-
ity of the rank-based analysis based on the test statistic F
. These small sample studies

demonstrate that the empirical level of F
over a variety of designs, sample sizes and error

distributions are close to the nominal values.
In classical inference there are three tests of general hypotheses: the likelihood ratio
test (reduction in sums of squares test), Walds test and Raos scores (gradient) test. A
good discussion of these tests can be found in Rao (1973). When the hypotheses are the
general linear hypotheses ( 3.6.1), the errors have a normal distribution, and the least squares
procedure is used then the three tests statistics are algebraically equivalent. Actually the
equivalence holds without normality, although in this case the reduction in sums of squares
statistic is not the likelihood ratio test; see the discussion in Hettmansperger and McKean
(1983).
There are three rank-based tests for the general linear hypotheses, also. The reduction
in dispersion test statistic F
is the analogue of the likelihood ratio test, i.e., the reduction

in sums of squares test. Since Walds test statistic is a quadratic form in full model
estimates, its rank analogue is given by
F
,Q
=
_
M
_
_
M(X
X)
1
M
1
_
M
_
/q

2
. (3.6.14)
Provided
is a consistent estimate of
it follows from the asymptotic distribution of
R
, Corollary 3.5.1, that under H
0
, qF
,Q
has an asymptotic
2
distribution. Hence the
test statistics F
and F
,Q
have the same null asymptotic distributions. Actually as Ex-
ercise 3.16.16 shows, the dierence of the test statistics converges to zero in probability
under H
0
. Unlike the classical methods, though, they are not algebraically equivalent, see
Hettmansperger and McKean (1983).
The rank gradient scores test is easiest to dene in terms of the reparameterized
model, ( 3.6.20); that is, the null hypothesis is H
0
:
2
= 0. Rewrite the random vector
dened in ( 3.6.11) of Theorem 3.6.1 using as the true parameter under H
0
,
0
= (
01
, 0
,
i.e.,
__
B
A
1
1
I
n
1/2
S(YX
0
)
_
nW
__
B
A
1
1
I
n
1/2
S(Y X
0
)
_
. (3.6.15)
From the proof of Theorem 3.6.1 this quadratic form has an asymptotic
2
distribution
with q degrees of freedom. Since it does depend on
0
, it can not be used as a test statistic.
Suppose we substitute the reduced model R-estimate of
1
; i.e., the rst p q components
of

r
, dened immediately after expression ( 3.6.9). We shall call it

01
. Now since this is
the reduced model R-estimate, we have
S(Y X
r
)
.
=
_
0
S
2
(YX
1
r,1
)
_
, (3.6.16)
where the subscript 2 on S denotes the last p q components of S. This yields
A
= S
2
(YX
1
r,1
)
_
X
2
X
2
X
2
X
1
(X
1
X
1
)
1
X
1
X
2
_
1
S
2
(YX
1
r,1
) (3.6.17)
as a test statistic. This is often called the aligned rank test, since the observations are
aligned by the reduced model estimate. Exercise 3.16.17 shows that under H
0
, A
has an
asymptotic
2
distribution. As the proof shows, the dierence between qF
and A
converges
to zero in probability under H
0
. Aligned rank tests were introduced by Hodges and Lehmann
(1962) and are developoed in the linear model by Puri and Sen (1985).
Suppose in ( 3.6.16) we use a reduced model estimate

r,1
which is not the R-estimate;
for example, it may be the LS-estimate. Then we have
S(YX
r
)
.
=
_
S
1
(Y X
1
r,1
)
S
2
(Y X
1
r,1
)
_
. (3.6.18)
The reduced model estimate must satisfy
n(
0
) = O
p
(1), under H
0
. Then the statistic
in ( 3.6.17) is
A
= S
2
_
X
2
X
2
X
2
X
1
(X
1
X
1
)
1
X
1
X
2
_
1
S
2
, (3.6.19)
where, from ( 3.6.11),
S
2
= S
2
(Y X
1
r,1
) X
2
X
1
(X
1
X
1
)
1
S
1
(YX
1
r,1
) . (3.6.20)
Note that when the R-estimate is used, the second term in S
2
vanishes and we have ( 3.6.17);
see Adichi(1978) and Chiang and Puri (1984).
Hettmansperger and McKean (1983) give a general discussion of these three tests. Note
that both F
,Q
and F
require estimation of full model estimates and the scale parameter
while A
does not. However when using a linear model, one is usually interested in more
than hypothesis testing. Of primary interest is checking the quality of the t; i.e., does the
model t the data. This requires estimation of the full model parameters and an estimate
of
. Diagnostics for ts based on R-estimates are the topics of Section 3.9. One is also
usually interested in estimating contrasts and their standard errors. For R-estimates this
requires an estimate of
. Moreover, as discussed in Hettmansperger and McKean (1983),

the small sample properties of the aligned rank test can be poor on certain designs.
The inuence function of the test statistic F
is derived in Appendix A.5.2. As

discussed there, it is easier to work with the
_
qF
. The result is given by,

(x
0
, y
0
;
_
qF
) = [[F(y
0
x
r
)][
_
x
0
_
(X
X)
1
_
(X
1
X
1
)
1
0
0 0
__
x
0
_
1/2
. (3.6.21)
As shown in the Appendix, the null distribution of F
can be read from this result. Note

that similar to the R-estimates, the inuence function of F
is bounded in the Y -space but

not in the x-space; see ( 3.5.17).
3.6.2 Theory of Rank-Based Tests under Alternatives
In the last section, we developed the null asymptotic theory of the rank-based tests based
on a general score function. In this section we obtain some properties of these tests under
alternative models. We show rst that the test based on the reduction of dispersion, RD
,
( 3.2.16), is consistent under any alternative to the general linear hypothesis. We then show
the eciency of these tests is the same as the eciency results obtained in Chapter 2.
Consistency
We want to show that the test statistic F
is consistent for the general linear hypothesis,

( 3.2.5). Without loss of generality, we will again reparameterize the model as in ( 3.6.20)
and consider as our hypothesis H
0
:
2
= 0 versus H
A
:
2
,= 0. Let
0
= (
01
,
02
)
be
the true parameter. We will assume that the alternative is true; hence,
02
,= 0. Let be a
given level of signicance. Let T(
) = RD
/(
/2) where RD
= D(
r
) D(
). Because
we estimate
under the full model by a consistent estimate, to show consistency of F
it
suces to show
P
0
[T(
)
2
,q
] 1 , (3.6.22)
as n .
As in the proof under the null hypothesis, it is convenient to work with the approximating
quadratic function Q(YX), ( 3.5.11). As above, let

and

denote the minimizing values
of Q and D respectively under the full model. The present argument simplies if, for the
full model, we replace

by

in T(
). We can do this because we can write

D(YX
) D(Y X
) =
_
D(YX
) Q(YX
)
_
+
_
Q(Y X
) Q(YX
)
_
+
_
Q(Y X
) D(YX
)
_
.
Applying asymptotic quadraticity, Theorem A.3.8, the rst and third dierences go to 0 in
probability while the second dierence goes to 0 in probability by Lemma 3.6.2; hence the
left side goes to 0 in probability under the alternative model. Thus we need only show that
P
0
[(2/
)(D(
r
) D(
))
2
,q
] 1 , (3.6.23)
where, as above,

r
denotes the reduced model R-estimate. We state the result next. The
proof can be found in the appendix; see Theorem A.3.10.
Theorem 3.6.2. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section 3.4 hold.
The test statistic F
is consistent for the hypotheses ( 3.2.5).

Eciency Results
The above result establishes that the rank-based test statistic F
is consistent for the general

linear hypothesis, ( 3.2.5). We next derive the eciency results of the test. Our rst step is
to obtain the asymptotic power of F
along a sequence of alternatives. This generalizes the

asymptotic power lemmas discussed in Chapters 1 and 2. From this the eciency results
will follow. As with the consistency discussion it is more convenient to work with the model
( 3.6.20).
The sequence of alternative models to the hypothesis H
0
:
2
= 0 is:
Y = 1 +X
1
1
+X
2
(/
n) +e , (3.6.24)
where is a nonzero vector. Because R-estimates are invariant to location shifts, we can
assume without loss of generality that
1
= 0. Let
n
= (0
n)
and let H
n
denote
the hypothesis that ( 3.6.24) is the true model. The concept of contiguity will prove helpful
with the asymptotic theory of the statistic F
under this sequence of models. A discussion

of contiguity is given in the appendix; see Section A.2.2.
Theorem 3.6.3. Under the sequence of models ( 3.6.24) and the assumptions (E.1), (D.1),
(D.2), and (S.1) of Section 3.4,
P
n
(T(
) t) P(
2
q
(
) t) , (3.6.25)
where
2
q
(
) has a noncentral
2
-distribution with q degrees of freedom and non-centrality
parameter
=
2
W
1
0
, (3.6.26)
where W
0
= lim
n
nW and W is dened in display ( 3.6.10).
Proof: As in the proof of Theorem 3.6.1 we can write the drop in dispersion as the sum of
the same ve dierences. Since the rst two and last two dierences go to zero in probability
under the null model, it follows from the discussion on contiguity, (Section A.2.2), that
these dierences go to zero in probability under the model ( 3.6.24). Hence we need only be
concerned about the middle dierence. Since
1
= 0, the middle dierence reduces to the
same quantity as in Theorem 3.6.1; i.e., we obtain,
RD
/2
=
__
B
A
1
1
I
S(Y)
_
W
__
B
A
1
1
I
S(Y)
_
+ o
p
(1) .
The asymptotic linearity result derived in the Appendix, (Theorem A.3.8), is
sup
nc
|n
1/2
S(YX)
_
n
1/2
S(Y)
1
n
_
| = o
p
(1) ,
for all c > 0. Since

n|
n
| = ||, we can take c = || and get
|n
1/2
S(Y X
n
)
_
n
1/2
S(Y)
1
(0
_
| = o
p
(1) . (3.6.27)
The above probability statements hold under the null model and, hence, by contiguity under
the sequence of models ( 3.6.24) also. Under the sequence of models ( 3.6.24), however,
n
1/2
S(Y X
n
)
D
N
p
(0, ) .
Hence, under the sequence of models ( 3.6.24)
n
1/2
S(Y)
D
N
p
(
1
(0
, ) . (3.6.28)
Then under the sequence of models ( 3.6.24),
_
B
A
1
1
I
n
1/2
S(Y)
D
N
q
(
1
W
0
, W
0
) .
From this last result, the conclusion readily follows.
Several interesting remarks follow from this theorem. First, since W
0
is positive denite,
under alternatives the noncentrality parameter > 0. Thus the asymptotic distribution of
T(
) under the sequence of models ( 3.6.24) has mean q + . Furthermore, the asymptotic
power of a level test based on T(
) is P[
2
q
()
2
,q
].
Second, note that that we can write the non-centrality parameter as
= (
2
n)
1
[
A
2
(B)
A
1
1
B] .
Both matrices A
2
and A
1
1
are positive denite; hence, the non-centrality parameter is
maximized when is in the kernel of B. One way of assuring this for a design is to take
B = 0. Because B = X
1
X
2
this condition holds for orthogonal designs. Therefore orthogonal
designs are generally more ecient than non-orthogonal designs.
We next obtain the asymptotic relative eciency of the test statistic F
with respect to
the least squares classical F-test, F
LS
, dened by ( 3.2.17) in Section 3.2.2. The theory for
F
LS
under local alternatives is outlined in Exercise 3.16.18 where it is shown that, under the
additional assumption that the random errors e
i
have nite variance
2
, the null asymptotic
distribution of qF
LS
is a central
2
q
distribution. Thus both F
and F
LS
have the same
asymptotic null distribution. As outlined in Exercise 3.16.18, under the sequence of models
( 3.6.24) qF
LS
has an asymptotic noncentral
2
q,
LS
with noncentrality parameter
LS
= (
2
)
1
W
1
0
(3.6.29)
Based on Theorem 3.6.3, the asymptotic relative eciency of F
and F
LS
is the ratio of
their non-centrality parameters; i.e.,
e(F
, F
LS
) =

LS
=

2
.
Thus the eciency results for the rank-based estimates and tests discussed in this sec-
tion are the same as the eciency results presented in Chapters 1 and 2. An asymp-
totically ecient analysis can be obtained if the selected rank score function is
f
(u) =
f
0
(F
1
0
(u))/f
0
(F
1
0
(u)) where f
0
is the form of the density of the error distribution. If the
errors have a logistic distribution then the Wilcoxon scores will result in an asymptotically
ecient analysis.
Usually we have no knowledge of the distribution of the errors. In which case, we would
recommend using Wilcoxon scores. With them, the loss in relative eciency to the classical
analysis at the normal distribution is only 5%, while the gain in eciency over the classical
analysis for long tailed error distributions can be substantial as discussed in Chapters 1 and
2.
Many of the studies reviewed in the article by McKean and Sheather (1991) included
power comparisons of the rank-based analyses with the least squares F-test, F
LS
. The
empirical power of F
LS
at normal error distributions was slightly better than the empirical
power of F
, under Wilcoxon scores. Under error distributions with heavier tails than the
normal distribution, the empirical power of F
was generally larger, often much larger,

than the empirical power of F
LS
. These studies provide empirical evidence that the good
asymptotic eciency properties of the rank-based analysis hold in the small sample setting.
As discussed above, the noncentrality parameters of the test statistics F
and F
LS
dier in
only the scale parameters. Hence, in practice, planning designs based on the noncentrality
parameter of F
can proceed similar to the planning of a design using the noncentrality

parameter of F
LS
; see, for example, the discussion in Chapter 4 of Graybill (1976).
3.6.3 Further Remarks on the Dispersion Function
Let e denote the rank-based residuals when the linear model, ( 3.2.4), is t using the scores
based on the function . Suppose the same assumptions hold as above; i.e., (E.1), (D.1), and
(D.2) in Section 3.4. In this section, we explore further properties of the residual dispersion
D(e); see also Sections 3.9.2 and 3.11.
The functional corresponding to the dispersion function evaluated at the errors e
i
is
determined as follows: letting F
n
denote the empirical distribution function of the iid errors
e
1
, . . . , e
n
we have
1
n
D(e) =
n
i=1
a(R(e
i
))e
i
1
n
=
n
i=1
_
n
n + 1
F
n
(e
i
)
_
e
i
1
n
=
_

_
n
n + 1
F
n
(x)
_
xdF
n
(x)
P
_
(F(x))xdF(x) = D
e
. (3.6.30)
As Exercise 3.16.19 shows, D
e
is a scale parameter; see also the examples below.
Let D(e) denote the residual dispersion D(
) = D(Y, ). We next show that n

1
D(e)
also converges in probability to D
e
, a result which will prove useful in Sections 3.9.2 and
3.11. Assume without loss of generality that the true is 0. We can write
D(e) = (D(e) Q(
)) + (Q(
) Q(
)) +Q(
) .
By Lemmas 3.6.1 and 3.6.2 the two dierences on the right hand side converge to 0 in
probability. After some algebra, we obtain
Q(
) =
2
_
1
n
S(e)
_
1
n
X
X
_
1
1
n
S(e)
_
+ D(e) .
By Theorem 3.5.2 the term in braces on the right side converges in distribution to a
2
random variable with p degrees of freedom. This implies that (D(e) D(e))/(
/2) also
converges in distribution to a
2
random variable with p degrees of freedom. Although this
is a stronger result than we need, it does imply that n
1
(D(e) D(e)) converges to 0 in
probability. Hence, n
1
D(e) converges in probability to D
e
.
The natural analog to the least squares F-test statistic is
F
=
RD/q

D
/2
, (3.6.31)
where
D
= D(e)/(n p 1), rather than F
. But we have,
qF
/2
n
1
D(e)/2
qF
2
(q) , (3.6.32)
where
F
is dened by

n
1
D(e)
P
F
. (3.6.33)
Hence, to have a limiting
2
-distribution for qF
we need to have
F
= 1. Below we give
several examples where this occurs. In the rst example, the form of the error distribution
is known while in the second example the errors are normally distributed; however, these
cases rarely occur in practice.
There is an even more acute problem with using F
, though. In Section A.5.2 of the

appendix, we show that the inuence function of F
is not bounded in the Y -space, while,

as noted above, the inuence function of the statistic F
is bounded in the Y -space provided

the score function (u) is bounded. Note, however, that the inuence functions of D(e)
and F
are linear rather than quadratic as is the inuence function of F

LS
. Hence, they
are somewhat less sensitive to outliers in the Y -space than F
LS
; see Hettmansperger and
McKean (1978).
Example 3.6.1. Form of Error Density Known.
Assume that the errors have density f(x) =
1
f
0
(x/) where f
0
is known. Our choice
of scores would then be the optimal scores given by
0
(u) =
1
_
I(f
0
)
f
0
(F
1
0
(u))
f
0
(F
1
0
(u))
, (3.6.34)
where I(f
0
) denotes the Fisher information corresponding to f
0
. These scores yield an
asymptotically ecient rank-based analysis. Exercise 3.16.20 shows that with these scores
= D
e
. (3.6.35)
Thus
F
= 1 for this example and qF
0
has a limiting
2
(q)-distribution under H
0
.
3.7. IMPLEMENTATION OF THE R-ANALYSIS 187
Example 3.6.2. Errors are Normally Distributed.
In this case the form of the error density is f
0
(x) = (
2)
1
exp x
2
/2; i.e., the
standard normal density. This is of course a subcase of the last example. The optimal scores
in this case are the normal scores
0
(u) =
1
(u) where denotes the standard normal
distribution function. Using these scores, the statistic qF
0
has a limiting
2
(q)-distribution
under H
0
. Note here that the score function
0
(u) =
1
(u) is unbounded; hence the
above theory must be modied to obtain this result. Under further regularity conditions on
the design matrix, Jureckova (1969) obtained asymptotic linearity for the unbonded score
function case; see, also, Koul (1992, p. 51). Using these results, the limiting distribution of
qF
0
can be obtained. The R-estimates based on these scores, however, have an unbounded
inuence function; see Section 1.8.1. We next consider this analysis for Wilcoxon and sign
scores.
If Wilcoxon scores are employed then Exercise 3.16.21 shows that
=
_
3
(3.6.36)
D
e
=
_
3
. (3.6.37)
Thus, in this case, a consistent estimate of
/2 is n
1
D(e)(/6).
For sign scores a similar computation yields
S
=
_
2
(3.6.38)
D
e
=
_
2
(3.6.39)
Hence n
1
D(e)(/4) is a consistent estimate of
S
/2.
Note that both examples are overly restrictive and again in all cases the resulting rank-
based test of the general linear hypothesis H
0
has an unbounded inuence function, even
in the case when the errors have a normal density and the analysis is based on Wilcoxon
or sign scores. In general then, we recommend using a bounded score function and the
corresponding test statistic F
, ( 3.2.18) which is highly ecent and whose inuence function,

3.6.21, is bounded in the Y -space.
3.7 Implementation of the R-Analysis
Up to this point, we have presented the geometry and asymptotic theory of the R-analysis.
In order to implement the analysis we need to discuss the estimation of the scale parameters
and
S
. Estimation of
S
is discussed around expression (1.5.28). Here, though, the
estimate is based on the residuals. We next discuss estimation of the scale parameter
.
We also discuss algorithms for obtaining the rank-based analysis.
3.7.1 Estimates of the Scale Parameter
The estimators of
that we dicuss are based on the R-residuals formed after estimating .

In particular, the estimators do not depend on the the estimate of intercept parameter .
Suppose then we have t Model ( 3.2.3) based on a score function which satises (S.1),
( 3.4.10), i.e., is bounded, and is standardized so that
_
= 0 and
_

2
= 1. Let

denote the R-estimate of and let e

R
= YX
denote the residuals based on the R-t.

There have been several estimates of
proposed. McKean and Hettmansperger (1976)

proposed a Lehmann type estimator based on the standardized length of a condence interval
for the intercept parameter . This estimator is a function of residuals and is consistent
provided the density of the errors is symmetric. It is similar to the estimators of
discussed
in Chapter 1. For Wilcoxon scores, Aubuchon and Hettmansperger (1984, 1989) obtained a
density type estimator for
and showed it was consistent for symmetric and asymmetric

error distributions. Both of these estimators are available as options in the command RREGR
in Minitab. In this section we briey sketch the development of an estimator of
for
bounded score functions proposed by Koul, Sievers and McKean (1987). It is a density
type estimate based on residuals which is consistent for symmetric and asymmetric error
distributions which satisfy (E.1), ( 3.4.1). It further satises a uniform consistency property
as stated in Theorem 3.7.1. Witt et al. (1995) derived the inuence function of this estimator,
showing that it is robust.
A bootstrap percentile-t procedure based on this estimator did quite well in terms of
empirical validity and eciency in the Monte Carlo study performed by George, McKean,
Schucany and Sheather (1995).
Let the score function satisfy (S.1), (S.2), and (S.3) of Section 3.4. Since it is bounded,
consider the standardization of it given by
(u) =
(u) (0)
(1) (0)
. (3.7.1)
Since
is a linear function of the inference properties under either score function are the
same. The score function
will be useful since it is also a distribution function on (0, 1).

Recall that
= 1/ where
=
_
1
0
(u)
f
(u)du and
f
(u) =
f
(F
1
(u))
f(F
1
(u))
.
Note that
=
_

(u)
f
(u)du = ((1) (0))
1
. For the present it will be more conve-
nient to work with
.
If we make the change of variable u = F(x) in
, we can rewrite it as,
=
_

(F(x))f
(x)dx
=
_

(F(x))f
2
(x)dx
=
_

f(x)d
(F(x)) ,
where the second equality is obtained upon integration by parts using dv = f
(x) dx and
u =
(F(x)).
From the above assumptions on
(F(x)) is a distribution function. Suppose Z

1
and Z
2
are independent random variables with distributions functions F(x) and
(F(x)),
respectively. Let H(y) denote the distribution function of [Z
1
Z
2
[. It then follows that,
H(y) =
_
P[[Z
1
Z
2
[ y] =
_
[F(z
2
+ y) F(z
2
y)]d
(F(z
2
)) y > 0
0 y 0
. (3.7.2)
Let h(y) denote the density of H(y). Upon dierentiating under the integral sign in expres-
sion ( 3.7.2) it easily follows that
h(0) = 2
. (3.7.3)
So to estimate we need to estimate h(0).
Using the transformation t = F(z
2
), rewrite ( 3.7.2) as
H(y) =
_
1
0
_
F(F
1
(t) + y) F(F
1
(t) y)
(t) . (3.7.4)
Next let

F
n
denote the empirical distribution function of the R-residuals and let

F
1
n
(t) =
infx :

F
n
(x) t denote the usual inverse of

F
n
. Let

H
n
denote the estimate of H
which is obtained by replacing F by

F
n
. Some simplication follows by noting that for
t ((j 1)/n, j/n],

F
1
n
(t) = e
(j)
. This leads to the following form of

H
n
,
H
n
(y) =
_
1
0
_
F
n
(
F
1
n
(t) + y)

F
n
(
F
1
n
(t) y)
_
d
(t)
=
n
j=1
_
(
j1
n
,
j
n
]
_
F
n
(
F
1
n
(t) + y)

F
n
(
F
1
n
(t) y)
_
d
(t)
=
n
j=1
_
F
n
(e
(j)
+ y)

F
n
(e
(j)
y)
_
_
_
j
n
_
_
j 1
n
__
=
1
n
n
i=1
n
j=1
_
_
j
n
_
_
j 1
n
__
I([e
(i)
e
(j)
[ y) . (3.7.5)
An estimate of h(0) and hence
, ( 3.7.3), is an estimate of the form

H
n
(t
n
)/(2t
n
) where
t
n
is chosen close to 0. Since

H
n
is a distribution function, let

t
n,
denote the th quantile
of

H
n
; i.e.,

t
n,
=

H
1
n
(). Then take t
n
= t
n,
/
n. Our estimate of is given by
n,
=
((1) (0))
H
n
(t
n,
/
n)
2t
n,
/
n
. (3.7.6)
Its consistency is given by the following theorem:
Theorem 3.7.1. Under (E.1),(D.1), (S.1), and (S.2) of Section 3.4, and for any 0 < < 1,
sup
C
[
n,
[
P
0 ,
where ( denotes the class of all bounded, right continuous, nondecreasing score functions
dened on the interval (0, 1).
The proof can be found in Koul et al. (1987). It follows immediately that
= 1/
n,
is
a consistent estimate of
. Note that the uniformity condition on the scores in the theorem

is more than we need here. This result, though, proves useful in adaptive procedures which
estimate the score function; see McKean and Sievers (1989).
Since the scores are dierentiable, an approximation of

H
n
is obtained by an application
of the mean value theorem to ( 3.7.5) which results in
n
(y) =
1
c
n
n
n
i=1
n
j=1
_
j
n + 1
_
I([e
(i)
e
(j)
[ y) , (3.7.7)
where c
n
=
n
j=1
(j/(n + 1)) is such that

H
n
is a distribution function.
The expression ( 3.7.5) for

H
n
contains a density estimate of f based on a rectangular
kernel. Hence, in choosing we are really choosing a bandwidth for a density estimator.
As most kernel type density estimates are sensitive to the bandwidth, so is
sensitive to
. Several small sample studies have been done on this estimate of
; see McKean and

Sheather (1991) for a summary. In these studies the quality of an estimator of
is based
on how well it standardizes test statistics such as F
in terms of how close the empirical

-levels of the test statistic are to nominal -levels. In the same way, scale estimators used
in condence intervals were judged by how close empirical condence levels were to nominal
condence levels. The major concern is thus the validity of the inference procedure. For
moderate sample sizes where the ratio of n/p exceeds 5, the value of = .80 yielded valid
estimates. For ratios less than 5, larger values of , around .90, gave valid estimates. In
all cases it was found that the following simple degrees of freedom correction beneted the
analysis

=
_
n
n p 1
1
. (3.7.8)
Note that this is similar to the least squares correction on the maximum likelihood estimate
(under normality) of the variance.
3.7.2 Algorithms for Computing the R-Analysis
As we saw in Section 3.2, the dispersion function D() is a continuous convex function of
. Gradient type algorithms, such as steepest descent, can be use to minimize D() but
they are often agonizingly slow. The algorithm which we describe next is a Newton type of
algorithm based on the asymptotic quadraticity of D(). It is generally much faster than
gradient type algorithms and is currently used in the RREGR command in Minitab and in
the program RGLM (Kapenga, McKean and Vidmar, 1988). A nite algorithm to minimize
D() is discussed by Osborne (1985).
The Newton type of algorithm needs an initial estimate which we denote as

(0)
. Let
e
(0)
= Y X
(0)
denote the initial residuals and let
(0)
denote the initial estimate of
based on these residuals. By ( 3.5.11) the approximating quadratic to D based on

(0)
is
given by,
Q() =
_
2
(0)
_
1
_

(0)
_
X
_

(0)
_
(0)
_
S
_
YX
(0)
_
+D
_
Y X
(0)
_
.
By ( 3.5.13), the value of which minimizes Q() is given by
(1)
=

(0)
+
(0)
(X
X)
1
S(Y X
(0)
) . (3.7.9)
This is the rst Newton step. In the same way that the rst step was dened in terms of
the initial estimate, so can a second step be dened in terms of the rst step. We shall call
these iterated estimates or k-step estimates. In practice, though, we would want to know
if D
_
(1)
_
is less than D
_
(0)
_
before proceeding. A more formal algorithm is presented
below.
These k-step estimates satisfy some interesting properties themselves which we briey
discuss; details can be found in McKean and Hettmansperger (1978). Provided the initial
estimate is such that

n(
(0)
) is bounded in probability then for any k 1 we have
n
_
(k)
_
P
0 ,
where

denotes a minimizing value of D. Hence the k-step estimates have the same
asymptotic distribution as

. Furthermore
(k)
, if it is any of
the scale estimates discussed in Section 3.7.1 based on k-step residuals. Let F
(k)
denote the
R-test of a general linear hypothesis based on reduced and full model k-step estimates. Then
it can be shown that F
(k)
satises the same asymptotic properties as the test statistic F
under the null hypothesis and contiguous alternatives. Also it is consistent for any alternative
H
A
.
Formal Algorithm
In order to outline the algorithm used by RGLM, rst consider the QR-decomposition of X
which is given by
Q
X = R , (3.7.10)
where Q is an n n orthogonal matrix and R is an n p upper triangular matrix of
rank p. As discussed in Stewart (1973), Q can be expressed as a product of p Householder
transformations. Writing Q = [Q
1
Q
2
] where Q
1
is np, it is easy to show that the columns
of Q
1
form an orthonormal basis for the column space of X. In particular the projection
matrix onto the column space of X is given by H = Q
1
Q
1
. The software package LINPACK
(1979) is a collection of subroutines which eciently computes QR-decompositions and it
further has routines which obtain projections of vectors.
Note that we can write the kth Newton step in terms of residuals as
e
(k)
= e
(k1)

Ha(R(e
(k1)
) (3.7.11)
where a(R(e
(k1)
) denotes the vector whose ith component is a(R(e
(k1)
i
). Let D
(k)
denote
the dispersion function evaluated at e
(k)
. The Newton step is a step from e
(k1)
along the
direction
Ha(R(e
(k1)
)). If D
(k)
< D
(k1)
the step has been successful; otherwise, a linear
search can be made along the direction to nd a value which minimizes D. This would then
become the kth step residual. Such a search can be performed using methods such as false
position as discussed below in Section 3.7.3. Stopping rules can be based on the relative
drop in dispersion, i.e., stop when
D
(k)
D
(k1)
D
(k1)
<
D
, (3.7.12)
where
D
is a specied tolerance. A similar stopping rule can be based on the relative size
of the step. Upon stopping at step k, obtain the tted value

Y = Y e
(k)
and then the
estimate of by solving X =

Y.
A formal algorithm is: Let
D
and
s
be the given stopping tolerances.
1. Set k = 1. Obtain initial residuals e
(k1)
and based upon these get an initial estimate

(0)
of
.
2. Obtain e
(k)
as in expression ( 3.7.11). If the step is successful proceed to the next step,
otherwise search along the Newton direction for a value which minimizes D then go to
the next step. An algorithm for this search is discussed in Section 3.7.3.
3. If the relative drop in dispersion or length of step is within its respective tolerance
D
or
s
stop; otherwise set e
(k1)
= e
(k)
and go to step (2).
4. Obtain the estimate of and the nal estimate of
.
The QR decomposition can readily be used to form a reduced model design matrix for
testing the general linear hypotheses ( 3.2.5), M = 0, where M is a specied q p matrix.
Recall that we called the column space of X,
F
, and the space
F
constrained by M = 0
the reduced model space, . The key result lies in the following theorem:
Theorem 3.7.2. Denote the row space of M by 1(M
). Let Q
M
be a p (p q) matrix
whose columns consist of an orthonormal basis for the space (1(M
))
. If U = XQ
M
, then
1(U) = .
Proof: If u then u = Xb for some b where Mb = 0. Hence b (1(M
))
; i.e.,
b = Q
M
c for some c. Conversely, if u 1(U) then for some c R
pq
, u = X(Q
M
c).
Hence u 1(X) and M(Q
M
c) = (MQ
M
)c = 0.
Thus using the LINPACK subroutines mentioned above, it is easy to write an algorithm
which obtains the reduced model design matrix Udened above in the theorem. The package
RGLM uses such an algorithm to test linear hypotheses; see Kapenga, McKean and Vidmar
(1988).
3.7.3 An Algorithm for a Linear Search
The computation for many of the quantities needed in a rank-based analysis involve simple
linear searches. Examples include the estimate of the location parameter for a signed-rank
procedure, the estimate of the shift in location in the two sample location problem, the
estimate of
discussed in Section 3.7 and the search along the Newton direction for a
minimizing value in Step (2) of the algorithm for the R-t in a regression problem discussed
in the last section. The following is a generic setup for these problems: solve the equation
S(b) = K , (3.7.13)
where S(b) is a decreasing step function and K is a specied constant. Without loss of
generality we will take K = 0 for the remainder of the discussion. By the monotonicity,
a solution always exists, although, it may be an interval of solutions. In almost all cases,
S(b) is asymptotically linear; so, the search problem becomes relatively more ecient as the
sample size increases.
There are certainly many search algorithms that can be used for solving ( 3.7.13). One
that we have successfully employed is the Illinois version of regula falsi; see Dowell and
Jarratt (1971). McKean and Ryan (1977) employed this routine to obtain the estimate and
condence interval for the two sample Wilcoxon location problem. We will write the generic
asymptotic linearity result as,
S(b)
.
= S(b
(0)
) (b b
(0)
) . (3.7.14)
The parameter is often of the form
1
C where C is some constant. Since is a scale
parameter, initial estimates of it include such estimates as the MAD, ( 3.9.27), or the
sample standard deviation. We have found MAD to usually be preferable. An outline of an
algorithm for the search is:
1. Bracket Step. Beginning with an initial estimate b
(0)
step along the b-axis to b
(1)
where
the interval (b
(0)
, b
(1)
), or vice-versa, brackets the solution. Asymptotic linearity can
be used here to make these steps; for instance, if
(0)
is an estimate of based on b
(0)
then the rst step is
b
(1)
= b
(0)
+ S(b
(0)
)/
(0)
.
2. Regula-Falsi. Assume the interval (b
(0)
, b
(1)
) brackets the solution and that b
(1)
is the
more recent value of b
(0)
, b
(1)
. If [b
(1)
b
(0)
[ < then stop. Else, the next step is where
the secant line determined by b
(0)
, b
(1)
intersects the b-axis; i.e.,
b
(2)
= b
(0)
b
(1)
b
(0)
S(b
(1)
) S(b
(0)
)
S(b
(0)
) . (3.7.15)
(a) If (b
(0)
, b
(2)
) brackets the solution then replace b
(1)
by b
(2)
and go to (2) but use
S(b
(0)
)/2 in place of S(b
(0)
) in determination of the secant line, (this is the Illinois
modication).
(b) If (b
(2)
, b
(1)
) brackets the solution then replace b
(0)
by b
(2)
and go to (2).
The above algorithm is easy to implement. Such searches are used in the package RGLM;
see Kapenga, McKean and Vidmar (1988).
3.8 L
1
-Analysis
This section is devoted to L
1
-procedures. These are widely used procedures; see, for example,
Bloomeld and Steiger (1983). We rst show that they are equivalent to R-estimates based on
the sign score function under Model ( 3.2.4). Hence the asymptotic theory for L
1
-estimation
and subsequent analysis is contained in Section 3.5. The asymptotic theory for L
1
-estimation
can also be found in Bassett and Koenker (1978) and Rao (1988) from an L
1
point of view.
Consider the sign scores; i.e., the scores generated by (u) = sgn(u1/2). In this section
we shall denote the associated pseudo-norm by
|v|
S
=
n
i=1
sgn(R(v
i
) (n + 1)/2)v
i
v 1
n
;
see, also, Section 2.6.1. This score function is optimal if the errors follow a double exponential
(Laplace) distribution; see Exercise 2.13.19 of Chapter 2. We shall summarize the analysis
based on the sign scores, but rst we show that indeed the R-estimates based on sign scores
are also L
1
-estimates, provided that the intercept is estimated by the median of residuals.
Consider the intercept model, ( 3.2.4), as given in Section 3.2 and let denote the
column space of X and
1
denote the column space of the augmented matrix X
1
= [1 X].
First consider the R-estimate of based on the L
1
pseudo-norm. This is a vector
Y
S
such that
Y
S
= Argmin
|Y |
S
.
3.8. L
1
-ANALYSIS 195
Next consider the L
1
-estimate for the space
1
; i.e., the L
1
-estimate of 1 + . This is
a vector

Y
L
1

1
such that
Y
L
1
= Argmin
1
|Y|
L
1
,
where |v|
L
1
=
[v
i
[ is the L
1
-norm.
Theorem 3.8.1. R-estimates based on sign scores are equivalent to L
1
-estimates; that is,
Y
L
1
=

Y
S
+ medY

Y
S
1 . (3.8.1)
Proof: Any vector v
1
can be written uniquely as v = a1 + v
c
where a is a scalar and
v
c
. Since the sample median minimizes the L
1
-distance between a vector and the space
spanned by 1, we have
|Yv|
L
1
= |Y a1 v
c
|
L
1
|YmedY v
c
1 v
c
|
L
1
.
But it is easy to show that sgn(Y
i
medY v
c
v
ci
) = sgn(R(Y
i
v
ci
) (n + 1)/2) for
i = 1, . . . , n. Putting these two results together along with the fact that the sign scores sum
to 0 we have,
|Y v|
L
1
= |Ya1 v
c
|
L
1
|YmedYv
c
1 v
c
|
L
1
= |Yv
c
|
S
; , (3.8.2)
for any vector v
1
. Once more using the sign argument above, we can show that
|YmedY

Y
S
1

Y
S
|
L
1
= |Y

Y
S
|
S
. (3.8.3)
Putting ( 3.8.2) and ( 3.8.3) together establishes the result.
Let

b
S
= (
S
,
S
) denote the R-estimate of the vector of regression coecients b =
(
0
,
. It follows that these R-estimates are the maximum likelihood estimates if the
errors e
i
are double exponentially distributed; see Exercise 3.16.13.
From the discussions in Sections 3.5 and 3.5.2,
b
S
has an approximate N(b,
2
S
(X
1
X
1
)
1
)
distribution, where
S
= (2f(0))
1
. From this the eciency properties of the L
1
-procedures
discussed in the rst two chapters carry over to the L
1
linear model procedures. In particular
its eciency relative to LS at the normal distribution is .63, and it can be much more ecient
than LS for heavier tailed error distributions.
As Exercise 3.16.22 shows, the drop in dispersion test based on sign scores, F
S
, is, except
for the scale parameter, the likelihood ratio test of the general linear hypothesis ( 3.2.5),
provided the errors have a double exponential distribution. For other error distributions, the
same comments about eciency of the L
1
estimates can be made about the test F
S
.
In terms of implementation, Schrader and McKean (1987) found it more dicult to
standardize the L
1
statistics than other R-procedures, such as the Wilcoxon. Their most
successful standardization of F
S
was based on the following bootstrap procedure:
1. Compute the full model L
1
estimates

S
and
S
, the full model residuals e
1
, . . . , e
n
,
and the test statistic F
S
.
2. Select e
1
, . . . , e
n
, the n = n (p + 1) nonzero residuals.
3. Draw a bootstrap random sample e
1
, . . . , e
n
with replacement from e
1
, . . . , e
n
. Calcu-
late

S
and F
S
, the L
1
estimate and test statistic, from the model y
i
=
S
+x
S
+e
i
.
4. Independently repeat step 3 a large number B times. The bootstrap p value, p
=
#F
S
F
S
/B.
5. Reject H
0
at level if p
.
Notice that by using full model residuals, the algorithm estimates the null distribution
of F
S
. The algorithm depends on the number B of bootstrap samples taken. We suggest at
least 2000.
3.9 Diagnostics
One of the most important parts in the analysis of a linear model is the examination of the
resulting t. Tools for doing this include residual plots and diagnostic techniques. Over the
last fteen years or so, these tools have been developed for ts based on least squares; see,
for example, Cook and Weisberg (1982) and Belsley, Kuh and Welsch (1980). Least squares
residual plots can be used to detect such things as curvature not accounted for by the tted
model; see, Cook and Weisberg (1989) for a recent discussion. Further diagnostic techniques
can be used to detect outliers which are points that dier greatly from pattern set by the
bulk of the data and to measure the inuence of individual cases on the least squares t.
In this section we explore the properties of the residuals from the rank-based ts, showing
how they can be used to determine model misspecication. We present diagnostic techniques
for rank-based residuals that detect outlying and inuential cases. Together these tools oer
the user a residual analysis for the rank-based method for the t of a linear model similar
to the residual analysis based on least squares estimates.
In this section we consider the same linear model, ( 3.2.3), as in Section 3.2. For a given
score function , let

and e
R
denote the R-estimate of and residuals from the R-t
of the model based on these scores. Much of the discussion is taken from the articles by
McKean, Sheather and Hettmansperger (1990, 1991, 1993). Also, see Dixon and McKean
(1996) for a robust rank-based approach to modeling heteroscedasticity.
3.9.1 Properties of R-Residuals and Model Misspecication
As we discussed above, a primary use of least squares residuals is in detection of model
misspecication. In order to show that the R-residuals can also be used to detect model
misspecication, consider the sequence of models
Y = 1 +X +Z +e , (3.9.1)
3.9. DIAGNOSTICS 197
where Z is an n q centered matrix of constants and = /
n, for ,= 0. Note that

this sequence of models is contiguous to Model ( 3.2.3). Suppose we t model ( 3.2.3), i.e.
Y = 1 + X + e, when model ( 3.9.1) is the true model. Hence the model has been
misspecied. As a rst step in examining the residuals in this situation, we consider the
limiting distribution of the corresponding R-estimate.
Theorem 3.9.1. Assume model ( 3.9.1) is the true model. Let

be the R-estimate for the

model ( 3.2.3). Suppose that conditions (E.1) and (S.1) of Section 3.4 are true and that
conditions (D.1) and (D.2) are true for the augmented matrix [X Z]. Then
p
_
+ (X
X)
1
X
Z/
n,
2
(X
X)
1
_
distribution. (3.9.2)
Proof: Without loss of generality assume that = 0. Note that the situation here is
the same as the situation in Theorem 3.6.3; except now the null hypothesis corresponds to
= 0 and

is the reduced model estimate. Thus we seek the asymptotic distribution of

the reduced model estimate. As in Section 3.5.1 it is easier to consider the corresponding
pseudo estimate

which is the reduced model estimate which minimzes the quadratic Q(Y
X), ( 3.5.11). Under the null hypothesis, = 0,

n(

)
P
0; hence by contiguity
n(

)
P
0 under the sequence of models ( 3.9.1). Thus

and

have the same
distributions under ( 3.9.1); hence, it suces to nd the distribution of

. But by ( 3.5.13),
(X
X)
1
S(Y) , (3.9.3)
where S(Y) is the rst p components of the vector T(Y) = [X Z]
a(R(Y)). By ( 3.6.28) of
Theorem 3.6.3
n
1/2
T(Y)
D
N
p+q
(
1
(0
) , (3.9.4)
where
is the following limit,

lim
n
1
n
_
X
X X
Z
Z
X Z
Z
_
=
.
Because

is dened by ( 3.9.3), the result is just an algebraic computation applied to ( 3.9.4).
With a few more steps we can write a rst order expression for

, which is given in the

following corollary:
Corollary 3.9.1. Under the assumptions of the last theorem,
= +
(X
X)
1
X
(F(e)) + (X
X)
1
X
Z/
n + o
p
(n
1/2
) . (3.9.5)
Proof: Without loss of generality assume that the regression coecients are 0. By
( A.3.10) and expression ( 3.6.27) of Theorem 3.6.3 we can write
1
n
T(Y) =
1
n
_
X
(F(e))
Z
(F(e))
_
+
1
1
n
_
X
Z
Z
Z
_
+ o
p
(1) ;
hence, the rst p components of
1
n
T(Y) satisfy
1
n
S(Y) =
1
n
X
(F(e)) +
1
1
n
X
Z + o
p
(1) .
By expression ( 3.9.3) and the fact that

n(
)
P
0 the result follows.
From this corollary we obtain the following rst order expressions of the R-residuals and
R-tted values:
Y
R
.
= 1 +X +
H(F(e)) +HZ (3.9.6)

e
R
.
= e
H(F(e)) + (I H)Z , (3.9.7)

where H = X(X
X)
1
X
. In Exercise 3.16.23 the reader is asked to show that the least

squares tted values and residuals satisfy
Y
LS
= 1 +X +He +HZ (3.9.8)
e
LS
= e He + (I H)Z . (3.9.9)
In terms of model mispecication the coecients of interest are the regression coecients.
Hence, at this time we need not consider the eect of the estimation of the intercept. This
avoids the problem of which estimate of the intercept to use. In practice, though, for both
R- and LS-ts, the intercept will also be tted and its eect will be removed from the
residuals. We will also include the eect of estimation of the intercept in our discussion of
the standardization of residuals and tted values in Sections 3.9.2 and 3.9.3, respectively.
Suppose that the linear model ( 3.2.3) is correct. Based on its rst order expression when
= 0, e
R
is a function of the random errors similar to e
LS
; hence, it follows that a plot
of e
R
versus

Y
R
should generally be a random scatter, similar to the least squares residual
plot.
In the case of model misspecication, note that the R-residuals and least squares residuals
have the same asymptotic bias, namely (I H)Z. Hence R-residual plots, similar to those
of least squares, are useful in identifying model misspecication.
For least squares residual plots, since least squares residuals and the tted values are
uncorrelated, any pattern in this plot is due to model misspecication and not the tting
procedure used. The converse, however, is not true. As the example on the potency of drug
compounds below illustrates, the least squares residual plot can exhibit a random scatter
for a poorly tted model. This orthogonality in the LS residual plot does, however, make it
easier to pick out patterns in the plot. Of course the R-residuals are not orthogonal to the
R-tted values, but they are usually close to orthogonality; see Naranjo et al. (1994). We
introduce the following parameter to measure the extent of departure from orthogonality.
Denote general tted values and residuals by

Y and e respectively. The expected
departure from orthogonality is the parameter dened by
= E
_
e
Y
_
. (3.9.10)
For least squares,
LS
is of course 0. For R-ts, we have the following rst order expression
for it:
Theorem 3.9.2. Under the assumptions of Theorem 3.9.1 and either Model ( 3.2.3) or
Model ( 3.9.1),
R
.
= p
(E[(F(e
1
))e
1
]
) . (3.9.11)
Proof: Suppose Model ( 3.9.1) holds. Using the above rst order expressions we have
R
.
= E [(e + 1
H(F(e)) + (I H)Z)
(X +
H(F(e)) +HZ)]
Using E[(F(e))] = 0, E[e] = E(e
1
)1, and the fact that X is centered this expression
simplies to
R
.
=
E [trH(F(e))e
]
2
E [trH(F(e))(F(e))
] .
Since the components of e are independent, the result follows. The result is invariant to
either of the models.
Although in general,
R
,= 0 for R-estimates, if, as the next corollary shows, optimal
scores (see Examples 3.6.1 and 3.6.2) are used the expected departure from orthogonality
is 0.
Corollary 3.9.2. Under the hypothesis of the last theorem, if optimal R-scores are used then
R
= 0.
Proof: Let (u) =
cf
(F
1
(u))
f(F
1
(u))
where c is chosen so that
_

2
(u)du = 1. Then
=
__
(u)
_
(F
1
(u))
f(F
1
(u))
_
du
_
1
= c.
Some simplication and an integration by parts shows
_
(F(e))e dF(e) = c
_
f
(e) de = c.
Naranjo et al. (1994) conducted a simulation study to investigate the above properties of
rank-based and LS residuals over several small sample situations of null (the true model was
tted) models and misspecied models. Error distributions included the normal distribution
and a contaminated normal distribution. Wilcoxon scores were used. The rst part of
the study concerned the amount of association between residuals and tted values where
the association was measured by several correlation coecients, including Pearsons r and
Kendalls . Because of orthogonality between the LS residuals and tted values, Pearsons
r is always 0 for LS. On the other measures of association, however, the results for the
Wilcoxon analysis and LS were about the same. In general, there was little association. The
second part investigated measures of randomness in a residual plot, including a runs tests and
a quadrant count test, (the quadrants were determined by the medians of the residuals and
tted values). The results were similar for the LS and Wilcoxon ts. Both showed validity
Table 3.9.1: Cloud Data
%I-8 0 1 2 3 4 5 6 7 8 0
Cloud Point 22.1 24.5 26.0 26.8 28.2 28.9 30.0 30.4 31.4 21.9
%I-8 2 4 6 8 10 0 3 6 9
Cloud Point 26.1 28.5 30.3 31.5 33.1 22.8 27.3 29.8 31.8
over the null models and exhibited similar power over the misspecied models. In a power
study over a quadratic misspecied model, the Wilcoxon analysis exhibited more power for
long tailed error distributions. In summary, the simulation study provided empirical evidence
that residual analyses based on Wilcoxon ts are similar to LS based residual analyses.
There are other useful residual plots. Two that we will briey discuss are q q plots
and added variable plots. As with standard residual plots, the internal R-studentized
residuals (see Section 3.9.2) can be used in place of the residuals. Since the R-estimates
of are consistent, the distribution of the residuals should resemble the distribution of the
errors. This leads to consideration of another useful residual plot, a qq plot. In this plot,
the quantiles of the target distribution form the horizontal coordinates while the sample
quantiles (ordered residuals) form the vertical coordinates. Linearity of this plot indicates
the appropriateness of the target distribution as the true model distribution; see Exercise
3.16.24. McKean and Sievers (1989) discuss how to use these plots adaptively to select
appropriate rank scores. In the next example, we use them to examine how well the R-t
ts the bulk of the data and to highlight outliers.
For the added variable plot, let e
R
denote the residuals from the R-t of the model
Y = 1 +X +e. In this case, Z is a known vector and we wish to decide whether or not
to add it to the regression model. For the added variable plot, we regress Z on X. We will
denote the residuals from this t as e(Z [ X) = (I H)Z. The added variable plot consists
of the scatter plot of the residuals e
R
versus e(Z [ X). Under model misspecication ,= 0
from expression ( 3.9.7), the residuals e
R
are also a function of (I H)Z. Hence, the plot
can be quite powerful in determining the potential of Z as a predictor.
Example 3.9.1. Cloud Data
The data for this example can be found in Table 3.9.1. It is taken from an exercise on
p.162 of Draper and Smith (1966). The dependent variable is the cloud point of a liquid, a
measure of degree of crystallization in a stock. The independent variable is the percentage
of I-8 in the base stock. The subsequent R-ts for this data set were all based on Wilcoxon
scores with the intercept estimate
S
, the median of the residuals.
Panel A of Figure 3.9.1 displays the residual plot (R-residuals versus R-tted values)
of the R-t of the simple linear model. The curvature in the plot indicates that this model
is a poor choice and that a higher degree polynomial model would be more appropriate.
Panel B of Figure 3.9.1 displays the residual plot from the R-t of a quadratic model. Some
curvature is still present in the plot. A cubic polynomial was tted next. Its R-residual plot,
found in Panel C of Figure 3.9.1, is much more of a random scatter than the rst two plots.
On the basis of residual plots the cubic polynomial is an adequate model. Least squares
residual plots would also lead to a third degree polynomial.
Figure 3.9.1: Panel A through C are the residual plots of the Wilcoxon ts of the linear,
quadratic and cubic models, respectively, for the Cloud Data. Panel D is the qq plot based
on the Wilcoxon t of the cubic model
Wilcoxon linear fit

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
24 26 28 30 32 34
-
2
.
0
-
1
.
0
0
.
0
0
.
5
Panel A
Wilcoxon quadratic fit

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
24 26 28 30 32
-
0
.
6
-
0
.
2
0
.
2
0
.
6
Panel B
Wilcoxon cubic fit

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
22 24 26 28 30 32
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
Panel C
Normal quantiles
W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
-1 0 1
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
Panel D
In the R-residual plot of the cubic model, several points appear to be outlying from the
bulk of the data. These points are also apparent in Panel D of Figure 3.9.1 which displays
the qq plot of the R-residuals. Based on these plots, the R-regression appears to have t the
bulk of the data well. The qq plot suggests that the underlying error distribution has slightly
heavier tails than the normal distribution. A scale would be helpful in interpretating these
residual plots as discussed the next section. Table 3.9.2 displays the estimated coecients
along with their standard errors. The Wilcoxon and least squares ts are practically the
same.
Example 3.9.2. Potency Data, Example 3.3.3 continued
Table 3.9.2: Wilcoxon and LS-estimates of the regression coecients for Cloud Data. Stan-
dard errors are in parentheses.
Method Intercept Linear Quadratic Cubic Scale
Wilcoxon 22.35 (.18) 2.24 (.17) -.23 (.04) .01 (.003)
= .307
Least Squares 22.31 (.15) 2.22 (.15) -.22 (.04) .01 (.003) = .281
This example was discussed in Section 3.3. Recall that the data were the result of an
experiment concerning the potency of drug compounds manufactured under dierent levels
of 4 factors and one covariate. Here we want to discuss a residual analysis of the rank-based
ts of the two models that were t in Example 3.3.3.
First consider Model ( 3.3.1) without the quadratic terms, i.e., without the parameters
11
,
12
and
13
. The residuals used are the internal R-studentized residuals dened in the
next section; see ( 3.9.31). They provide a convenient scale for detecting outliers. The
curvature in the Wilcoxon-residual plot of this model, Panel A of Figure 3.9.2, is quite
apparent, indicating the need for quadratic terms in the model; whereas, the LS residual
plot, Panel C of Figure 3.9.2, does not exhibit this quadratic eect. As the R-residual plot
indicates there are outliers in the data and these had an eect on the LS t. Panels B and
D display the residual plots, when the squared terms of the factors are added to model, i.e.,
Model ( 3.3.1) was t. This R-residual plot no longer exhibits the quadratic eect indicating
a better tting model. Also by examining the R-plots for both models, it is seen that the
outlyingness of some of the outliers indicated in the plot for the rst model was accounted
for by the larger model.
3.9.2 Standardization of R-Residuals
In this section we want to obtain an expression for the variance of the R-residuals under
the model ( 3.2.3). We will assume in this section that
2
, the variance of the errors, is
nite. As we show below, similar to the least squares residual, the variance of an R-residual
depends both on its location in the x-space and the underlying variation of the errors. The
internal Studentized least squares residuals (residuals divided by their estimated standard
errors) have proved useful in diagnostic procedures since they correct for both the model
and the underlying variance. The internal R-Studentized residuals dened below, ( 3.9.31),
are similarly Studentized R-residuals.
A diagnostic use of a Studentized residual is in detecting outlying observations. The R-
method provides a robust t to the bulk of the data. Thus any case with a large Studentized
residual can be considered an outlier from this model. Even though a robust t is resistant
to outliers, it is still useful to detect such points. Indeed in practice these are often the points
of most interest. The value of an internally Studentized residual is in its simplicity. It tells
how many estimated standard errors a residual is away from the center of the data.
The standardization depends on which estimate of the intercept is selected. We shall
obtain the result for
S
the median of e
Ri
and only state the results for the intercept based
Figure 3.9.2: Panels A and B are the Wilcoxon internal studentized residuals plots for models
without and with, respectively, the three quadratic terms
11
,
12
and
13
. Panels C and D
are the analogous plots for the LS t.
Wilcoxon w/o quad. terms

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
7.8 8.0 8.2 8.4
-
0
.
4
0
.
0
0
.
4
Panel A
Wilcoxon with quad. terms

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
7.6 7.8 8.0 8.2
0
.
0
0
.
5
1
.
0
Panel B
LS w/o quad. terms

L
S

r
e
s
i
d
u
a
l
s
7.8 8.0 8.2 8.4
-
0
.
4
0
.
0
0
.
2
0
.
4
0
.
6
Panel C
LS with quad. terms

L
S

r
e
s
i
d
u
a
l
s
7.8 8.0 8.2 8.4
-
0
.
4
0
.
0
0
.
2
0
.
4
0
.
6
Panel D
on symmetric errors. Thus the residuals we seek to standardize are given by
e
R
= Y
S
1 X
. (3.9.12)
We will obtain a rst order approximation of cov(e
R
). Since the residuals are invariant to
the regression coecients, we can assume without loss of generality that the true parameters
are zero. Recall that h
ci
is the ith diagonal element of H = X(X
X)
1
X
and h
i
= n
1
+h
ci
.
Theorem 3.9.3. Under the conditions (E.1), (E.2), (D.1), (D.2) and (S.1) of Section 3.4,
if the intercept estimate is
s
then a rst order representation of the variance of e
R,i
is
Var(e
R,i
)
.
=
2
(1 K
1
n
1
K
2
h
ci
) , (3.9.13)
where K
1
and K
2
are dened in expressions ( 3.9.18) and ( 3.9.19), respectively. In the case
of a symmetric error distribution when the estimate of the intercept is given by
+
, discussed
in Section 3.5.2, and (S.3) also holds,
Var(e
R,i
)
.
=
2
(1 K
2
h
i
) . (3.9.14)
Proof: Using the rst order expression for

given in ( 3.5.24) and the asymptotic repre-

sentation of
S
given by ( 3.5.23), we have
e
R
.
= e
S
sgn(e)1 H
(F(e)) , (3.9.15)
where sgn(e) =
sgn(e
i
)/n and
S
and
are dened in expressions ( 3.4.6) and ( 3.4.4),

respectively. Because the median of e
i
is 0 and
_
(u) du = 0, we have
E[e
R
]
.
= E(e
1
)1 .
Hence,
cov(e
R
)
.
= E[(e
S
sgn(e)1 H
(F(e)) E(e
1
)1)
(e
S
sgn(e)1 H
(F(e)) E(e
1
)1)
] . (3.9.16)
Let J = 11
/n denote the projection onto the space spanned by 1. Since our design matrix
is [1 X], the leverage of the ith case is h
i
= n
1
+h
ci
where h
ci
is the ith diaggonal entry of
the projection matrix H. By expanding the above expression and using the independence of
the components of e we get after some simplication (see Exercise 3.16.25):
Cov(e
R
)
.
=
2
I K
1
J K
2
H , (3.9.17)
where
K
1
=
_
_
2
_
2
S
1
_
, (3.9.18)
K
2
=
_
_
2
_
2

1
_
, (3.9.19)
S
= E[e
i
sgn(e
i
)] , (3.9.20)
= E[e
i
(F(e
i
))] , (3.9.21)
2
= Var(e
i
) = E((e
i
E(e
i
))
2
) . (3.9.22)
This yields the rst result, ( 3.9.13). Next consider the case of a symmetric error distribution.
If the estimate of the intercept is given by
+
, discussed in Section 3.5.2, the result simplies

to ( 3.9.14).
From Cook and Weisberg (1982, p. 11) in the least squares case, Var(e
LS,i
) =
2
(1 h
i
)
so that K
1
and K
2
are correction factors due to using the rank score function.
Based on the results in the theorem, an estimate of the variance-covariance matrix of e
R
is
S =
2
I

K
1
J

K
2
H
c
, (3.9.23)
where
K
1
=

2
S

2
_
2
S

S
1
_
, (3.9.24)
K
2
=

2

2
_
2
1
_
, (3.9.25)
S
=
1
n p
[ e
R,i
[ , (3.9.26)
and
=
1
n p
D(
) .
The estimators
S
and
are discussed in Section 3.7.1.

To complete the estimate of the Cov(e
R
) we need to estimate . A robust estimate of it
is given by the MAD,
= 1.483med
i
[ e
Ri
med
j
e
Rj
[ , (3.9.27)
which is a consistent estimate of if the errors have a normal distribution. For the examples
discussed here, we used this estimate in ( 3.9.23) - ( 3.9.25).
It follows from ( 3.9.23) that an estimate of Var(e
R,i
) is
s
2
R,i
=
2
(1

K
1
1
n

K
2
h
c,i
) . (3.9.28)
where h
ci
= x
i
(X
X)
1
x
i
.
Let
2
LS
denote the usual least squares estimate of the variance. Least squares residuals
are standardized by s
LS,i
where
s
2
LS,i
=
2
LS
(1 h
i
) ; (3.9.29)
see page 11 of Cook and Weisberg (1982) and recall that h
i
= n
1
+x
i
(X
X)
1
x
i
.
If the error distribution is symmetric ( 3.9.28) reduces to
s
2
R,i
=
2
(1

K
2
h
i
) . (3.9.30)
Internal R-studentized Residual
We dene the internal R-studentized residuals as
r
R,i
=
e
R,i
s
R,i
for i = 1, . . . , n , (3.9.31)
where s
R,i
is the square root of either ( 3.9.28) or ( 3.9.30) depending on whether one assumes
an asymmetric or symmetric error distribution, respectively.
It is interesting to compare expression ( 3.9.30) with the estimate of the variance of the
least squares residual
2
LS
(1 h
i
). The correction factor

K
2
depends on the score function
() and the underlying symmetric error distribution. If, for example, the error distribution
is normal and if we use normal scores, then

K
2
converges in probability to 1; see Exercise
3.16.26. In general, however , we will not wish to specify the error distribution and then

K
2
provides a natural adjustment.
A simple benchmark is useful in declaring whether or not a case is an outlier. We are
certainly not advocating eliminating such cases but agging them as potential outliers and
targeting them for further study. As we discussed in the last section, the distribution of the
R-residuals should resemble the true distribution of the errors. Hence a simple rule for all
cases is not apparent. In general, unless the residuals appear to be from a highly skewed
distribution, a simple rule is to declare a case to be a potential outlier if its residual
exceeds two standard errors in absolute value; i.e., [r
R,i
[ > 2.
The matrix

S, ( 3.9.23), is an estimate of a rst order approximation of cov(e
R
). It is not
necessarily positive semi-denite and we have not constrained it to be so. In practice this
has not proved troublesome since only occasionally have we encountered negative estimates
of the variance of the residuals. For instance, the R-t for the cloud data resulted in one
case with a negative variance. Presently, we replace ( 3.9.28) by
1 h
i
, where is the
MAD estimate ( 3.9.27), in these situations.
We have already illustrated the internal R-studentized residuals for the potency of Ex-
ample 3.9.2 discussed in the last section. We use them next on the Cloud data.
Example 3.9.3. Cloud Data, Example 3.9.1, continued
Returning to cloud data example, Panel A of Figure 3.9.3 displays a residual plot of the
internal Wilcoxon studentized residuals versus the tted values. It is similar to Panel C of
Figure 3.9.1 but it has a meaningful scale on the vertical axis. The residuals for three of
the cases (4, 10, and 16) are over two standard errors from the center of the data. These
should be agged as potential outliers. Panel B of Figure 3.9.3 displays the normal qq
plot of the internal Wilcoxon studentized residuals. The underlying error structure appears
to have heavier tails than the normal distribution.
As with their least squares counterparts, we think the chief benets of the internal R-
studentized residuals is their usefulness in diagnostic plots and agging potential outliers.
External R-studentized Residual
Another statistic that is useful for agging outliers is a robust version of the external t
statistic. The LS version of this diagnostic is discussed in detail in Cook and Weisberg (1982).
A robust version of this diagnostic is discussed in McKean, Sheather and Hettmansperger
(1991). We briey describe this latter approach.
Suppose we want to examine the ith case to see if its an outlier. Consider the mean
shift model given by,
Y = X
1
b +
i
d
i
+e , (3.9.32)
Figure 3.9.3: Internal Wilcoxon studentized residual plot, Panel A, and corresponding normal
qq plot, Panel B, for the Cloud Data.
Wilcoxon cubic fit

W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
22 24 26 28 30 32
-
2
-
1
0
1
2
Panel A
Normal quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
-1 0 1
-
2
-
1
0
1
2
Panel B
where X
1
is the augmented matrix [1 X] and d
i
is an n1 vector of zeroes except for its ith
component which is a 1. A formal hypothesis that the the ith case is an outlier is given by
H
0
:
i
= 0 versus H
A
:
i
,= 0 . (3.9.33)
One way of testing these hypotheses is to use the test procedures described in Section 3.6.
This requires tting Model ( 3.9.32) for each value of i. A second approach is described next.
Note that we can rewrite Model ( 3.9.32) equivalently as
Y = X
1
b
+
i
d
i
+e , (3.9.34)
where d
i
= (I H
1
)d
i
, H
1
is the projection matrix onto the column space of X
1
and
b
= b + H
1
d
i
i
; see Exercise 3.16.27. Because of the orthogonality between X and d
i
,
the least squares estimate of
i
can be obtained by a simple linear regression of Y on d
i
or
equivalently of e
LS
on d
i
. For the rank-based estimate, the asymptotic distribution theory
of the regression estimates suggests a similar approach. Accordingly, let

R,i
denote the
R-estimate when e
R
is regressed on d
i
. This is a simple regression and the estimate can be
obtained by a linear search algorithm; see Section 3.7.2. As Exercise 3.16.29 shows, this
estimate is the inversion of an aligned rank statistic to test the hypotheses ( 3.9.33). Next
let
,i
denote the estimate of
produced from this regression. We dene the external

R-studentized residual to be the statistic
t
R
(i) =
R,i

,i
/
_
1 h
1,i
, (3.9.35)
where h
1,i
is the ith diagonal entry of H
1
. Note that we have standardized

R,i
by its
asymptotic standard error.
A nal remark on these external t-statistics is in order. In the mean shift model, ( 3.9.32),
the leverage value of the ith case is 1. Hence, the design assumption (D.2), ( 3.4.7), is not
true. This invalidates both the LS and rank-based asymptotic theory for the external t
statistics. In light of this, we do not propose the statistic t
R
(i) as a test statistic for the
hypotheses ( 3.9.33) but as a diagnostic for agging potential outliers. As a benchmark, we
suggest the value 2.
3.9.3 Measures of Inuential Cases
Since R-estimates have bounded inuence in the y-space but not in the x- space, the R-t
may be aected by outlying points in the x-space. We next introduce a statistic which
measures the inuence of the ith case on the robust t. We work with the usual model
( 3.2.3). First, we need the rst order representation of

Y
R
. Similar to the proof of Theorem
3.9.3 which obtained the rst order representation of the residuals, ( 3.9.15), we have
Y
R
.
= 1 +X +
S
sgn(e)1 +H
(F(e)) ; (3.9.36)
see exercise 3.16.28.
Let

Y
R
(i) denote the R-predicted value of Y
i
when the ith case is deleted from the model.
We shall call this model, the delete i model. Then the change in the robust t due to the
ith case is
RDFFIT
i
=

Y
R,i

Y
R
(i) . (3.9.37)
RDFFIT
i
is our measure of the inuence of case i. Computation of this statistic is discussed
later. Clearly, in order to be useful, RDFFIT
i
must be assessed relative to some scale.
RDFFIT is a change in the tted value; hence, a natural scale for assessing RDFFIT is
a tted value scale. Using as our estimate of the intercept
S
, it follows from the expression
( 3.9.36) with = 0 that
Var(
Y
R,i
)
.
= n
1
2
S
+ h
c,i
. (3.9.38)
Hence, based on a tted scale assessment, we standardize RDFFIT by an estimate of the
square root of this quantity.
For least squares diagnostics there is some discussion on whether to use the original model
or the model with the ith point deleted for the estimation of scale. Cook and Weisberg
(1982) advocate the original model. In this case the scale estimate is the same for all n
cases. This allows casewise comparisons involving the diagnostic. Belsley, Kuh, and Welsch
(1980), however, advocate scale estimation based on the delete i model. Note that both
standardizations correct for the model and the underlying variation of the errors.
Let
S
(i) and
(i) denote the estimates of

S
and
for the delete i model as discussed

above. Then our diagnostic in which RDFFIT
i
is assessed relative to a tted value scale
with estimates of scale based on the delete i model is given by
RDFFITS
i
=
RDFFIT
i
(n
1

2
S
(i) + h
c,i

2
(i))
1
2
. (3.9.39)
This is an R-analogue of the least squares diagnostic DFFITS
i
proposed by Belsley et al.
(1980). For standardization based on the original model, replace
S
(i) and
(i) by
S
and

respectively. We shall dene,

RDCOOK
i
=
RDFFIT
i
(n
1

2
S
+ h
c,i

2
)
1
2
. (3.9.40)
If
+
R
is used as the estimate of the intercept then, provided the errors have a symmetric
distribution, the R-diagnostics are obtained by replacing Var(
Y
R,i
) with Var(
Y
R,i
) = h
i

2
;
see Exercise 3.16.30 for details. This results in the diagnostics,
RDFFITS
symm,i
=
RDFFIT
i
h
i

(i)
, (3.9.41)
and
RDCOOK
symm,i
=
RDFFIT
i
h
i

. (3.9.42)
This eliminates the need to estimate
S
.
There is also a disagreement on what benchmarks to use for agging points of potential
inuence. As Belsley et al. (1980) discuss in some detail, DFFITS is inversely inuenced
by sample size. They advocate a size-adjusted benchmark of 2
_
p/n for DFFITS. Cook
and Weisberg (1982) suggest a more conservative value which results in

p. We shall use
both benchmarks in the examples. We realize these diagnostics only ag potentially inu-
ential points that require investigation. Similar to the two references cited above, we would
never recommend indiscriminately deleting observations solely because their diagnostic val-
ues exceed the benchmark. Rather these are potential points of inuence which should be
investigated.
The diagnostics described above are formed with the leverage values based on the projec-
tion matrix. These leverage values are nonrobust, (see Rousseeuw and van Zomeren, 1990).
For data sets with clusters of outliers in factor space robust leverage values can be formulated
in terms of high breakdown estimates of the center and scatter matrix in factor space. One
such choice would be the MVE, minimum volume ellipsoid, proposed by Rousseeuw and van
Zomeren (1990). Other estimates could be based on the robust singular value decomposition
discussed by Ammann (1993). See, also, Simpson, Ruppert and Carroll (1992). We recom-
mend computing

Y
R
(i) with a one or two step R-estimate based on the residuals from the
original model; see Section 3.7.2. Each step involves a single ordering of the residuals which
are nearly in order, (in fact on the rst step they are in order) and a single projection onto
the range of X, (easily obtained by using the routines in LINPACK as discussed in Section
3.7.2).
The diagnostic RDFITTS
i
measures the change in the tted values when the ith case is
deleted. Similarly we can also measure changes in the estimates of the regression coecients.
For the LS analysis, this is the diagnostic DBETAS proposed by Belsley, Kuh and Welsch
(1980). The corresponding diagnostics for the rank-based analysis are:
RDBETAS
ij
=
,j
,j
(i)

(i)
_
(X
X)
jj
, (3.9.43)
where

(i) denotes the R-estimate of in the delete i-model. A similar statistic can
be constructed for the intercept parameter. Furthermore, a DCOOK verison can also be
constructed as above. These diagnostics are often used when [RDFFITS
i
[ is large. In such
cases, it may be of interest to know which components of the regression coecients are more
inuential than other components. The benchmark suggested by Belsley, Kuh and Welsch
(1980) is 2/
n.
Example 3.9.4. Free Fatty Acid (FFA) Data.
The data for this example can be found in Morrison (1983, p.64) and for convenience in
Table 3.9.3. The response is the level of free fatty acid of prepubescent boys while the
independent variables are age, weight, and skin fold thickness. The sample size is 41. Panel
A of Figure 3.9.4 depicts the residual plot based on the least squares internal t residuals.
From this plot there appears to be several outliers. Certainly the cases 12, 22, 26 and 9 are
outlying and perhaps the cases 8, 10 and 38. In fact, the rst four of these cases probably
control the least squares t, obscuring cases 8, 10 and 38.
As our rst R-t of this data, we used the Wilcoxon scores with the intercept estimated
by the median of the residuals,
s
. Note that all seven cases stand out in the Wilcoxon
residual plot based on the internal R-studentized residuals, ( 3.9.31); see Panel B of Figure
3.9.4. This is further conrmed by the ts displayed in Table 3.9.4, where the LS t with
these seven cases deleted is very similar to the Wilcoxon t using all the cases. The qq
plot of the internal R-studentized residuals, Panel C of Figure 3.9.4, also highlights these
outlying cases. Similar to the residual plot, the q q plot suggests that the underlying
error distribution is positively skewed with a light left tail. The estimates of the regression
coecients and their standard errors are displayed in Table 3.9.4. Due to the skewness in
Table 3.9.3: Free Fatty Acid (FFA) Data
Age Weight Skin Fold Free Fatty
Case (months) (LBS) Thickness Acid
1 105 67 0.96 0.759
2 107 70 0.52 0.274
3 100 54 0.62 0.685
4 103 60 0.76 0.526
5 97 61 1.00 0.859
6 101 62 0.74 0.652
7 99 71 0.76 0.349
8 101 48 0.62 1.120
9 107 59 0.56 1.059
10 100 51 0.44 1.035
11 100 80 0.74 0.531
12 101 57 0.58 1.333
13 104 58 1.10 0.674
14 99 58 0.72 0.686
15 101 54 0.72 0.789
16 110 66 0.54 0.641
17 109 59 0.68 0.641
18 109 64 0.44 0.355
19 110 76 0.52 0.256
20 111 50 0.60 0.627
21 112 64 0.70 0.444
22 117 73 0.96 1.016
23 109 68 0.82 0.582
24 112 67 0.52 0.325
25 111 81 1.14 0.368
26 115 74 0.82 0.818
27 115 63 0.56 0.384
28 125 74 0.72 0.509
29 131 70 0.58 0.634
30 121 63 0.90 0.526
31 123 67 0.66 0.337
32 125 82 0.94 0.307
33 122 62 0.62 0.748
34 124 67 0.74 0.401
35 122 60 0.60 0.451
36 129 98 1.86 0.344
37 128 76 0.82 0.545
38 127 63 0.26 0.781
39 140 79 0.74 0.501
40 141 60 0.62 0.524
41 139 81 0.78 0.318
the data, it is not surprising, that the LS and R estimates of the intercept are dierent since
the former estimates the mean of the residuals while the later estimates the median of the
residuals.
Figure 3.9.4: Panel A, Internal LS studentized residual plot on the original Free Fatty Acid
Data; Panel B, Internal Wilcoxon studentized residual plot on the original Free Fatty Acid
Data; Panel C, Internal Wilcoxon studentized normal qq plot on the original Free Fatty
Acid Data; and Panel D, Internal R-studentized residual plot on the original Free Fatty Acid
Data based on the score function
.5
(u).
LS fit original data

L
S

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
0.3 0.5 0.7 0.9
-
1
0
1
2
Panel A
Wilcoxon fit original data

W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
0.3 0.5 0.7
0
1
2
3
Panel B

Normal quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s

(
o
r
i
g
)
-2 -1 0 1 2
0
1
2
3
Panel C
Fit based on Bent score original data

B
e
n
t

s
c
o
r
e

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
0.4 0.6 0.8
-
1
0
1
2
3
Panel D
Table 3.9.5 displays the values of the R and LS diagnostics for the cases of interest. For
the seven cases cited above, the internal Wilcoxon studentized residuals, ( 3.9.31), denitely
ag three of the cases and for two of the others it exceeds 1.70; see Panel B of Figure 3.9.4.
As RDFFITS, ( 3.9.39), indicates none of these seven cases seem to have an eect on the
Wilcoxon-t, (the liberal benchmark is .62), whereas the 12th case appears to have an eect
on the least squares t. RDFFITS exceeded the benchmark only for case 2 for which it
had the value -.64. Case 36 with h
36
= .53 has high leverage but it did not have an adverse
Table 3.9.4: Estimates of (rst cell entry) and
(second cell entry) for Free Fatty Acid

Data.
Original Data log y
Par. LS Wilcoxon LS (w/o 7 pts.) R-Bent Score LS Wilcoxon
0
1.70 .33 1.49 .27 1.24 .21 1.37 .21 1.12 .52 .99 .54
1
-.002 .003 -.001 .003 -.001 .002 -.001 .002 -.001 .005 .000 .005
2
-.015 .005 -.015 .004 -.013 .003 -.015 .003 -.029 .008 -.031 .008
3
.205 .167 .274 .137 .285 .103 .355 .104 .444 .263 .555 .271
Scale .215 .178 .126 .134 .341 .350
Table 3.9.5: Regression Diagnostics for cases of interest for the Fatty Acid Data.
LS Wilcoxon Bent Score
Case h
i
Int. t DFFIT Int. t DFFIT Int. t DFFIT
8 0.12 1.16 0.43 1.57 0.44 1.73 0.31
9 0.04 1.74 0.38 2.14 0.13 2.37 0.26
10 0.09 1.12 0.36 1.59 0.53 1.84 0.30
12 0.06 2.84 0.79 3.30 0.33 3.59 0.30
22 0.05 2.26 0.53 2.51 -0.06 2.55 0.11
26 0.04 1.51 0.32 1.79 0.20 1.86 0.10
38 0.15 1.27 0.54 1.70 0.53 1.93 0.19
2 0.10 -1.19 -0.40 -0.17 -0.64 -0.75 -0.48
7 0.11 -1.07 -0.37 -0.75 -0.44 -0.74 -0.64
11 0.22 0.56 0.30 0.97 0.31 1.03 0.07
40 0.25 -0.51 -0.29 -0.31 -0.21 -0.35 0.06
36 0.53 0.18 0.19 -0.04 -0.27 -0.66 -0.34
eect on either the Wilcoxon t or the LS t. This is true too of cases 11 and 40 which were
the only other cases whose leverage values exceeded the benchmark of 2p/n.
As we noted above, both the residual and the qq plots indicate that the distribution of the
residuals is positively skewed. This suggests a transformation as discussed below, or perhaps
a prudent choice of a score function which would be more appropriate for skewed error
distributions than the Wilcoxon scores. The score function
.5
(u), ( 2.5.34), is more suited
to positively skewed errors. Panel D of Figure 3.9.4 displays the internal R-studentized
residuals based on the R-t using this bent score function. From this plot and the
tabled diagnostics, the outliers stand out more from this t than the previous two ts. The
RDFFITS values for this t are even smaller than those of the Wilcoxon t, which is
expected since this score function protects on the right. While Case 7 has a little inuence
on the bent score t, no other cases have RDFFITS exceeding the benchmark.
Table 3.9.4 displays the estimates of the betas for the three ts along with their standard
errors. At the .05 level, coecients 2 and 3 are signicant for the robust ts while only
coecient 2 is signicant for the LS t. The robust ts appear to be an improvement over
LS. Of the two robust ts, the bent score t appears to be more precise than the Wilcoxon
t.
A practical transformation on the response variable suggested by the Box-Cox transfor-
mation is the log. Panel A of Figure 3.9.5 shows the internal R-studentized residuals plot
based on the Wilcoxon t of the log transformed response. Note that 5 of the cases still
stand out in the plot. The residuals from the transformed response still appear to be skewed
as is evident in the qq plot, Panel B of Figure 3.9.5. From Table 3.9.4, the Wilcoxon t
seems slightly more precise in terms of standard errors.
Figure 3.9.5: Panel A, Internal R-studentized residuals plot of the log transfomed Free Fatty
Acid Data; Panel B, Corresponding normal qq plot.
Wilcoxon fit logs data

W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
-1.0 -0.6 -0.2
-
1
.
0
0
.
0
1
.
0
2
.
0
Panel A

Normal quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s

(
l
o
g
s
)
-2 -1 0 1 2
-
1
.
0
0
.
0
1
.
0
2
.
0
Panel B
3.10 Survival Analysis
In this section we discuss scores which are appropriate for lifetime distributions when the
log of lifetime follows a linear model. These are called accelerated failure time models;
3.10. SURVIVAL ANALYSIS 215
Kalbeisch and Prentice (1980). Let T denote the lifetime of a subject and let x be a p 1
vector of covariates associated with T. Let h(t; x) denote the hazard function of T at time t;
see Section 2.8. Suppose T follows a log linear model; that is, Y = log T follows the linear
model
Y = +x
+ e , (3.10.1)
where e is a random error with density f. Exponentiating both sides we get T = exp +
x
T
0
where T
0
= expe. Let h
0
(t) denote the hazard function of T
0
. This is called the
baseline hazard function. Then the hazard function of T is given by
h(t; x) = h
0
(t exp( +x
)) exp( +x
) . (3.10.2)
Thus the covariate x accelerates or decelerates the failure time of T; hence, the name accel-
erated failure time for these models.
An important subclass of the accelerated failure time models are those where T
0
follows
a Weibull distribution, i.e.,
f
T
0
(t) = (t)
1
exp(t)
, t > 0 , (3.10.3)
where and are unknown parameters. In this case it follows that the hazard function of
T is proportional to the baseline hazard function with the covariate acting as the factor of
proportionality; i.e.,
h(t; x) = h
0
(t) exp( +x
) . (3.10.4)
Hence these models are called proportional hazards models. Kalbeisch and Prentice
(1980) show that the only proportional hazards models which are also accelerated failure
time models are those for which T
0
has the Weibull density. We can write the random error
e = log T
0
as e = +
1
W
0
where = log and W
0
has the extreme value distribution
discussed in Section 2.8 of Chapter 2. Thus the optimal rank scores for these log-linear
models are generated by the function
f
(u) = 1 log(1 u) ; (3.10.5)
see ( 2.8.8) of Chapter 2.
Next we consider suitable score functions for the general failure time models, ( 3.10.1).
As noted in Kalbeisch and Prentice (1980) many of the error distributions currently used
for these models are contained in the log-F class. In this class, e = log T is distributed
down to an unknown scale parameter, as the log of an F random variable with 2m
1
and 2m
2
degrees of freedom. In this case we shall say that e has a GF(2m
1
, 2m
2
) distribution. The
distribution of T is Weibull if (m
1
, m
2
) (1, ), log-normal if (m
1
, m
2
) (, ), and
generalized gamma if (m
1
, m
2
) (, 1); see Kalbeish and Prentice. If (m
1
, m
2
) = (1, 1)
then the e has a logistic distribution. In general this class contains a variety of shapes. The
distributions are symmetric for m
1
= m
2
, positively skewed for m
1
> m
2
, and negatively
skewed for m
1
< m
2
. While Kalbeisch and Prentice discuss this class for m
1
, m
2
1, we
will extend the class to m
1
, m
2
> 0 in order to include heavier tailed error distributions.
For random errors with distribution GF(2m
1
, 2m
2
), the optimal rank score function is
given by
m
1
,m
2
(u) = (m
1
m
2
(exp F
1
(u) 1))/(m
2
+ m
1
exp F
1
(u)) , (3.10.6)
where F is the cdf of the GF(2m
1
, 2m
2
) distribution; see Exercise 3.16.31. We shall label
these scores as GF(2m
1
, 2m
2
) scores. It follows that the scores are strictly increasing and
bounded below by m
1
and above by m
2
. Hence an R-analysis based on these scores will
have bounded inuence in the Y -space.
Figure 3.10.1: Schematic of the four classes, C1 - C4, of the GF(2m
1
, 2m
2
) scores
m1
m
2
0.0 0.5 1.0 1.5 2.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
C3 C2
C4 C1
Neg. Skewed Light Tailed
Pos. Skewed Heavy Tailed
This class of scores can be conveniently divided into the four subclasses C
1
through C
4
which are represented by the four quadrants with center (1, 1) as depicted in Figure 3.10.1.
The point (1, 1) in this gure corresponds to the linear-rank, Wilcoxon scores. These scores
are optimal for the logistic distribution, GF(2, 2), and form a natural center point for
the scores. One score function from each class with the density for which it is optimal is
plotted in Figure 3.10.2. These plots are generally representative. The score functions in C
2
change from concave to convex as u increases and, hence, are suitable for light tailed error
structure, while, those in C
4
pass from convex to concave and are suitable for heavy tailed
error structure. The score functions in C
3
are always convex and are suitable for negatively
skewed error structure with heavy left tails and moderate right tails, while those in C
1
are
suitable for positively skewed errors with heavy right tails and moderate left tails.
Figure 3.10.2: Column A contains plots of the densities: the Class C1 distribution GF(3, .8);
the Class C2 distribution GF(4, 8); the Class C3 distribution GF(.5, 6); and the Class C4
distribution GF(1, .6). Column B contains the corresponding optimal score functions.
x
G
F
(
3
,.8
)
-
d
e
n
s
ity
(
x
)
-4 -2 0 2 4 6 8 10
0
.0
0
.0
5
0
.1
0
0
.1
5
Column A
u
G
F
(
3
,.8
)
-
s
c
o
r
e
(
u
)
0.0 0.2 0.4 0.6 0.8 1.0
-
1
.0
-
0
.5
0
.0
Column B
x
G
F
(
4
,8
)
-
d
e
n
s
ity
(
x
)
-2 -1 0 1 2
0
.1
0
.2
0
.3
0
.4
u
G
F
(
4
,8
)
-
s
c
o
r
e
(
u
)
0.0 0.2 0.4 0.6 0.8 1.0
-
1
0
1
2
x
G
F
(
.5
,6
)
-
d
e
n
s
ity
(
x
)
-15 -10 -5 0
0
.0
0
.0
4
0
.0
8
0
.1
2
u
G
F
(
.5
,6
)
-
s
c
o
r
e
(
u
)
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.5
1
.0
1
.5
x
G
F
(
1
,.6
)
-
d
e
n
s
ity
(
x
)
-5 0 5 10
0
.0
0
.0
4
0
.0
8
0
.1
2
u
G
F
(
1
,.6
)
-
s
c
o
r
e
(
u
)
0.0 0.2 0.4 0.6 0.8 1.0
-
0
.4
-
0
.2
0
.0
0
.2
Figure 3.10.2 shows how a score function corresponds to its density. If the density has
a heavy right tail then the score function will tend to be at on the right side; hence, the
resulting estimate will be less sensitive to outliers on the right. While if the density has a
light right tail then the scores will tend to rise on the right in order to accentuate points on
the right. The plots in Figure 3.10.2 suggest approximating these scores by scores consisting
of two or three line segments such as the bent score function, ( 2.5.34).
Generally the GF(2m
1
, 2m
2
) scores cannot be obtained in closed form due to F
1
, but
programs such as Minitab and Splus can easily produce them. There are two interesting
subclasses for which closed forms are possible. These are the subclasses GF(2, 2m
2
) and
GF(2m
1
, 2). As Exercise 3.16.32 shows, the random variables for these classes are the logs
of variates having Pareto distributions. For the subclass GF(2, 2m
2
) the score generating
function is
m
2
(u) =
_
m
2
+ 2
m
2
_
1/2
_
m
2
(m
2
+ 1)(1 u)
1/m
2
_
. (3.10.7)
These are the powers of rank scores discussed by Mielke (1972) in the context of two sample
problems.
It is interesting to note that the asymptotic relative eciency of the Wilcoxon to the
optimal rank score function at the GF(2m
1
, 2m
2
) distribution is given by
ARE =
12
4
(m
1
+ m
2
)
2
(2m
1
)
2
(2m
2
)(m
1
+ m
2
+ 1)
4
(m
1
)
4
(m
2
)
2
(2m
1
+ 2m
2
)m
1
m
2
; (3.10.8)
see Exercise 3.16.31. This eciency can be arbitrarily small. For instance, in the subclass
GF(2, 2m
2
) the eciency reduces to
ARE =
3m
2
(m
2
+ 2)
(2m
2
+ 1)
2
, (3.10.9)
which approaches 0 as m
2
0 and
3
4
as m
2
. Thus in the presence of severely skewed
errors, the Wilcoxon scores can have arbitrarily low eciency compared to a fully ecient
R-estimate based on the optimal scores.
For a given problem, the choice of scores presents a problem. McKean and Sievers (1989)
discuss several methods for score selection, one of which is illustrated in the next example.
This method is adaptive in nature with the adaption depending on residuals from an initial
t. In practice, this can lead to overtting. Its use, however, can lead to insight and may
prove benecial for tting future data sets of the same type; see McKean et al. (1989) for
such an application. Using XLISP-STAT (Tierney, 1990), Wang (1996) presents a graphical
interface for methods of score selection.
Example 3.10.1. Insulating Fluid Data.
We consider a problem discussed in Nelson (1982, p. 227) and also discussed by Lawless
(1982, p. 185). The data consist of breakdown times T of an electrical insulating uid
subject to seven dierent levels of voltage stress v. Panel A of Figure 3.10.3 displays a
scatter plot of Y = log T versus log v.
As a full model we consider a oneway layout, as discussed in Chapter 4, with the re-
sponse variable Y = log T and with the seven voltage levels as treatments. The comparison
boxplots, Panel B of Figure 3.10.3, are an appropriate display for this model. The one
method for score selection that we briey touch on here is based on qq plots; see McKean
and Sievers (1989). Using Wilcoxon scores we obtained an initial t of the oneway layout
model as discussed in Chapter 4. Panel C of Figure 3.10.3 displays the q q plot of the
ordered residuals versus the logistic quantiles based on this t. Although the left tail of the
logistic distribution appears adequate, the right side of the plot indicates that distributions
with lighter right tails might be more appropriate. This is conrmed by the near linearity
of the GF(2, 10) quantiles versus the Wilcoxon residuals. After trying several R-ts using
GF(2m
1
, 2m
2
) scores with m
1
, m
2
1, we decided that the qq plot of the GF(2, 10) t,
Panel D of Figure 3.10.3, appeared to be most linear and we used it to conduct the following
R-analysis.
For the t of the full model using the scores GF(2, 10), the minimum value of the disper-
sion function, D, is 103.298 and the estimate of
is 1.38. Note that this minimum value

of D is the analogue of the pure sum of squared errors in a least squares analysis; hence,
we will use the notation DPE = 103.298 for pure error dispersion. We rst test the
goodness of t of a simple linear model. The reduced model in this case is a simple linear
model. The alternative hypothesis is that the model is not linear but, other than this, it is
not specied; hence, the full model is the oneway layout. Thus the hypotheses are
H
0
: Y = + log v + e versus H
A
: the model is not linear. (3.10.10)
To test H
0
, we t the reduced model Y = + log v + e. The dispersion at the reduced
model it is 104.399. Since, as noted above, the dispersion at the full model is 103.298, the
lack of t is the reduction in dispersion RDLOF = 104.399 103.298 = 1.101. Therefore
the value of the robust test statistic is F
= .319. There is no evidence on the basis of this

test to contest a linear model.
The GF(2, 10)-t of the simple linear model is

Y = 64 17.67 log v, which is graphed
in Panel A of Figure 3.10.3. Under this linear model, the estimate of the scale parameter
is 1.57. From this we compute a 95% condence interval for the slope parameter
to be 17.67 3.67; hence, it appears that the slope parameter diers signicantly from
0. In Lawless there was interest in computing a condence interval for E(Y [x = log 20).
The robust estimate of this conditional mean is

Y = 11.07 and a condence interval is
11.07 1.9. Similar to the other robust condence intervals, this interval is the same as in
the least squares analysis, except that
replaces . A fuller discussion of the R-analysis of

this data set can be found in McKean and Sievers (1989).
Example 3.10.2. Sensitivity Analysis for Insulating Fluid Data.
As noted by Lawless, engineers may suggest a Weibull distribution for breakdown times in
this problem. As discussed earlier this means the errors have an extreme value distribution.
This distribution is essentially the limit of a GF(2, 2m) distribution as m . For com-
pleteness we obtained, using the IMSL (1987) subroutine UMIAH, estimates based on an
extreme value likelihood function. These estimates are labeled EXT. R-estimates based on
the the optimum R-score function ( 2.8.8) for the extreme value distribution are labeled as
REXT. The inuence functions for EXT and REXT estimates are unbounded in Y -space
and, hence, neither estimate is robust; see ( 3.5.17).
In order to illustrate this lack of robustness, we conducted a small sensitivity analysis.
We replaced the fth point, which had the value 6.05 (log units), in the data with an outlying
Table 3.10.1: Sensitivity Analysis for Insulating Data.
Value of Y
5
Original (.05) 7.75 10.05 16.05 30.05
Estimate

LS 59.4 -16.4 60.8 -16.8 62.7 -17.3 67.6 -18.7 79.1 -21.9
Wilcoxon 62.7 -17.2 63.1 -17.4 63.0 -17.4 63.1 -17.4 63.1 -17.4
GF(2, 10) 64.0 -17.7 65.5 -18.1 67.0 -18.5 67.1 -18.5 67.1 -18.5
REXT 64.1 -17.7 65.5 -18.1 68.3 -18.9 68.3 -18.9 68.3 -18.9
EXT 64.8 -17.7 68.4 -18.7 79.3 -21.8 114.6 -31.8 191.7 -53.5
observation. Table 3.10.1 summarizes the results for several dierent choices of the outlier.
Note that even for the rst case when the changed point is 7.75, which is the maximum of the
original data, there is a substantial change in the EXT-estimates. The EXT t is a disaster
when the point is changed to 10.05, whereas the R-estimates exhibit robustness. This is even
more so for succeeding cases. Although the REXT-estimates have an unbounded inuence
function, they behaved well in this sensitivity analysis.
3.11 Correlation Model
In this section, we are concerned with the correlation model dened by
Y = +x
+ e (3.11.1)
where x is a p-dimensional random vector with distribution function M and density function
m, e is a random variable with distribution function F and density f, and x and e are
independent. Let H and h denote the joint distribution function and joint density function
of Y and x. It follows that
h(x, y) = f(y x
)m(x) . (3.11.2)
Denote the marginal distribution and density functions of Y by G and g.
The hypotheses of interest are:
H
0
: Y and x are independent versus H
A
: Y and x are dependent . (3.11.3)
By ( 3.11.2) this is equivalent to the hypotheses H
0
: = 0 versus H
A
: ,= 0. For this
section, we will use the additional assumptions:
(E.2) Var(e) =
2
e
< (3.11.4)
(M.1) E[xx
] = , > 0 . (3.11.5)
Without loss of generality assume that E[x] = 0 and E(e) = 0.
3.11. CORRELATION MODEL 221
Let (x
1
, Y
1
), . . . , (x
n
, Y
n
) be a random sample from the above model. Dene the n p
matrix X
1
to be the matrix whose ith row is the vector x
i
and let X be the corresponding
centered matrix, i.e, X = (I n
1
11
)X
1
. Thus the notation here agrees with that found in
the previous sections.
We intend to briey describe the rank-based analysis for this model. As we will show
using conditional arguments the asymptotic inference we developed for the xed x case will
hold for the stochastic case also. We then want to explore measures of association between x
and Y . These will be analogues of the classical coecient of multiple determination, R
2
.
As with R
2
, these robust CMDs will be 0 when x and Y are independent and positive when
they are dependent. Besides dening these measures, we will obtain consistent estimates of
them. First we show that, conditionally, the assumptions of Section 3.4 hold. Much of the
discussion in this section is taken from the paper by Witt, Naranjo and McKean (1995).
3.11.1 Hubers Condition for the Correlation Model
The key assumption on the design matrix for the nonstochastic x linear model was Hubers
condition, (D.2), ( 3.4.7). As we next show, it holds almost surely (a.s.) for the correlation
model. This will allow us to easily obtain inference methods for the correlation model as
discussed below.
First dene the modulus of a matrix A to be
m(A) = max
i,j
[a
ij
[ . (3.11.6)
As Exercise 3.16.33 shows the following three facts follow from this denition: m(AB)
p m(A)m(B) where p is the common dimension of A and B; m(AA
) m(A)
2
; and m(A) =
max a
ii
if A is positive semidenite. We next need a preliminary lemma found in Arnold
(1980).
Lemma 3.11.1. Let a
n
be a sequence of nonnegative real numbers. If n
1
n
i=1
a
i
a
0
then n
1
sup
1in
a
i
0.
Proof: We have,
a
n
n
=
1
n
n
i=1
a
i
n 1
n
1
n 1
n1
i=1
a
i
0 . (3.11.7)
Now suppose that n
1
sup
1in
a
n
, 0. Then for some > 0 and for all integers N there
exists an n
N
such that n
N
N and n
1
N
sup
1in
N
a
i
. Thus we can nd a subsequence
of integers n
j
such that n
j
and n
1
j
sup
1in
j
a
i
. Let a
in
j
= sup
1in
j
a
i
. Then

a
in
j
n
j
a
in
j
i
n
j
. (3.11.8)
Also, since n
j
and > 0, i
n
j
; hence, expression ( 3.11.8) leads to a contradiction
of expression ( 3.11.7).
The following theorem is due to Arnold (1980).
Theorem 3.11.1. Under ( 3.11.5),
lim
n
max diag
_
X(X
X)
1
X
_
= 0 , a.s. ; (3.11.9)
Proof: Using the facts cited above on the modulus of a matrix, we have
m
_
X(X
X)
1
X
_
p
2
n
1
m(XX
) m
_
_
1
n
X
X
_
1
_
. (3.11.10)
Using the assumptions on the correlation model, the law of large numbers yields
_
1
n
X
X
_
a.s. . Hence we need only show that n

1
m(XX
) 0 a.s. . Let U
i
denote the ith diagonal
element of XX
. We then have,
1
n
n
i=1
U
i
=
1
n
tr X
X
a.s.
tr .
By Lemma 3.11.1 we have n
1
sup
in
U
i
a.s
0. Since XX
is positive semidenite, the desired

conclusion is obtained from the facts which followed expression ( 3.11.6).
Thus given X, we have the same assumptions on the design matrix as we did in the
previous sections. By conditioning on X, the theory derived in Section 3.5 holds for the
correlation model also. Such a conditional argument is demonstrated in Theorem 3.11.2
below. For later discussion we summarize the rank-based inference for the correlation model.
Given a specied score function , let

denote the R-estimate of dened in Section

3.2. Under the correlation model ( 3.11.1) and the assumptions ( 3.11.4), (S.1), ( 3.4.10),
and ( 3.11.5)

n(
)
D
N
p
(0,
2
1
). Also the estimates of
discussed in Section
3.7.1 will be consistent estimates of
under the correlation model. Let
denote such
an estimate. In terms of testing, consider the R-test statistic, F
= (RD/p)/(
/2), of the
above hypothesis H
0
of independence. Employing the usual conditional argument, it follows
that pF
2
(p,
R
), a.e. M under H
n
: = /
n where the noncentrality parameter

R
is given by =
/
2
.
Likewise for the LS estimate

LS
of . Using the conditional argument, (see Arnold
(1980) for details),

n(
LS
)
D
N
p
(0,
2
1
) and under H
n
, pF
LS
D

2
(p,
LS
) with
noncentrality parameter
LS
=
/
2
. Thus the ARE of the R-test F
to the least squares

test F
LS
is the ratio of noncentrality parameters,
2
/
2
. This is the usual ARE of rank tests

to tests based on least squares in simple location models. Hence the test statistic F
has
eciency robustness. The theory of rank-based tests in Section 3.6 applies to the correlation
model.
We return to measures of association and their estimates. For motivation, we consider
the least squares measure rst.
3.11.2 Traditional Measure of Association and its Estimate
The traditional population coecient of multiple determination (CMD) is dened by
R
2
=

2
e
+
; (3.11.11)
see Arnold (1981). Note that R
2
is a measure of association between Y and x. It lies between
0 and 1 and it is 0 if and only if Y and x are independent, (because Y and x are independent
if and only if = 0).
In order to obtain a consistent estimate of R
2
, treat x
i
as nonstochastic and t by least
squares the model Y
i
= +x
i
+e
i
, which will be called the full model. The residual amount
of variation is SSE =
n
i=1
(Y
i

LS
x
LS
)
2
, where

LS
and
LS
are the least squares
estimates. Next t the reduced model dened as the full model subject to H
0
: = 0. The
total amount of variation is SST =
n
i=1
(Y
i
Y )
2
. The reduction in variation in tting
the full model over the reduced model is SSR = SST SSE. An estimate of R
2
is the
proportion of explained variation given by
R
2
=
SSR
SST
. (3.11.12)
The least squares test statistic for H
0
versus H
A
is F
LS
= (SSR/p)/
2
LS
where
2
LS
=
SSE/(n p 1). Recall that R
2
can be expressed as
R
2
=
SSR
SSR + (n p 1)
2
LS
=
p
np1
F
LS
1 +
p
np1
F
LS
. (3.11.13)
Now consider the general correlation model. As shown in Arnold (1980), under ( 3.11.4)
and ( 3.11.5), R
2
is a consistent estimate of R
2
. Under the multivariate normal model R
2
is
the maximum likelihood estimate of R
2
.
3.11.3 Robust Measure of Association and its Estimate
The rank-based analogue to the reduction in residual variation is the reduction in residual
dispersion which is given by RD = D(0) D(
R
). Hence, the proportion of dispersion
explained by tting is
R
1
= RD/D(0) . (3.11.14)
This is a natural CMD for any robust estimate and, as we shall show below, the population
CMD for which R
1
is a consistent estimate does satisfy interesting properties. As expression
( A.5.11) of the Appendix shows, however, the inuence function of the denominator is not
bounded in the Y -space. Hence the statistic R
1
is not robust.
In order to obtain a CMD which is robust, consider the test statistic of H
0
, F
=
(RD/p)/(
/2), ( 3.6.12). As we indicated above, the test statistic F
has eciency ro-

bustness. Furthermore, as shown in the Appendix, the inuence function of F
is bounded
in the Y -space. Hence the test statistic is robust.
Consider the relationship between the classical F-test and R
2
given by expression ( 3.11.13).
In the same way but using the robust test F
, we can dene a second R-coecient of mul-

tiple determination
R
2
=
p
np1
F
R
1 +
p
np1
F
R
=
RD
RD + (n p 1)(
/2)
. (3.11.15)
It follows from the above discussion on the R-test statistic that the inuence function of R
2
has bounded inuence in the Y -space.
The parameters that respectively correspond to the statistics D(0) and D(
R
) are D
y
=
_
(G(y))ydG(y) and D
e
=
_
(F(e))edF(e); see the discussion in Section 3.6.3. The
population CMDs associated with R
1
and R
2
are:
R
1
= RD/D
y
(3.11.16)
R
2
= RD/(RD + (
/2)) , (3.11.17)
where RD = D
y
D
e
. The properties of these parameters are discussed in the next section.
The consistency of R
1
and R
2
is given in the following theorem:
Theorem 3.11.2. Under the correlation model ( 3.11.1) and the assumptions (E.1), ( 2.4.16),
(S.1), ( 3.4.10), (S.2), ( 3.4.11), and ( 3.11.5),
R
i
P
R
i
a.e. M , i = 1, 2 .
Proof: Note that we can write
D(0) =
n
i=1
_
n
n + 1
F
n
(Y
i
)
_
Y
i
1
n
=
_

_
n
n + 1
F
n
(t)
_
td
F
n
(t) ,
where

F
n
denotes the empirical distribution function of the random sample Y
1
, . . . , Y
n
. As
n the integral converges to D
y
.
Next consider the reduction in dispersion. By Theorem 3.11.1, with probability 1, we
can restrict the sample space to a space on which Hubers design condition (D.1) holds and
on which n
1
X
X . Then conditionally given X, we have the assumptions found in

Section 3.4 for the non-stochastic model. Hence from the discussion found in Section 3.6.3
(1/n)D(
R
)
P
D
e
. Hence it is true unconditionally, a.e. M. The consistency of
was
discussed above. The result then follows.
Example 3.11.1. Measures of Association for Wilcoxon Scores.
For the Wilcoxon score function,
W
(u) =
12(u 1/2), as Exercise 3.16.34 shows,

D
y
=
_
(G(y))y dy =
_
3/4E[Y
1
Y
2
[ where Y
1
, Y
2
are iid with distribution function G.
Likewise, D
e
=
_
3/4E[e
1
e
2
[ where e
1
, e
2
are iid with distribution function F. Finally
= (
12
_
f
2
)
1
. Hence for Wilcoxon scores these coecients of multiple determination
simplify to
R
W1
=
E[Y
1
Y
2
[ E[e
1
e
2
[
E[Y
1
Y
2
[
(3.11.18)
R
W2
=
E[Y
1
Y
2
[ E[e
1
e
2
[
E[Y
1
Y
2
[ E[e
1
e
2
[ + (1/(6
_
f
2
))
. (3.11.19)
As discussed above, in general, R
W1
is not robust but R
W2
is.
Example 3.11.2. Measures of Association for Sign Scores.
For the sign score function, Exercise 3.16.34 shows that D
y
=
_
(G(y))y dy = E[Y medY [
where medY denotes the median of Y . Likewise D
e
= E[e mede[. Hence for sign scores,
the coecients of multiple determination are
R
S1
=
E[Y medY [ E[e mede[
E[Y medY [
(3.11.20)
R
S2
=
E[Y medY [ E[e mede[
E[Y medY [ E[e mede[ + (4f(mede))
1
. (3.11.21)
These were obtained by McKean and Sievers (1987) from a l
1
point of view.
3.11.4 Properties of R-Coecients of Multiple Determination
In this section we explore further properties of the population coecients of multiple deter-
mination proposed in the last section. To show that R
1
and R
2
, ( 3.11.16) and ( 3.11.17),
are indeed measures of association we have the following two theorems. The proof of the
rst theorem is quite similar to corresponding proofs of properties of the dispersion function
for the nonstochastic model.
Theorem 3.11.3. Suppose f and g satisfy the condition (E.1), ( 3.4.1), and their rst
moments are nite then D
y
> 0 and D
e
> 0, where D
y
=
_
(G(y))y dy.
Proof: It suces to show it for D
y
since the proof for D
e
is the same. The function
is increasing and
_
= 0; hence, must take on both negative and positive values. Thus
the set A = y : (G(y)) < 0 is not empty and is bounded above. Let y
0
= sup A. Then
D
y
=
_
y
0
(G(y))(y y
0
)dG(y) +
_

y
0
(G(y))(y y
0
)dG(y) . (3.11.22)
Since both integrands are nonnegative, it follows that D
y
0. If D
y
= 0 then it follows
from (E.1) that (G(y)) = 0 for all y ,= y
0
which contradicts the facts that takes on both
positive and negative values and that G is absolutely continuous.
The next theorem is taken from Witt (1989).
Theorem 3.11.4. Suppose f and g satisfy the conditions (E.1) and (E.2) in Section 3.4
and that satises assumption (S.2), ( 3.4.11). Then RD is a strictly convex function of
and has a minimum value of 0 at = 0.
Proof: We will show that the gradient of RD is zero at = 0 and that its second matrix
derivative is positive denite. Note rst that the distribution function, G, and density, g, of
Y can be expressed as G(y) =
_
F(y
x)dM(x) and g(y) =

_
f(y
x)dM(x). We have
RD
=
_ _ _

[G(y)]yf(y
x)f(y
u)udM(x)dM(u)dy
_ _
[G(y)]yf
(y
x)xdM(x)dy . (3.11.23)
Since E[x] = 0, both terms on the right side of the above expression are 0 at = 0. Before
obtaining the second derivative, we rewrite the rst term of ( 3.11.23) as
_ __ _

[G(y)]yf(y
x)f(y
u)dydM(x)
_
udM(u) =
_ __

[G(y)]g(y)yf(y
u)dy
_
udM(u) .
Next integrate by parts the expression in brackets with respect to y using dv =
[G(y)]g(y)dy
and t = yf(y
u). Since is bounded and f has a nite second moment this leads to
RD
=
_ _
[G(y)]f(y
u)dydM(u) +
_ _
[G(y)]yf
(y
u)udydM(u)
_ _
[G(y)]yf
(y
x)xdydM(x)
=
_ _
[G(y)]f(y
u)udydM(u) .
Hence the second derivative of RD is
2
RD
=
_ _
[G(y)]f
(y
x)xx
dydM(x)
_ _ _

[G(y)]f(y
x)f(y
u)xu
dydM(x)dM(u) . (3.11.24)
Now integrate the rst term on the right side of ( 3.11.24) by parts with respect to y by
using dt = f
(y
x)dy and v = [(G(y)]. This leads to
2
RD
=
_ _ _

[G(y)]f(y
x)f(y
u)x(u x)
dydM(x)dM(u) . (3.11.25)
We have, however, the following identity
_ _

[G(y)]f(y
x)f(y
u)(u x)(u x)
dydM(x)dM(u) =
_ _

[G(y)]f(y
x)f(y
u)u(u x)
dydM(x)dM(u)
_ _

[G(y)]f(y
x)f(y
u)x(u x)
dydM(x)dM(u) .
Since the two integrals on the right side of the last expression are negatives of each other,
this combined with expression ( 3.11.24) leads to
2
2
RD
=
_ _

[G(y)]f(y
x)f(y
u)(u x)(u x)
dydM(x)dM(u) .
Since the functions f and M are continuous and the score function is increasing, it follows
that the right side of this last expression is a positive denite matrix.
It follows from these theorems that the R
i
s satisfy properties of association similar to
R
2
. We have 0 R
i
1. By Theorem 3.11.4, R
i
= 0 if and only if = 0 if and only if Y
and x are independent.
Example 3.11.3. Multivariate Normal Model
Further understanding of R
i
can be gleaned from their direct relationship with R
2
for the
multivariate normal model.
Theorem 3.11.5. Suppose Model ( 3.11.1) holds. Assume further that the (x, Y ) follows a
multivariate normal distribution with the variance-covariance matrix
(x,Y )
=
_

2
e
+
_
. (3.11.26)
Then, from ( 3.11.16) and ( 3.11.17),
R
1
= 1
_
1 R
2
(3.11.27)
R
2
=
1
_
1 R
2
1
_
1 R
2
[1 (1/(2T
2
))]
, (3.11.28)
where T =
_
[(t)]td(t), is the standard normal distribution function, and R
2
is the
traditional coecient of multiple determination given by ( 3.11.11).
Proof: Note that
2
y
=
2
e
+
and E(Y ) = +
E[x]. Further the distribution function

of Y is G(y) = ((y
E(x))/
y
) where is the standard normal distribution function.
Then
D
y
=
_

[(y/
y
)] yd(y/
y
) (3.11.29)
=
y
T . (3.11.30)
Similarly, D
e
=
e
T. Hence,
RD = (
y
e
)T (3.11.31)
By the denition of R
2
, we have R
2
= 1

2
e
2
y
. This leads to the relationship,
1
_
1 R
2
=

y
y
. (3.11.32)
The result ( 3.11.27) follows from the expressions ( 3.11.31) and ( 3.11.32).
For the result ( 3.11.28), by the assumptions on the distribution of (x, Y ), the distribution
of e is N(0,
2
e
); i.e., f(x) = (2
2
e
)
1/2
exp x
2
/(2
2
e
) and F(x) = (x/
e
). It follows that
f
(x)/f(x) =
2
e
x, which leads to
(F
1
(u))
f
(F(u))
=
1
1
(u) .
Hence,
=
_
1
0
(u)
_
1
1
(u)
_
du
=
1
e
_
1
0
(u)
1
(u) du .
Upon making the substitution u = (t), we obtain the relationship T =
e
/
. Using this,
the result ( 3.11.31), and the denition of R
2
, ( 3.11.11), we get
R
2
=
ye
y
ye
y
+
e
y
1
2T
2
.
The result for R
2
follows from this and ( 3.11.32).
Note that T is free of all parameters. It can be shown directly that the R
i
s are one-to-one
increasing functions of R
2
; see Exercise 3.16.35. Hence, for the multivariate normal model
the parameters R
2
, R
1
, and R
2
are equivalent.
Although the CMDs are equivalent for the normal model, they measure dependence
between x and Y on dierent scales. We can use the above relationships derived in the
last theorem to have these coecients measure the same quantity at the normal model by
simply solving for R
2
in terms of R
1
and R
2
in ( 3.11.27) and ( 3.11.28) respectively. These
parameters will be useful later so we will call them R
1
and R
2
respectively. Hence solving
as indicated we get
R
2
1
= 1 (1 R
1
)
2
(3.11.33)
R
2
2
= 1
_
1 R
2
1 R
2
(1 (1/(2T
2
)))
_
2
. (3.11.34)
Again, at the multivariate normal model we have R
2
= R
2
1
= R
2
2
.
For Wilcoxon scores and sign scores the reader is ask to show in Exercise 3.16.36 that
(1/(2T
2
)) = /6 and (1/(2T
2
)) = /4, respectively.
Example 3.11.4. A Contaminated Normal Model.
As an illustration of these population coecients of multiple determination, we evaluate them
for the situation where the random error e has a contaminated normal distribution with
proportion of contamination and the ratio of contaminated variance to uncontaminated
2
c
, the random variable x has a univariate normal N(0, 1) distribution, and the parameter
= 1. So
= 1. Without loss of generality, we took = 0 in ( 3.11.1). Hence Y and

x are dependent. We consider the CMDs based on the Wilcoxon score function only.
The density of Y = x + e is given by,
g(y) =
1
_
y
2
_
+

_
1 +
2
c
_
y
_
1 +
2
c
_
.
This leads to the expressions,
D
y
=
12
2
_
2
1/2
(1 )
2
2 + 2
1/2
2
_
1 +
2
c
+ (1 )[3 +
2
c
]
1/2
_
D
e
=
12
2
_
2
1/2
(1 )
2
+ 2
1/2
c
+ (1 )
_
1 +
2
c
_
=
_
12
2
_
(1 )
2
2
+

2
2
+
2(1 )
_
2
c
+ 1
__
1
;
see Exercise 3.16.37. Based on these quantities the coecients of multiple determination
R
2
, R
1
and R
2
can be readily formulated.
Table 3.11.1 displays these parameters for several values of and for
2
c
= 9 and 100. For
ease of interpretation we rescaled the robust CMDs as discussed above. Thus at the normal
( = 0) we have R
2
1
= R
2
2
= R
2
with the common value of .5 in these situations. Certainly
as either or
c
change, the amount of dependence between Y and x changes; hence all
Table 3.11.1: Coecients of Multiple Determination under Contaminated Errors (e).
e CN(,
2
c
= 9) e CN(,
2
c
= 100)

CMD .00 .01 .02 .05 .10 .15 .00 .01 .02 .05 .10 .15
R
2
.50 .48 .46 .42 .36 .31 .50 .33 .25 .14 .08 .06
R
1
.50 .50 .48 .45 .41 .38 .50 .47 .42 .34 .26 .19
R
2
.50 .50 .49 .47 .44 .42 .50 .49 .47 .45 .40 .36
the coecients change somewhat. However, R
2
decays as the percentage of contamination
increases, and the decay is rapid in the case
2
c
= 100. This is true also, to a lesser degree,
for R
1
which is predictable since its denominator has unbounded inuence in the Y -space.
The coecient R
2
shows stability with the increase in contamination. For instance when
2
c
= 100, R
2
decays .44 units while R
2
decays only .14 units. See Witt et al. (1995) for more
discussion on this example.
Ghosh and Sen (1971) proposed the mixed rank test statistic to test the hypothesis of
independence ( 3.11.3). It is essentially the gradient test of the hypothesis H
0
: = 0.
As we showed in Section 3.6, this test statistic is asymptotically equivalent to F
. Ghosh
and Sen (1971), also, proposed a pure rank statistic in which both variables are ranked and
scored.
3.11.5 Coecients of Determination for Regression
We have mainly been concerned with coecients of multiple determination as measures
of dependence between the random variables Y and x. In the regression setting, though,
the statistic R
2
is one of the most widely used statistics, not in the sense of estimating
dependence but in the sense of comparing models. As the proportion of variance accounted
for, R
2
is intuitively appealing. Likewise R
1
, the proportion of dispersion accounted for in
the t, is an intuitive statistic. But neither of these statistics are robust. The statistic R
2
though is robust and is directly linked (a one-to-one function) to the robust test statistic
F
. Furthermore it lies between 0 and 1, having the values 1 for a perfect t and 0 for a
complete lack of t. These properties make R
2
an attractive coecient of determination for
regression as the following example illustrates.
Example 3.11.5. Hald Data
This data consists of 13 observations and 4 predictors. It can be found in Hald (1952)
but it is also discussed in Draper and Smith (1966) where it serves to illustrate a method of
predictor subset selection based on R
2
. The data are given in Table 3.11.2. The response
is the heat evolved in calories per gram of cement. The predictors are the percent in weight
Table 3.11.2: Hald Data used in Example 3.11.5
x
1
x
2
x
3
x
4
Response
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
Table 3.11.3: Coecients of Multiple Determination on Hald Data
Subset of Original Data Changed Data
Predictors R
2
R
1
R
2
R
2
R
1
R
2
x
1
, x
2
.98 .86 .92 .57 .55 .92
x
1
, x
3
.55 .33 .52 .47 .24 .41
x
1
, x
4
.97 .84 .90 .52 .51 .88
x
2
, x
3
.85 .63 .76 .66 .46 .72
x
2
, x
4
.68 .46 .62 .34 .27 .57
x
3
, x
4
.94 .76 .89 .67 .52 .83
of ingredients used in the cement and are given by:
x
1
= amount of tricalcium aluminate
x
2
= amount of tricalcium silicate
x
3
= amount of tetracalcium alumino ferrite
x
4
= amount of dicalcium silicate .
To illustrate the use of the coecients of determination R
1
and R
2
, suppose we are
interested in the best two variable predictor model based on coecients of determination.
Table 3.11.3 gives the results for two data sets. The rst is the original Hald data while in
the second we changed the 11th response observation from 83.8 to 8.8.
Note that on the original data all three coecients choose the subset x
1
, x
2
. For the
changed data, though, the outlier severely aects the LS coecient R
2
and the nonrobust
coecient R
1
, but the robust coecient R
2
was much less sensitive to the outlier. It chooses
the same subset x
1
, x
2
as it did with the original data; however, the LS coecient selects
the subset x
3
, x
4
, two dierent predictors than its selection for the original data. The
nonrobust coecient R
1
still chooses x
1
, x
2
, although, at a relativity much smaller value.
This example illustrates that the coecient R
2
can be used in the selection of predic-
tors in a regression problem. This selection could be formalized like the MAXR procedure
in SAS. In a similar vein, the stepwise model building criteria based on LS estimation
(Draper and Smith, 1966) could easily be robustied by using R-estimates in place of LS-
estimates and the robust test statistic F
in place of F
LS
.
3.12 High Breakdown (HBR) Estimates
By (3.5.17), the inuence function of the R-estimate is unbounded in the x-space. While
in a designed experiment this is of little consequence, for non-designed experiments where
there are widely dispersed xs, (i.e. outliers in factor space), this is of some concern. In
this chapter we present R-estimators which have inuence functions bounded in both spaces
and which can have 50% breakdown. We shall call these estimators high breakdown R
(HBR) estimators. Further, we derive diagnostics which dierentiate between ts based on
these estimators, R-estimators and LS-estimators. Tableman (1990) provides an alternative
development of bounded inuence R-estimates.
3.12.1 Geometry of the HBR-Estimates
Consider the linear model ( 3.2.3). In Chapter 3, estimation and testing are based on the
pseudo-norm, (3.2.6). Here we shall consider the function
|u|
HBR
=
i<j
b
ij
[u
i
u
j
[ , (3.12.1)
where the weights b
ij
are positive and symmetric, i.e., b
ij
= b
ji
. It is then easy to show, see
Exercise 3.15.1, that the function ( 3.12.1) is a pseudo-norm. As noted in Section 2.2.2, if
the weights b
ij
1, then this pseudo-norm is proportional to the pseudo-norm based on the
Wilcoxon scores. Hence we will refer to this as a generalized R (HBR) pseudo-norm.
Since this is a pseudo-norm we can develop estimation and testing procedures using the
same geometry as in the last chapter. Briey the HBR estimate of in model ( 3.2.3) is a
vector

HBR
such that
HBR
= Argmin
|Y X|
HBR
. (3.12.2)
Equivalently we can dene the dispersion function
D
HBR
() = |YX|
HBR
. (3.12.3)
Since it is based on a pseudo-norm, D
HBR
is a continuous, nonnegative, convex function of
. The negative of its gradient is given by
S
HBR
() =
i<j
b
ij
(x
i
x
j
)sgn[(Y
i
Y
j
) (x
i
x
j
)
] . (3.12.4)
3.12. HIGH BREAKDOWN (HBR) ESTIMATES 233
Thus the HBR-estmate solves the equation
S
HBR
()
.
= 0 .
In the next subsection, we discuss the selection of the weights b
ij
.
The HBR-estimates were proposed by Chang, McKean, Naranjo and Sheather (1999).
Using the package RBR, these estimates are easily computed, as discussed in the examples
below.
3.12.2 Weights
The weight for a point (x
i
, Y
i
), i = 1, . . . , n, for the HBR estimates is a function of two
components. One component depends on the distance of the point x
i
from the center
of the X-space (factor space) and the other component depends on the size of the residual
based on an initial high breakdown t. As shown below, these components are used in
combination, so the weight due to one component may be oset by the weight of the other
component.
First, we consider distance in factor space. It seems reasonable to downweight points
far from the center of the data. The leverage values h
i
= n
1
+ x
ci
(X
c
X
c
)
1
x
ci
, for i =
1, . . . , n, measure distance (Mahalanobis) from the center relative to the scatter matrix X
c
X
c
.
Leverage values, though, are based on means and the usual (LS) variance-covariance scatter
matrix which are not robust estimators. There are several robust estimators of location
and scatter from which to choose, including the high breakdown minimum covariance
determinant (MCD) which is an ellipsoid that covers about half of the data and yet has
minimum determinant. Although computationally intensive, Rousseeuw and Van Driessen
(1999) present a fast computational algorithm for it. Let v
c
denote the center of the ellipsoid.
Letting V denote the MCD, the robust distances are given by
Q
i
= (x
i
v
c
)
V
1
(x
i
v
c
). (3.12.1)
We dene the associated weights by w
i
= min
_
1,
c
Q
i
_
, where c is usually set at the 95th
percentile of the
2
(p) distribution. Note that good points generally have weights 1.
The class of GR-estimates proposed by Sievers (1983) use weights of the form b
ij
= w
i
w
j
which depend only on distance in factor space. These estimates have positive breakdown
and bounded inuence in factor space, but as Exercise 3.16.41 shows they are always less
ecient than the Wilcoxon estimates, unless all weights are 1. Further, at times, the loss in
eciency can be severe; see Chang et al. (1999) for discussion. One reason is that good
points of high leverage (points that follow the model) are down weighted by the same amount
as points at the same distance from the center of factor space but which do not follow the
model (bad points of high leverage).
The weights for the HBR estimates are a function of the GR weights and residual infor-
mation from the Y -space. The residuals are based on a high breakdown initial estimate of
the regression coecients. We have chosen to use the least trim squares (LTS) estimate
which is given by
Argmin
h
i=1
[Y x
]
2
(i)
, (3.12.2)
where h = [n/2] + 1 and where the notation (i) denotes the ith ordered absolute residual;
see Rousseeuw and Van Driessen (1999). Let e
0
denote the residuals from this initial t.
Dene the function (t) by (t) = 1, t, or 1 according as t 1, 1 < t < 1, or t
1. Let be estimated by the initial scaling estimate MAD = 1.483 med
i
[e
(0)
i
med
j
e
(0)
j
[ .
Recall the robust distances Q
i
, dened in expression (??). Let
m
i
=
_
b
Q
i
_
= min
_
1,
b
Q
i
_
,
and consider the weights
b
ij
= min
_
1,
c
[e
i
[

[e
j
[
min
_
1,
b
Q
i
_
min
_
1,
b
Q
j
__
, (3.12.3)
where the tuning constants b and c are both set at 4. From this point-of-view, it is clear that
these weights downweight both outlying points in factor space and outlying responses. Note
that the initial residual information is a multiplicative factor in the weight function. Hence,
a good leverage point will generally have a small (in absolute value) initial residual which
will oset its distance in factor space. The following example will illustrate the dierences
among the Wilcoxon, GR and HBR estimates.
Example 3.12.1 (Stars Data). This data set is drawn from an astronomy study on the star
cluster CYG OB1 which contains 47 stars; see Rousseeuw and Leroy (1987) for a discussion
on the history of the data set. The response is the logarithm of the light intensity of the star
while the independent variable is the logarithm of the temperature of the star. The data are
tabled in Table 3.12.1 and are shown in Panel A of Figure 3.12.1. Note that four of the
stars, called giants, form a cluster of outliers in factor space while the rest of the stars fall in
a point cloud. Panel A shows also the overlay plot of the LS and Wilcoxon ts. Note that
the cluster of four outliers in the x space have exerted such a strong inuence on the ts that
they have drawn the LS and Wilcoxon ts towards the cluster. This behavior is predictable
based on the inuence functions of these estimates. These four giant cases have very large
robust distances from the center of the data. Hence the weights used by the GR estimates
severely downweight these points, resulting in its t through the point cloud. For this data
set, the initial LTS t ignores the four giant stars and ts the point cloud. Hence, the four
giant stars are bad leverage points and, hence, are downweighted for the HBR t, also.
The RBR command to compute the GR or HBR estimates is the same as for the Wilcoxon
estimates, wwest, except that the argument for the weight indicator bij is set at bij="GR"
or bij="HBR", respectively. For example, suppose the design matrix without the intercept
column is in the variable xmat and the response vector is in the variable y. Then the following
R commands place the LS, Wilcoxon, GR and HBR estimates as respective columns in the
matrix ests.
ls.fit = lm(y~x)
wil.fit = wwest(x,y,bij="WIL",print.tbl=F)
gr.fit = wwest(x,y,bij="GR",print.tbl=F)
hbr.fit = wwest(x,y,bij="HBR",print.tbl=F)
est = cbind(ls.fit$coef,wil.fit$tmp1$coef,gr.fit$tmp1$coef,hbr.fit$tmp1$coef)
Example 3.12.2 (Stars Data, Continued). Suppose in the last example that we had no
subject matter available. Then based on the scatterplot, we may decide to t a quadratic
model. The plots of the LS, Wilcoxon GR, and HBR ts for the quadratic model are found
in Panel B of Figure ??. The quadratic ts based on the LS, Wilcoxon and HBR estimates
follow the curvature in the data, while the GR t misses the curvature resulting in a very
poor t. For quadratic model, the cluster of four giant stars are good data points and the
HBR weights take this into account. The weights used for the GR t, however, ignore this
residual information and severely downweight the four giant star cases, resulting in the poor
t as shown in the gure.
The last two plots in the gure, Panels C and D, are the residual plots for the GR and
HBR ts. Based on their ts, the LS and Wilcoxon residual plots are the same as the HBR.
The pattern in the GR residual plot (Panel C), while not random, does not indicate how to
proceed with model selection. This is often true for residual plots based on high breakdown
ts; see McKean et al. (1993).
3.12.3 Asymptotic Normality of

HBR
The asymptotic normality of the HBR estimates was developed by Chang (1995) and Chang
et al. (1999). Much of our development is in Appendix A.6 which is taken from the article
by Chang et al. (1999). Our discussion is for general weights under assumptions that we will
specify as we proceed. In order to establish asymptotic normality of

HBR
, we need some
further notation and assumptions. Dene the parameters
ij
= B
ij
(0)/E
(b
ij
) , for 1 i, j n , (3.12.4)
where
B
ij
(t) = E
[b
ij
I(0 < y
i
y
j
< t)] . (3.12.5)
Consider the symmetric n n matrix A
n
= [a
ij
] dened by
a
ij
=
_

ij
b
ij
if i ,= j
k=i
ik
b
ik
if i = j
. (3.12.6)
Table 3.12.1: Stars Data
Log Log Log Log
Star Temp Intensity Star Temp Intensity
1 4.37 5.23 25 4.38 5.02
2 4.56 5.74 26 4.42 4.66
3 4.26 4.93 27 4.29 4.66
4 4.56 5.74 28 4.38 4.90
5 4.30 5.19 29 4.22 4.39
6 4.46 5.46 30 3.48 6.05
7 3.84 4.65 31 4.38 4.42
8 4.57 5.27 32 4.56 5.10
9 4.26 5.57 33 4.45 5.22
10 4.37 5.12 34 3.49 6.29
11 3.49 5.73 35 4.23 4.34
12 4.43 5.45 36 4.62 5.62
13 4.48 5.42 37 4.53 5.10
14 4.01 4.05 38 4.45 5.22
15 4.29 4.26 39 4.53 5.18
16 4.42 4.58 40 4.43 5.57
17 4.23 3.94 41 4.38 4.62
18 4.42 4.18 42 4.45 5.06
19 4.23 4.18 43 4.50 5.34
20 3.49 5.89 44 4.45 5.34
21 4.29 4.38 45 4.55 5.54
22 4.29 4.22 46 4.45 4.98
23 4.42 4.42 47 4.42 4.50
24 4.49 4.85
Dene the p p matrix C
n
as
C
n
= X
A
n
X . (3.12.7)
Since the rows and columns of A
n
sum to zero, it can be shown that
C
n
=
i<j
ij
b
ij
(x
j
x
i
)(x
j
x
i
)
; (3.12.8)
see Exercise 3.15.21. Let
U
i
= (1/n)
n
j=1
(x
j
x
i
) E(b
ij
sgn(y
j
y
i
)[y
i
) . (3.12.9)
Besides assumptions (E.1), ( 3.4.1), (D.2), ( 3.4.7), and (D.3), ( 3.4.8) of Chapter 3, we
need to assume additionally that,
(H.1) There exists a matrix C
H
such that n
2
C
n
= n
2
X
A
n
X
P
C
H
. (3.12.10)
(H.2) There exists a p p matrix
H
, (1/n)
n
i=1
Var(U
i
)
H
. (3.12.11)
(H.3)

n(
(0)
)
D
N(0, ) where

(0)
is the initial estimator and
is a positive denite matrix. (3.12.12)
(H.4) The weight function b
ij
= g
_
x
i
, x
j
, y
i
, y
j
,
(0)
_
g
ij
_
(0)
_
is
continuous and the gradient g
ij
is bounded uniformly in i and j.(3.12.13)
For the correlation model, an explicit expression can be given for the matrix C
H
assumed
in (H.1); see ( 3.12.22) and, also, Lemma 3.12.1.
As our theory will show, the HBR-estimate attains 50% breakdown (Section 3.12.3) and
asymptotic normality, at rate
n, provided the initial estimate of regression estimates have

these qualities. One such estimate is the least trimmed squares, LTS, which is given by
expression (3.12.2). Another class of such estimates are the rank-based estimates proposed
by Hossjer (1994); see also Croux, Rousseeuw and Hossjer (1994).
The development of the theory for

HBR
proceeds similar to that of the R estimates.
The theory is sketched in the appendix, Section A.6, and here we present only the two main
results: the asymptotic distribution of the gradient and the asymptotic distribution of the
estimate.
Theorem 3.12.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) -
( 3.12.13),
n
3/2
S
HBR
(0)
D
N(0,
H
).
The proof of this theorem proceeds along the same lines of the theory that was used to
obtain the null distribution of the gradients of the R estimates. The projection of S
HBR
(0)
is rst determined and its asymptotic distribution is established as N(0,
H
). The result
follows then upon showing that dierence between S
HBR
(0) and its projection goes to zero
in second mean; see Theorem A.6.4 for details. The following theorem gives the asymptotic
distribution of

HBR
.
Theorem 3.12.2. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) -
( 3.12.13),
n(
HBR
)
D
N( 0, (1/4)C
1
H

H
C
1
H
).
The proof of this theorem is similar to that of the R estimates. First asymptotic linearity
and quadraticity are established. These results are then combined with Theorem 3.12.1 to
yield the result; see Theorem A.6.1 of the Appendix for details.
The following lemma derives another representation of the limiting matrix C
H
, which will
prove useful in the derivation of the inuence function of

HBR
found in the next section and
Section 3.12.6 which concerns the implementation of these high breakdown estimates. For
what follows, assume without loss of generality that the true parameter value = 0. Let
g
ij
_
(0)
_
b
_
x
i
, x
j
, y
i
, y
j
,
(0)
_
denote the weights as a function of the initial estimator.
Let g
ij
(0) b(x
i
, x
j
, y
i
, y
j
) denote the weight function evaluated at the true value = 0.
The following result is proved in Lemma A.6.1 of the Appendix:
B
ij
(t) =
_

b
_
x
i
, x
j
, y
j
+ t, y
j
,
(0)
_
f(y
j
+ t) f(y
j
)
k=i,j
f(y
k
) dy
1
dy
n
.
(3.12.14)
It is further shown that B
ij
(t) is continuous in t. The representation we want is:
Lemma 3.12.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) - ( 3.12.13),
E
_
1
2
1
n
2
n
i=1
n
j=1
_

b(x
i
, x
j
, y
i
, y
j
)f
2
(y
j
) dy
j
(x
j
x
i
)(x
j
x
i
)
_
C
H
. (3.12.15)
Proof: By ( 3.12.4), ( 3.12.5), ( 3.12.8), and ( 3.12.14),
E
_
1
n
2
C
n
_
=
1
2
1
n
2
n
i=1
n
j=1
B
ij
(0)(x
j
x
i
)(x
j
x
i
)
. (3.12.16)
Because B
ij
(0) is uniformly bounded over all i and j, and the matrix (1/n
2
)
j
(x
j

x
i
)(x
j
x
i
)
converges to a positive denite matrix, the right side of ( 3.12.16) also converges.
By Lemmas A.6.1 and A.6.3 of the Appendix, we have
B
ij
(0) =
_
b(x
i
, x
j
, y
j
, y
j
)f
2
(y
j
) dy
j
+ o(1) (3.12.17)
where the remainder term is uniformly small over all i and j. Under Assumption (H.1),
( 3.12.10), the result follows.
Remark 3.12.1 (Empirical Eciency). As noted above, there is always a loss of eciency of
the GR estimator relative to the Wilcoxon estimator. It was hoped that the HBR estimator
would regain some this eciency. This was conrmed in a Monte Carlo study which is
discussed in Section 8 of the article by Chang et al. (1999). In this study, over a series of
designs, which included contamination in both responses and factor space, in all but two of
the situations, the empirical eciency of the HBR estimate relative to the Wilcoxon estimate
was always larger than that of the GR estimate relative to the Wilcoxon estimate
Remark 3.12.2 (Stability Study). To obtain its full 50% breakdown, the HBR estimates
require initial estimates with 50% breakdown. It is known that slight changes to centrally
located data can cause some high breakdown estimates to change by a large amount. This
was discussed for the high breakdown least median squares (LMS) estimates by Hettman-
sperger and Sheather (1992, 1993) and later conmed in a Monte Carlo study by Sheather,
McKean and Hettmansperger (1997). Part of the article by Chang et al. (1999) consisted
of a stability study for the HBR estimator using LMS and LTS starting values. Over the
situations investigated, the HBR estimates were much more stable than either the LTS or
LMS estimates but were less stable than the Wilcoxon estimates.
3.12.4 Robustness Prperties of the HBR Estimates
In this section we show that the HBR estimate can attain 50% breakdown and derive its
inuence function. We show that its inuence function is bounded in both the x and the
Y -spaces. The argument for breakdown is taken from Chang (1995) while the inuence
function derivation is taken from Chang et al. (1999).
Breakdown of the HBR Estimate
Let Z = z
i
= (x
i
, y
i
), i = 1, . . . , n denote the sample of data points and | | the
Euclidean norm. Dene the breakdown point of the estimator at sample Z as
n
(
, Z) = max
_
m
n
; sup
Z
(Z
(Z)| <
_
,
where the supremum is taken over all samples Z
that can result from replacing m observa-

tions in Z by arbitrary values. See, also, Dention 1.6.1.
We now state conditions under which the HBR estimate remains bounded.
Lemma 3.12.2. Suppose there exist nite constants M
1
> 0 and M
2
> 0 such that the
following conditions hold:
(B1) inf
=1
sup
ij
b
ij
(x
j
x
i
)
= M
1
.
(B2) sup
ij
b
ij
[y
j
y
i
[ = M
2
.
Then
|
HBR
| <
1
M
1
_
1 + 2
_
n
2
__
M
2
.
Proof: Note that,
D
HBR
() sup
ij
b
ij
[y
j
y
i
(x
j
x
i
)
[ ||M
1
M
2
2
_
n
2
_
M
2
whenever ||
1
M
1
_
1 + 2
_
n
2
_
M
2
. Since D
HBR
(0) =
i<j
b
ij
[y
j
y
i
[
_
n
2
_
M
2
and
D
HBR
is a convex function of , it follows that

HBR
= Argmin D
HBR
() satises
|
HBR
| <
1
M
1
_
1 + 2
_
n
2
__
M
2
.
The lemma follows.
For our result, we need to further assume that the data points Z are in general position;
that is, any subset of p + 1 of these points determines a unique solution . In particular,
this implies that neither all of the x
i
s are the same nor are all of the y
i
s are the same; hence,
provided the weights have not broken down, this implies that both constants M
1
and M
2
of
Lemma 3.12.2 are positive.
Theorem 3.12.3. Assume that the data points Z are in general position. Let v, V and
(0)
denote the initial estimates of location, scatter and . Let
n
(v, Z),
n
(V, Z) and
n
(
(0)
, Z) denote their corresponding breakdown points. Then breakdown point of the HBR
estimator is
n
(
HBR
, Z) = min
n
(v, Z),
n
(V, Z),
n
(
(0)
, Z), 1/2 . (3.12.18)
Proof: Corrupt m points in the data set Z and let Z
be the sample consisting of these corrupt

points and the remaining n m points. Assume that Z
is in general position. Assume that

v(Z
), V(Z
) and

(0)
(Z
) have not broken down. Then the constants M

1
and M
2
of Lemma
3.12.2 are positive and nite. Hence, by Lemma 3.12.2, |
HBR
(Z
)| < and the theorem

follows.
Based on this last result, the HBR-estimate has 50% breakdown provided the initial
estimates v, V and

(0)
all have 50% breakdown. Assuming that the data points are in
general position, the MCD estimates of location and scatter as discussed near expression
(3.12.1) have 50% breakdown. For initial estimates of the regression coecients, again
assuming that the data points are in general position, the LTS-estimates, ( 3.12.2), have
50% breakdown; see, also, Hossjer (1994). The HBR-estimates used in the examples of
Section 3.12.6 employ the MCD estimates of location and scatter and the LTS-estimate of
the regression coecients, resulting in the weights dened in (3.12.3).
Inuence Function of the HBR Estimate
In order to derive the inuence function, we start with the gradient equation S()
.
= 0,
written as
0
.
=
1
n
2
n
i=1
n
j=1
b
ij
sgn(z
j
z
i
)(x
j
x
i
).
Note by Lemma A.6.3 of the Appendix, that b
ij
= g
ij
(0) + O
p
(1/
n) so that the dening

equation may be written as
0
.
=
1
n
2
n
i=1
n
j=1
g
ij
(0)sgn(z
j
z
i
)(x
j
x
i
) , (3.12.19)
ignoring a remainder term of magnitude O
p
(1/
n).
Inuence functions are derived at the model where both x and y are stochastic; hence,
consider the correlation model of Section ??,
y = x
+ e , (3.12.20)
where e has density f, x is a p 1 random vector with density function m, and e and x are
independent. Let F and M denote the corresponding distribution functions of e and x. Let
H and h denote the joint distribution function and density of y and x. It then follows that
h(x, y) = f(y x
)m(x) . (3.12.21)
If we rewrite equation ( 3.12.19) using the Stieltjes integral notation of the empirical
distribution of (x
i
, y
i
), for i = 1, . . . , n, we see that the functional (H) solves the equation
0 =
_ _
b(x
1
, x
2
, y
1
, y
2
)sgny
2
y
1
(x
2
x
1
)
(H)(x
2
x
1
)dH(x
1
, y
1
)dH(x
2
, y
2
) .
Let I(a < b) = 1 or 0, depending on whether a < b or a > b. Then using the fact that
the sign function is odd and the symmetry of the weight function in its x and y arguments
we can write the dening equation of the functional (H) as
0 =
_ _
x
1
b(x
1
, x
2
, y
1
, y
2
)
_
I(y
2
y
1
< (x
2
x
1
)
(H))
1
2
_
dH(x
1
, y
1
)dH(x
2
, y
2
) .
Dene the matrix C
H
by
C
H
=
_
1
2
_ _ _
(x
2
x
1
)b(x
1
, x
2
, y
1
, y
1
)(x
2
x
1
)
f
2
(y
1
) dy
1
dM(x
1
)dM(x
2
)
_
. (3.12.22)
Note that under the correlation model C
H
is the assumed limiting matrix of Assumption
(H.1), ( 3.12.10); see Lemma 3.12.1.
The next theorem gives the result for the inuence function of

HBR
. Its proof ig given
in Theorem A.5.1 of the Appendix.
Theorem 3.12.4. The inuence function for the estimate

HBR
is given by
(x
0
, y
0
,
HBR
) = C
1
H
1
2
_ _
(x
0
x
1
)b(x
1
, x
0
, y
1
, y
0
)sgny
0
y
1
dF(y
1
)dM(x
1
) ,
(3.12.23)
where C
H
is given by expression ( 3.12.22).
In order to show that the inuence function correctly identies the asymptotic distribu-
tion of the estimator, dene W
i
as
W
i
=
_ _
(x
i
x
1
)b(x
1
, x
i
, y
1
, y
i
)sgn(y
i
y
1
) dF(y
1
)dM(x
1
) . (3.12.24)
Next write W
i
in terms of a Stieltjes integral over the empirical distribution of (x
j
, y
j
) as
W
i
=
1
n
n
j=1
(x
i
x
j
)b(x
j
, x
i
, y
j
, y
i
)sgn(y
i
y
j
) . (3.12.25)
If we can show that (1/
n)
n
j=1
W
i
d
N(0,
H
), then we are done. From the proof of
Theorem A.6.4 in the Appendix, it will suce to show that
1
n
n
i=1
(U
i
W
i
)
P
0 , (3.12.26)
where U
i
= (1/n)
n
j=1
(x
i
x
j
)E[b
ij
sgn(y
i
y
j
)[y
i
] . Writing the left hand side of ( 3.12.26)
as
(1/n
3/2
)
n
i=1
n
j=1
(x
i
x
j
) E [b
ij
sgn(y
i
y
j
)[y
i
] g
ij
(0)sgn(y
i
y
j
) ,
where g
ij
(0) b(x
j
, x
i
, y
j
, y
i
), the proof is analogous to the proof of Theorem A.6.4.
3.12.5 Discussion
The inuence function, (x
0
, y
0
,
HBR
), for the HBR estimate is a continuous function of x
0
and y
0
. With a proper choice of a weight function it is bounded in both the x and Y spaces.
This is true for the weights given by (3.12.3); furthermore, for these weights (x
0
, y
0
,
HBR
)
goes to zero as x
0
and y
0
get large in any direction.
The inuence function (x
0
, y
0
,
) is a generalization of the inuence functions for the

Wilcoxon estimates; see Exercise 3.15.22. Panel A of Figure 3.12.2 shows the inuence
function of the HBR estimate for the special case where (x, Y ) has a bivariate normal dis-
tribution with mean 0 and the identity matrix as the variance-covariance matrix. For this
plot we used the weights given by ( ??) where m
i
= (b/x
2
i
) with the constants b = c = 4.
For comparison purposes, the inuence functions of the Wilcoxon and GR estimates are also
shown in Panels B and C of Figure 3.12.2. The Wilcoxon inuence function is bounded
in the Y space but is unbounded in the x space while the GR estimate is bounded in both
spaces. Note because the weights of the GR estimate do not depend on Y , it does not taper
to 0 as y
0
, as the inuence function of the HBR estimate does. For all three plots,
we used the method of Monte Carlo, (10000 simulations for each of 1600 grid points), to
perform the numerical integration. The plot of the Wilcoxon inuence function is an easily
veriable check on the Monte Carlo because of its closed form, ( 3.5.17).
High breakdown estimates can have unbounded inuence functions. Such estimates can
have instability problems as discussed in Sheather, McKean and Hettmansperger (1999) for
the LMS estimate which has unbounded inuence in the x space at the quartiles of Y . The
generalized S estimators discussed in Croux et al. (1994) also have unbounded inuence
functions in the x space at particular values of Y . In contrast, the inuence function of the
HBR estimate is bounded everywhere. This helps to explain its more stable behavior than
the LMS in the stability study discussed in Chang et al. (1997).
3.12.6 Implementation and Examples
In this section, we discuss how to estimate the standard errors of the HBR estimates and
how to properly standardize the residuals. We then consider two examples.
Standard Errors and Studentized Residuals
Using the asymptotic distribution of the HBR estimate as a guideline and upon substituting
the estimated weights for the true weights we can estimate the asymptotic standard errors
for these estimates. The asymptotic variance-covariance matrix of

HBR
is a function of the
two matrices
H
and C
H
, given in ( 3.12.11) and ( 3.12.10), respectively. The matrix
H
is the variance-covariance matrix of the random vector U
i
, ( 3.12.9). We can approximate
U
i
by the expression,
U
i
=
1
n
n
j=1
(x
j
x
i
)
b
ij
(1 2F
n
(e
i
)) , (3.12.1)
where
b
ij
are the estimated weights, e
i
are the HBR residuals and F
n
is the empirical distri-
bution function of the residuals. Our estimate of
H
is then the sample variance-covariance
matrix of

U
1
, . . . ,

U
n
, i.e.,
H
=
1
n 1
n
i=1
_
U
i

U
__
U
i

U
_
. (3.12.2)
For the matrix C
H
, consider the results in Lemma 3.12.1. Upon substituting the esti-
mated weights for the weights, expression ( 3.12.17) simplies to
B
ij
(0)
.
=
b
ij
_
f
2
(t) dt =
b
ij
1
12
W
, (3.12.3)
where
W
is the scale parameter (3.4.4) for the Wilcoxon score function; i.e.,
W
= [
12
_
f
2
(t) dt]
1
.
To estimate
W
, we will use the estimator
W
given in expression ( 3.7.8). Now approximating
b
ij
in C
n
using ( 3.12.3) leads to the estimate
n
2
C
n
=
1
4
3
(
W
n)
2
n
i=1
n
j=1
b
ij
(x
j
x
i
)(x
j
x
i
)
. (3.12.4)
Similar to the R- and GR-estimates, we estimate the intercept by

HBR
= med
1in
y
i
x
HBR
. (3.12.5)
Because
HBR
is bounded in probability and X is centered, it follows, using an argument
very similar to the corresponding result for the R estimates (see McKean et al., 1990), that
the joint asymptotic distribution of and

HBR
is given by
n
__

HBR
_ _

__
D
N
__
0
0
_
,
_

2
S
0
0 (1/4)C
1
C
1
__
, (3.12.6)
where
S
is dened by (3.4.6); see the discussion around expression expression (1.5.28) for
estimation of this scale parameter.
3.12.7 Studentized Residuals
An important use of robust residual is in detection of outliers. This is easiest done when
the residuals are correctly studentized by an estimate of their standard deviation. Let

HBR
and
HBR
be the estimates of the last section for and , respectively. Denote the residual
for the ith case by
e
i
= y
i
x
HBR
, (3.12.7)
and the vector of residuals by e
. Using ( 3.12.6), a rst-order approximation of the standard

deviation of the residuals, e
i
, can be obtained in the same way as the derivation for Studen-
tized residuals for regular R estimates; see the development proposed for robust estimates,
in general, by McKean, Sheather, and Hettmansperger (1990, 1993).
As briey outlined in the Appendix, this development for the HBR residuals results in
the rst order approximation given by
Var(e
)
.
=
2
I +
2
S
H
1
+
1
4
X(X
X)
1
(X
X)
1
X
2
S
1
H
1
12
2
_
A
X(X
X)
1
X+X(X
X)
1
X
_
, (3.12.8)
where
2
is the variance of e
i
,
1
= E[[e
i
[],
2
= E[e
i
(2F(e
i
) 1)], H
1
= n
1
11
, and A
is
dened above expression ( A.6.17) of the Appendix.
We recommend estimating
2
by MAD = 1.483med[e[
i
;
1
by
1
=
1
n
n
i=1
[e
i
[ ; (3.12.9)
and
2
by
2
=
1
n
n
i=1
_
R(e
i
)
n + 1

1
2
_
e
i
, (3.12.10)
which is a consistent estimate of
2
; see McKean et al. (1990). Replacing a
ij
by a
ij
, ( ??),
yields an estimate of the matrix A
. Estimation of was discussed in Section ??. Let

V
denote the estimate of Var(e
).
Let
2
b e
i
denote the the ith diagonal entry of

V. Dene the Studentized residuals by
e
i
=
e
i

b e
i
. (3.12.11)
As in LS, these standard errors correct for both the underlying variance of the errors
and location. For agging outliers, appropriate benchmarks for these residuals are 2; see
McKean et al. (1990, 1993) for discussion.
3.12.8 Examples
Similar to the discussion in Section ?? on the dierence between GR estimates and R
estimates, high breakdown estimates and highly ecient estimates often give conicting
results. High breakdown estimates are less sensitive to outliers and clusters of outliers in
the x-space; hence, for data sets where this is a problem high breakdown estimates often
give better ts than highly ecient ts. On the other hand, similar to the GR estimates,
the HBR estimates are hampered in tting and detecting curvature while this is not true of
the highly ecient estimates. We choose two examples which illustrate these disagreements
between the high breakdown HBR t and the highly ecient Wilcoxon t.
To obtain the HBR estimates we used the weights given by ( ??). As initial estimates
of location and scatter we chose the MVE estimates discussed in Section ??. They were
computed by the algorithm proposed by Hadi and Simono (1993). For initial estimates of
the regression coecients we chose the LMS estimates given by,
_

(0)
(0)
_
= Argmin med
1in
y
i
x
2
; (3.12.12)
see Rousseeuw and Leroy (1987). These estimates have 50% breakdown. We computed them
with the algorithm written by Stromberg (1993). The Mallows tuning constant b was set at
the 95th percentile of
2
(p). The tuning constant c was set at [Meda
i
+3 MADa
i
]
2
where
a
i
= e
i
(
0
)/( m
i
) and = MAD, ( ??). Once the weights are computed, a Gauss-Newton
type algorithm similar to that used by RGLM (see Kapenga et al., 1988) is used to actually
obtain the HBR estimates.
Example 3.12.3. Stars Data, Example 3.12.1 continued
Table 3.12.1: Estimates of Coecients for the Quadratic Data
Quadratic Data
Fit Intercept Linear Quadratic
Wilcoxon -.665 5.95 -.652
HBR .422 4.64 -.375
LMS 1.12 3.65 -.141
In Example 3.12.1 we compared the GR and Wilcoxon ts for a data set consisting of
measurements of stars. Recall that there is a cluster of outliers in the x-space which greatly
inuences the Wilcoxon t but has little eect on the GR t. The HBR estimates of the
intercept and slope are 6.91 and 2.71, respectively, which are quite close to the GR estimates
as displayed in Table ??. Hence similar to the GR estimates, the outlying cluster of giant
stars has little eect on the HBR t.
Example 3.12.4. Quadratic Data
In order to demonstrate the problems that the high breakdown estimates have in tting
curvature, we simulated data from the following quadratic model:
Y
i
= 5.5[x
i
[ .6x
2
i
+ e
i
, (3.12.13)
where the e
i
s were simulated iid N(0, 1) variates and the x
i
s were simulated contaminated
normal variates with the contamination proportion set at .25 and the ratio of the variance
of the contaminated part to the noncontaminated part set at 16. Panel A of Figure 3.12.1
displays a scatter plot of the data overlaid with the Wilcoxon, HBR, and LMS ts. The
estimated coecients for these ts are in Table 3.12.1. As shown, the Wilcoxon t is quite
good in tting the curvature of the data. Its estimates are close to the true values. On
the other hand, the high breakdown ts are quite poor. The LMS t missed the curvature
in the data. This is true too for the HBR t, although, the t did correct itself somewhat
from the poor LMS starting values. Panels B and C of Figure 3.12.1 contain the internal
studentized residual plots based on the Wilcoxon and the HBR ts, respectively. Based on
the Wilcoxon residual plot, no further models would be considered. The HBR residual plot
shows as outliers the two points which were tted poorly. It also has a mild linear trend in
it, which is not helpful since a linear term was t. This trend is true for the LMS residual
plot, (Panel D); although, it gives an overall impression of the lack of a quadratic term in
the model. In such cases in practice, a higher degree polynomial may be tted, which in this
case would be incorrect. Diculties in reading residual plots from high breakdown ts, as
encountered here, were discussed in Section ??; see, also, McKean et al. (1993).
3.13. DIAGNOSTICS FOR DIFFERENTIATING BETWEEN FITS 247
3.13 Diagnostics for Dierentiating between Fits
Is the least squares (LS) t appropriate for the data at hand? How dierent would a more
robust estimate be from LS? Is a high breakdown estimator necessary, or is a highly ecient
robust estimator sucient? In this section, we present simple intuitive diagnostics which help
answer these questions. These measure the dierence in ts among LS, highly ecient R,
and high breakdown R ts. These diagnostics were developed by McKean et al. (1996a,1999);
see, also, McKean and Sheather (2009) for a recent discussion. The package RBR computes
them; see Terpstra and McKean (2005) and McKean, Terpstra and Kloke (2009). We sketch
the development for the diagnostics that dierentiate between the LS and R ts rst. Also,
we focus on Wilcoxon scores, leaving the general scores analog to the exercises. We begin
by looking at the dierence in LS and Wilcoxon (W) ts, which leads to our diagnostics.
Consider the linear model (3.2.3). The design matix is centered, so the LS estimator
is

LS
= (X
X)
1
X
Y. Let

W
denote the R-estimate, immediately following expression
(3.2.7), based on Wilcoxon scores, ((u) =
12[u (1/2)]). We use to denote the corre-

sponding scale parameter
. We then have the following theorem:

Theorem 3.13.1. Assume that the random errors e
i
of Model (3.2.3) have nite variance
2
. Also, assume that Assumptions (E1), (E2), (D1) and (D2) of Section 3.4 are true. Then
LS
W
is asymptotically normal with mean 0 and variance-covariance matrix
Var(
LS
W
) =
2
(X
X)
1
, (3.13.1)
where
2
=
2
+
2
W
2,
W
is the scale parameter (??) for Wilcoxon scores, and =
12
W
E[e(F(e) 1/2)].
The proof of this theorem can be found in the appendix. The parameter is positive
unless
W
= 0. To see this note that E[e(F(e) 1/2)] = Cov(e, F(e)) 0; hence,
0. Then by the Cauchy-Schwarz inequality,
W
_
E(e
2
) E(
12(F(e) 1/2))
2
=
2
1 =
W
so that
2
(
W
)
2
0.
Outliers have an eect on the LS-estimate of the intercept,
LS
, also. Hence this mea-
surement of overall dierence in the R- and LS-estimates needs to based on the estimates
of too. The problems to which we apply our diagnostics are often messy. In particular,
we do not want to preclude errors with skewed distributions. Hence, as in Section ??, our R
estimate of the intercept is the median of the Wilcoxon residuals, i.e.,
S
= medY
i
x
W
.
This, however, raises a problem since
LS
is a consistent estimate of the mean of the errors,
e
, while
R
is a consistent estimate of the median of the errors,
e
. We shall dene the
dierence in these target values to be
d
=
e

e
. (3.13.2)
One way to avoid a problem here is to assume that the errors have a symmetric distribution,
i.e.
d
= 0, but this is undesirable in developing diagnostics for exploratory analysis. Instead,
we consider measures composed of two parts: one part measures the dierence in slope
parameters and the other part measures dierences in the estimates of intercept. First, the
following notation is convenient, here. Let b = (,
denote the vector of parameters. Let
b
LS
and

b
W
denote respectively the LS and Wilcoxon estimates of b.
The version of Theorem 3.13.1 that includes intercept is the following corollary. Let
s
= 1/(2f()), where f is the error density and is the median of f.
Corollary 3.13.1. Under the assumptions of Theorem 3.13.1,

b
W

b
LS
is asymptotically
normal with mean vector (
d
, 0
and variance-covariance matrix

Var
_

LS

S
LS
W
_
.
=
_
(
2
s
/n) 0
0
2
(X
X)
1
_
(3.13.3)
where
2
s
=
2
+
2
s
2
s
E(e sgn(e)).
By the Cauchy-Schwarz inequality, E(e sgn(e)) so that
2
s
(
s
)
2
0. Hence,
the parameter
2
s
0.
A simple diagnostic to measure the dierence between the LS and Wilcoxon t suggested
by this corollary is given by (
b
LS
b
W
)
A
1
D
(
b
LS
b
W
) where A
D
is the covariance matrix
( 3.13.3). This would have an asymptotic
2
distribution with p + 1 degrees of freedom.
Monte Carlo studies, however, showed that it was too liberal. The major problem is that
if the LS and Wilcoxon ts are close, then
W
is close to , which can lead to a practical
singularity for the matrix A
D
; see McKean et al. (1996a, 1999) for discussion. One practical
solution is to standardize with the asymptotic covariance matrix of the Wilcoxon estimate.
This leads to the diagnostic
TDBETAS(LS, W) = (
b
LS
b
W
)
A
1
W
(
b
LS
b
W
) (3.13.4)
where
A
W
=
_

2
s
/n 0
0
2
W
(X
X)
1
_
, (3.13.5)
where
W
is the robust estimator of
W
given in expression (3.7.8), for Wilcoxon scores, and

s
is the robust estimator of
s
discussed around expression (??).
The diagnostic TDBETAS(LS, W) decomposes into separate intercept and slope terms
TDBETAS(LS, W) = (n/widehat
2
s
)(
LS

S
)
2
+ (1/
2
W
)(
LS
W
)
X(
LS
W
)
= (n/
2
s
)(
LS

S
)
2
+ (1/wodehat
2
W
) |

Y
LS

Y
W
|
2
.
= TDINT(LS, W) + TDBETAS(LS, W)
(3.13.6)
Even if the Wilcoxon and LS ts of the slope parameters are essentially the same, TDBETAS(LS, W)
can be large because of asymmetry of the errors, i.e. TDINT(LS, W) is large. Hence, both
parts of the decomposition are useful diagnostics. Below we give a benchmark for large
values of TDBETAS(LS, W).
If TDBETAS(LS, W) is large then we often want to determine the cases which are the
main contributors to this large dierence. hence, consider the correspondingly standardized
statistic for dierence in ith tted value
CFITS
i
(LS, W) =
Y
W,i

Y
LS,i
SE(
Y
W,i
)
, (3.13.7)
where SE(
Y
W,i
) =
W
[h
i
(1/n)], (the expression in brackets is ith leverage of the design
matrix). As with TDBETAS(LS, W), the standardization of CFITS
i
(LS, W) is robust.
Note that this standardization is similar to that proposed for the diagnostic RDFFITS by
McKean, Sheather and Hettmansperger (1990).
Note that ( 3.13.7) standardizes by only one tted value in the numerator (instead of the
dierence). Belsley, Kuh, and Welsch (1980) used a similar standardization in assessing the
dierence between y
LS,i
and y
LS,(i)
, the ith deleted tted value. They suggested a benchmark
of 2
_
(p + 1)/n, and we propose the same benchmark for CFITS
i
(LS, W). Having said this,
we have found it useful in many cases to ignore the benchmarks and simply look for gaps
that separate large CFITS from small CFITS (see the examples presented below).
Simulations, such as those discussed in McKean et al. (1999), show that standardization
at the Wilcoxon t is successful and has much better performance than standardization using
the asymptotic covariance of the dierence in ts (Theorem 3.13.1). These simulations were
performed over a wide range of error and x-variable distributions.
Using the benchmark for CFITS, we can derive an analogous benchmark for TDBETAS
by replacing
S
with
W
. We realize that this is may be a crude approximation but we
are deriving a benchmark. Let X
1
= [1 : X] and denote the projection matrix by H =
X
1
(X
1
X
1
)
1
X
1
. Replacing
S
with
W
, we have Cov(
b
R
)
.
=
2
W
(X
1
X
1
)
1
and SE(
Y
W,i
)
.
=
h
ii
. Under this approximation, it follows from (??) that an observation is agged by
the diagnostic CFITS
W,i
(LS, W) whenever
[
Y
W,i

Y
LS,i
[

W
h
ii
> 2
_
(p + 1)/n (3.13.8)
We use this expression to obtain a benchmark for the diagnostic TDBETAS(LS, W) as
follows:
TDBETAS(LS, W) = (
b
W

b
LS
)
[
2
W
(X
1
X
1
)
1
]
1
(
b
W

b
LS
)
= (1/
2
W
)[X
1
(
b
R
b
GR
)]
[X
1
(
b
W

b
LS
)]
= (1/
2
W
)
i
(
Y
W,i

Y
LS,i
)
2
= (p + 1)
1
n
i
_

Y
W,i

Y
LS,i

W
_
(p + 1)/n
_
2
.
Since h
ii
has the average value (p + 1)/n, (3.13.8) suggests agging TDBETAS(LS, W) as
large whenever TDBETAS(LS, W) > (p + 1)(2
_
(p + 1)/n)
2
, or
TDBETAS(LS, W) >
4(p + 1)
2
n
. (3.13.9)
We proceed the same way for diagnostics to indicate dierences in ts between Wilcoxon
and HBR ts and between LS and HBR ts. The asymptotic representation for the HBR
estimate of , (??), can be used to obtain the asymptotic distribution of the dierences
between these ts. For data sets where the HBR weights are close to one, though, the
covariance matrix of this dierence is practically singular, resulting in the diagnostic being
quite liberal; see McKean et al. (1996a, 1999) for discussion. So, as we did for the diagnostic
between the Wilcoxon and LS ts, we standardize the dierences in ts using the asymptotic
covariance matrix of the Wilcoxon estimate; i.e., A
W
. Hence, the total dierences in ts are
given by
TDBETAS(W, HBR) = (
b
W

b
HBR
)
A
1
W
(
b
W

b
HBR
) (3.13.10)
TDBETAS(LS, HBR) = (
b
LS
b
HBR
)
A
1
W
(
b
LS
b
HBR
). (3.13.11)
We recommend using the benchmark given by (3.13.9) for these diagnostics, also. Likewise
thr diagnostics for casewise dierences are given by
CFITS
i
(W, HBR) =
Y
W,i

Y
HBR,i
SE(
Y
W,i
)
(3.13.12)
CFITS
i
(LS, HBR) =
Y
LS,i

Y
HBR,i
SE(
Y
W,i
)
. (3.13.13)
They suggested a benchmark of 2
_
(p + 1)/n, and We recommend the same benchmark
(2
_
(p + 1)/n) as discussed above for these diagnostics.
Example 3.13.1 (Bonds Data). Siegel (1997) presented a data set, the Bonds data, which
we use to illustrate some of these concepts. It was further discussed in Sheather (2009) and
McKean and Sheather (2009). The responses are the bid prices for U.S. treasury bonds
while the dependent variable is the coupon rate (size of the bonds periodic payment rate (in
percent). The data are shown in Panel A of Figure 3.13.1 overlaid with the LS (solid line) and
Wilcoxon (broken line) ts. The ts dier dramatically and the diagnostic TDBETA(LS,W)
has the value 213.7 which far exceeds the benchmark of 0.457. The three cases yielding
the largest values for the casewise diagnostic CFITS are cases 4, 13, and 35. Panels B and
C display the LS and Wilcoxon Studentized residual plots. As can be seen, the Wilcoxon
Studentized residual plot highlights Cases 4, 13, and 35, also. Their Studentized residuals
exceed 20 and clearly should be label outliers. These are the outlying points on the far left
in the scatterplot of the data. On the other hand, the LS Studentized residual plot shows
only two of them exceeding the benchmark. Further, the bow-tie pattern of the Wilcoxon
residual plot indicates heteroscedasticity of the errors. As discussed in Sheather (2009), this
heteroscedasticity is to be expected because the bonds have dierent maturity dates.
Table 3.13.1: Estimates of regression coecients for the Hawkins data.
(se)
1
(se)
2
(se)
3
(se)
LS -0.387 (0.42) 0.239 (0.26) -0.334 (0.15) 0.383 (0.13)
Wilcoxon -0.776 (0.20) 0.169 (0.11) 0.018 (0.07) 0.269 (0.05)
HBR -0.155 (0.22) 0.096 (0.12) 0.038 (0.07) -0.046 (0.06)
As further discussed in Sheather (2009), the three outlying cases are of a dierent type of
bond then the others. The plot in Panel D is the Studentized residuals versus tted values
for the Wilcoxon t after removing these three cases. Note that are still a few outlying data
points. The diagnostic TDBETA(LS,Wil) has the value 1.55 which exceeds the benchmark
of 0.50 but the dierence is far less than the dierence based on the original data.
Next consider the dierences between the LS and HBR ts. The leverage values corre-
sponding to the three outlying cases exceed the benchmark for leverage points, (the smallest
leverage value of these three cases has value 0.152 which exceeds the benchmark of 0.114).
The diagnostic TDBETA(LS,HBR) has the value 318.8 which far exceeds the benchmark.
As discussed above the Wilcoxon t is sensitive to outliers in factor space and in this case
TDBETA(Wil,HBR) is 10.5. When the outliers are omitted, the value of this statistic is
0.034 which is less than the benchmark.
In this simple regression model, it is obvious that the three outlying cases are on the
edge of factor space. As the next example shows, in a multiple regression problem this is
generally not as apparent. The diagnostics discussed in this section, though, alert the user
to potential troublesome points in factor space or response space.
Example 3.13.2 (Hawkins Data). This is an articial data set proposed by Hawkins, Bradu
and Kass (1984) involving three independent variables. There are a total of 75 data points
in the set and the rst 14 of them are outlying in factor space. The other 61 points follow
a linear model. Of the 14 outlying points, the rst 10 points do not follow the model while
the points 11 through 14 do; hence, the rst ten cases are referred to as bad points of high
leverage while the next 4 cases are referred to as good points of high leverage.
Panels A, B and C of Figure 3.13.2 are the unstandardized residual plots from the LS,
Wilcoxon and HBR ts, respectively. Note that the LS and Wilcoxon ts are fooled by the
bad outliers. Their ts are drawn by the bad points of high leverage while they both ag
the four good points of high leverage. On the other hand, as Panel C indicates, the HBR t
correctly identied the 10 bad points of high leverage and t well the 4 good points of high
levearge. Table 3.13.1 displays the estimates and their standard errors.
The dierences in ts diagnostics were successful for this data set. As displayed on the
plot in Panel D, TDBETAS(W, HBR) = 1324 which far exceeds the benchmark value of
0.853 and which indicates the the Wilcoxon and HBR ts dier substantially. The plot
in Panel D consists the diagnostics CFITS
W,i
(W, HBR) versus Case i. The 14 largest
values of CFITS
W,i
(W, HBR) are for the 14 outlying cases. Recall that the Wilcoxon t
incorrectly t the 4 good leverage points. So it is reassuring to see that all 14 outlying were
correctly identied. Also, in a further investigation of this data set, the gap between these
14 CFITS
W,i
(W, HBR) values and the other cases, would lead to one considering a t based
on the other 61 cases. Assuming that the matrix x1 is the design matrix (not including the
intercept, the following is the Rt code which obtained the ts and the diagnostics:
fit.ls = lm(y~x1)
fit.wl = wwest(x1,y)
fit.hbr = wwest(x1,y,bij="HBR")
fdwilhbr = fitdiag(x1,y,est=c("WIL","HBR"))
fdwilhbr$tdbeta
fdwilhbr$bmtd
cfit =fdwilhbr$cfit
fdwills = fitdiag(x1,y,est=c("WIL","LS"))
fdlshbr = fitdiag(x1,y,est=c("LS","HBR"))
3.14 Rank-Based procedures for Nonlinear Models
In this section, we consider the general nonlinear model:
Y
i
= f
i
(
0
) +
i
. i = 1, . . . , n , (3.14.1)
where f
i
are known real valued functions dened on a compact space and
1
, . . . ,
n
are independent and identically distributed random errors with probability density function
h(t). The asymptotic properties and conditions needed for the numerical stability of the LS
estimation procedure were investigated in Jennrich (1969); see, also, Malinvaud (1970) and
Wu (1981). LS estimation in nonlinear models is a direct extension of its estimation in linear
models. The same norm (Euclidean) is minimized to obtain the LS estimate of
0
. For the
rank-based procedures of this section, we simply replace the Euclidian norm by an R norm,
as we did in the linear model case. Hence, the geometry is the same as in the linear model
case.
For the nonlinear model ( 3.14.1), Oberhofer (1982) obtained the weak consistency for R
estimates based on the sign scores, i.e., the L
1
estimate. Abebe and McKean (2007) obtained
the consistency and asymptotic normality for the nonlinear R estimates of Model ( 3.14.1)
based on the Wilcoxon score function. In this section, we briey discuss the Wilcoxon
development.
For the long form of the model, let Y = (y
1
, . . . , y
n
)
T
and f() = (f
1
(), . . . , f
n
())
T
.
Given a norm | | on n-space, a natural estimator of is a value

which minimizes the
distance between the response vector y and f(); i.e.,
= argmin
|yf()|. If the norm

is the Euclidean norm then

is the LS estimate. Our interest, though, is in the Wilcoxon
norm given in expression (??), where the score function is (u) =
12[u (1/2)]. Here,

3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 253
we write the norm as | |
W
, where W denotes the Wilcoxon score function. We dene the
Wilcoxon estimator of
0
, denoted hereafter by

W,n
, as
W,n
= argmin
|y f()|
W
. (3.14.2)
We assume that f
i
() is dened and continuous for all , for all i. It then follows that
the dispersion function is a continuous function of and, hence, since is compact, that
the Wilcoxon estimate

exists.
To state the asymptotic properties of the LS and R nonlinear estimates, certain assump-
tions are need. These are discussed in detail in Abebe and McKean (2007). We do note the
analog of Assumption (??) for the linear model; that is, the sequence of matrices
n
1
n
i=1
f
i
(
0
)f
i
(
0
)
converges to a positive denite matrix (

0
) where f
i
() is the p 1 derivative vector of
f
i
() with respect to . Under these assumptions, Abebe and McKean (2007) showed that
W,n
converges in probability to
0
. They, then, derived the asymptotic distribution of
W,n
.
Similar to the derivation in the linear model case of Section ??, this involves a pseudo-linear
model.
Cconsider the local linear model given by
y
i
= x
i
T
0
+
i
. (3.14.3)
where for i = 1, . . . , n,
y
i
= y
i
() y
i
f
i
(
0
) +f
i
(
0
)
T
0
.
Note that the probability density function of the errors of Model 3.14.3 is h(t), i.e., the
density function of
i
. Dene the corresponding Wilcoxon dispersion function as,
D
n
() [2n(n + 1)]
1
i<j
[e
i
() e
j
()[ . (3.14.4)
Furthermore, let,
n
= argmin
n
() . (3.14.5)
It then follows (see Exercise 3.16.42) that
n(
0
)
D
N
p
(0,
2
(
0
)) , (3.14.6)
where is as given in (??). Abebe and McKean (2007) show that

n(
W,n
n
) 0, in
probability; hence, we have the asymptoic distribution for the Wilcoxon estimator, which we
stae in a theorem.
Theorem 3.14.1. Under assumptions in Abebe and McKean (2007),
n(
W,n
0
)
D
N
p
(0,
2
(
0
)). (3.14.7)
Let
LS,n
denote the LS estimator of
0
. Under suitable regularity conditions, the asymp-
totic distribution of the LS estimator is given by
n(
LS,n
0
)
D
N
p
(0,
2
(
0
)) , (3.14.8)
where
2
is the variance of the random error
i
. It follows immediately, from expressions
(3.14.7) and (3.14.8), that, for any component of
0
, the asymptotic relative eciency (ARE)
between the Wilcoxon estimator and the LS estimator of the component is given by the ratio
2
/
2
. This, of course, is the ARE between the Wilcoxon and LS estimators in linear models.
If the error distribution is normal, then this ratio is the well known number 0.955. Hence,
there is only a loss of 5% eciency, if one uses the Wilcoxon estimator instead of the LS
estimator when the errors are normally distributed. In contrast, the L
1
estimator has the
asymptotic relative eciency of 63% relative to the LS estimator. The ARE between the
Wilcoxon and L
1
estimators at normal errors is 150%. Hence, as in the linear model case,
the Wilcoxon estimator is a highly ecient estimator for nonlinear models. For heavier
tailed error distributions the Wilcoxon estimator is generally much more ecient than the
LS estimator. A discussion of such results, along with a Monte Carlo verication, for a family
of contaminated normal error distributions can be found in Abebe and McKean (2007).
Using the pseudo model and Section ??3.5, a useful asymptotic representation of the
Wilcoxon estimate is given by:
n(
W,n
0
) = (n
1
X
T
X
)
1
n
1/2
X
T
[H(y
0
)] + o
p
(1) , (3.14.9)
where X
is the np matrix with the ith row given by f

i
(
0
)
T
and y
is an n1 vector
with the ith component y
i
f
i
(
0
) +f
i
(
0
)
T
0
.
Based on (??), we can obtain the inuence function of the Wilcoxon estimate. Assume
f
i
depends on a set of predictors z
i
Z
m
as f
i
() = f(z
i
, ). Assume also that f is a
continuous function of for each z Z and is a measurable function of z for each
with respect to a -nite measure. Under these assumptions, the representation above gives
us the local inuence function of the Wilcoxon estimate at the point (z
0
, y
0
) ,
IF(z
0
, y
0
;
W,n
) = (
0
)
1
[H(y
0
)]f(z
0
,
0
) .
Note that the inuence function is unbounded if the tangent plane of o at
0
in unbounded.
This phenomenon corresponds to the existence of high leverage points in linear regression.
The HBR estimators, however, can be extended to the nonlinear model, also; see??. These
are robust in such cases.
3.14. RANK-BASED PROCEDURES FOR NONLINEAR MODELS 255
3.14.1 Implementation
To implement the asypmtotic inference based on the Wilcoxon estimate we need a consistent
estimator of the variance-covariance matrix. Dene the statistic (
W,n
) to be (
0
) of
expression (N2) with
0
replaced by

W,n
. By Assumption (N2) and the consistency of
W,n
to
0
, (
W,n
) converges in probability to (
0
). Next, it follows from the asymptotic
representation (??) that the estimator of proposed by Koul et al. [?] for linear models is
also a consistent estimator of for our nonlinear model. We denote this estimator by .
Thus
2
(
W,n
) is a consistent estimator of the asymptotic variance-covariance matrix of
W,n
.
Estimation Algorithm
Similar to the LS estimates for nonlinear models, a Gauss-Newton type of algorithm can
be used to obtain the Wilcoxon t. Recall that this is an iterated algorithm which uses
the Taylor Series expansion of the function f() evaluated at the current estimate to obtain
the estimate at the next iteration. Thus each iteration consists of tting a linear model.
Abebe and McKean [?] show that this algorithm for obtaining the Wilcoxon t converges
in a probability sense. Using this algorithm, all that is required to compute the Wilcoxon
nonlinear estimate is a computational procedure for Wilcoxon linear model estimates; see
Exercise 3.16.43 for further discussion.
We next consider an example that demonstrates the robustness and eciency proper-
ties of the rank estimators in comparison to the least squares(LS) estimator in practical
situations.
Example 3.14.1 (Chwiruts data). These data are taken from the ultrasonic block reference
study by Chwirut [?]. The response variable is ultrasonic response and the predictor variable
is metal distance. The study involved 214 observations. The model under consideration is,
f
i
() f(x
i
;
1
,
2
,
3
)
exp[
1
x
i
]
2
+
3
x
, i = 1, . . . , 214 .
Using the Wilcoxon and LS tting procedures, we t the (original) data and then a data set
with one observation replaced by an outlier. Figure 3.14.1 displays the results of the ts.
For the original data, as shown in the gure and by the estimates given in Table 3.14.1,
the LS and Wilcoxon ts are quite similar. As shown in the residual plots of Figure 3.14.1,
there are several moderate outliers in the original data set. These outliers have an impact
on the LS estimate of scale, the square-root of MSE, which has the value
LS
= 3.36. In
contrast, the Wilcoxon estimate of is = 2.45 which explains the Wilcoxons smaller
standard errors than those of LS in the table of estimates.
For robustness considerations, we introduced a gross outlier in the response space, (ob-
servation 17 was changed from 8.025 to 5000). The Wilcoxon and LS ts were obtained.
As shown in Figure 3.14.1, the LS estimate essentially did not converge. From the plot of
Table 3.14.1: Wilcoxon and LS estimates based on the original data with standard errors
(SE) and the Wilcoxon estimates based on the data with substituted gross outlier.
Original Data Set Outlier Data Set
Wil. Est. SE LS est. SE Wil. Est. SE
1
0.1902 0.0161 0.1903 0.0219 0.1897 0.0161
2
0.0061 0.0002 0.0061 0.0003 0.0061 0.0002
3
0.0197 0.0006 0.0105 0.0008 0.0107 0.0006
the tted models and residual plots, it is clear that the Wilcoxon t performs dramatically
better than its LS counterpart. In Table 3.14.1 the Wilcoxon estimates are displayed with
their standard error. There is basically little dierence between the Wilcoxon ts for the
original set and the data set with the gross outlier.
3.15. EXERCISES 257
3.15 Exercises
3.15.1. Show that the function ( 3.12.1) is a pseudo-norm.
3.15.2. Consider the hypotheses H
0
: = 0 versus H
A
: ,= 0.
(a). Using Theorem ??, derive the Wald type test based on the GR estimate of .
(b). Using Theorem ??, derive the gradient test.
3.15.3. In the derivation of Lemma ?? show the results for Cases 2-6.
3.15.4. Show that the second standardized term of the variance of n
3/2
S(0) in expression
( ??) goes to 0 as n .
3.15.5. Consider Theorem ??.
(a). If the weights are identically equal to 1, show that the random variable
12
n
RD
GR
reduces to the random variable
RD
/2
of Theorem 3.6.1 provided (u) =
12(u 1/2),
(i.e., Wilcoxon scores).
(b). Complete the proof of Theorem ??.
3.15.6. Keeping in mind that the weights for the GR estimates depend only on the xs,
discuss how the bootstrap described in Section 3.8 for the L
1
estimate can be used to
bootstrap the p-value of the test based on the statistic F
GR
.
3.15.7. (a). Write an algorithm which obtains the simulated distribution of
q
i=1
2
i
(1),
where
1
, . . . ,
q
are specied and
2
i
(1) are iid
2
one degree of freedom random
variables.
(b). Write a second algorithm which uses the algorithm in (a) to obtain the p-value of the
test statistic qF
GR
.
3.15.8. For general linear hypotheses, H
0
: M = 0 versus H
A
: M ,= 0, discuss the
Wald type test based on the GR estimates.
3.15.9. Show that the second term in expression ( ??) is less than or equal to zero.
3.15.10. For the data in Example ?? consider polynomial models of degree p,
y
i
= +
p
j=1
j
(x
i
x)
j
.
Once a good t has been established, a hypothesis of interest is H
0
:
p
= 0. Suppose for the
GR estimates we use the weights given by expression ( ??) parameterized by the exponent
1
. For
1
= .5, 1, and 2, obtain the asymptotic relative eciency between the GR estimate
and the Wilcoxon estimate of
p
for polynnomial models of degree p = 1, . . . , 4. If software is
available, obtain these eciencies for the weights given by expression ( ??) using the tuning
constants recommended.
3.15.11. Obtain the details of the derivation of the Var(e
GR
) given in expression ( ??).
3.15.12. Obtain the projection that is used in the proof of Theorem ??.
3.15.13. In the proof of Theorem ??, show that
n
_
GR
_
is asymptotically normal.
3.15.14. In the proof of Theorem ??, show that E[F
c
(e)S
(
0
)] = (n/6)WX.
3.15.15. By completing the following steps, obtain the asymptotic variance-covariance ma-
trix of

b
GR
, where

b
R
= (
S,R
,
R
)
and

b
GR
= (
S,GR
,
GR
)
.
(a). Show that

b
R
= T
b
R
where
T =
_
1 x
0 I
_
.
(b). Show the following holds, (where AV represents asymptotic variance),
AV
__

S,R
R
_
S,GR
GR
__
= TAV
__

S,R

S,GR
0
_
+
_
0
GR
GR
__
T
.
(c). Since the asymptotic representation ( 3.5.23) holds for both
S,R
and
S,GR
and since
the centered intercept is asymptotically independent of the regression estimates, use
( ??) to conclude that
AV
__

S,R
R
_
S,GR
GR
__
=
2
_
x
Dx x
D
Dx D
_
,
where D = (X
WX)
1
X
W
2
X(X
WX)
1
(X
X)
1
.
3.15.16. Show that the asymptotic variance-covariance matrix derived in Exercise 3.15.15
is singular by showing that the vector (1, x)
is in its kernel.
3.15.17. Show that RD
GR
() = D
GRy
D
GRe
, ( ??) and ( ??), is a strictly convex function
of with a minimum value of 0 at = 0.
3.15.18. Using the inuence function, ( ??), of the GR estimate, obtain the asymptotic
distribution of

GR
.
3.15.19. The inuence function of the test statistic T
GR
is given in Theorem ??. Use it to
obtain the asymptotic distribution of T
GR
.
3.15.20. Obtain the approximate distribution of

HBR

R
, where

R
is the Wilcoxon
estimate. Use it to obtain HBR-analogues of the diagnostics statistics TDBETAS
R
, ( ??),
and CFITS
R,i
, ( ??).
3.15.21. Show that the p p matrix C
n
dened in expression ( 3.12.7), can be written
alternately as C
n
=
i<j

ij
b
ij
(x
j
x
i
)(x
j
x
i
)
.
3.15. EXERCISES 259
3.15.22. Consider the inuence function of the HBR estimate given in expression ( A.5.24).
(a). If the weights for residuals are set at 1, show that the inuence function of the HBR
estimate simplies to the inuence function of the GR estimate given in ( ??).
(b). If the weights for residuals and the xs are both set at 1, show that the inuence function
of the HBR estimate simplies to the inuence function of the Wilcoxon estimate given
in ( 3.5.17).
3.16 Exercises
3.16.1. For the baseball data in Example 3.3.2, explore other transformations of the pre-
dictor Years in order to obtain a better tting model than the one discussed in the example.
3.16.2. Consider the linear model ( 3.2.2).
(a). Show that the ranks of the residuals can only change values at the
_
n
2
_
equations y
i
i
= y
j
x
j
.
(b). Determine the change in dispersion as moves across one of these dening planes.
(c). For the telephone data, Example 3.3.1, obtain the plot shown in Panel D of Figure
3.3.1; i.e., a plot of the dispersion function D() for a set of values in the interval
(.2, .6). Locate the estimate of slope on the plot.
(d). Plot the gradient function S() for the same set of values in the interval (.2, .6).
Locate the estimate of slope on the plot.
3.16.3. In Section 2.2 of Chapter 2, the two sample location problem was modeled as a
regression problem; see ( 2.2.2). Consider tting this model using Wilcoxon scores.
(a.) Show that the gradient test statistic ( 3.5.8) simplies to the square of the standardized
MWW test statistic ( 2.2.21).
(b.) Show that the regression estimate of the slope parameter is the Hodges-Lehmann esti-
mator given by expression ( 2.2.18).
(c.) Verify Parts (a) and (b) by tting the data in the two sample problem of Exercise
2.13.48 as a regression model.
3.16.4. For the simple linear regression problem, if the values of the independent variable
x are distinct and equally spaced show that the Wilcoxon test statistic is equivalent to the
test for correlation based on Spearmans r
s
, where
r
s
=
(R(x
i
)
n+1
2
)(R(y
i
)
n+1
2
))
_
(R(x
i
)
n+1
2
)
2
_
(R(y
i
)
n+1
2
)
2
.
Note that the denominator of r
s
is a constant. Obtain its value.
3.16.5. For the simple linear regression model consider the process
T() =
n
i=1
n
j=1
sgn(x
i
x
j
)sgn((Y
i
x
i
) (Y
j
x
j
)) .
(a.) Show under the null hypothesis H
0
: = 0, that E(T(0)) = 0 and that Var (T(0)) =
2(n 1)n(2n + 5)/9.
3.16. EXERCISES 261
(b.) Determine the estimate of based on inverting the test statistic T(0); i.e, the value of
which solves
T()
.
= 0 .
(c.) Show that when the two sample problem is written as a regression model, ( 2.2.2), this
estimate of is the Hodges-Lehmann estimate ( 2.2.18).
Note: Kendalls is a measure of association between x
i
and Y
i
given by = T(0)/(n(n
1)); see Chapter 4 of Hettmansperger (1984) for further discussion.
3.16.6. Show that the R estimate,

is an equivariant estimator; that is,

(Y + X) =
(Y) +) and

(kY) = k
(Y).
3.16.7. Consider Model 3.2.1 and the hypotheses ( 3.2.5). Let
F
denote the column space
of the full model design matrix X and let denote the subspace of
F
subject to H
0
. Show
that is a subspace of
F
and determine its dimension. Hint: One way of establishing the
dimension is to show that C = X(X
X)
1
M
is a basis matrix for

F

c
.
3.16.8. Show that Assumptions ( 3.4.9) and ( 3.4.8) imply Assumption ( 3.4.7).
3.16.9. For the special case of Wilcoxon scores, obtain the proof of Theorem 3.5.2 by rst
getting the projection of the statistic S(0).
3.16.10. Assume that the errors e
i
in Model ( 3.2.2) have nite variance
2
. Let

LS
denote
the least squares estimate of . Show that

n(
LS
)
D
N
p
(0,
2
1
). Hint: First show
that the LS estimate is location and scale equivariant. Then without loss of generality we
can assume that the true is 0.
3.16.11. Under the additional assumption that the errors have a symmetric distribution,
show that R-estimates are unbiased for all sample sizes.
3.16.12. Let
f
(u) = f
(F
1
(u))/f(F
1
(u)) denote the optimal scores for the density
f(x) and suppose that f is symmetric. Show that
f
(1 u) =
f
(u); that is, the optimal
scores are odd about 1/2.
3.16.13. Suppose the errors e
i
are double exponentially distributed. Show that the L
1
-
estimate, i.e. the R estimate based on sign scores, is the maximum likelihood estimate.
3.16.14. Using Theorem 3.5.7, show that
_

_
is approximately N
p+1
__

0
x
0
_
,
_

n

2
(X
X)
1
(X
X)
1
x
2
(X
X)
1
__
,
(3.16.1)
where
n
= n
1
2
S
+
2
(X
X)
1
x and
S
and and
are given respectively by ( 3.4.6) and

( 3.4.4).
3.16.15. Show that the random vector within the brackets in the proof of Lemma 3.6.2 is
bounded in probability.
3.16.16. Show that dierence between the numerators of the two F-statistics, ( 3.6.12) and
( 3.6.14), converges to 0 in probability under the null hypothesis.
3.16.17. Show that the dierence between F
, ( 3.6.12), and A
, ( 3.6.17), converges to 0
in probability under the null hypothesis.
3.16.18. By showing the following results, establish the asymptotic distribution of the least
squares test statistic, F
LS
, under the sequence of models ( 3.6.24) with the additional as-
sumption that the random errors have nite variance
2
.
(a). First show that
1
n
X
Y
D
N
__
B
A
2
_
,
2
I
_
, (3.16.2)
where the matrices A
2
and B are dened in the proof of Theorem 3.6.1. This can
be established by using the Lindeberg-Feller CLT, Theorem A.1.1 of the appendix, to
show that an arbitrary linear combination of the components of the random vector on
the left side converges in distribution to a random variable with a normal distribution.
(b). Based on Part (a), show that
_
B
A
1
1
.
.
.I
_
1
n
X
Y
D
N
_
W
1
,
2
W
1
_
, (3.16.3)
where the matrices A
1
and W are dened in the proof of Theorem 3.6.1.
(c). Let F
LS
(
2
) denote the LS F-test statistic with the true value of
2
replacing the
estimate
2
. Show that
F
LS
(
2
) =
__
B
A
1
1
.
.
.I
_
1
n
X
Y
_
2
W
__
B
A
1
1
.
.
.I
_
1
n
X
Y
_
. (3.16.4)
(d). Based on ( 3.16.3) amd ( 3.16.4), show that F
LS
(
2
) has a limiting noncentral
2
distribution with noncentrality parameter given by expression ( 3.6.29).
(e.) Obtain the nal result by showing that
2
2
under the
sequence of models ( 3.6.24).
3.16.19. Show that D
e
, ( 3.6.30) is a scale parameter; i.e, D
e
(F
ae+b
) = [a[D
e
(F
e
).
3.16.20. Establish expression ( 3.6.35).
3.16.21. Suppose Wilcoxon scores are used.
(a). Establish the expressions ( 3.6.36) and ( 3.6.37).
3.16. EXERCISES 263
(b). Similarly for sign scores establish ( 3.6.38) and ( 3.6.39).
3.16.22. Consider the model ( 3.2.1) and hypotheses ( 3.6.9). Suppose the errors have a
double exponential distribution with density f(t) = (2b)
1
exp [t[/b. Assume b is known.
Show that the likelihood ratio test is equivalent to the drop in dispersion test based on sign
scores.
3.16.23. Establish expressions ( 3.9.8) and ( 3.9.9).
3.16.24. Let X be a random variable with distribution function F
X
(x) and let Y = aX +b.
Dene the quantile function of X as q
X
(p) = F
1
X
(p). Show that q
X
(p) is a linear function
of q
Y
(p).
3.16.25. Verify expression ( 3.9.17).
3.16.26. Assume that the errors have a normal distribution. Show that

K
2
, ( 3.9.25),
converges in probability to 1.
3.16.27. Verify expression ( 3.9.34).
3.16.28. Proceeding as in Theorem 3.9.3 show that the rst order representation of the
tted value

Y
R
is given by expression ( 3.9.36). Next show that the approximate variance
of the ith tted case is given by expression ( 3.9.38).
3.16.29. Consider the mean shift model, ( 3.9.32). Show that the estimator of
i
given by
the numerator of expression ( 3.9.35) is based on the inversion of an aligned rank statistic
to test the hypotheses ( 3.9.33).
3.16.30. Assume that the errors have a symmetric distribution. Verify expressions ( 3.9.41)
and ( 3.9.42).
3.16.31. Assume that the errors have the distribution GF(2m
1
, 2m
2
).
(a). Show that the optimal rank score function is given by expression ( 3.10.6).
(b). Show that the asymptotic relative eciency between the Wilcoxon analysis and the
rank based analysis based on the optimal scores for the distribution GF(2m
1
, 2m
2
) is
3.16.32. Suppose the errors have density function
f
m
2
(x) = e
x
(1 +m
1
2
e
x
)
(m
2
+1)
, m
2
> 0 , < x < . (3.16.5)
(a). Show that the optimal scores are given by expression ( 3.10.7).
(b). Show that the asymptotic relative eciency of the Wilcoxon analysis to the rank anal-
ysis based on the optimal rank score function for the density ( 3.16.5) is given by
expression ( 3.10.9).
3.16.33. The denition of the modulus of a matrix A is given in expression ( 3.11.6). Verify
the three properties concerning the modulus of a matrix listed in the text following this
denition.
3.16.34. Consider Example 3.11.1. If Wilcoxon scores are used, show that D
y
=
_
3/4E[Y
1
Y
2
[ where Y
1
, Y
2
are iid with distribution function G and that D
e
=
_
3/4E[e
1
e
2
[ where
e
1
, e
2
are iid with distribution function F. Next assume that sign scores are used. Show that
D
y
= E[Y med Y [ where med Y denotes the median of Y . Likewise D
e
= E[e med e[.
3.16.35. In Example 3.11.3, show that coecients of multiple determination R
1
and R
2
given by expressions ( 3.11.27) and ( 3.11.28), respectively, are one-to-one functions of R
2
3.16.36. At the end of Example 3.11.3, verify, for Wilcoxon scores and sign scores, that
(1/(2T
2
)) = /6 and (1/(2T
2
)) = /4, respectively.
3.16.37. In Example 3.11.4, show that the density of Y is given by,
g(y) =
1
_
y
2
_
+

_
1 +
2
c
_
y
_
1 +
2
c
_
.
Using this, verify the expressions for D
y
, D
e
, and
found in the example.

3.16.38. For the baseball data given in Exercise 1.12.32, consider the variables height and
weight.
(a.) Obtain the scatterplot of height versus weight.
(b.) Obtain the CMDs: R
2
, R
1
, R
2
, R
2
1
and R
2
2
.
3.16.39. Consider a linear model of the form,
Y = X
+e , (3.16.6)
where X
is n p whose column space
F
does not include 1. This model is often called
regression through the origin. Note for the pseudo-norm | |
that
|YX
=
n
i=1
a(R(y
i
x
i
))(y
i
x
i
)
=
n
i=1
a(R(y
i
(x
i
x
))(y
i
(x
i
x
)
=
n
i=1
a(R(y
i
(x
i
x
))(y
i
(x
i
x
) , (3.16.7)
3.16. EXERCISES 265
where x
i
is the ith row of X
and x
is the vector of column averages of X
. Based on
this result, the estimate of the regression coecients based on the R-t of Model ( 3.16.6) is
estimating the regression coecients of the centered model, i.e., the model with the design
matrix X = X
H
1
X
. Hence, in general, the parameter is not estimated. This problem

also occurs in a weighted regression model. Dixon and McKean (1996) proposed the following
solution. Assume that ( 3.16.6) is the true model, but obtain the R-t of the model:
Y = 1
1
+X
1
+e = [1 X
]
_

1
1
_
+ e , (3.16.8)
where the true
1
is 0. Let X
1
= [1 X
] and let
1
denote the column space of X
1
. Let
Y
1
= 1
1
+ X
1
denote the R-tted value based on the t of Model ( 3.16.8). Note that

1
. Let

Y
= H
Y
1
be the projection of this tted value onto the desired space
.
Finally estimate
by solving the equation

X
=

Y
(3.16.9)
(a.) Show that

= (X
)
1
X
Y
1
is the solution of ( 3.16.9).
(b.) Assume that the density function of the errors is symmetric, that the R-score function is
odd about 1/2 and that the intercept
1
is estimated by solving the equation T
+
(e
R
)
.
= 0 as discussed in Section 3.5.2. Under these assumptions show that
,
2
(X
)
1
) distribution . (3.16.10)
(c.) Next, suppose that the intercept is estimated by median of the residuals from the R-t
of ( 3.16.8). Using the asymptotic representations (3.5.24) and (3.5.23), show that the
asymptotic representation of

is given by
=
S
(X
)
1
X
H
1
sgn(e) +
(X
)
1
X
H
X
[F(e) + o
p
(1/
n). (3.16.11)
Use this result to show that the asymptotic variance of of

is given by
AsyVar(
) =
2
S
(X
)
1
X
H
1
X
((X
)
1
+
2
(X
)
1
X
H
X
X
(X
)
1
, (3.16.12)
(d.) Show that the invariance to x
as shown in ( 3.16.7) is true for any pseudo-norm.

3.16.40. The data in Table 3.16.1 are presented in Graybill and Iyer (1994). The dependent
variable is the weight (in grams) of a crystalline form of a certain chemical compound while
the independent variable is the length of time (in hours) that the crystal was allowed to
grow. A model of interest is the regression through the origin model ( 3.16.6). Obtain the
R-estimate of
for this data using the procedure described in ( 3.16.9). Compare the t
with the intercept the R-t of the intercept model.
Table 3.16.1: Crystal Data for Exercise 3.16.40
Time (hours) 2 4 6 8 10 12 14
Weight (grams) 0.08 1.12 4.43 4.98 4.92 7.18 5.57
Time (hours) 16 18 20 22 24 26 28
Weight (grams) 8.40 8.881 10.81 11.16 10.12 13.12 15.04
3.16.41. ?????????????EDIT??? The following lemma will be helpful in a comparison of
the eciency between the R- and the GR-estimates.
Lemma 3.16.1. The matrix
B = (X
WX)
1
X
W
2
X(X
WX)
1
(X
X)
1
is positive semi-denite.
Proof: Let v be any vector in 1
p
. Since X
WX is non-singular, there exist a vector u

such that v = X
WXu. Hence by the Pythagorean Theorem,

v
Bv = |WXu|
2
|HWXu|
2
0 ,
where H is the projection matrix onto the column space of X.
Based on this lemma, it is easy to see that there is always a loss of eciency when using
the GR-estimates. As the examples below show this loss can be severe. If the design matrix,
though, has clusters of outlying points then this downweighting may be necessary.
3.16.42. Consider the pseudo-linear model (3.14.3) of Section 3.14. For the Wilcoxon pseudo
estimator, obtain the asymptotic result (3.14.6).
3.16.43. By lling in the brief sketch below, write out the Gauss-Newton algorithm for the
Wilcoxon estimate of the nonlinear model (3.14.1).
Let

0
be an initial estimate of . Let f
0
= f(
0
). Write the norm to be minimized as
|Y f|
W
= |Y f
0
+ [f
0
f]|
W
. Then use a Taylor series of order 1 to approximate the
term in brackets. The increment for the next step estimate is the Wilcoxon estimator of this
approximate linear model with Y f
0
as the dependent variable. For actual implementa-
tion, discuss why the regression through the origin algorithm of Exercise 3.16.39 is usually
necessary here.
3.16. EXERCISES 267
Figure 3.10.3: Panel A: Scatterplot of Insulating Fluid Data, Example 3.10.1, overlaid with
GF(2, 10) and LS ts; Panel B: Comparison Boxplots of log breakdown times over levels
of voltage stress; Panel C: qq plot of Wilcoxon t versus logistic population quantiles of
full model (oneway layout); Panel D: qq plot of GF(2, 10) t versus GF(2, 10) population
quantiles of full model (oneway layout)
Log voltage
L
o
g

s
u
r
v
i
v
a
l

t
i
m
e
3.3 3.4 3.5 3.6
-
2
0
2
4
6
8
GF(2,10)-Fit
LS-Fit
Panel A
-
2
0
2
4
6
8
1 2 3 4 5 6 7
Voltage levels
L
o
g

s
u
r
v
i
v
a
l

t
i
m
e
Panel B

Logistic quantiles
W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
-4 -2 0 2 4
-
4
-
2
0
2
Panel C
GF(2,10) quantiles
G
F
(
2
,
1
0
)

r
e
s
i
d
u
a
l
s
-8 -6 -4 -2 0 2
-
4
-
2
0
2
Panel D
3.6 3.8 4.0 4.2 4.4 4.6
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0
Log temperature
L
o
g

l
i
g
h
t

i
n
t
e
n
s
i
t
y
Panel A: Linear Model
Wil
LS
HBR
GR
3.6 3.8 4.0 4.2 4.4 4.6
4
.
0
4
.
5
5
.
0
5
.
5
6
.
0
Log temperature
L
o
g

l
i
g
h
t

i
n
t
e
n
s
i
t
y
Panel B: Quadratic Model
LS,Wil,HBR
GR
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
G
R

r
e
s
i
d
u
a
l
s

(
Q
u
a
d

M
o
d
e
l
)
Panel C: Quadratic Model
0
.
5
0
.
0
0
.
5
1
.
0
H
B
R

r
e
s
i
d
u
a
l
s

(
Q
u
a
d

M
o
d
e
l
)
Panel D: Quadratic Model
3.16. EXERCISES 269
Figure 3.12.2: Panel A: Inuence function for the HBR estimate; Panel B: Inuence function
for the Wilcoxon estimate; Panel C: Inuence function for the GR estimate.
-
1
0
-
5

0
5
1
0
X
-
1
0
-
5

0
5
1
0
Y
-
4
-
2

0
2
4
I
n
f
l
u
.

f
u
n
c
.
Panel A: HBR Estimate
-
1
0
-
5

0
5
1
0
X -
1
0
-
5

0
5
1
0
Y
-
1
5
-
1
0
-
5

0
5
1
0
1
5
I
n
f
l
u
.

f
u
n
c
.
Panel B: Wilcoxon Estimate
-
1
0
-
5

0
5
1
0
X
-
1
0
-
5

0
5
1
0
Y
-
2
-
1

0
1
2
I
n
f
l
u
.

f
u
n
c
.
Panel C: GR Estimate
Figure 3.12.1: Panel A: For the Quadratic Data of Example 3.12.4, scatter plot of data
overlaid by Wilcoxon, HBR and LMS ts; Panel B: studentized residual plot based on the
Wilcoxon t; Panel C: studentized residual plot based on the HBR t; Panel D: residual plot
based on the LMS t
X
Y
0 2 4 6
0
5
1
0
1
5
Panel A
Wilc.
HBR
LMS

Wilcoxon fit
W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
0 2 4 6 8 10
-
1
0
1
2
Panel B
HBR fit
H
B
R

r
e
s
i
d
u
a
l
2 4 6 8 10 14
-
8
-
6
-
4
-
2
0
2
Panel C
LMS fit
L
M
S

r
e
s
i
d
u
a
l
5 10 15 20
-
1
5
-
1
0
-
5
0
Panel D
3.16. EXERCISES 271
4 6 8 10 12
9
0
9
5
1
0
0
1
0
5
1
1
0
1
1
5
1
2
0
Coupon rate
B
i
d

p
r
i
c
e
Panel A
85 90 95 100 105 110 115
1
0
1
2
3
LS fit
L
S

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
Panel B
80 90 100 110 120
0
1
0
2
0
3
0
Wilcoxon fit
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
Panel C
95 100 105 110 115 120
2
0
2
Wilcoxon fit
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
Panel D
Figure 3.13.1: Plots for Bonds Data.
Figure 3.13.2: Panel A: LS residual plot of the Hawkins data; Panel B: Wilcoxon residual
plot; Panel C: HBR residual plot; Panel D: CFITS(W, HBR) by case.
0 2 4 6 8
2
0
2
4
LS fit
L
S

r
e
s
i
d
u
a
l
s
Panel A
0 2 4 6 8 10
1
2
1
0
2
0
2
Wilcoxon fit
W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
Panel B
2
4
6
8
1
0
H
B
R

r
e
s
i
d
u
a
l
s
Panel C
1
0
2
0
3
0
C
F
I
T
S

(
W
i
l
c
o
x
o
n
,
H
B
R
)
Panel D: TDBETAS(W,HBR) = 1324
3.16. EXERCISES 273
1 2 3 4 5 6
1
0
5
0
5
1
0
(a) Wilcoxon Residual vs. Predictor
x
r
e
s
i
d
s
1 2 3 4 5 6
1
0
5
0
5
1
0
(b) LS Residual vs. Predictor
x
r
e
s
i
d
s
1 2 3 4 5 6
2
0
4
0
6
0
8
0
(c) Wilcoxon and LS Fits : Original Data
x
y
Wilcoxon
LS
1 2 3 4 5 6
1
5
1
0
5
0
5
1
0
(d) Wilcoxon Residual vs. Predictor : Outlier
x
r
e
s
i
d
s
1 2 3 4 5 6
2
0
0
1
5
0
1
0
0
5
0
0
(e) LS Residual vs. Predictor : Outlier
x
r
e
s
i
d
s
1 2 3 4 5 6
2
0
4
0
6
0
8
0
(f) Wilcoxon and LS Fits : Outlier
x
y
Wilcoxon
LS
Figure 3.14.1: Analysis of Chwiruts data
Chapter 4
Experimental Designs: Fixed Eects
4.1 Introduction
In this chapter we will discuss rank-based inference for experimental designs based on the
theory developed in Chapter 3. We will concentrate on factorial type designs and analysis of
covariance designs but, based on our discussion, it will be clear how to extend the rank-based
analysis for any xed eects design. For example, based on this rank-based inference Vidmar
and McKean (1996) developed a response surface methodology which is quite analogous to
the traditional response surface methods. We will discuss estimation of eects, tests of linear
hypotheses concerning eects, and multiple comparison procedures. We illustrate this rank-
based inference with numerous examples. One purpose of our discussion is to show how
this rank-based analysis is analogous to the traditional analysis based on least squares. In
Section 4.2.5 we will introduce pseudo-observations which are based on an R-t of the full
model. We show that the rank-based analysis (Wald type) can be obtained by substituting
these pseudo-observations in place of the responses in a package that obtains the traditional
analysis. We begin with the oneway design.
In our development we apply rank scores to residuals. In this sense our methods are not
pure rank statistics; but they do provide consistent and highly ecient tests for traditional
linear hypotheses. The rank transform method is a pure rank test and it is discussed in
Section 4.7 where we describe various drawbacks to the approach for testing traditional
linear hypotheses in linear models. Brunner and his colleagues have successfully developed
a general approach to testing in designed experiments based on pure ranks; although, the
hypotheses of their approach are generally not linear hypotheses. Brunner and Puri (1996)
provide an excellent survey of these pure rank tests. We will not pursue them further in this
book.
While we will only consider linear models in this chapter, there have been extensions
of robust inference to other models. For example, Vidmar, McKean and Hettmansperger
(1992) extended this robust inference to generalized linear models for quantal responses in
the context of drug combination problems and Li (1991) discussed rank procedures for a
logistic model. Stefanski, Carroll and Ruppert (1986) discussed generalized M-estimates for
275
276 CHAPTER 4. EXPERIMENTAL DESIGNS: FIXED EFFECTS
generalized linear models.
4.2 Oneway Design
Suppose we want to determine the eect that a single factor A has on a response of interest
over a specied population. Assume that A has k levels, each level being referred to as
a treatment group. In this situation, the completely randomized design is often used to
investigate the eect of A. For this design n subjects are selected at random from the
population of interest and n
i
of these subjects are randomly assigned to level i of A, for
i = 1, . . . , k. Let Y
ij
denote the response of the jth subject in the ith level of A. We will
assume that the responses are independent of one another and that the distributions among
levels dier by at most shifts in location. Although the randomization gives some credence
to the assumption of independence, after tting the model a residual analysis should be
conducted to check this assumption and the assumption that the level distributions dier by
at most a shift in locations.
Under these assumptions, the full model can be written as
Y
ij
=
i
+ e
ij
j = 1, . . . , n
i
, i = 1, . . . , k , (4.2.1)
where the e
ij
s are iid random variables with density f(x) and distribution function F(x) and
the parameter
i
is a convenient location parameter, (for example, the mean or median). Let
T(F) denote the location functional. Assume, without loss of generality, that T(F) = 0. Let
ii
denote the shift between the distributions of Y
ij
and Y
i
l
. Recall from Chapter 2 that the
parameters
ii
are invariant to the choice of locational functional and that
ii
=
i
i
.
If
i
is the mean of the Y
ij
then Hocking (1985) calls this the means model. If
i
is the
median of the Y
ij
then we will call it the medians model; see Section 4.2.4 below.
Observational studies can also be modeled this way. Suppose k independent samples
are drawn from k dierent populations. If we assume further that the distributions for the
dierent populations dier by at most a shift in locations then Model ( 4.2.1) is appropriate.
But as in all observational studies, care must be taken in the interpretation of the results of
the analyses.
While the parameters
i
x the locations, the parameters of interest in this chapter are
contrasts of the form h =
k
i=1
c
i
i
where
k
i=1
c
i
= 0. Similar to the shift parameters,
contrasts are invariant to the choice of location functional. In fact contrasts are linear
functions of these shifts; i.e.,
h =
k
i=1
c
i
i
=
k
i=1
c
i
(
i
1
) =
k
i=2
c
i
i1
= c
1
, (4.2.2)
where c
1
= (c
2
, . . . , c
k
) and
1
= (
21
, . . . ,
k1
) (4.2.3)
4.2. ONEWAY DESIGN 277
is the vector of location shifts from the rst cell. In order to easily reference the theory
of Chapter 3, we will often use
1
which references cell 1. But picking cell 1 is only for
convenience and similar results hold for the selection of any other cell.
As in Chapter 2, we can write this model in terms of a linear model as follows. Let
Z
= (Y
11
, . . . , Y
1n
1
, . . . , Y
k1
, . . . , Y
kn
k
) denote the vector of all observations,
= (
1
, . . . ,
k
)
denote the vector of locations, and n =
n
i
denote the total sample size. The model can
then be expressed as a linear model of the form
Z = W +e , (4.2.4)
where e denotes the n1 vector of random errors e
ij
and the nk design matrix W denotes
the appropriate incidence matrix of 0s and 1s; i.e.,
W =
_
_
1
n
1
0 0
0 1
n
2
0
.
.
.
.
.
.
.
.
.
0 0 1
n
k
_
_
. (4.2.5)
Note that the vector 1
n
is in the column space of W; hence, the theory derived in Chapter
3 is valid for this model.
At times it will be convenient to reparameterize the model in terms of a vector of shift
parameters. For the vector
1
, let W
1
denote the last k 1 columns of W and let X be the
centered W
1
; i.e., X = (I H
1
)W
1
, where H
1
= 1(1
1)
1
1
= n
1
11
and 1
= (1, . . . , 1) .
Then we can write Model ( 4.2.4) as
Z = 1 +X
1
+e , (4.2.6)
where
1
is given in expression ( 4.2.3). It is easy to show that for any matrix [1[X
],
having the same column space as W, that its corresponding non-intercept parameters are
linear functions of the shifts and, hence, are invariant to the selected location functional.
The relationship between Models ( 4.2.4) and ( 4.2.6) will be explored further in Section
4.2.4.
4.2.1 R-Fit of the Oneway Design
Note that the sum of all the column vectors of W equals the vector of ones 1
n
. Thus 1
n
is in
the column space of W and we can t Model ( 4.2.4) by using the R-estimates discussed in
Chapter 3. In this chapter we assume that a specied score function, a(i) = (i/(n+1)), has
been chosen which, without loss of generality, has been standardized; recall (S.1), ( 3.4.10).
A convenient way of obtaining the t is by the QR-decomposition algorithm on the incidence
matrix W; see Section 3.7.3.
For the R ts used in the examples of this chapter, we will use the cell median model;
that is, Model 4.2.4 with T(F) = 0 where T denotes the median functional and F denotes
the distribution of the random errors e
i
. We will use the score function (u) to obtain the R
t of this model. Let X
1
denote the tted value. As discussed in Chapter 3, X
1
lies in
the column space of the centered matrix X = (I H
1
)W
1
. We then estimate the intercept
as

S
= med
1in
Z
i
x
1
, (4.2.7)
where x
i
is the ith row of X. The nal tted value and the residuals are, respectively,
Z =
S
1 +X
1
(4.2.8)
e = Z
Z . (4.2.9)
Note that

Z lies in the column space of W and that, further, T(F
n
) = 0, where F
n
denotes
the empirical distribution function of the residuals and T is the median location functional.
Denote the tted value of the response Y
ij
as

Y
ij
. Given

Z, we nd from ( 4.2.4) that
= (W
W)
1
W
Z. Because W is an incidence matrix, the estimate of

i
is the common
tted value of the ith cell which, for future reference, is given by

i
=

Y
ij
, (4.2.10)
for any j = 1, . . . , n
i
. In the examples below, we shall denote the R t described in this
paragraph by stating that the model was t using Wilcoxon scores and the residuals were
adjusted to have median zero.
It follows from Section 3.5.2 that is asymptotically normal with mean . To do
inference based on these estimates of
i
we need their standard errors but these can be
obtained immediately from the variance of the tted values given by expression ( 3.9.38).
First note that the leverage value for an observation in the ith cell is 1/n
i
and, hence, the
leverage value for the centered design is h
c,i
= h
i
n
1
= (n n
i
)/(nn
i
). Therefore by
( 3.9.38) the approximate variance of
i
is given by
Var(
i
)
.
=
1
n
i
+
1
n
(
2
S
) . i = 1, . . . , k ; (4.2.11)
Let
and
S
denote, respectively, the estimates of
and
S
presented in Section 3.7.1.
The estimated approximate variance of
i
is expression ( 4.2.11) with these estimates in
place of
and
S
. Dene the minimum value of the dispersion function as DE, i.e,
DE = D(e) =
k
i=1
n
i
j=1
a(R(e
ij
))e
ij
. (4.2.12)
The symbol DE stands for the dispersion of the errors and is analogous to LS sums of
squared errors, SSE. Upon tting such a model, a residual analysis as discussed in Section
3.9 should be conducted to assess the goodness of t of the model.
Example 4.2.1. LDL Cholesterol of Quail
Table 4.2.1: Data for Example 4.2.1.
Drug LDL Cholesterol
I 52 67 54 69 116 79 68 47 120 73
II 36 34 47 125 30 31 30 59 33 98
III 52 55 66 50 58 176 91 66 61 63
IV 62 71 41 118 48 82 65 72 49
Table 4.2.2: Estimates of Location Levels for the Quail Data
Drug Wilcoxon Fit LS Fit
Compound Est. SE Est. SE
I 67.0 6.3 74.5 9.6
II 42.0 6.3 52.3 9.6
III 63.0 6.3 73.8 9.6
IV 62.0 6.6 67.6 10.1
Thirty-nine quail were randomly assigned to four diets, each diet containing a dierent
drug compound, which, hopefully, would reduce low density lipid (LDL) cholesterol. The
drug compounds are labeled: I, II, III, and IV. At the end of the prescribed experimental
time the LDL cholesterol of each quail was measured. The data are displayed in Table 4.2.1.
From the boxplot, Panel A of Figure 4.2.1, it appears that Drug Compound II was more
eective than the other three in lowering LDL. The data appear to be positively skewed with a
long right tail. We tted the data using Wilcoxon scores, (u) =
12(u1/2), and adjusted

the residuals to have median 0. The Panel B of Figure 4.2.1 displays the Wilcoxon residuals
versus tted values. The long right tail of the error distribution is apparent from this plot.
The lower panels of Figure 4.2.1 involve the internal R-studentized residuals, ( 3.9.31), with
the benchmarks 2. The internal R-studentized residuals detected six outlying data points
while the normal qq plot of these residuals clearly shows the skewness.
The estimates of
and
S
are 19.19 and 21.96, respectively. For comparison the LS
estimate of was 30.49. Table 4.2.2 displays the Wilcoxon and LS estimates of the cell
locations along with their standard errors. The Wilcoxon and LS estimates of the location
levels are quite dierent, as they should be since they estimate dierent functionals under
asymmetric errors. The long right tail has drawn out the LS estimates. The standard errors
of the Wilcoxon estimates are much smaller than their LS counterparts.
This data set was taken from a much larger study discussed in McKean, Vidmar and
Sievers (1989). Most of the data in that study exhibited long right tails. The left tails
were also long; hence, transformations such as logarithms were not eective. Scores more
appropriate for positively skewed data were used with considerable success in this study.
These scores are briey discussed in Example 2.5.1.
Figure 4.2.1: Panel A: Comparison boxplots for data of Example 4.2.1; Panel B: Wilcoxon
residual plot; Panel C: Wilcoxon internal R-studentized residual plot; Panel D: Wilcoxon
internal R-studentized residual normal qq plot.
4
0
6
0
8
0
1
2
0
1
6
0
I II III IV
Drug Compound
L
D
L

C
h
o
l
e
s
t
e
r
o
l
o
Panel A
Wilcoxon fit
W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
s
45 50 55 60 65
-
2
0
0
2
0
4
0
6
0
8
0
Panel B
Wilcoxon fit
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
45 50 55 60 65
0
2
4
6
8
Panel C
Normal quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
-2 -1 0 1 2
0
2
4
6
8
Panel D
4.2.2 Rank-Based Tests of H
0
:
1
= =
k
Consider Model ( 4.2.4). A hypothesis of interest in the oneway design is that there are no
dierences in the levels of A; i.e.,
H
0
:
1
= =
k
versus H
1
:
i
,=
i
for some i ,= i
. (4.2.13)
Dene the k (k 1) matrix M as
M =
_
_
1 1 0 0 . . . 0
1 0 1 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 0 0 0 . . . 1
_
_
. (4.2.14)
Then M =
1
, ( 4.2.3), and, hence, H
0
is equivalent to M = 0. Note that the rows of
M form k 1 linearly independent contrasts in the vector . If the design matrix given in
( 4.2.6) is used then the null hypothesis is simply H
0
: I
k1
1
= 0; that is, all the regression
coecients are zero. We shall discuss two rank-based tests for this hypothesis.
One appropriate test statistic is the gradient test statistic, ( 3.5.8), which is given by,
T =
2
a
S(Z)
(X
X)
1
S(Z) , (4.2.15)
where S(Z)
= (S
2
(Z), . . . , S
k
(Z)) for
S
i
(Z) =
n
i
j=1
a(R(Z
ij
)) , (4.2.16)
and, as dened in Theorem 3.5.1,
2
a
= (n 1)
1
n
i=1
a
2
(i) . (4.2.17)
Based on Theorem 3.5.2 a level test for H
0
versus H
1
is:
Reject H
0
in favor of H
1
if T
2
(, k 1) , (4.2.18)
where
2
(, k 1) denotes the upper level critical value of
2
-distribution with k 1
degrees of freedom. Because the design matrix X of Model ( 4.2.6) is an incidence matrix,
the gradient test simplies. First note that
(X
X)
1
=
1
n
1
J + diag
_
1
n
2
, . . . ,
1
n
k
_
, (4.2.19)
where J is a (k 1) (k 1) matrix of ones; see Exercise 4.8.1. Since the scores sum to 0,
we have that S(Z)
1
k1
= S
1
(Z). Upon combining these results, the gradient test statistic
simplies to
T
=
2
a
S(Z)
(X
X)
1
S(Z) =
2
a
k
i=1
1
n
i
S
2
i
(Z) . (4.2.20)
Table 4.2.3: Analysis of Dispersion Table for the Hypotheses ( 4.2.13)
Source D=Dispersion df MD F
A RD k 1 RD/(k 1) F
Error n k
/2
For Wilcoxon scores further simplication is possible. In this case
S
i
(Y) =
n
i
j=1
12
_
R(Y
ij
)
n + 1

1
2
_
=
12
n + 1
n
i
_
R
i
n + 1
2
_
, (4.2.21)
where R
i
denotes the average of the ranks from sample i. Also for Wilcoxon scores
2
a
=
n/(n + 1). Thus the test statistic for Wilcoxon scores is given by
H
W
=
12
n(n + 1)
k
i=1
n
i
_
R
i
n + 1
2
_
2
. (4.2.22)
This is the Kruskal-Wallis (1952) test statistic. It is distribution free under H
0
. In the case
of two levels, the Kruskal-Wallis test is equivalent to the MWW test discussed in Chapter
2; see Exercise 4.8.2. From the discussion on eciency, Section 3.5, the eciency results
for the Kruskal-Wallis test are the same as for the MWW.
As a second rank-based test, we briey discuss the drop in dispersion test for H
0
versus
H
1
given by expression ( 3.6.12). Under the null hypothesis, the underlying distributions of
the k levels of A are the same; hence, the reduced model is
Y
ij
= + e
ij
, (4.2.23)
where is a common location functional. Thus there are no parameters to t in this case
and the reduced model dispersion is
DT = D(Y) =
k
i=1
n
i
j=1
a(R(Y
ij
))Y
ij
. (4.2.24)
The symbol DT denotes total dispersion in the problem which is analogous to the classical
LSs total variation, SST. Hence the reduction in dispersion is RD = DT DE, where
DE is dened in expression ( 4.2.12), and the drop in dispersion test is given by F
=
(RD/(k 1))/(
/2). As discussed in Section 3.6 this should be compared with F-critical

values having k 1 and n k degrees of freedom. The analysis can be summarized in an
analysis of dispersion table of the form given in Table 4.2.3.
Because the Kruskal-Wallis test is a gradient test, the drop in dispersion test and the
Kruskal-Wallis test have the same asymptotic eciency; see Section 3.6. The third test
discussed in that section, the Wald type test, will be discussed below, for this hypothesis, in
Section 4.2.5.
Example 4.2.2. LDL Cholesterol of Quail, Example 4.2.1 continued
For the hypothesis of no dierence among the locations of the cholesterol levels of the
drug compounds, hypotheses ( 4.2.13), the results of the LS F-test, the Kruskal-Wallis test,
and the drop in dispersion test can be found in Table 4.2.4. The long right tail of the
errors spoiled the LS test statistic. Using it, one would conclude that there is no signicant
dierence among the drug compounds which is inconsistent with the boxplots in Figure
4.2.1. On the other hand, both robust procedures detect the dierences among the drug
compounds especially the drop in dispersion test statistic.
Table 4.2.4: Tests of Hypotheses ( 4.2.13) for the Quail Data
Test Scale
Procedure Statistic or
df p-value
LS, F
LS
1.14 30.5 (3, 35) .35
Drop Disp., F
3.77 19.2 (3, 35) .02

Kruskal Wallis 7.18 3 .067
4.2.3 Tests of General Contrasts
As discussed above the parameters and hypotheses of interest for Model ( 4.2.4) can usually
be dened in terms of contrasts. In this section we discuss R-estimates and tests of contrasts.
We will apply these results to more complicated designs in the remainder of the chapter.
For Model ( 4.2.4), consider general linear hypotheses of the form
H
0
: M = 0 versus H
A
: M ,= 0 , (4.2.25)
where M is a q k matrix of contrasts (rows sum to 0) of full row rank. Since M is a matrix
of contrasts, the hypothesis H
0
is invariant to the intercept and, hence, can be tested by
the R-test statistic discussed in Section 3.6. To obtain the test based on the reduction of
dispersion, F
, discussed in Section 3.6, we need to t the reduced model which is Model

( 4.2.4) subject to H
0
. Let D() denote the minimum value of the dispersion function for
the reduced model t and let RD = D() DE denote the reduction in dispersion. Note
that RD is analogous to the reduction in sums of squares of the traditional LS analysis. The
test statistic is given by F
= (RD/q)/
/2. As discussed in Chapter 3 this statistic should

be compared with F-critical values having q and n k degrees of freedom. The test can be
summarized in the analysis of dispersion table found in Table 4.2.5, which is analogous to
the traditional analysis of variance table for summarizing a LS analysis.
Example 4.2.3. Poland China Pigs
This data set, presented on page 87 of Schee (1959), concerns the birth weights of Poland
China pigs in eight litters. For convenience we have tabled that data in Table 4.2.6. There
are 56 pigs in the eight litters. The sample sizes of the litters vary from 4 to 10.
Table 4.2.5: Analysis of Dispersion Table for H
0
: M = 0
.
M = 0 RD q MRD = RD/q F
Error n k
/2
Table 4.2.6: Birth weights of Poland China Pigs by litter.
Litter Birth Weight
1 2.0 2.8 3.3 3.2 4.4 3.6 1.9 3.3 2.8 1.1
2 3.5 2.8 3.2 3.5 2.3 2.4 2.0 1.6
3 3.3 3.6 2.6 3.1 3.2 3.3 2.9 3.4 3.2 3.2
4 3.2 3.3 3.2 2.9 3.3 2.5 2.6 2.8
5 2.6 2.6 2.9 2.0 2.0 2.1
6 3.1 2.9 3.1 2.5
7 2.6 2.2 2.2 2.5 1.2 1.2
8 2.5 2.4 3.0 1.5
In Exercise 4.8.3 a residual analysis is conducted of this data set and the hypothesis
( 4.2.13) is tested. Here we are only concerned with the following contrast suggested by
Schee. Assume that the litters 1, 3, and 4 were sired by one boar while the other litters
were sired by another boar. The contrast of interest is that the average litter birthweights
of the pigs sired by the boars are the same; i.e., H
0
: h = 0 where
h =
1
3
(
1
+
3
+
4
)
1
5
(
2
+
5
+
6
+
7
+
8
) . (4.2.26)
For this hypothesis, the matrix Mof expression ( 4.2.25) is given by [5 3 5 5 3 3 3 3].
The value of the LS F-test statistic is 11.19, while F
= 15.65. There are 1 and 48 degrees

of freedom for this hypothesis so both tests are highly signicant. Hence both tests indicate
a dierence in average litter birthweights of the boars. The reason F
is more signicant
than F
LS
is clear from the residual analysis found in Exercise 4.8.3
4.2.4 More on Estimation of Contrasts and Location
In this section we further explore the relationship between Models ( 4.2.4) and ( 4.2.6). This
will enable us to formulate the contrast procedure based on pseudo-observations discussed
in Section 4.2.5. Recall that the design matrix X of expression ( 4.2.6) is a centered design
matrix based on the last k 1 columns of the design matrix W of expression ( 4.2.4).
To determine the relationship between the parameters of these models, we simply match
them by location parameter for each level. For Model ( 4.2.4) the location parameter for
level i is of course
i
. In terms of Model ( 4.2.6), the location parameter for the rst level is
k
j=2
n
j
n

j
and that of the ith level is
k
j=2
n
j
n

j
+
i
. Hence, letting =
k
j=2
n
j
n

j
,
Table 4.2.7: All Pairwise 95% Condence Intervals for the Quail Data Based on the Wilcoxon
Fit
Dierence Estimate Condence Interval
1
-25.0 (42.7, 7.8)
3
-21.0 (38.6, 3.8)
4
-20.0 (37.8, 2.0)
3
4.0 (13.41, 21.41)
4
5.0 (12.89, 22.89)
4
1.0 (16.89, 18.89)
we can write the vector of level locations as
= ( )1 + (0,
1
)
, (4.2.27)
where
1
is dened in expression (4.2.3).
Let h = M be a q 1 vector of contrasts of interest, (i.e., rows of M sum to 0). Write
M as [m M
1
]. Then by ( 4.2.27) we have
h = M = M
1
1
. (4.2.28)
By Corollary 3.5.1,

1
has an asymptotic N(,
2
(X
X)
1
) distribution. Hence, based on
expression ( 4.2.28), the asymptotic variance-covariance matrix of the estimate M is
h
=
2
M
1
(X
X)
1
M
1
. (4.2.29)
Note that the only dierence for the LS-t, is that
2
would be substituted for
2
. Expressions
( 4.2.28) and ( 4.2.29) are the basic relationships used by pseudo-observations discussed in
Section 4.2.5.
To illustrate these relationships, suppose we want a condence interval for
i
i
. Based
on expression ( 4.2.29), an asymptotic (1 )100% condence interval is given by,

i

i
t
(/2,nk)

_
1
n
i
+
1
n
i
; (4.2.30)
i.e., same as LS except
replaces .
To illustrate the above condence intervals, Table 4.2.7 displays the six pairwise con-
dence intervals among the four drug compounds. On the basis of these intervals Drug
Compound 2 seems best. This conclusion, though, is based on six simultaneous condence
intervals and the problem of overall condence in these intervals needs to be addressed. This
is discussed in some detail in Section 4.3, at which time we will return to this example.
Medians Model
Suppose we are interested in estimates of the level locations themselves. We rst need to
select a location functional. For the discussion we will use the median; although, for any
other functional, only a change of the scale parameter
S
is necessary. Assume then that the
R-residuals have been adjusted so that their median is zero. As discussed above, ( 4.2.10),
the estimate of
i
is

Y
ij
, for any j = 1, . . . , n
i
, where

Y
ij
is the tted value of Y
ij
. Let
= (
1
, . . . ,
k
)
. Further, is asymptotically normal with mean and the asymptotic

variance of
i
is given in expression ( 4.2.11). As Exercise 4.8.4 shows, the asymptotic
covariance between estimates of location levels is:
cov(
i
,
i
) = (
2
S
)/n , (4.2.31)
for i ,= i
. As Exercises 4.8.4 and 4.8.18 show, expressions ( 3.9.38) and ( 4.2.31) lead to a
verication of the condence interval ( 4.2.30).
Note that if the scale parameters are the same, say,
S
=
= then the approximate

variance reduces to
2
/n
i
and the covariances are 0. Hence, in this case, the estimates
i
are asymptotically independent. This occurs in the following two ways:
1. For the t of Model ( 4.2.4) use a score function (u) which satises (S2) and use
the location functional based on the corresponding signed-rank score function
+
(u) =
((u + 1)/2). The asymptotic theory, though, requires the assumption of symmetric
errors. If the Wilcoxon score function is used then the location functional would result
in the residuals being adjusted so that the median of the Walsh averages of the adjusted
residuals is 0.
2. Use the l
1
score function
S
(u) = sgn(u (1/2)) to t Model ( 4.2.4) and use the
median as the location functional. This of course is equivalent to using an l
1
t on
Model ( 4.2.4). The estimate of
i
is then the cell median.
4.2.5 Pseudo-observations
We next discuss a convenient way to estimate and test contrasts once an R-t of Model
( 4.2.4) is obtained. Let

Z denote the R-t of this model, let e denote the vector of residuals,
let a(R(e)) denote the vector of scored residuals, and let
be the estimate of
. Let H
W
denote the projection matrix onto the column space of the incidence matrix W. Because of
( 3.2.13), the fact that 1
n
is in the column space of W, and that the scores sum to 0, we get
H
W
a(R(e)) = 0 . (4.2.32)
Dene the constant
by
=
n k
n
i=1
a
2
(i)
. (4.2.33)
Because n
1
a
2
(i)
.
= 1,
.
= 1(k/n). Then the vector of pseudo-observations is dened
by
Z =

Z +
a(R(e)) ; (4.2.34)
see Bickel (1976) for a discussion of the pseudo-observations. For the Wilcoxon scores, we
get
2
W
=
(n k)(n + 1)
n(n 1)
. (4.2.35)
Let

Z and

e denote the LS t and residuals, respectively, of the pseudo-observations,
( 4.2.34). By ( 4.2.32) we have,
Z =

Z , (4.2.36)
and, hence,
e =
a(R(e)) . (4.2.37)
From this last expression and the denition of
, ( 4.2.33), we get
1
n k
e =
2
. (4.2.38)
Therefore the LS-t of the pseudo-observations results in the R-t of Model (4.2.4) and,
further, the LS estimator MSE is
2
.
The pseudo-observations can be used to compute the R-inference on a given contrast, say,
h = M. If the pseudo-observations are used in place of the observations in a LS algorithm,
then based on the variance-covariance of

h, ( 4.2.29), expressions ( 4.2.36) and ( 4.2.38)
imply that the resulting LS estimate of h and the LS estimate of the corresponding variance-
covariance matrix of

h will be the R-estimate of h and the R-estimate of the corresponding
variance-covariance matrix of

h. Similarly for testing the hypotheses ( 4.2.25), the LS test
using the pseudo-observations will result in the Walds type R-test, F
,Q
of these hypotheses
given by expression ( 3.6.14). Pseudo-observations will be used in many of the subsequent
examples of this chapter.
The pseudo-observations are easy to obtain. For example, the package rglm returns the
pseudo-observations directly in the outputed data set of ts and residuals. These pseudo-
observations can then be read into Minitab or another package for further analyses. In
Minitab itself, for Wilcoxon scores the robust regression command, RREGR, has the sub-
command PSEUDO which returns the pseudo-observations. Then the pseudo-observations
can be used in place of the observations in Minitab commands to obtain the R-inference on
contrasts.
To demonstrate how easy it is to use the pseudoobservations with Minitab, reconsider
Example 4.2.1 concerning LDL cholesterol levels of quail under the treatment of 4 dierent
drug compounds. Suppose we want the Wald type R-test of the hypotheses that there is no
eect due to the dierent drug compounds. The pseudoobservations were obtained based
on the full model R-t and placed in column 10 and the corresponding levels were placed in
column 11. The Wald F
,Q
statistic is obtained by using the following Minitab command:
oneway c10 c11
The execution of this command returned the value of the F
,Q
= 3.45 with a p-value .027,
which is close to the result based on the F
-statistic.
4.3 Multiple Comparison Procedures
Our basic model for this section is Model 4.2.4; although, much of what we do here pertains
to the rest of this Chapter also. We will discuss methods based on the R-t of this model as
described in Section 4.2.1. In particular, we shall use the same notation to describe the t;
i.e., the R-residuals and tted values are, respectively, e and

Z, the estimate of and
are
and
, and the vector of pseudo-observation is

Z. We also denote the pseudo-observation
corresponding to the observation Y
ij
as

Z
ij
.
Besides tests of contrasts of level locations, often we want to make comparisons among
the location levels, for instance, all pairwise comparisons among the levels. With so many
comparisons to make, overall condence becomes a problem. Multiple comparison proce-
dures, MCP, have been developed to oset this problem. In this section we will explore
several of these methods in terms of robust estimation. These procedures can often be di-
rectly robustied. It is our intent to show this for several popular methods, including the
Tukey T-method. We will also discuss simultaneous, rank-based tests among levels. We will
show how simple Minitab code, based on the pseudo-observations, suces to compute these
procedures. It is not our purpose to give a full discussion of MCPs. Such discussions can be
found, for example, in Miller (1981) and Hsu (1996).
We will focus on the problem of simultaneous inference for all
_
k
2
_
comparisons
i

i
based on an R-t of Model ( 4.2.4). Recall, ( 4.2.28), that a (1 )100% asymptotic

condence interval for
i
i
based on the R-t of Model 4.2.4 is given by

i

i
t
(/2,nk)

_
1
n
i
+
1
n
i
. (4.3.1)
In this section we say that this condence interval has experiment error rate . As
Exercise 4.8.8 illustrates, simultaneous condence for several such intervals can easily slip
well below 1 . The error rate for a simultaneous condence procedure will be called its
family error rate.
We next describe six robust multiple comparison procedures for the problem of all pair-
wise comparisons. The error rates for them are based on asymptotics. But note that the
same is true for MCPs based on least squares when the normality assumption is not valid.
Sucient Minitab code is given, to demonstrate how easily these procedures can be per-
formed.
4.3. MULTIPLE COMPARISON PROCEDURES 289
1. Bonferroni Procedure. This is the simplest of all the MCPs. Suppose we are
interested in making l comparisons of the form
i
i
. If each individual condence
interval, ( 4.3.1), has condence 1
l
, then the family error rate for these l simultaneous
condence intervals is at most ; see Exercise 4.8.8. To do all comparisons just select
l =
_
k
2
_
. Hence the R-Bonferroni procedure declares
levels i and i
dier if [
i

i
[ t
(/(2(
k
2
)),nk)

_
1
n
i
+
1
n
i
. (4.3.2)
The asymptotic family error rate for this procedure is at most .
To obtain these Bonferroni intervals by Minitab assume that pseudo-observations,

Y
ij
,
are in column 10, the corresponding levels, i, are in column 11, and the constant /
_
k
2
_
is in k1. Then the following two lines of Minitab code will obtain the intervals:
oneway c10 c11;
bonferroni k1.
2. Protected LSD Procedure of Fisher. First use the test statistic F
to test the
hypotheses that all the level locations are the same, ( 4.2.13), at level . If H
0
is
rejected then the usual level 1 condence intervals, ( 4.2.28), are used to make
the comparisons. If we fail to reject H
0
then either no comparisons are made or the
comparisons are made using the Bonferroni procedure. In summary, this procedure
declares
levels i and i
dier if F
F
,k1,nk
and [
i

i
[ t
(/2,nk)

_
1
n
i
+
1
n
i
.
(4.3.3)
This MCP has no family error rate but the initial test does oer protection. In a large
simulation study conducted by Carmer and Swanson (1973) this procedure based on
LS estimates performed quite well in terms of power and level. In fact, it was one of
the two procedures recommended. In a moderate sized simulation study conducted by
McKean, Vidmar and Sievers (1989) the robust version of the protected LSD discussed
here performed similarly to the analogous LS procedure on normal errors and had a
considerable gain in power over LS for error distributions with heavy tails.
Upon rejection of the hypotheses ( 4.2.13) at level , the following Minitab code will
obtain the comparison condence intervals. Assume that pseudo-observations,

Y
ij
, are
in column 10, the corresponding levels, i, are in column 11, and the constant is in
k1.
oneway c10 c11;
fisher k1.
The F-test that appears in the AOV table upon execution of these commands is Walds
test statistic F
,Q
for the hypotheses ( 4.2.13). Recall from Chapter 3 that it is asymp-
totically equivalent to F
under the null and local hypotheses.

3. Tukeys T Procedure. This is a multiple comparison procedure for the set of all
contrasts, h =
k
i=1
c
i
i
where
k
i=1
c
i
= 0. Assume that the sample sizes for the
levels are the same, say, n
1
= = n
k
= m. The basic geometric fact for this
procedure is the following equivalence due to Tukey, (see Miller, 1981): for t > 0,
max
1i,i
k
[(
i
i
)(
i

i
)[ t
k
i=1
c
i

i
1
2
t
k
i=1
[c
i
[
k
i=1
c
i
i

k
i=1
c
i

i
+
1
2
t
k
i=1
[c
i
[ ,
(4.3.4)
for all contrasts
k
i=1
c
i
i
where
k
i=1
c
i
= 0. Hence to obtain simultaneous condence
intervals for the set of all contrasts we need the distribution of the left side of this
inequality. But rst note that
(
i
i
) (
i

i
) = (
i
i
) (
1
1
) (
i

i
) (
1
1
)
= (
i1
i1
) (
1
) .
Hence, we need only consider the asymptotic distribution of

1
which by ( 4.2.19) is
N
k1
(
1
,

2
m
[I +J]).
Recall, if v
1
, . . . , v
k
are iid N(0,
2
), then the max
1i,i
k
[v
i
v
i
[/ has the Studentized
range distribution, with k1 and degrees of freedom. But we can write this random
variable as
max
1i,i
k
[v
i
v
i
[ = max
1i,i
k
[(v
i
v
1
) (v
i
v
1
)[ .
Hence we need only consider the random vector of shifts v
1
= (v
2
v
1
, . . . , v
k
v
1
) to
determine the distribution. But v
1
has distribution N
k1
(0,
2
[I +J]). Based on this,
it follows from the asymptotic distribution of

1
, that if we substitute q
;k,
m
for t in expression ( 4.3.4), where q
;k,
is the upper critical value of a Studentized
range distribution with k and degrees of freedom, then the asymptotic probability
of the resulting expression will be 1 .
The parameter
, though, is unknown and must be replaced by an estimate. In the

Tukey T procedure for LS, the parameter is . The usual estimate s of is such
that if the errors are normally distributed then the random variable (n k)s
2
/
2
has
a
2
distribution and is independent of the LS location estimates. In this case the
Studentized range distribution with k 1 and n k degrees of freedom is used. If
the errors are not normally distributed then this distribution leads to an approximate
simultaneous condence procedure. We proceed similarly for the procedure based
on the robust estimates. Replacing t in expression ( 4.3.4) by q
;k,nk

m, where
q
;k,nk
is the upper critical value of a Studentized range distribution with k and
nk degrees of freedom, yields an approximate simultaneous condence procedure for
the set of all contrasts. As discussed before, though, small sample studies have shown
that the Student t-distribution works well for inference based on the robust estimates.
Hopefully these small sample properties carry over to the approximation based on the
Studentized range distribution. Further research is needed in this area.
Tukeys procedure requires that the level sample sizes are the same which is frequently
not the case in practice. A simple adjustment due to Kramer (1956) results in the
simultaneous condence intervals,

i

i

1
2
q
;k,nk

_
1
n
i
+
1
n
i
. (4.3.5)
These intervals have approximate family error rate . This approximation is often
called the Tukey-Kramer procedure.
In summary the R-Tukey-Kramer procedure declares
levels i and i
dier if [
i

i
[
1
2
q
;k,nk

_
1
n
i
+
1
n
i
. (4.3.6)
The asymptotic family error rate for this procedure is approximately .
To obtain these R-Tukey intervals by Minitab assume that pseudo-observations,

Y
ij
,
are in column 10, the corresponding levels, i, are in column 11, and the constant is
in k1. Then the following two lines of Minitab code will obtain the intervals:
oneway c10 c11;
tukey k1.
4. Pairwise Tests Based on Joint Rankings. The above methods were concerned with
estimation and simultaneous condence intervals for eects. Traditionally, simultane-
ous nonparametric inference has dealt with comparison tests. The rst such procedure
we will discuss is based on the combined rankings of all levels; i.e, the rankings that
are used by the Kruskal-Wallis test. We will discuss this procedure using the Wilcoxon
score function; see Exercise 4.8.10 for the analogous procedure based on a selected
score function. Assume a common level sample size m. Denote the average of the ranks
for the ith level by R
i
and let R
1
= (R
2
R
1
, . . . , R
k
R
1
). Using the results of
Chapter 3, under H
0
:
1
= =
k
, R
1
is asymptotically N
k1
(0,
k(n+1)
12
(I
k1
+J
k1
));
see Exercise 4.8.9. Hence, as in the development of the Tukey procedure above we
have the asymptotic result
P
H
0
_
max
1i,i
k
[R
i
R
i
[
_
k(n + 1)
12
q
;k,
_
.
= 1 . (4.3.7)
Hence the joint ranking procedure declares
levels i and i
dier if [R
i
R
i
[
_
k(n + 1)
12
q
;k,
. (4.3.8)
This procedure has an approximate family error rate of . This procedure is not easy
to invert for simultaneous condence intervals for the eects. We would recommend the
Tukey procedure, (2), with Wilcoxon scores for corresponding simultaneous inference
on the eects.
An approximate level test of the hypotheses ( 4.2.13) is given by
Reject H
0
if max
1i,i
k
[R
i
R
i
[
_
k(n + 1)
12
q
;k,
. (4.3.9)
Although the Kruskal-Wallis test is the usual choice in practice.
The joint ranking procedure, ( 4.3.9), is approximate for the unequal sample size case.
Miller (1981, p. 166) describes a procedure similar to the Schee procedure in LS which
is valid for the unequal sample size case, but which is also much more conservative; see
Exercise 4.8.6. A Tukey-Kramer type rule, ( 4.3.6), for the procedure ( 4.3.9) is
levels i and i
dier if [R
i
R
i
[
_
n(n + 1)
24
_
1
n
i
+
1
n
i
q
;k,
. (4.3.10)
The small sample properties of this approximation need to be studied.
5. Pairwise Tests Based on Separate Rankings. For this procedure we compare
levels i and i
by ranking the combined ith and i
th samples. Let R
(i
)
i
denote the sum
of the ranks of the ith level when it is compared with the i
th level. Assume that the

sample sizes are the same, n
1
= = n
k
= m. For 0 < < 1, dene the critical value
c
;m,k
by
P
H
0
_
max
1i,i
k
R
(i
)
i
c
;m,k
_
= . (4.3.11)
Tables for this critical value at the 5% and 1% levels are provided in Miller (1981).
The separate ranking procedure declares
levels i and i
dier if R
(i
)
i
c
;m,k
or R
(i)
i
c
;m,k
. (4.3.12)
This procedure has an approximate family error rate of and was developed indepen-
dently by Steel (1960) and Dwass (1960).
An approximate level test of the hypotheses ( 4.2.13) is given by
Reject H
0
if max
1i,i
k
R
(i
)
i
c
;m,k
, (4.3.13)
although as noted for the last procedure the Kruskal-Wallis test is the usual choice in
practice.
Corresponding simultaneous condence intervals can be constructed similar to the
condence intervals developed in Chapter 2 for a shift in locations based the MWW
statistic. For the condence interval for the ith and i
th samples corresponding to the

test ( 4.3.12), rst form the dierences between the two samples, say,
D
ii
kl
= Y
ik
Y
i
l
1 k, l m .
Let D
(1)
, . . . , D
(m
2
)
denote the ordered dierences. Note here that the critical value
c
;m,k
is for the sum of the ranks and not statistics of the form S
+
R
, ( 2.4.2). But recall
that these versions of the Wilcoxon statistic dier by the constant m(m+1)/2. Hence
the condence interval is
(D
(c
;m,k
m(m+1)
2
+1)
, D
(m
2
c
;m,k
+
m(m+1)
2
)
) . (4.3.14)
It follows that this set of condence intervals, over all pairs of levels i and i
, form a
set of simultaneous 1 condence intervals. Using the iterative algorithm discussed
in Section 3.7.2, the dierences need not be formed.
6. Procedures Based on Pairwise Distribution Free Condence Intervals. Sim-
ple pairwise (separate ranking) multiple comparison procedures can be easily formu-
lated based on the MWW condence intervals discussed in Section 2.4.2. Such proce-
dures do not depend on equal sample sizes. As an illustration, we describe a Bonfer-
roni type procedure for the situation of all l =
_
k
2
_
comparisons. For the levels (i, i
),
let [D
ii
(c
/(2l)
+1)
, D
ii
(n
i
n
i
c
/(2l)
)
) denote the (1(/l))100% condence interval discussed
in Section 2.4.2 based on the n
i
n
i
dierences between the ith and i
th samples. This
procedure declares
levels i and i
dier if 0 is not in [D
ii
(c
/(2l)
+1)
, D
ii
(n
i
n
i
c
/(2l)
)
) . (4.3.15)
This Bonferroni type procedure has family error rate at most . Note that the asymp-
totic value for c
/(2l)
is given by
c
/(2l)
.
=
n
i
n
i
2
z
/(2l)
_
n
i
n
i
(n
i
+ n
i
+ 1)
12
.5 ; (4.3.16)
see ( 2.4.13). A Protected LSD type procedure can be constucted in the same
way, using as the overall test either the Kruskal-Wallis test or the test based on F
;
Reconsider the data on the LDL levels of quail subject to four dierent drug compounds.
The full model t returned the estimate = (67, 42, 63, 62). We set = .05 and ran the
rst ve MCPs on this data set. We used the Minitab code based on pseudo-observations
to compute the rst four procedures and we obtained the later two by Minitab commands.
A table that helps for the separate rank procedure can be found on page 242 of Lehmann
Table 4.3.1: Drug Compounds Declared Signicantly Dierent by MCPs
Compunds Declared Respective Condence
Procedure Dierent Interval
Bonferroni (1, 2) (1.25, 49.23)
Fisher (1, 2),(2, 3),(2, 4) (7.83, 42.65) (38.57, 3.75) (37.79, 2.01)
Tukey-Kramer (1, 2) (2.13, 48.35)
Joint Ranking None
Separate Ranking None
(1975) which links the tables in Miller (1981) with a table of family error values for
this procedure. Based on these values, the Minitab MANN command can then be used
to obtain the condence intervals ( 4.3.14). For each procedure, Table 4.3.1 displays the
drug compounds that were declared signicantly dierent by the procedure. The rst three
procedures, based on eects, declared drug compounds 1 and 2 dierent. Fishers PLSD
also declared drug compound 2 dierent from drug compounds 3 and 4. The usual summary
schematic based on Fishers is
2 4 3 1 ,
which shows the separation of the second drug compound from the other three compounds.
On the other hand, the schematic for either the Bonferroni or Tukey-Kramer procedures is
2 4 3 1
which shows that though Treatment 2 is signicantly dierent from Treatment 1 it does not
dier signicantly from either Treatments 4 or 3. The joint ranking procedure came close
to declaring drug compounds 1 and 2 dierent because the dierence in average rankings
between these levels was 12.85 slightly less than the critical value of 13.10. The separate-
ranking procedure declared none dierent. Its interval, ( 4.3.14), for compounds 1 and 2 is
(29, 68.99). In comparison, the corresponding condence interval for the Tukey procedure
based on LS is (14.5, 58.9). Hence, the separate ranking procedure was impaired more by
the outliers than least squares.
4.3.1 Discussion
We have presented robust analogues to three of the most popular multiple comparison pro-
cedures: the Bonferroni, Fishers protected least signicant dierence, and the Tukey T
method. These procedures provide the user with estimates of the most interesting pa-
rameters in these experiments, namely the simple contrasts between treatment eects, and
estimates of standard errors with which to assess these contrasts. The robust analogues
are straightforward. Replace the LS estimates of the eects by the robust estimates and
replace the estimate of by the estimate of
. Furthermore, these robust procedures can

easily be obtained by using the pseudo-observations as discussed in Section 4.2.5. Hence,
the asymptotic relative eciency between the LS based MCP and its robust analogue is the
same as the ARE between the LS estimator and robust estimator, as discussed in Chapters
1-3. In particular if Wilcoxon scores are used, then the ARE of the Wilcoxon MCP to that
of the LS MCP is .955 provided the errors are normally distributed. For error distributions
with longer tails than the normal, the Wilcoxon MCP is generally much more ecient than
its LS MCP counterpart.
The theory behind the robust MCPs is asymptotic, hence, the error rates are approxi-
mate. But this is true also for the LS MCPs when the errors are not normally distributed.
Verication of the validity and power of both LS and robust MCPs is based on small sample
studies. The small sample study by McKean et al. (1989) demonstrated that the Wilcoxon
Fisher PLSD had the same validity as its LS counterpart over a variety of error distributions
for a oneway design. For normal errors, the LS MCP had slightly more empirical power
than the Wilcoxon. Under error distributions with heavier tails than the normal, though,
the empirical power of the Wilcoxon MCP was larger than the empirical power of the LS
MCP.
The decision as to which MCP to use has long been debated in the literature. It is not
our purpose here to discuss these issues. We refer the reader to books devoted to MCPs
for discussions on this topic; see, for example, Miller (1981) and Hsu (1996). We do note
that, besides
replacing , the error part of the robust MCP is the same as that of LS;
hence, arguments that one procedure dominates another in a certain situation will hold for
the robust MCP as well as for LS.
There has been some controversy on the two simultaneous rank-based testing procedures
that we presented: pairwise tests based on joint rankings and pairwise tests based on sepa-
rate rankings. Miller (1981) and Hsu (1996) both favor the tests based on separate rankings
because in the separate rankings procedure the comparison between two levels is not in-
uenced by any information from the other levels which is not the case for the procedure
based on joint rankings. They point out that this is true of the LS procedure, also, since
the comparison between two levels is based only on the dierence in sample means for those
two levels, except for the estimate of scale. However, Lehmann (1975) points out that the
joint ranking makes use of all the information in the experiment while the separate ranking
procedure does not. The spacings between all the points is information that is utilized by the
joint ranking procedure and that is lost in the separate ranking procedure. The quail data,
Example 4.3.1, is illustrative. The separate ranking procedure did quite poorly on this data
set. The sample sizes are moderate and in the comparisons when half of the information
is lost, the outliers impaired the procedure. In contrast, the joint ranking procedure came
close to declaring drug compounds 1 and 2 dierent. Consider also the LS procedure on this
data set. It is true that the ouliers impaired the sample means, but the estimated variance,
being a weighted average of the level sample variances, was drawn down some over all the
information; for example, instead of using s
3
= 37.7 in the comparisons with the third level,
the LS procedure uses a pooled standard deviation s = 30.5. There is no way to make a
similar correction to the separate ranking procedure. Also, the separate rankings procedure
can lead to inconsistencies in that it could declare Treatment A superior to B, Treatment
B superior to Treatment C, while not declaring Treatment A superior to Treatment C; see
page 245 of Lehmann (1975) for a simple illustration.
4.4 Twoway Crossed Factorial
For this design we have two factors, say, A at a levels and B at b levels that may have an eect
on the response. Each combination of the ab factor settings is a treatment. For a completely
randomized design, n subjects are selected at random from the reference population and then
n
ij
of these subjects are randomly assigned to the (i, j)th treatment combination; hence,
n =
n
ij
. Let Y
ijk
denote the response for the kth subject at the (i, j)th treatment
combination, let F
ij
denote the distribution function of Y
ijk
, and let
ij
= T(F
ij
). Then the
unstructured or full model is
Y
ijk
=
ij
+ e
ijk
, (4.4.1)
where e
ijk
are iid with distribution and density functions F and f, respectively. Let T denote
the location functional of interest and assume without loss of generality that T(F) = 0. The
submodels described below utilize the two-way structure of the design.
Model 4.4.1 is the same as the oneway design model ( 4.2.1) of Section 4.2. Using
the scores a(i) = (i/(n + 1)), the R-t of this model can be obtained as described in that
section. We will use the same notation as in Section 4.2: i.e, e denotes the residuals from
the t adjusted so that T(F
n
) = 0 where F
n
is the empirical distribution function of the
residuals; will denote the R-estimate of the ab 1 vector of the
ij
s; and
denotes the
estimate of
. For the examples discussed in this section, Wilcoxon scores are used and the
residuals are adjusted so that their median is 0.
An interesting submodel is the additive model which is given by
ij
= + (
i
) + (
j
) . (4.4.2)
For the additive model, the prole plots, (
ij
versus i or j), are parallel. A diagnostic
check for the additive model is to plot the sample prole plots, (
ij
versus i or j), and
see how close the proles are to parallel. The null hypotheses of interest for this model are
the main eect hypotheses given by
H
0A
:
i
=
i
for all i, i
= 1, . . . a and (4.4.3)
H
0B
:
j
=
j
for all j, j
= 1, . . . b . (4.4.4)
Note that there are a 1 and b 1 free constraints for H
0A
and H
0B
, respectively. Under
H
0A
, the levels of A have no eect on the response.
The interaction parameters are dened as the dierences between the full model
parameters and the additive model parameters; i.e.,
ij
=
ij
[ + (
i
) + (
j
)] =
ij
j
+ . (4.4.5)
4.4. TWOWAY CROSSED FACTORIAL 297
The hypothesis of no interaction is given by
H
0AB
=
ij
= 0 , i = 1, . . . , a , j = 1, . . . , b . (4.4.6)
Note that are (a1)(b1) free constraints for H
0AB
. Under H
0AB
the additive model holds.
Historically nonparametric tests for interaction were developed in an ad hoc fashion.
They generally do not appear in nonparametric texts and this has been a shortcoming of
the area. Sawilowsky (1990) provides an excellent review of nonparametric approaches to
testing for interaction. The methods we present are simply part of the general R-theory in
testing general linear hypotheses in linear models and they are analogous to the traditional
LS tests for interactions.
All these hypotheses are contrasts in the parameters
ij
of the oneway model, ( 4.4.1);
hence they can easily be tested with the rank based analysis as described in Section 4.2.3.
Usually the interaction hypothesis is tested rst. If H
0AB
is rejected then there is diculty
in interpretation of the main eect hypotheses, H
0A
and H
0B
. In the presence of interac-
tion H
0A
concerns the cell mean averaged over Factor B, which may have little practical
signicance. In this case multiple comparisons, (see below), between cells may be of more
practical signicance. If H
0AB
is not rejected then there are two schools of thought. The
pooling school would take the additive model, ( 4.4.2), as the new full model to test main
eects. The non-poolers would stick with unstructured model, ( 4.4.1), as the full model.
In either case with little evidence of interaction present, the main eect hypotheses are more
interpretable.
Since Model ( 4.4.1) is a oneway design, the multiple comparison procedures discussed in
Section 4.3 can be used. The crossed structure of the design makes for several interesting
families of contrasts. When interaction is present in the model, it is often of interest to
consider simple contrasts between cell locations. Here, we will only mention all
_
ab
2
_
pairwise
comparisons. Among others, the Bonferroni, Fisher, and Tukey T procedures described in
Section 4.3 can be used. The rule for the Tukey-Kramer procedure is:
cells (i, j) and (i
, j
) dier if [
ij

i
j
[
1
2
q
;ab,nab

1
n
ij
+
1
n
i
. (4.4.7)
The asymptotic family error rate for this procedure is approximately .
The pseudo-observations discussed in Section 4.2.5 can be used to easily obtain the Wald
test statistic, F
,Q
, ( 3.6.14), for tests of hypotheses and similarly they can be used to obtain
multiple comparison procedures for families of contrasts. Simply obtain the R-t of Model
( 4.4.1), form the pseudo-observations, ( 4.2.34), and input these pseudo-observations in a LS
package. The analysis of variance table outputed will contain the Wald type R-tests of the
main eect hypotheses, (H
0A
and H
0b
), and the interaction hypothesis, (H
0AB
). As with a
LS analysis, one has to know what main hypotheses are being tested by the LS package. For
instance, the main eect hypothesis H
0A
, ( 4.4.3), is a Type III sums of squares hypothesis
in SAS; see Speed , Hocking, and Hackney (1978).
Table 4.4.1: Data for Example 4.4.1,Lifetimes of Motors, (hours)
Insulation
Temp. 1 2 3
1176 2856 3528
1512 3192 3528
200
F 1512 2520 3528

1512 3192
3528 3528
624 816 720
624 912 1296
225
F 624 1296 1488

816 1392
1296 1488
204 300 252
228 324 300
250
F 252 372 324

300 372
324 444
Example 4.4.1. Lifetime of Motors
This problem is an unbalanced twoway design which is discussed on page 471 of Nelson
(1982); see, also, McKean and Sievers (1989) for a discussion on R-analyses of this data
set. The responses are lifetimes of three motor insulations, (1, 2, and 3), which were tested
at three dierent temperatures (200
F, 225
F, and 250
F). The design is an unbalanced

3 3 factorial with 5 replicates in 6 of the cells and 3 replicates in the others. The data are
displayed in Table 4.4.1. Following Nelson, as the response variable we considered the logs
of the lifetimes. Let Y
ijk
denote the log of the lifetime of the kth replicate at temperature
level i and which used motor insulation j. As a full model we will use Model ( 4.4.1). The
results found in Tables 4.4.2 and 4.4.3 are for the R-analysis based on Wilcoxon scores
with the intercept estimated by the median of the residuals. Hence the R-estimates of
ij
estimate the true cell medians.
The cell median prole plot based on the Wilcoxon estimates, Panel A of Figure 4.4.1,
indicates that some interaction is present. Panel B of Figure 4.4.1 is a plot of the internal
Wilcoxon studentized residuals, ( 3.9.31), versus tted values. It indicates randomness but
also shows several outlying data points which are, also, quite evident in the qq plot, Panel
C of Figure 4.4.1, of the Wilcoxon studentized residuals versus logistic population quantiles.
This plot indicates that score functions for distributions with heavier right tails than the
logistic would be more appropriate for this data; see McKean and Sievers (1989) for more
discussion on score selection for this example. Panel D of Figure 4.4.1, Casewise plot of the
Wilcoxon studentized residuals, readily identies the outliers as the fth observation in cell
(1, 1), the fth observation in cell (2, 1), and the rst observation in cell (2, 3).
4.4. TWOWAY CROSSED FACTORIAL 299
Figure 4.4.1: Panel A: Cell median prole plot for data of Example 4.4.1, cell medians based
on the Wilcoxon t; Panel B: Internal Wilcoxon studentized residual plot; Panel C: Logistic
qq plot based on internal Wilcoxon studentized residuals; Panel D: Casewise plot of the
Wilcoxon studentized residuals.
Motor insulation
C
e
l
l

m
e
d
i
a
n

e
s
t
i
m
a
t
e
s

(
W
i
l
c
o
x
o
n
)
1.0 1.5 2.0 2.5 3.0
5
6
7
8
200 deg
225 deg
250 deg
Panel A
Wilcoxon fit
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
5.5 6.5 7.5
-
6
-
4
-
2
0
2
4
6
Panel B
Logistic quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
-2 0 2
-
6
-
4
-
2
0
2
4
6
Panel C
Case
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
0 10 20 30 40
-
6
-
4
-
2
0
2
4
6
Panel D
Table 4.4.2: Analysis of Dispersion Table for Lifetime of Motors Data
Source RD df MRD F
R
Temperature (T) 26.40 2 13.20 121.7
Motor Insulation (I) 3.72 2 1.86 17.2
TI 1.24 4 .310 2.86
Error 30 .108
Table 4.4.3: Contrasts for Dierences in Insulations at Temperature 200
.
Contrast Estimate Condence
11
12
-.76 (1.22, .30)
11
13
-.84 (1.37, .32)
12
13
-.09 (.62, .44)
Table 4.4.2 is an ANOVA table for the R-analysis. Since F(.05, 4, 30) = 2.69, the test
of interaction is signicant at the .05 level. This conrms the prole plot, Panel A. It is
interesting to note that the least squares F-test statistic for interaction was 1.30 and, hence,
was not signicant. The LS analysis was impaired because of the outliers. The row eect
hypothesis is that the average row eects are the same. The column eect hypothesis is
similarly dened. Both main eects are signicant. In the presence of interaction, though,
we have interpretation diculties with main eects.
In Nelsons discussion of this problem it was of interest to estimate the simple contrasts
of mean lifetimes of insulations at the temperature setting of 200
. Since this is the rst

temperature setting, these contrasts are
1j

1j
. Table 4.4.3 displays the estimates
of these contrasts along with corresponding condence intervals formed under the Tukey-
Kramer procedure as discussed above, ( ??). It seems that insulations 2 and 3 are better
than insulation 1 at the temperature of 200
, but between insulations 2 and 3 there is no

discernible dierence.
In this example, the number of observations per parameter was less than ve. To oset
uneasiness over the use of the rank analysis for such small samples, McKean and Sievers
(1989) conducted a a Monte Carlo study on this design. The empirical levels and powers of
the R-analysis were good over situations similar to those suggested by this data.
4.5 Analysis of Covariance
Often there are extraneous variables available besides the response variable. Hopefully these
variables explain some of the noise in the data. These variables are called covariates or
concomittant variables and the traditional analysis of such data is called analysis of
covariance.
As an example, consider the oneway model ( 4.2.1), with k levels and suppose we have a
single covariate, say, x
ij
. A rst order model is y
ij
=
i
+ x
ij
+ e
ij
. This model, however,
4.5. ANALYSIS OF COVARIANCE 301
assumes that the covariate behaves the same within each treatment combination. A more
general model is
y
ij
=
i
+ x
ij
+
i
x
ij
+ e
ij
j = 1, . . . , n
i
, i = 1, . . . , k . (4.5.1)
Hence the slope at the ith level is
i
= +
i
and, thus, each treatment combination has its
own linear model. There are two natural hypotheses for this model: H
0C
:
1
= =
k
and H
0L
:
1
= =
k
. If H
0C
is true then the dierence between the levels of Factor A
are just the dierences in the location parameters
i
for a given value of the covariate. In
this case, contrasts in these parameters are often of interest as well as the hypothesis H
0L
.
If H
0C
is not true then the covariate and the treatment combinations interact. For example,
whether one treatment combination is better than another, may depend on where in factor
space the responses are measured. Thus as in crossed factorial designs, the interpretation
of main eect hypotheses may not be clear; for more discussion on this point see Huitema
(1980).
The above example is easily generalized. Consider a designed experiment with k treat-
ment combinations. This may be a oneway model with a factor at k levels, a twoway crossed
factorial design model with k = ab treatment combinations, or some other design. Suppose
we have n
i
observations at treatment level i. Let n =
n
i
denote the total sample size.
Denote by W the full model incidence matrix and the k 1 vector of location parameters.
Suppose we have p covariates. Let U be the n p matrix of covariates and let Z denote
the n 1 vector of responses. Let denote the corresponding p 1 vector of regression
coecients. Then the general covariate model is given by
Z = W +U +V +e , (4.5.2)
where V is the n pk matrix consisting of all column products of W and U and the pk 1
vector is the vector of interaction parameters between the design and the covariates.
The rst hypothesis of interest is
H
0C
:
11
= =
pk,pk
versus H
AC
:
ij
,=
i
j
for some (i, j) ,= (i
, j
). (4.5.3)
Other hypotheses of interest consist of contrasts in the the
ij
. In general, let M be a q k
matrix of contrasts and consider the hypotheses
H
0
: M = 0 versus H
A
: M ,= 0 . (4.5.4)
Matrices M of interest are related to the design. For a oneway design M may be a (k1)k
matrix that tests all the location levels to be the same, while for a twoway design it may
be used to test that all interactions between the two factors are zero. But as noted above,
the hypothesis H
0C
concerns interaction between the covariate and design spaces. While the
interpretation of these later hypotheses, ( 4.5.4), are clear under H
0C
they may not be if H
0C
is false.
Table 4.5.1: Snake Data
Placebo Treatment 2 Treatment 3 Treatment 4
Initial Final Initial Final Initial Final Initial Final
Dist. Dist. Dist. Dist. Dist. Dist. Dist. Dist.
25 25 17 11 32 24 10 8
13 25 9 9 30 18 29 17
10 12 19 16 12 2 7 8
25 30 25 17 30 24 17 12
10 37 6 1 10 2 8 7
17 25 23 12 8 0 30 26
9 31 7 4 5 0 5 8
18 26 5 3 11 1 29 29
27 28 30 26 5 1 5 29
17 29 19 20 25 10 13 9
The rank-based t of the full Model ( 4.5.2) proceeds as described in Chapter 3, after a
score function is chosen. Once the tted values and residuals have been obtained, the diag-
nostic procedures described in Section 3.9 can be used to assess the t. With a good t, the
model estimates of the parameters and their standard errors can be used to form condence
intervals and regions and multiple comparison procedures can be used for simultaneous in-
ference. Reduced models appropriate for the hypotheses of interest can be obtained and the
values of the test statistic F
can be used to test them. This analysis can be conducted

by the package rglm. It can also be conducted by tting the full model and obtaining the
pseudo-observations. These in turn can be substituted for the responses in a package which
performs the traditional LS analysis of covariance in order to obtain the R-analysis.
Example 4.5.1. Snake Data
As an example of an analysis of covariance problem consider the data set discussed by A
and Azen (1972). The data are reproduced below in Table 4.5.1. It involves four methods,
three of which are intended to reduce a humans fear of snakes. Forty subjects were given
a behavior approach test to determine how close they could walk to a snake without feeling
uncomfortable. This score was taken as the covariate. Next they were randomly assigned to
one of the four treatments with ten subjects assigned to a treatment. The rst treatment
was a control (placebo) while the other three treatments were dierent methods intended to
reduce a humans fear of snakes. The response was a subjects score on the behavior approach
test after treatment. Hence, the sample size is 40 and the number of independent variables
in Model ( 4.5.2) is 8. Wilcoxon scores were used to conduct the analysis of covariance
described above with the residuals adjusted to have median 0.
The plots of the response variable versus the covariate for each treatment are found
in Panels A - D of Figure 4.5.1. It is clear from the plots that the relationship between
the response and the covariate varies with the treatment, from virtually no relationship for
4.5. ANALYSIS OF COVARIANCE 303
Figure 4.5.1: Panels A - D: For the Snake Data, scatterplots of Final Distance versus Initial
Distance for the Placebo and Treatments 2-4, overlaid with the Wilcoxon t (solid line)
and the LS t (dashed line); Panel E: Internal Wilcoxon studentized residual plot; Panel F:
Wilcoxon studentized logistic qq plot.

Initial distance
F
in
a
l d
is
ta
n
c
e
10 15 20 25
1
5
2
0
2
5
3
0
3
5
Panel A: Placebo
Initial distance
F
in
a
l d
is
ta
n
c
e
5 10 15 20 25 30
0
5
1
0
1
5
2
0
2
5
Panel B: Treatment 2
Initial distance
F
in
a
l d
is
ta
n
c
e
5 10 15 20 25 30
0
5
1
0
1
5
2
0
Panel C: Treatment 3
Initial distance
F
in
a
l d
is
ta
n
c
e
5 10 15 20 25 30
1
0
1
5
2
0
2
5
Panel D: Treatment 4

Wilcoxon Fit
W
ilc
o
x
o
n
S
tu
d
e
n
tiz
e
d
r
e
s
id
u
a
ls
0 5 10 15 20 25
-
5
0
5
1
0
Panel E
Logistic quantiles
W
ilc
o
x
o
n
S
tu
d
e
n
tiz
e
d
R
e
s
id
u
a
ls
-2 0 2
-
5
0
5
1
0
Panel F
the rst treatment (placebo) to a fairly strong linear relationship for the third treatment.
Outliers are apparent in these plots also. These plots are overlaid with Wilcoxon and LS
ts of the full model, Model ( 4.5.1). Panels E and F of Figure 4.5.1 are, respectively, the
internal Wilcoxon studentized residual plot and the internal Wilcoxon studentized logistic
qq plot. The outliers stand out in these plots. From the residual plot, the data appears
to be heteroscedastic and, as Exercise 4.8.14 shows, the square root transformation of the
response does lead to a better t.
Table 4.5.2 displays the Wilcoxon and LS estimates of the linear models for each treat-
ment. As this table and Figure 4.5.1 shows, the larger discrepancy between the Wilcoxon
and LS estimates occur for those treatments which have large outliers. The estimates of
and are 3.92 and 5.82, respectively; hence, as the table shows the estimated standard
Table 4.5.2: Wilcoxon and LS Estimates of the Linear Models for Each Treatment
Wilcoxon Estimates LS Estimates
Treatment Int. (SE) Slope (SE) Int. (SE) Slope (SE)
1 27.3 (3.6) -.02 (.20) 25.6 (5.3) .07 ( .29)
2 -1.78 (2.8) .83 ( .15) -1.39 (4.0) .83 ( .22)
3 -6.7 (2.4) .87 (.12) -6.4 (3.5) .87 (.17)
4 2.9 (2.4) .66 ( .13) 7.8 (3.4) .49 (.19)
Table 4.5.3: Analysis of Dispersion (Wilcoxon) for the Snake Data
H
OC
24.06 3 8.021 4.09
Treatment 74.89 3 24.96 12.7
Error 32 1.96
errors of the Wilcoxon estimates are lower than their LS counterparts.
Table 4.5.3 displays the analysis of dispersion table for this data. Note that F
strongly
rejects H
OC
, (p-value is .015). This conrms the discussion above based on Figure 4.5.1. The
second hypothesis tested is no treatment eect, H
0
:
1
= =
4
. Although F
strongly
rejects this hypothesis also, in lieu of the results for H
OC
, the practical interpretation of such
a decision is not obvious. The value of the LS F-test for H
OC
is 2.34 (p-value is .078). If
H
OC
is not rejected then the LS analysis could lead to an invalid interpretation. The outliers
spoiled the LS-analysis of this data set. As shown in Exercise 4.8.15 both the R-analysis
and the LS-analysis strongly reject H
OC
for the square root transformation of the response.
4.6 Further Examples
In this section we present two further data examples. Our main purpose in this section is
to show how easy it is to use the rank-based analysis on more complicated models. Each
example is a three-way crossed factorial design. The rst has replicates while the second
involves a covariate. Besides displaying tests of the eects, we also consider estimates and
standard errors of contrasts of interest.
Example 4.6.1. Marketing Data
This data set is drawn from an exercise on page 953 of Neter et al. (1996). A marketing
rm research consultant studied the eects that three factors have on the quality of work
performed under contract by independent marketing research agencies. The three factors
and their levels are: Fee level, ((1) High, (2) Average, and (3) Low); Scope, ((1) All contract
work performed in house, (2) Some subcontract out); Supervision, ((1) Local supervision,
(2) Traveling supervisors). The response was the quality of the work performed as measured
by an index. Four agencies were chosen for each level combination. For convenience the data
are displayed in Table 4.6.1.
4.6. FURTHER EXAMPLES 305
Table 4.6.1: Marketing Data for Example 4.6.1
Supervision
Local Supervision Traveling Supervision
Fee Level In House Sub-out In House Sub-out
124.3 115.1 112.7 88.2
High 120.6 119.9 110.2 96.0
120.7 115.4 113.5 96.4
122.6 117.3 108.6 90.1
119.3 117.2 113.6 92.7
Average 188.9 114.4 109.1 91.1
125.3 113.4 108.9 90.7
121.4 120.0 112.3 87.9
90.9 89.9 78.6 58.6
Low 95.3 83.0 80.6 63.5
88.8 86.5 83.5 59.8
92.0 82.7 77.1 62.3
Hence, the design is a 3 2 2 crossed factorial with 4 replications, which we shall write
as,
y
ijkl
=
ijk
+ e
ijkl
, i = 1, . . . , 3; j, k = 1, 2; l = 1, . . . 4 , (4.6.1)
where y
ijkl
denotes the response for the lth replicate, at Fee i, Scope j, and Supervision k.
Wilcoxon scores were selected for the t with residuals adjusted to have median 0. Panels A
and B of Figure 4.6.1 show, respectively, the residual and normal qq plots for the internal
R-studentized residuals, ( 3.9.31), based on this t. The scatter in the residual plot is fairly
random and at. There do not appear to be any outliers. The main trend in the normal
qq plot indicates tails lighter than those of a normal distribution. Hence, the t is good
and we proceed with the analysis.
Table 4.6.2 displays the tests of the eects based on the LS and Wilcoxon ts. The Wald-
type F
,Q
statistic based on the pseudo-observations is also given. The LS and Wilcoxon
analyses agree which is not surprising based on the residual plot. The main eects are
highly signicant and the only signicant interaction is the interaction between Scope and
Supervision.
As a subsequent analysis, we shall consider nine contrasts of interest. We will use the
Bonferroni method based on the pseudo-observations as discussed in Section 4.3. We used
Minitab to obtain the results that follow. Because the factor Fee does not interact with
the other two factors, the contrasts of interest for this factor are:
1

2
,
1
3
, and
2

3
. Table 4.6.3 presents the estimates of these contrasts and the 95% Bonferroni
condence intervals which are given by the estimates of the contrast t
(.05/18;36)

_
2/16
.
=
2.64. From these results, quality of work signicantly improves for either higher or average
fees over low fees. The results for high or average fees are insignicant.
Since the factors Scope and Supervision interact, but do not interact separately or jointly
Figure 4.6.1: Panel A: Wilcoxon studentized residual plot for data of Example 4.6.1 data;
Panel B: Wilcoxon studentized residual normal qq plot
Wilcoxon fit
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
60 70 80 90 110
-
1
0
1
Panel A

Normal quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
-2 -1 0 1 2
-
1
0
1
Panel B
Table 4.6.2: Tests of Eects for the Market Data
Eect df F
LS
F
F
,Q
Fee 2 679. 207. 793.
Scope 1 248. 160. 290.
Supervision 1 518. 252. 596.
FeeScope 2 .108 .098 .103
FeeSuper. 2 .053 .004 .002
ScopeSuper. 1 77.7 70.2 89.6
FeeScopeSuper. 2 .266 .532 .362
or
36 2.72 2.53 2.53

Table 4.6.3: Contrasts of Interest for the Market Data
Contrast Estimate Condence Interval
2
1.05 (1.59, 3.69)
3
31.34 (28.70, 33.98)
3
30.28 (27.64, 32.92)
Table 4.6.4: Parameters of Interest for the Market Data
Parameter
11

12

21

22
Estimate 111.9 101.0 106.4 81.64
with the factor fee, the parameters of interest are the simple contrasts among
11
,
12
,
21
and
22
. Table 4.6.4 displays the estimates of these parameters. Using = .05, the
Bonferroni bound for a simple contrast here is t
.05/18;36

_
2/12
.
= 3.04. Hence all 6 simple
pairwise contrasts among these parameters, are signicantly dierent from 0. In particular,
averaging over fees, the best quality of work occurs when all contract work is done in house
and under local supervision. The source of the interaction between the factors Scope and
Supervision is also clear from these estimates.
Example 4.6.2. Pigs and Diets
This data set is discussed on page 291 of Rao (1973). It concerns the eect of diets on the
growth rate of pigs. There are three diets, called A, B and C. Besides the diet classication,
the pigs were classied according to their pens (5 levels) and sex (2 levels). Their initial
weight was also recored as a covariate. The data are displayed in Table 4.6.5.
The design is a 5 3 2 a crossed factorial with only one replication. For comparison
purposes, we will use the same model that Rao used which is a xed eects model with main
eects and the two-way interaction between the factors Diets and Sex. Letting y
ijk
and x
ijk
denote, respectively, the growth rate in pounds per week and the initial weight of the pig in
pen i, on diet j and sex k, this model is given by:
y
ijk
= +
i
+
j
+
k
+ ()
jk
+ x
ijk
+ e
ijk
, i = 1, . . . , 5; j = 1, 2, 3 k = 1, 2 . (4.6.2)
For convenience we have written the model as an over parameterized model; although, we
could have expressed it as a cell means model with constraints for the interaction eects
which are assumed to be 0. The eects of interest are the diet eects,
j
.
We t the model using the Wilcoxon scores. The analysis could also be carried out
using pseudo-observations and Minitab. Panels A and B of Figure 4.6.2 display the residual
plot and normal q q plot of the internal R-studentized residuals based on the Wilcoxon
t. The residual plot shows the three outliers. The outliers are prominent in the qq, but
note that even the remaining plotted points indicate an error distribution with heavier tails
than the normal. Not surprisingly the estimate of
is smaller than that of , .413 and

.506, respectively. The largest outlier corresponds to the 6th pig which had the lowest initial
weight, (recall that the internal R-studentized residuals account for position in factor space),
Table 4.6.5: Data for Example 4.6.2
Pen Diet Sex Initial Wt. Growth Rate
A G 48 9.94
B G 48 10.00
1 C G 48 9.75
C H 48 9.11
B H 39 8.51
A H 38 9.52
B G 32 9.24
C G 28 8.66
2 A G 32 9.48
C H 37 8.50
A H 35 8.21
B H 38 9.95
C G 33 7.63
A G 35 9.32
3 B G 41 9.34
B H 46 8.43
C H 42 8.90
A H 41 9.32
C G 50 10.37
A H 48 10.56
4 B G 46 9.68
A G 46 10.98
B H 40 8.86
C H 42 9.51
B G 37 9.67
A G 32 8.82
5 C G 30 8.57
B H 40 9.20
C H 40 8.76
A H 43 10.42
Figure 4.6.2: Panel A: Internal Wilcoxon studentized residual plot for data of Example 4.6.2
data; Panel B: Internal Wilcoxon studentized residual normal qq plot
Wilcoxon fit
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
8.5 9.5 10.5
-
6
-
4
-
2
0
2
4
Panel A

Normal quantiles
W
i
l
c
o
x
o
n

S
t
u
d
e
n
t
i
z
e
d

r
e
s
i
d
u
a
l
s
-1 0 1
-
6
-
4
-
2
0
2
4
Panel B
but its response was above the rst quartile. The second largest outlier corresponds to the
pig which had the lowest response.
Table 4.6.6 displays the results of the tests for the eects for the LS and Wilcoxon ts.
The pseudo-observations were obtained based on the Wilcoxon t and were inputed as the
responses in SAS to obtain F
,Q
using Type III sums of squares. The Wilcoxon analyses
based on F
and F
,Q
are quite similar. All three tests indicate no interaction between the
factors Diet and Sex which claries the interpretation of the main eects. Also all three
agree on the need for the covariate. Diet has a signicant eect on weight gain as does sex.
The robust analyses indicate that pens is also a contributing factor.
Table 4.6.7 displays the results of the analyses when the covariate is not taken into
account. It is interesting to note, here, that the factor diet is not signicant based on the LS
t while it is for the Wilcoxon analyses. The heavy tails of the error distribution, as evident
in the residual plots, has foiled the LS analysis.
Table 4.6.6: Test Statistics for the Eects of Pigs and Diets Data with Initial Weight as a
Covariate
Eect df F
LS
F
F
,Q
Pen 4 2.35 3.65
3.48
Diet 2 4.67
7.98
8.70
Sex 1 5.05
8.08
8.02
DietSex 2 0.17 1.12 .81

Initial Wt. 1 13.7
19.2
19.6
or
19 .507 .413 .413
Denotes signicance at the .05 level

Table 4.6.7: Test Statistics for the Eects of Pigs and Diets Data with No Covariate
Eect df F
LS
F
F
,Q
Pen 4 2.95
4.20
5.87
Diet 2 2.77 4.80
5.54
Sex 1 1.08 3.01 3.83

DietSex 2 0.55 1.28 1.46
or
20 .648 .499 .501
Denotes signicance at the .05 level

4.7 Rank Transform
In this section we present a short comparison between the rank-based analysis of this chapter
with the rank transform analysis (RT). Much of this discussion is drawn from McKean and
Vidmar (1994). The main point of this section is to show that often the RT test does
not work well for testing hypotheses in factorial designs and more complicated models.
Hence, we do not recommend using RT methods. On the other hand, Akritas, Arnold
and Brunner (1997) develop an unique approach in which factorial hypotheses are replaced
by corresponding nonparametric hypotheses based on cdfs. They then show that RT type
methods are appropriate for these nonparametric hypotheses.
As we have pointed out, the rank-based analysis is quite analogous to the LS based
traditional analysis. It is based on R-estimates while the traditional analysis is based on LS
estimates. The only dierence in the geometry of estimation is that that the R-estimates
are based on the pseudo-norm ( 3.2.6) while the LS estimates are based on the Euclidean
pseudo-norm. The rank-based analysis produces condence intervals and regions and tests of
general linear hypotheses. The diagnostic procedures of Chapter 3 can be used to check the
adequacy of the t of the model and determine outliers and inuential points. Furthermore,
the eciency properties discussed for the simple location nonparametric procedures carry
over to this analysis. The rank-based analysis oers the user a complete and highly ecient
analysis of a linear model as an alternative to the traditional analysis. Further, there are
computational algorithms available for these procedures.
4.7. RANK TRANSFORM 311
Proposed by Conover and Iman (1981), the rank transform (RT) has become a very
popular procedure. The RT test of a linear hypothesis consists generally of ranking the
dependent variable and then performing the LS test on these ranks. Although in general the
RT oers no estimation, and hence no model checking, it is a simple procedure for testing.
Some basic dierences between the rank-based analysis and RT are readily apparent.
In linear models the Y
i
s are independent but not identically distributed. Hence when the
RT is applied indiscriminately to a linear model, the ranking is performed on non-identically
distributed items. The rankings in the RT are not free of the xs. In contrast, the residuals
based on the R-estimates, under Wilcoxon scores, satisfy
n
i=1
x
ij
R(Y
i
x
R
)
.
= 0 , j = 1, . . . , p . (4.7.1)
Hence the R-residuals have been adjusted by the t so that the ranks are orthogonal to the
x-space, i.e., the ranks are free of the xs. These are the ranks that are used in the R-test
statistic F
, at the full model. Under H

0
this would also be true of the expected ranks of
the residuals in the R-t of the reduced model. Note, also, that the statistic F
is invariant
to the values of the parameters of the reduced model.
Unlike the rank-based analysis there is no general supporting theory for the RT. Hora
and Conover (1984) presented asymptotic null theory on the RT for treatment eect in a
randomized block design with no interaction. Thompson and Ammann (1989) explored the
eciency of this RT, showing, however, that this eciency depends on the block parameters.
RT theory for repeated measures designs has been developed by Akritas (1991, 1993) and
Thompson (1991b). These extensions also have the unpleasant trait that their eciencies
depend on nuisance parameters.
Many of these theoretical studies on the RT have raised serious questions concerning
the validity of the RT for simple two-way and more complicated designs. For a two-way
crossed factorial design, Brunner and Nuemann (1986) showed that the RT statistics are
not reasonable for testing main eects in the presence of interaction for designs larger than
2 2 designs. This was echoed by Akritas (1990) who stated further that RT statistics are
not reasonable test statistics for interaction nor most other common hypotheses in either
two-way crossed or nested classications. In several of these articles, (see Akritas, 1990 and
Thompson, 1991a, 1993), the nonlinear nature of the RT is faulted. For a given model the
hypotheses of interest are linear contrasts in model parameters. The rank transform, though,
is nonlinear; hence often the original hypothesis is no longer tested by the rank transformed
data. The same issue was raised earlier by Fligner (1981) in a discussion of the article by
Conover and Iman (1981).
In terms of small sample properties, initial simulations of the RT analysis on certain
models, (see for example Iman, 1974), did appear promising. Now there has been ample
evidence based on simulation studies questioning the wisdom of doing RTs on designs as
simple as two-way factorial designs with interaction; see, for example, Blair, Sawilowsky and
Higgins (1987). We discuss one such study next and then present an analysis of covariance
example where the use of the RT results in a poor analysis.
4.7.1 Monte Carlo Study
Another major Monte Carlo study on the RT was performed by Sawilowsky, Blair and
Higgins (1989), which investigated the behavior of the RT over a three way factorial design
with interaction. In many of their situations, the RT gave severely inated empirical levels
and severely deated empirical powers. We present the results of a small Monte Carlo study
discussed in McKean and Vidmar (1994), which is based on the study of Sawilowsky et al.
The model for the study is a 222 three-way factorial design. The shortcomings of the RT
as discussed in the two-way models above seem to become worse for such models. Letting
A, B, and C denote the factors, the model is
Y
ijkl
= +a
i
+b
j
+c
k
+(ab)
ij
+(ac)
ik
+(bc)
jk
+(abc)
ijk
+e
ijkl
, i, j, k = 1, 2, l = 1, . . . , r ,
where r is the number of replicates per cell. In the study by Sawilowsky et al., r was set at
2, 5, 10 or 20. Several distributions were considered for the errors e
ijkl
, including the normal.
They considered the usual seven hypotheses (3 main eects, 3 two-ways, and 1 three-way)
and 8 patterns of alternatives. The nonnull eects were set at c where c was a multiple of
; see, also, McKean and Vidmar (1992) for further discussion. The study of Sawilowsky et
al. found that the RT test for interaction . . . is dramatically nonrobust at times and that it
has poor power properties in many cases.
In order to compare the behavior of the rank-based analysis and the RT, on this de-
sign, we performed part of their simulation study. We considered standard normal errors
and contaminated normal errors, which had 10% contamination from a normal distribution
with mean 0 and standard deviation 8. The normal variates were generated as discussed in
Marsaglia and Bray (1964) using uniform variates which were generated by a portable FOR-
TRAN generator written by Kahaner, Moler and Nash (1989). There were 5 replications per
cell and the non null constant of proportionality c was set at .75. The simulation size was
1000.
Tables 4.7.1 and 4.7.2 summarize the results of our study for the following two situa-
tions: the two-way interaction A C and the three-way interaction eect A B C. The
alternative for the AC situation had all main eects and all two-way interactions in while
the alternative for the ABC situation had all main eects, two-way interactions besides
the three-way alternative in. These were poor situations for the RT in the study conducted
by Sawilowsky et al. and as Tables 4.7.1 and 4.7.2 indicate the RT behaves poorly for
these situations in our study also. Its empirical levels are deplorable. For instance, at
the nominal .10 level for the three-way interaction test under normal errors, the RT has
an empirical level of .777, while the level is .511 at the contaminated normal. In contrast
the levels of the rank-based analysis were quite close to the nominal levels under normal
errors and slightly conservative under the contaminated normal errors. In terms of power,
note that the empirical power of the rank-based analysis is slightly less than the empirical
power of LS under normal errors while it is substantially greater than the power of LS under
contaminated normal errors. For the three-way interaction test, the empirical power of the
RT falls below its empirical level.
Table 4.7.1: Empirical Levels and Power for Test of AC
Error Distribution
Normal Errors Contaminated Normal Errors
Model Model
Null Alternative Null Alternative
Nominal Nominal Nominal Nominal
Method .10 .05 .01 .10 .05 .01 .10 .05 .01 .10 .05 .01
LS .095 .040 .009 .998 .995 .977 .087 .029 .001 .602 .505 .336
Wilcoxon .104 .060 .006 .997 .992 .970 .079 .032 .004 .934 .887 .713
RT .369 .243 .076 .847 .770 .521 .221 .128 .039 .677 .576 .319
Table 4.7.2: Empirical Levels and Power for Test of AB C
Error Distribution
Normal Errors Contaminated Normal Errors
Model Model
Null Alternative Null Alternative
Nominal Nominal Nominal Nominal
Method .10 .05 .01 .10 .05 .01 .10 .05 .01 .10 .05 .01
LS .094 .050 .005 1.00 .998 .980 .102 .041 .001 .598 .485 .301
Wilcoxon .101 .060 .004 .997 .992 .970 .085 .039 .006 .948 .887 .713
RT .777 .644 .381 .484 .343 .144 .511 .377 .174 .398 .276 .105
Table 4.7.3: Shirley Data
Group 1 Group 2 Group 3
Initial Final Initial Final Initial Final
time time time time time time
1.8 79.1 1.6 10.2 1.3 14.8
1.3 47.6 0.9 3.4 2.3 30.7
1.8 64.4 1.5 9.9 0.9 7.7
1.1 68.7 1.6 3.7 1.9 63.9
2.5 180.0 2.6 39.3 1.2 3.5
1.0 27.3 1.4 34.0 1.3 10.0
1.1 56.4 2.0 40.7 1.2 6.9
2.3 163.3 0.9 10.5 2.4 22.5
2.4 180.0 1.6 0.8 1.4 11.4
2.8 132.4 1.2 4.9 0.8 3.3
Example 4.7.1. The Rat Data
The following example, taken from Shirley (1981), contrasts the rank-based methods, the
rank transformed methods, and least squares methods in an analysis of covariance setting.
The response is the time it takes a rat to enter a chamber after receiving a treatment designed
to delay the time of entry. There were 30 rats in the experiment and they were divided evenly
into three groups. The rats in Groups 2 and 3 received an antidote to the treatment. The
covariate is the time taken by the rat to enter the chamber prior to its treatment. The data
are presented in Table 4.7.3 and are displayed in Panel A of Figure 4.7.1.
As a full model, we considered the model,
y
ij
=
j
+
j
x
ij
+ e
ij
, j = 1, . . . , 3, i = 1, . . . , 10 (4.7.2)
where y
ij
denotes the response for the ith rat in Group j and x
ij
denotes the corresponding
covariate. There is a slight quadratic aspect to the Wilcoxon residual plot, Panel B of Figure
4.7.1, which is investigated in Exercise 4.8.16.
Panel C of Figure 4.7.1 displays a plot of the internal Wilcoxon studentized residuals by
case. Note that there are several outliers. These also can be seen in the plots of the data
for groups 2 and 3, Panels E and F of Figure 4.7.1. Note that the outliers have an eect on
the LS-ts, drawing the ts toward the outliers in each group. In particular, for Group 3, it
only took one outlier to spoil the LS-t. On the other hand, the Wilcoxon t is not aected
by the outliers. The estimates are given in Table 4.7.4. As the plots indicate, the LS and
Wilcoxon estimates dier numerically. Further evidence of the more precise R-ts relative
to the LS-ts is given by the estimates of the scale parameters and
found in the Table

4.7.4.
We rst test for homogeneity of slopes for the groups; i.e, H
0
:
1
=
2
=
3
. As clearly
shown in Panel A of Figure 4.7.1 this does not appear to be true for this data. While the
Figure 4.7.1: Panel A: Wilcoxon ts of all groups; Panel B: Internal Wilcoxon studentized
residual plot; Panel C: Internal Wilcoxon studentized residuals by Case; Panel D: LS (solid
line) and Wilcoxon (dashed line) ts for Group 1; Panel E: LS (solid line) and Wilcoxon
(dashed line) ts for Group 2; Panel F: LS (solid line) and Wilcoxon (dashed line) ts for
Group 3.
Time (Before Treatment)
T
im
e
(
A
fte
r
T
r
e
a
tm
e
n
t)
1.0 1.5 2.0 2.5
0
5
0
1
0
0
1
5
0
2
0
0
Panel A
Group 1
Group 2
Group 3
Wilcoxon Fitted Values

W
ic
o
x
o
n
R
e
s
id
u
a
ls
0 50 100 150
-
1
0
1
Panel B

Case
W
ilc
o
x
o
n
S
tu
d
e
n
tiz
e
d
R
e
s
id
u
a
l
0 5 10 15 20 25 30
-
4
-
2
0
2
4
6
Panel C
T
im
e
(
A
fte
r
T
r
e
a
tm
e
n
t)
1.0 1.5 2.0 2.5
5
0
1
0
0
1
5
0
Panel D
LS
Wilcoxon
T
im
e
(
A
fte
r
T
r
e
a
tm
e
n
t)
1.0 1.5 2.0 2.5
0
1
0
2
0
3
0
4
0
Panel E
LS
Wilcoxon
T
im
e
(
A
fte
r
T
r
e
a
tm
e
n
t)
1.0 1.5 2.0
1
0
2
0
3
0
4
0
5
0
6
0
Panel F
LS
Wilcoxon
Table 4.7.4: LS and Wilcoxon Estimates (standard errors) for the Rat Data.
Group 1 Group 2 Group 3
Procedure or
LS -39.1 (20.) 76.8 (10.) -15.6 (22.) 20.5 (14.) -14.7 (19.) 21.9 (12.) 20.5
Wilcoxon -54.3 (16.) 84.2 (8.6) -19.3 (18.) 21.0 (11.) -11.6 (16.) 17.4 (10.) 17.0
slopes for Groups 2 and 3 seem to be about the same, (the Wilcoxon 95% condence interval
for
2

3
is 3.9 27.2), the slope for Group 1 appears to dier from the other two. To
conrm this statistically, the value of the F
statistic to test homogeneity of slopes, H

0
, has
the value 9.88 with 2 and 24 degrees of freedom, which is highly signicant (p < .001). This
says that Group 1, the group that did not receive the antidote, does dier signicantly from
the other two groups in terms of how the groups interact with the covariate. In particular,
the estimated slope of post-treatment time to pre-treatment time for the rats in Group 1 is
about 4 times as large as the slope for the rats in the two groups which received the antidote.
Because there is interaction between the groups and the covariate, we did not proceed with
the second test on average group eects; i.e., testing
1
=
2
=
3
.
Shirley (1981) performed a rank transform on this data by ranking the response and then
applying standard least squares analysis. It is clear from Panel A of Figure 4.7.1 that this
nonlinear transform will result in homogeneous slopes for the ranked problem, as conrmed
by Shirleys analysis. But the rank transform is a nonlinear transform and the subsequent
analysis based on the rank transformed data does not test homogeneity of slopes in Model
( 4.7.2). The RT analysis is misleading in this case.
Note that using the rank-based analysis we performed an overall analysis of this data set,
including a residual analysis for model criticism. Hypotheses of interest were readily tested
and estimates of contrasts, along with standard errors, were easily obtained.
4.8. EXERCISES 317
4.8 Exercises
4.8.1. Derive expression ( 4.2.19).
4.8.2. In Section 4.2.2 when we have only two levels, show that the Kruskal-Wallis test is
equivalent to the MWW test discussed in Chapter 2.
4.8.3. Consider a oneway design for the data in Example 4.2.3. Fit the model using
Wilcoxon estimates and conduct a residual analysis, including residual and q q plots of
standardized residuals. Identify any outliers. Next test the hypothesis ( 4.2.13) using the
Kruskal-Wallis test and the test based on F
.
4.8.4. Using the notation of Section 4.2.4, show that the asymptotic covariance between
i
and
i
is given by expression ( 4.2.31). Next show that expressions ( 3.9.38) and ( 4.2.31)
lead to a verication of the condence interval ( 4.2.30).
4.8.5. Show that the asymptotic covariance between estimates of location levels is given by
expression( 4.2.31).
4.8.6. Suppose D is a symmteric, positive denite matrix. Prove that
sup
h
h
D
1
h
=
_
y
Dy . (4.8.1)
Refer to the Kruskal Wallis statistic H
W
, given in expression ( 4.2.22). Let y
= (R
1

n+1
2
, . . . , R
k
n+1
2
) and D =
12
n(n+1)
diag(n
1
, . . . , n
k
). Then, using ( 4.8.1), show that
H
W

2
(k 1) if and only if
k
i=1
h
i(R
i
n+1
2
)
r
n(n+1)
12
P
k
j=1
1
n
j
h
2
j
2
(k 1) ,
for all vectors h such that
h
i
= 0.
Hence, if the Kruskal-Wallis test rejects H
0
at level then there must be at least one
contrast in the rank averages that exceeds the critical value
_
2
(k 1). This provides
Schee type multiple contrast tests with family error rate approximately equal to .
4.8.7. Apply the procedure presented in Exercise 4.8.6 to the quail data of Example 4.2.1.
Use = .10.
4.8.8. Let I
1
and I
2
be (1)100% condence intervals for parameters
1
and
2
, respectively.
Show that,
P[
1
I
1

2
I
2
] 1 2 . (4.8.2)
(a). Suppose the condence intervals I
1
and I
2
are independent. Show that
1 2 P[
1
I
1

2
I
2
] 1 .
(b). Generalize expression ( 4.8.2) to k condence intervals and derive the Bonferroni pro-
cedure described in ( 4.3.2).
4.8.9. In the notation of the Pairwise Tests Based on Joint Rankings procedure of
Section 4.7, show that R
1
is asymptotically N
k1
(0,
k(n+1)
12
(I
k1
+ J
k1
)) under H
0
:
1
=
=
k
. (Hint: The asymptotic normailty follows as in Theorem 3.5.2. In order to deter-
mine the covariance matrix of R
1
, rst obtain the covariance matrix of the random vector
R
= (R
1
, . . . , R
k
) and then obtain the covariance matrix of R
1
by using the transformation
[1
k1
I
k1
].)
4.8.10. In Section 4.3, the Pairwise Tests Based on Joint Rankings procedure was
discussed based on Wilcoxon scores. Generalize this procedure for an arbitrary score function
(u).
4.8.11. For the baseball data in Exercise 1.12.32, consider the following oneway problem.
The response of interest is the hitters average and the three groups are left handed hitters,
right handed hitters, and switch hitters. Using either Minitab or rglm, obtain the following
analyses based on Wilcoxon scores:
(a.) Using the test statistic F
, test for an overall group eect. Obtain the p-value and

conclude at the 5% level.
(b.) Use the protected LSD procedure of Fisher to compare the groups at the 5% level.
4.8.12. Consider the Bonferroni type procedure described in item (6) of Section 4.3. For-
mulate a similar Protected LSD type procedure based on the test statistic F
. Use these
procedures to make the comparisons discussed in Exercise 4.8.11.
4.8.13. Consider the baseball data in Exercise 1.12.32. In Exercise 3.16.38, we investigated
the linear relationship between a players height and his weight. For this problem, consider
the simple linear model
height = + weight + e .
Using Wilcoxon scores and either Minitab or rglm, investigate whether or not the same
simple linear model can be used for both the pitchers and hitters. Obtain the p-value for
the test of this hypothesis based on the statistic F
.
4.8.14. In Example 4.5.1 obtain the square root of the response and t it to the full
model. Perform a residual analysis on the resulting t. In particular identify any outliers
and compare the heteroscedasticity in the plot of the residuals versus the tted values with
the analogous plot in Example 4.5.1.
4.8.15. For Example 4.5.1, overlay the Wilcoxon and LS ts for the four treatments based
on the square root transformation of the response. Then obtain an analysis of covariance for
both the Wilcoxon and LS analyses for the transformed data. Comment on the plots and
the results of the analyses.
4.8. EXERCISES 319
4.8.16. Consider Example 4.7.1. Investigate whether a model which also includes quadratic
terms in the covariates is more appropriate for the Rat Data than Model ( 4.7.2).
4.8.17. Consider Example 4.7.1. Eliminate the placebo group, Group 1, and perform an
analysis of covariance on Groups 2 and 3. Use the linear model, ( 4.7.2). Is there any
dierence between these groups?
4.8.18. Let H
W
= W(W
W)
1
W
be the projection matrix based on the incidence matrix,

( 4.2.5). Show that H
W
is a block diagonal matrix with the ith block a n
i
n
i
matrix of all
ones. Recall X = (I H
1
)W
1
in Section 4.2.1. Let H
X
= X(X
X)
1
X
be the projection
matrix. Then argue that H
W
= H
1
+H
X
and, hence, H
X
= H
W
H
1
is easy to nd. Using
( 4.2.8), show that, for the oneway design, cov(
Z)
.
=
2
S
H
1
+
2
H
X
and, hence, show that
var(
i
) is given by ( 4.2.11) and that cov(
i
,
i
) is given by ( 4.2.31).
4.8.19. Suppose we have k treatments of interest and we employ a block design consisting
of a blocks. Within each block, we randomly assign mk subjects to the treatments so that
each treatment receives m subjects. Suppose we model the responses Y
ijl
as
Y
ijl
= +
i
+
j
+ e
ijl
; i = 1, . . . , a , j = 1, . . . , k , l = 1, . . . , m ,
where e
ijl
are iid with cdf F(t). We want to test
H
0
:
1
= =
k
versus H
A
:
j
,=
j
for some j ,= j
.
Suppose we rank the data in the ith block from 1 to mk for i = 1, . . . , a. Let R
j
be the sum
of the ranks for the jth treatment. Show that
E(R
j
) =
am(mk + 1)
2
Var(R
j
) =
am
2
(mk + 1)(k 1)
12
Cov(R
j
, R
l
) =
am
2
(mk + 1)
12
.
Further, argue that
K
m
=
k
j=1
_
k 1
k
_
_
R
j
E(R
j
)
_
Var(R
j
)
_
2
=
_
12
akm
2
(mk + 1)
k
j=1
R
2
j
_
3a(mk + 1) .
is asymptotically
2
with k 1 degrees of freedom. Note if m = 1 then K
1
is the Friedman
statistic. Show that the eciency of the Friedman test relative to the twoway LS F-test is
12
2
[
_
f
2
(x) dx]
2
(k/(k + 1)). Plot the eciency as a function of k when f is N(0, 1).
Table 4.8.1: Box-Cox Data, Exercise 4.8.20
Treatments
Poisons 1 2 3 4
0.31 0.82 0.43 0.45
1 0.45 1.10 0.45 0.71
0.46 0.88 0.63 0.66
0.43 0.72 0.76 0.62
0.36 0.92 0.44 0.56
2 0.29 0.61 0.35 1.02
0.40 0.49 0.31 0.71
0.23 1.24 0.40 0.38
0.22 0.30 0.23 0.30
3 0.21 0.37 0.25 0.36
0.18 0.38 0.24 0.31
0.23 0.29 0.22 0.33
4.8.20. The data in Table 4.8.1 are the results of a 3 4 design discussed in Box and
Cox (1964). Forty-eight animals were exposed to three dierent poisons and four dierent
treatments. The response was the survival time of the animal. The design was balanced.
Use ( 4.4.1) as the full model to answer the questions below.
(a.) Using Wilcoxon scores obtain the t of the full model. Sketch the cell median prole plot
based on this t and discuss whether or not interaction between poison and treatments
is present.
(b.) Based on the Wilcoxon t, plot the residuals versus the tted values. Comment on the
appropriateness of the model. Also obtain the internal Wilcoxon studentized residuals
and identify any outliers.
(c.) Using the statistic F
, obtain the robust ANOVA table (main eects and interaction)

for this data. Conclude in terms of the p-values.
(d.) Note that the hypothesis matrix for interaction denes six interaction contrasts. Use the
Bonferroni and Protected LSD multiple comparison procedures, ( 4.3.2) and ( 4.3.3),
to investigate these contrasts. Determine which, if any, are signicant.
(e.) Repeat the analysis in Parts (c) and (d), (Bonferroni analysis), using LS. Compare the
Wilcoxon and LS results.
4.8.21. For testing the ordered alternative
H
0
:
1
= =
k
versus H
A
:
1

k
,
4.8. EXERCISES 321
with at least one strict inequality, let
J =
s<t
S
+
st
,
where S
+
st
= #(Y
tj
> Y
si
) for i = 1, . . . , n
s
and j = 1, . . . , n
t
; see ( 2.2.20). This test for
ordered alternatives was proposed independently by Jonckheere (1954) and Terpstra (1952).
Under H
0
, show the following:
(a.) E(J) =
n
2
P
n
2
t
4
.
(b.) V (J) =
n
2
(2n+3)
P
n
2
t
(2nt+3)
72
.
(c.) z = (J E(J))/
_
V (J) is approximately N(0, 1).
Hence, based on (a) - (c) an asymptotic test for H
0
versus H
A
, is to reject H
0
if z z
.
Chapter 5
Models with Dependent Error
Structure
5.1 General Mixed Models
Consider an experiment done over m blocks (clusters), where block k has n
k
observations.
Within block k, let Y
k
, X
k
, and e
k
denote respectively the n
k
1 vector of responses, the
n
k
p design matrix and the n
k
1 vector of errors. Let 1
n
k
denote a vector of n
k
ones.
Then the general mixed model for Y
k
is
Y
k
= 1
n
k
+X
k
+e
k
, k = 1, . . . m, (5.1.1)
where is the p 1 vector of regression coecients and is the intercept parameter. The
components of the random error vector e
k
are generally dependent random variables.
Later for theory, we make certain assumptions on the distribution of e
k
. Alternately, the
model can be written in the long form as
Y = 1
n
+X +e, (5.1.2)
where n =
m
k=1
n
k
denotes the total sample size, Y = (Y
1
, . . . , Y
m
)
, X = (X
1
, . . . , X
m
)
,
and e = (e
1
, . . . , e
m
)
. Because an intercept parameter is in the model, we can assume that

X is centered and that the true median of e
kj
is zero. Since we can always reparameterize,
assume that X has full column rank. It is important to note that the design matrices, X
k
s,
for the clusters need not have full column rank. For example, incomplete block designs can
be considered. To distinguish this general mixed model from the linear model of Chapter 3,
in this chapter we call the model of Chapter 3 the independent error or case model.
This general mixed model often occurs in the applied sciences. Examples include data
from clinical designs carried out over centers, repeated measures type data on individuals,
data from randomized block designs, and clusters of correlated data. As in Chapters 3 and
4, for inference the primary focus concerns the regression coecients (xed eects), but
the dependent structure must be taken into account in order to obtain valid inference for
323
324 CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE
the xed eects. Liang and Zeger (1986) discuss these models in some detail, developing a
weighted LS inference for it.
The xed eects part of the model is, of course, the linear model of Chapters 3 and 4. So
in this section we proceed to discuss the R t developed in Chapter 3 for Model (5.1.1). As
we show the asymptotic variance of the R estimator is a function of the dependence structure
in the model.
Let (u), 0 < u < 1, be a specied score function which satises Assumption (S.1) of
Section 3.4. Then, as in Chapter 3, the R estimator of is
= Argmin|YX|
. (5.1.3)
For Model (5.1.1), properties of this estimator were developed by Kloke, McKean and Rashid
(2009). They refer to it as the JR estimator for joint ranking; however, we use the terminology
of Chapter 3 and call it an R estimator. As in Chapter 3, equivalently,

is a solution to
S
(YX) = 0, where S
(YX) is the negative of the gradient of |YX|
given in
(3.2.12). Once is estimated, we estimate the intercept by the median of the residuals,
that is

S
= med
kj
y
kj
x
kj
. (5.1.4)
As in Chapter 3, both estimators are regression and scale equivariant.
For the theory discussed next, certain conditions are needed. Assume that the random
vectors e
1
, e
2
, . . . , e
m
are independent; i.e., the responses drawn from dierent blocks or
clusters are independent. Assume further that the univariate marginal distributions of e
k
are the same for all k. As discussed at the end of this section (see Subsection 5.1.1), this holds
for many models of practical interest; however, in Section 5.4, we do discuss more general
rank-based estimators which do not require this assumption. Let F(x) and f(x) denote the
common univariate marginal distribution function and density function. Assume that f(x)
follows Assumption (E.1) of Section 3.4 and that the usual regularity (likelihood) conditions
hold; see, for example, Section 6.5 of Lehmann and Casella (1998). For the design matrix
X, assume that Hubers condition (D.2) of Section 3.4 holds. As with the asymptotic theory
for the traditional estimates, (see, eg, Liang and Zeger, 1986), assume that the number of
clusters goes to , i.e., m , and that n
k
M, for all k, for some constant M.
Because of the invariances, without loss of generality, assume that the true regression
parameters are zero in Model (5.1.1). As in Chapter 3, asymptotic theory for the xed
eects estimator involves establishing the distribution of the gradient and the asymptotic
quadraticity of the dispersion function.
Consider Model (5.1.1) and assume the above conditions. It then follows from Brunner
and Denker (1994) that the projection of the gradient S
(Y X) is the random vector

X
[F(Y X)], where [F(Y X)] = ([F(Y

11
x
11
)], . . . , [F(Y
mnm
x
mnm
)])
.
We need to assume that the covariance structure of this projection is asymptotically stable;
that is, the following limit exists and is positive denite:
= lim
m
n
1
m
k=1
X
,k
X, (5.1.5)
5.1. GENERAL MIXED MODELS 325
where
,k
= Cov[F(e
k
)]. (In likelihood methods, a similar assumption is made on the
covariance structure of the errors).
As discussed by Kloke et al. (2009), under these assumptions, it follows from Theorem
3.1 of Brunner and Denker (1994) that
1
n
S
X
(0)
D
N
p
(0,
), (5.1.6)
where
is dened in expression (5.1.5). The linearity and quadraticity results obtained

in Chapter 3 for the linear model can be extended to our model. The linearity result is
S
X
() = S
X
(0)
1
n
1
X
X +o
p
(
n), uniformly for

n||
2
c, for c > 0, where
is
the same scale parameter as in Chapter 3; i.e., dened in expression (3.4.4). From this we
obtain the asymptotic representation of the R estimator given by
n(X
X)
1
X
[F(e)] + o
p
(1). (5.1.7)
Based on (5.1.6) and (5.1.7), it follows that the distribution of

is approximately normal
with mean and covariance matrix
V
=
2
(X
X)
1
_
m
k=1
X
,k
X
k
_
(X
X)
1
. (5.1.8)
Letting
s
= 1/2f(0),
S
is approximately normal with mean and variance
2
1
(0) =
2
S
1
n
m
k=1
_
n
k
j=1
var(sgn(e
kj
)) +
j=j
cov(sgn(e
kj
), sgn(e
kj
))
_
. (5.1.9)
In this section, we have kept the model general; i.e., we have not specied the covariance
structure. To conduct inference, we need an estimate of the covariance matrix of

. Dene
the residuals of the R t by
e
R
= Y
s
1
n
X
. (5.1.10)
Using these residuals, we estimate the parameter
as discussed in Section 3.7.1. Next, a

nonparametric estimate of
,k
, (5.1.5). is obtained by replacing the distribution function
F(t) in its denition by the empirical distribution function of the residuals. Based on these
results, for a specied vector h R
p
, an approximate (1 )100% condence interval for
h
is given by
h
z
/2
_
h

V
h. (5.1.11)
Consider general linear hypotheses of the form H
0
: M = 0 versus H
A
: M ,= 0,
where M is a q p matrix of rank q. We oer two test statistics. First, the asymptotic
distribution of

suggests a Wald type test of H

0
based on the test statistic
T
W,
= (M
)
T
[M
M
T
]
1
(M
). (5.1.12)
Under H
0
, T
W,
has an asymptotic
2
q
distribution with q degrees of freedom. Hence, a
nominal level test is to reject H
0
if T
W,

2
(q). As in the independent error case, this

test is consistent for all alternatives of the form M ,= 0. For eciency results consider
a sequence of local alternatives of the form: H
An
: M
n
=

n
, where ,= 0. Under this
sequence of alternatives T
W,
has an asymptotic noncentral
2
q
-distribution with noncentrality
parameter
= (M)
T
[MV
M
T
]
1
M. (5.1.13)
A second test utilizes the reduction in dispersion, RD
= D(Red) D(Full), where

D(Full) and D(Red) are respectively the minimum values of the dispersion function under
the full and reduced (full model constrained by H
0
) models. The asymptotically correct
standardization depends on the dependence structure of the errors; see Exercises 5.6.5 and
5.6.6 for discussion on this test and also of the aligned rank test of Chapter 3.
Our discussion has been for general scores. If we have knowledge of the distribution of
the errors then we can optimize the analysis by selecting a suitable score function. From
expression (5.1.8), although the dependence structure appears in the approximate covariance
of

, as in Chapters 2 and 3, the constant of proportionality is
. Hence, the discussion in

Chapters 2 and 3 concerning score selection based on minimizing
is still pertinent for the

rank-based analysis of this section. Example 5.2.1 of the next section illustrates such score
selection.
If the score function is bounded, then based on their asymptotic representation, (5.1.7),
these R estimators have bounded inuence in response space but not in factor space. How-
ever, for outliers in factor space, the high breakdown HBR estimators, (3.12.2), can be
extended in the same way as the R estimates.
5.1.1 Applications
In many applications the form of the covariance structure of the random vector of errors e
k
of Model 5.1.1 is known. This can result in a simplied asymptotic covariance structure for
. We discuss several such cases in the next few sections. In Section 5.2, we consider a
simple mixed model with block as a random eect. Here, besides an estimate of
, only
an additional covariance parameter is required to estimate V
. In Section 5.3.1, we discuss

a transformed procedure for a simple mixed model, provided the the block design matrices,
X
k
s, have full column rank. Another rich class of such models is the repeated measure
designs, where block is synonymous with subject. Two common types of covariance structure
for these designs are: (i) the covariance of the errors for a subject have compound symmetrical
structure, i.e., a simple random eect model, or (ii) the errors follow a stationary time
series model, for instance an autoregressive model. For Case (ii), the univariate marginals
would have the same distribution and, hence, the above assumptions hold for our rank-based
estimates. Using the residuals from the rank-based t, R estimators of the autoregressive
parameters of the error distribution can be obtained. These estimates could then be used
in the usual way to transform the observations and then a second (generalized) R estimate
5.2. SIMPLE MIXED MODELS 327
could be obtained based on these transformed observations; see Exercise 5.6.7 for details.
This is a robust analogue of the two-stage estimation procedure discussed for cluster samples
in Rao, Sutradhar and Yue (1993). Generalized R estimators based on transformations are
discussed in Sections 5.3 and 5.4.
5.2 Simple Mixed Models
In this section, we discuss a simple mixed model with block or cluster as a random eect.
Consider Model (5.1.1), but for each block k, model the error vector e
k
as e
k
= 1
n
k
b
k
+
k
,
where the components of
k
are independent and identically distributed and b
k
is a continuous
random variable which is independent of
k
. Hence, we write the model as
Y
k
= 1
n
k
+X
k
+1
n
k
b
k
+
k
, k = 1, . . . m. (5.2.1)
Assume that the random eects b
1
, . . . , b
m
are independent and identically distributed ran-
dom variables. It follows that the distribution of e
k
is exchangeable. In particular, all
marginal distributions of e
k
are the same; so, the theory of Section 5.1 holds. This family
of models contains the randomized block designs, but as in Section 5.1 the blocks can be
incomplete.
For this model, the asymptotic variance-covariance matrix of

, (5.1.8) simplies to
(X
X)
1
m
k=1
X
,k
X
k
(X
X)
1
,
,k
= (1
)1I
n
k
+
J
n
k
, (5.2.2)
where
= cov [F(e
11
)], [F(e
12
)] = E[F(e
11
)][F(e
12
)]. Also, the asymptotic vari-
ance of the intercept (5.1.9) simplies to n
1
2
S
(1 + n
S
), for
S
= cov [sgn (e
11
), sgn (e
12
)]
and n
= n
1
m
k=1
n
k
(n
k
1). As with LS, for positive deniteness, we need to assume that
each of
and
S
exceeds max
k
1/(n
k
1). Let M =
m
k=1
_
n
k
2
_
p, (the subtraction
of p, the dimension of the vector , is a degree of freedom correction). A simple moment
estimator of
is

= M
1
m
k=1
i>j
a[R(e
ki
)]a[R(e
kj
)]. (5.2.3)
Plugging this into (5.2.2) and using the estimate of
discussed earlier, we have an estimate

of the asymptotic covariance matrix of the R estimators.
For the general mixed model (5.1.1) of Section 5.1, the AREs for the rank-based proce-
dures are dicult to obtain; however, for the simple mixed model, (5.2.1), the ARE can be
obtained in closed form provided the design is centered within each block; see Kloke et al.
(2009). The reader is asked to show in Exercise 5.6.2 that for Wilcoxon scores, this ARE is
ARE(F
W,
, F
LS
) = [(1 )/(1
)]12
2
__
f
2
(t) dt
_
2
, (5.2.4)
where
is dened under expression (5.2.2) and is the correlation coecient within a

block. If the random vectors in a block follow the multivariate normal distribution, then this
ARE lies in the interval [0.8660, 0.9549] when 0 < < 1. The lower bound is attained when
1. The upper bound is attained when = 0 (the independent case), which is the usual
high eciency of the Wilcoxon to LS at the normal distribution. When 1 < < 0, this
ARE lies in [0.9549, 0.9662] and the upper bound is attained when = 0.52 and the lower
bound is attained when 1. Generally, the high eciency properties of the Wilcoxon
analysis to LS analysis in the independent errors case extend to the Wilcoxon analysis for
this mixed model design. See Kloke et al. (2009) for details.
5.2.1 Variance Component Estimators
In this section, we assume that the variances of the errors exist. Let
e
k
denote the variance-
covariance matrix of e
k
. Under the model of this section, the variance-covariance matrix of
e
k
is compound symmetric having the form
e
k
=
2
A
k
() =
2
[(1 )I
n
k
+ J
n
k
], where
2
= Var(e
ki
), I
n
k
is the identity matrix of order n
k
, and J
n
k
is a n
k
n
k
matrix of ones.
Letting
2
b
and
2
denote respectively the variances of the random eect b

k
and the error
, the total variance is given by
2
=
2
+
2
b
. The intraclass correlation coecient is
=
2
b
/(
2
+
2
b
). These parameters, (
2
,
2
b
,
2
), are referred to as the variance components.
To estimate these variance components, we use the estimates discussed in Kloke at al.
(2009); see, also Rashid and Nandram (1998) and Gerand and Schucany (2007). In block
k, rewrite model (5.2.1) as y
kj
[ + x
kj
] = b
k
+
kj
, j = 1, . . . , n
k
. The left side of this
expression is estimated by the residual
e
R,kj
= y
kj
[ +x
kj
], k = 1, . . . , m; j = 1, . . . , n
k
. (5.2.5)
Hence, a predictor (estimate) of b
k
is given by

b
k
= med
1jn
k
e
R,kj
. Hence a robust
estimator of the variance of b
k
is MAD, (3.9.27); that is,

2
b
= [MAD
1km
(
b
k
)]
2
=
_
1.483 med
1km
[
b
k
med
1jm
b
j
[
_
2
. (5.2.6)
In this simple mixed model, the residuals e
kj
, (5.2.5), are often call the marginal resid-
uals. In addition, though, we have the conditional residuals for the errors
kj
which are
dened by

kj
= e
R,kj
b
k
, j = 1, . . . n
k
, k = 1, . . . , m. (5.2.7)
A robust estimate of
2
is then

2
= [MAD
1jnn,1km
(
kj
)]
2
. (5.2.8)
Hence, robust estimates of the total variance
2
and the intraclass correlation coecient are

2
=
2
+
2
b
and =
2
b
/
2
. (5.2.9)
Thus, our robust estimates of the variance components are given in expressions (5.2.6),
(5.2.8), and (5.2.9).
5.2.2 Studentized Residuals
In Chapter 3, we presented Studentized residuals for R and HBR ts. These residuals are
fundamental for diagnostic analyses of linear models. They correct for both the model (factor
space) and the underlying covariance structure and allow for a simple benchmark rule for
designating potential outliers. In this section, we present Studentized residuals based on the
R t of the simple mixed model, (5.2.1). Because the marginal residuals e
R,kj
, (5.2.5), are
used to check the quality of t, these are the appropriate residuals for standardizing.
Because the block sample sizes n
k
are not necessarily the same, some additional notation
simplies the presentation. Let
1
and
2
be two parameters and dene the block-diagonal
matrix B(
1
,
2
) = diagB
1
(
1
,
2
), . . . , B
m
(
1
,
2
), where B
k
(
1
,
2
) = (
1
2
)I
n
k
+
2
J
n
k
,
k = 1, . . . , m. Hence, for Model (5.2.1), we can write Var(e) =
2
B(1, ).
Using the asymptotic representation for

given in expression (5.1.7), a tedious calcu-

lation, similar to that in Section 3.9.2, shows that the approximate covariance matrix of e
R
is given by
C
R
=
2
B(1, ) +

2
s
n
2
J
n
B(1,
S
)J
n
+
2
H
c
B(1,
)H
c
s
n
B(
11
,
12
)J
n
B(
11
,
12
)H
c

s
n
J
n
B(
11
,
12
)
+
s
n
J
n
B(
11
,
12
)H
c
H
c
B(
11
,
12
) +

s
n
H
c
B(
11
,
12
)J
n
, (5.2.10)
where H
c
is the projection matrix onto the column space of the centered design matrix X
c
,
J
n
is the n n matrix of all ones, and
11
= E[e
11
sgn (e
11
)],
12
= E[e
11
sgn (e
12
)],
11
= E[e
11
(F(e
11
))],
12
= E[e
11
(F(e
12
))],
11
= E[sgn(e
11
)(F(e
11
))],
12
= E[sgn(e
11
)(F(e
12
))],
and
and
S
are dened in (5.1.5) and (5.1.8), respectively.
To compute the Studentized residuals, estimates of the parameters in C
R
, (5.2.10), are
required. First, consider the matrix
2
B(1, ). In Section 5.2.1, we obtained robust estima-
tors
2
and given in expression (5.2.9). Substituting these estimators for
2
and into
2
B(1, ), we have a robust estimator of
2
B(1, ) given by
2
B(1, ). Expression (5.2.3)
gives a simple moment estimator of
. The parameters
S
,
11
,
12
,
11
,
12
,
11
, and
12
can be estimated in the same way. Substituting these estimators into the matrix C
R
, let

C
R
denote the resulting estimator.
For t = 1, . . . , n, let c
tt
denote the tth diagonal entry of the matrix

C
R
. Then the tth
Studentized marginal residual based on the R t is
e
R,t
= e
R,t
/
_
c
tt
. (5.2.11)
As in Chapter 3, the traditional benchmarks used with these Studentized residuals are the
limits 2.
5.2.3 Example and Simulation Studies
In this section we present an example of a randomized block design. It consists of only two
blocks, so we also summarize simulation studies which conrms the validity of the rank-based
analysis. For the examples and the simulation studies, we computed the rank-based analysis
using the collection of R functions Rfit described above. By the traditional t, we mean the
maximum likelihood t based on multivariate normality of the error random vectors. This t
and subsequent analysis was obtained using the R function lme as discussed in Pinheiro and
Bates (2000). The rank-based analysis of this section was computed by Rfit, a collection of
R functions.
Example 5.2.1 (Crab Grass Data). [tbh] Cobb (1998) presented an example of a complete
block design concerning the weight of crab grass. Much of our discussion is drawn from Kloke
at al. (2009). There are four xed factors in the experiment: the density of the crabgrass
at four levels, the nitrogen content of the crabgrass at two levels, the phosphorus content
of the crabgrass at two levels, and the potassium content of the crabgrass at two levels.
Two complete blocks of the experiment were carried out, so altogether there are n = 64
observations. Here block is a random factor and we assume the simple mixed model, (5.2.1),
of this section. Under each set of experimental conditions, crab grass was grown in a cup.
The response is the dry weight of a unit (cup) of crab grass, in milligrams. The data are
presented in Table A.0.1 of Appendix B.
We consider the rank-based analysis of this section based on Wilcoxon scores. For the
main eects model, Table 5.2.1 displays the estimated eects (contrasts) and standard errors
for the Wilcoxon and traditional analyses. For the nutrients, these eects are the dierences
between the high and low levels, while for the factor density the three contrasts reference the
highest density level. There are major dierences between the Wilcoxon and the traditional
estimates. For the Wilcoxon estimates, the nutrients nitrogen and phosphorus are signicant
and the contrast between the low and high levels of density is highly signicant. Nitrogen
is the only signicant eect for the traditional analysis. The Wilcoxon statistic to test the
density eects has the value T
W,
= 20.55 with p = 0.002; while, the traditional test statistic
is F
lme
= 0.82 with p = 0.490. The robust estimates of the variance components are:

2
= 206.33,
2
b
= 20.28, and = 0.098
An outlier accounts for much of this dramatic dierence between the robust and tradi-
tional analyses. Originally, one of the responses was mistyped; instead of the correct value
97.25, the response was typed as 972.5. As Cobb (1998) notes, this outlier was more dicult
to spot in the original units. Upon replacing the outlier with its correct value, the Wilcoxon
and traditional analyses are similar; although, the Wilcoxon analysis is still more precise;
see the discussion below on the other outliers in this data set. This is true too of the test for
the factor density: T
W,
= 23.23 (p = 0.001) and F
lme
= 6.33 with p = 0.001. The robust
estimates of the variance components are:
2
= 209.20,
2
b
= 20.45, and = 0.098 These
Table 5.2.1: Wilcoxon and Traditional Estimates and SEs of Eects for the Crabgrass.
Wilcoxon Traditional
Contrast Est. SE Est. SE
Nit 39.90 4.08 69.76 28.7
Pho 10.95 4.08 11.52 28.7
Pot 1.60 4.08 28.04 28.7
D
34
3.26 5.76 57.74 40.6
D
24
7.95 5.76 8.36 40.6
D
14
24.05 5.76 31.90 40.6
are essentially unchanged from their values on the original data. If on the original data the
experimenter had run the robust t and compared it with the traditional t, then the outlier
would have been discovered immediately.
Figure 5.2.1 contains the Wilcoxon Studentized residual plot and qq plot for the original
data. We have removed the large outlier from the plots, so that we can focus on the remaining
data. The vacant middle in the residual plot is an indication that interaction may be
present. For the hypothesis of interaction between the nutrients, the value of the Wald type
test statistic is T
W,
= 30.61, with p = 0.000. Hence, the R analysis strongly conrms that
interaction is present. On the other hand, the traditional likelihood ratio test statistic for this
interaction is 2.92, with p = 0.404. In the presence of interaction, many statisticians would
consider interaction contrasts instead of a main eects analysis. Hence, for such statisticians,
the robust and traditional analyses would have dierent practical interpretations.
5.2.4 Simulation Studies of Validity
In this data set, the number of blocks is two. Hence, to answer questions concerning the
validity of the Wilcoxon analysis, Kloke et al. (2009) conducted a small simulation study.
Table 5.2.2 summarizes the empirical condences and AREs of this study for two situations,
normal errors and contaminated normal errors (20% contamination and the ratio of the con-
taminated variance to the uncontaminated variance at 25). For each situation, the same
randomized block design as in the Crab Grass example was used, with the correlation struc-
ture as estimated by the Wilcoxon analysis. The empirical condences of the asymptotic
95% condence intervals were recorded. These intervals are of the form Estimate 1.96SE,
where SE denotes the standard errors of the estimates. The number of simulations was 10,000
for each situation, therefore, the error in the table based on the usual 95% condence interval
for a proportion is 0.004. The empirical condences for the Wilcoxon are quite good with
the target of 0.95 usually within range of error. They were perhaps a little conservative at
the the contaminated normal situation. Hence, the Wilcoxon analysis appears to be valid
for this design. The intervals based on the traditional t are slightly liberal. The empirical
AREs between two estimators displayed in Table 5.2.2 are the ratios of empirical mean
squared errors of the two estimators. As the table shows, the traditional t is more ecient
40 60 80 100
2
0
2
4
6
8
Wilcoxon fit
S
t
u
d
e
n
t
i
z
e
d

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
Studentized Residual Plot, Outlier Deleted
2 1 0 1 2
2
0
2
4
6
8
Normal quantiles
S
t
u
d
e
n
t
i
z
e
d

W
i
l
c
o
x
o
n

r
e
s
i
d
u
a
l
Normal qq Plot, Outlier Deleted
Figure 5.2.1: Studentized Residual and qq Plots, Minus Large Outlier.
at the normal but the eciencies are close to the value 0.95 for the independent error case.
The Wilcoxon analysis is much more ecient over the contaminated normal situation.
Does this rank-based analysis dier from the independent error analysis of Chapter 3?
As a tentative answer to this question, Kloke et al. (2009) ran 10,000 simulations using the
model for the Crab Grass Example. Wilcoxon scores were used for both analyses. To avoid
confusion, call the analysis of Chapter 3, the IR analysis, (I for independent errors) and the
analysis of this section the R analysis. They considered normal error distributions, setting
the variance components at the values of the robust estimates. Because the R and IR ts
are the same, they considered the dierences in their inferences of the six eects listed in
Table 5.2.1. For 95% nominal condence, the average empirical condences over these six
Table 5.2.2: Validity of Inference (Empirical Condence Sizes and AREs)
Norm. Errors Cont. Norm. Errors
Contrast Wilc. Traditional ARE Wilc. Traditional ARE
Nit 0.948 0.932 0.938 0.964 0.933 7.73
Pho 0.953 0.934 0.941 0.964 0.930 7.82
Pot 0.948 0.927 0.940 0.966 0.934 7.72
D
34
0.950 0.929 0.936 0.964 0.931 7.75
D
24
0.951 0.929 0.943 0.960 0.931 7.57
D
14
0.952 0.930 0.944 0.960 0.929 7.92
5.3. RANK-BASED PROCEDURES BASED ON ARNOLD TRANSFORMATIONS 333
contrasts are 95.32% and 96.12%, respectively for the R and IR procedures. Hence, both
procedures appear valid. For a measure of eciency, they averaged, across the contrasts, the
averages of squared lengths of the condence intervals. The ratio of the R to the IR averages
is 0.914; hence for the simulation, the R inference is about 9% more ecient than the IR
inference. Similar results for the traditional analyses are reported in Rao et al. (1993).
5.2.5 Simulation Study of Other Score Functions
Besides, the large outlier there are six other potential outliers in the Cobb data. This
quantity of outliers suggests the use of score functions which are more preferable than
the Wilcoxon score function for very heavy-tailed error structure. To investigate this,
we turned to the family of Winsorized Wilcoxon score functions. Recall that this fam-
ily was discussed for skewed data in Example 2.5.1. Here, though, asymmetry does not
appear to be warranted. We selected the score function which is linear over the inter-
val (0.2, 0.8), i.e., 20% Winsorizing on both sides. We denote it by WW
2
. For the pa-
rameters as in Table 5.2.1, the WW
2
estimates and standard errors (in parentheses) are:
39.16 (3.78), 10.13 (3.78), 2.26 (3.78), 2.55 (5.35), 7.68 (5.35), and 23.28 (5.35). The estimate
of the scale parameter is 14.97 compared to the Wilcoxon estimate which is 15.56. This
indicates that an analysis based on the WW
2
t has more precision than one based on the
Wilcoxon t.
To investigate this gain in precision, we ran a small simulation study. We used the same
model and the same correlation structure as estimated by the Wilcoxon t. We considered
normal and contaminated normal errors, with the percent of contamination at 20% and the
relative variance of the contaminated part at 25. For each situation 10,000 simulations were
run. The AREs were very similar for all six parameters, so we only report their averages.
For the normal situation the average ARE between the WW
2
and Wilcoxon estimates was
0.90; hence, the WW
2
estimate was 10% less ecient for the normal situation. For the
contaminated normal situation, though, this average was 1.21; hence, the WW
2
estimate
was 20% more ecient than the Wilcoxon estimate for the contaminated normal situation.
There are families of scores functions besides the Winsorized Wilcoxon scores. Gastwirth
(1966) presents several families of score functions appropriate for classes of distributions with
tails heavier than the exponential distribution. For certain cases, he selects a score based on
a maxi-min strategy.
5.3 Rank-Based Procedures Based on Arnold Trans-
formations
In this section, we apply a linear transformation to the mixed model, (5.1.1), and then obtain
the R ts. We begin with a brief but necessary discussion of the intercept parameter.
Write the mixed model in the long form (5.1.2), Y = 1
n
+ X + e. Suppose the
transformation matrix is A. Multiplying both sides of the model by A, the transformed
model is of the form
Y
= X
b +e
, (5.3.1)
where v
denotes the vector Av and the vector of parameters is b = (,
. While the
original model has an intercept parameter, in general, the transformed model does not. As
discussed in Exercise 3.16.39 of Chapter 3, the R t of Model (5.3.1) is actually the R t of
the model Y
=

X
b + e
, where

X
= (I H
1
)X
and H
1
is the projection matrix onto
the space spanned by 1; i.e,

X
is the centered design matrix based on X
.
As proposed in Exercise 3.16.39, to obtain an R t of Model (5.3.1), we use the following
algorithm:
(1.) Fit the model
Y
=
1
1 +

X
b +e
. (5.3.2)
By t we mean: obtain the R estimate of b and then estimate the
1
by the median
of the residuals. Let

Y
1
denote the R t.
(2.) Project

Y
1
to the right space; i.e., obtain
= H
X
1
. (5.3.3)
(3.) Solve X
b =

Y
; i.e., our estimator is
= (X
)
1
X
. (5.3.4)
As developed in Exercise 3.16.39,

b
is asymptotically normal with the asymptotic represen-

tation given by (3.16.11) and asymptotic variance given by (3.16.12). We use these results
in the remainder of this chapter.
5.3.1 R Fit Based on Arnold Transformed Data
As in the previous sections, consider an experiment done over m blocks, (clusters, centers),
and let Y
k
denote the vector of n
k
observations for the kth block, k = 1, . . . , m. In this
section, we consider the simple mixed model of Section 5.2. Using the notation of expression
(5.2.1), Y
k
follows the model Y
k
= 1
n
k
+ X
k
+ 1
n
k
b
k
+
k
, where b
k
is a random eect
and denotes the xed eects of interest. As in Section 5.2, assume that the blocks are
independent and b
k
and
k
are independent. Let e
k
= 1
n
k
b
k
+
k
. As in expression (5.1.2),
the long form of the model is useful, i.e., Y = 1
n
+X +e. Because there is an intercept
parameter in the model, we may assume that X is centered. Let n =
m
k=1
n
k
denote the
total sample size. For this section, in addition we assume that for all blocks X
k
has full
column rank p.
If the variances of the error variables exist, denote them by Var[b
k
] =
2
b
and Var[
kj
] =
2
.
In this case, the variance covariance structure for the kth block is compound symmetric which
we denote as
Var[e
k
] =
2
A
k
() =
2
[(1 )I
n
k
+ J
n
k
], (5.3.5)
where
2
=
2
+
2
b
, and =
2
b
/(
2
b
+
2
).
Arnold Transformation
Arnold (Chapters 14 and 15, 1981) discusses a Helmert transformation for these types of
models for traditional (least squares) analyses for balanced designs, i.e., all n
k
s are the same.
Kloke and McKean (2010) generalized Arnolds results to unbalanced designs and developed
the properties of the R t for the transformed data. Consider the n
k
n
k
orthogonal matrix
k
=
_
1
n
k
1
n
k
C
k
,
_
(5.3.6)
where the columns of C
k
form an orthonormal basis for 1
n
k
, (C
k
1
n
k
= 0). We call
k
an
Arnold transformation of size n
k
.
Now, apply an Arnolds Transformation of size n
k
to the response vector for the kth
block
Y
k
=
k
Y
k
=
_
Y

k1
Y
k2
_
where the mean component is Y
k1
=
+ b
k
+
n
k
x
k
+ e
k1
, the contrast component is
Y
k2
= X
k
+e
k2
, and the other quantities are:
x
k
=
1
n
k
1
n
k
X
k
e
k1
=
1
n
k
1
n
k
e
k
X
k
= C
k
X
k
e
k2
= C
k
e
k
= b
k
C
k
1
n
k
+C
k
= C
k
.
In particular, note that the contrast component contains, as a linear model, the xed eects
of interest and, moreover, it is free of the random block eect.
Furthermore, notice that all the information on is in the contrast component if x = 0.
This occurs when the experimental design is replicated at least once in each of the blocks and
the covariate does not change. Also, all of the information on is in the mean component if
the covariates are constant within a block. More often, however, there is information on
in both of the components. If this is the case, then for balanced designs, one can put both
pieces back together and obtain an estimator using all of the information. For unbalanced
designs this is not possible. The approach we take is to ignore the information in the mean
component and use the contrast component for inference.
Let n
= n m. Then the long form of the Arnold transformation (AT) is Y
2
= C
Y,
where C
= diag[C
1
, . . . , C
m
]. So we can model Y
2
as
Y
2
= X
+e
2
, (5.3.7)
where e
2
= C
e, and, provided variances exist, Var[e
2
] =
2
2
I
n
,
2
2
=
2
(1 ), and
X
= C
X.
LS Fit on Arnold Transformed Data
For the traditional least squares procedure, suppose the variance of the errors exist. Un-
der the additional assumption of normality, the transformed errors are independent. The
traditional estimator is thus the usual LS estimator
ATLS
= Argmin|y
2
X
|
LS
.
i.e.,

ATLS
= (X
)
1
X
2
. This is the extension of Arnolds (1981) solution that was
proposed by Kloke and McKean (2010) for the unbalanced case of Model (5.3.7). As usual,
estimate the intercept based on the mean of the residuals,

LS
=
1
n
1
(y y)
=
1
n
1
(I
n
X(X
)
1
X
)y = y.
As Exercise 5.6.3 shows the joint asymptotic distribution is
_

LS
ATLS
_
N
p+1
_ _

_
,
_

2
1
0
0
2
2
(X
)
1
__
(5.3.8)
where
2
1
= (
2
/n
2
)
m
k=1
[(1 )n
k
+ n
2
k
] and
2
2
=
2
(1 ). Notice that if inference
is to be on then we avoid explicit estimation of . To estimate
2
2
we may use
2
2
=
m
k=1
n
k
j=1
e
2
kj
/(n
p) where e
kj
= y
kj
x
kj
.
R Fit on Arnold Transformed Data
For the R t of Model (5.3.7), we briey sketch the development in Kloke and McKean (2010).
Assume that we have selected a score function (u). We dene the Arnolds transformation
rank-based (ATR) estimator of as the regression through the origin rank estimator dened
by the steps (5.3.2) - (5.3.4) of the last section; that is, the rank-based estimator is given by
ATR
= Argmin|y
2
X
. (5.3.9)
The results of Section 5.1 ensure that the ATR estimates are consistent and asymptot-
ically normal. The reason for doing an Arnold transformation, though, is that the trans-
formed error variables are uncorrelated. While this does not necessarily mean that they are
independent, in the literature they are usually treated as if they are. This is called work-
ing independence. The asymptotic distributions discussed next are formulated under the
working independence. The simulation results reported in Kloke and McKean (2010) sup-
port the validity of the asymptotic distributions over normal and contaminated normal error
distributions.
Recall from the regression through the origin algorithm that the asymptotic distribution
of

ATR
depends on the choice of the estimate of the intercept
1
. For the rst case, suppose
the median of the residuals is used as the estimate of the intercept, (
ATR
= medy
kj2
kj
ATR
. Then, under working independence, the joint approximate distribution of the
regression parameters is
_

ATR
ATR
_
N
p+1
__

_
,
_

2
s
2
s,e
/n 0
0 V
__
(5.3.10)
where V is given in expression (3.16.12) of Chapter 3,
2
s
= 1 +t
s
, t
m
k=1
n
k
(n
k
1),
and
s
= cov[sgn(e
11
)sgn(e
12
)].
For the second case, assume that the score function (u) is odd about 1/2; (1 u) =
(u). Let
+
ATR
denote the signed-rank estimator of the intercept; see expression (3.5.32)
of Chapter 3. Then, under working independence, the joint approximate distribution of the
rank-based estimator is
_

+
ATR
ATR
_
N
p+1
__

_
,
_

2
s
2
s,e
/n 0
0 V
__
, (5.3.11)
where V =
2
(X
)
1
. In comparing expressions (5.3.8) and (5.3.11), we see that asymp-
totic relative eciency (ARE) between the ATLS and the ATR estimators is the same as
that of LS and R estimates in ordinary linear models. In particular when Wilcoxon scores are
used and errors have a normal distribution, the ARE between the ATLS and ATR(Wilcoxon)
is the usual 0.95. Hence, for this second case, the ATR estimators are eciently robust.
To complete the practical inference, the scale parameters, and
s
are based on the
distribution of e
2kj
and can be estimated as discussed in Chapter 3. From this, an inference
is readily formed for the parameters of the model. Validity of the resulting condence
intervals is conrmed in the simulation study of Kloke and McKean (2010). Studentized
residuals are also discussed in this article. A matrix expression such as (5.2.10) for the
simple mixed model is derived by the authors; however, unlike the situation in Section 5.2.2,
some of the necessary correlations are not straightforward to estimate. Kloke and McKean
recommend a bootstrap to estimate the standard error of a residual. We use these in the
following example.
Example and Discussion
The following example is drawn from the article of Kloke and McKean (2010). Although
simple, the following data set demonstrates some of the nice features of Arnolds Transfor-
mation, particularly for balanced data.
Example 5.3.1 (Milliken and Johnson Data). The data in Table 5.3.1 are from an example
found on page 260 of Milliken and Johnson (2002). Each row represents a block of length two.
There is one covariate and each of the responses were measurements on dierent treatments.
The model for these data is
Y
k
= 1
2
+
_
0.5
0.5
_
+ x
k
1
2
+
k
.
Table 5.3.1: Data for Example 5.3.1.
x y
1
y
2
23.2 60.4 76.0
26.9 59.9 76.3
29.4 64.4 77.8
22.7 63.5 75.6
30.6 80.6 94.6
36.9 75.9 96.1
17.6 53.7 62.3
28.5 66.3 81.6
Table 5.3.2: ATR and ATLS estimates and standard errors for Example 5.3.1.
ATR ATLS
Est SE Est SE
70.8 3.54 72.8 8.98
14.45 1.61 14.45 1.19
1.43 0.65 1.46 0.33
The Arnolds Transformation for this model is
k
=
1
2
_
1 1
1 1
_
.
The transformed responses are Y
k
=
k
Y
k
= [Y
k1
, Y
k2
]
, where
Y
k1
=
x
k
+
k1
,
Y
k2
=
k2
,
2,
2, and
=
1
2
. We treat the transformed errors
k1
for k = 1, . . . , m
and
k2
for k = 1, . . . , m as iid. Notice that the rst component is a simple linear regression
model and the second component is a simple location model. For this example, we use
signed-rank to estimate both of the intercept terms. The estimates and standard errors of
the parameters are given in Table 5.3.2.
Kloke and McKean (2010) plotted bootstrap Studentized residuals for the least squares
(top) and Wilcoxon ts. These plots show no serious outliers.
To demonstrate the robustness of ATR estimates in the example, Kloke and McKean
(2010) conducted a small sensitivity analysis. They set the second data point to y
(i)
12
=
y
11
+y, where y varied from -30 to 30. Then the parameters
(i)
are estimated based on
the data set with the outlier. The graph below, displays the relative change of the estimate,
(i)
5.4. GENERAL ESTIMATING EQUATIONS (GEE) 339

as a function of y.
30 20 10 0 10 20 30
0
.
0
6
0
.
0
4
0
.
0
2
0
.
0
0
0
.
0
2
0
.
0
4
y
R
e
l
a
t
i
v
e

C
h
a
n
g
e

Over this range of y, the relative changes in the ATR estimate is between 0.042 to 0.062.
In contrast, as the reader is asked to show in Exercise 5.6.4, the relative change in ATLS
over this range is between 0.125 to 0.394. Hence, the relative change in the ATR estimates
is small, which indicates the robustness of the ATR estimates.
????????????????Discussion????????
5.4 General Estimating Equations (GEE)
For longitudinal data, Liang and Zeger (1986) presented an elegant, general iterated reweighted
least squares (IRLS) t of a generalized longitudinal model. As we note below, their t solves
a set of general estimating equations (GEE). Their model is more general than Model (5.1.1).
Abebe, McKean and Kloke (2010) developed a rank-based t of this general model which
we present in this section. While analogous to Liang and Zegers t, it is robust in response
space. Further, the procedure can easily be generalized to be robust in factor space, also.
Consider a longitudinal set of observations over m subjects. Let y
it
denote the tth
response for ith subject for t = 1, 2, . . . , n
i
and i = 1, 2, . . . , m. Assume that x
it
is a p 1
vector of corresponding covariates. Let n =
m
i=1
n
i
denote the total sample size. Assume
that the marginal distribution of y
it
is of the exponential class of distributions and is given
by
f(y
it
) = exp[y
it
it
a(
it
) + b(y
it
)] , (5.4.1)
where > 0,
it
= h(
it
),
it
= x
T
it
, and h() is a specied function. Thus the mean and
variance of y
it
are given by
E(y
it
) = a
(
it
) and Var(y
it
) = a
(
it
)/, (5.4.2)
where the denotes derivative. In this notation, the link function is h
1
(a
)
1
. More
assumptions are stated later for the theory.
Let Y
i
= (y
i1
, . . . , y
in
i
)
T
and X
i
= (x
i1
, . . . , x
in
i
)
T
denote the n
i
1 vector of responses
and the n
i
p matrix of covariates, respectively, for the ith individual. We consider the
general case where the components of the vector of responses for the ith subject, Y
i
, are
dependent. Let
i
= (
i1
,
i2
, . . . ,
in
i
)
T
, so that E(Y
i
) = a
(
i
) = (a
(
i1
), . . . , a
(
in
i
))
T
.
For a s 1 vector of unknown correlation parameters , let C
i
= C
i
() denote a n
i
n
i
correlation matrix. Dene the matrix
V
i
= A
1/2
i
C
i
()A
1/2
i
/ , (5.4.3)
where A
i
= diaga
(
i1
), . . . , a
(
in
i
). The matrix V
i
need not be the covariance matrix of
Y
i
. In any case, we refer to C
i
as the working correlation matrix. For estimation, let

V
i
be
an estimate of V
i
. This, in general, requires estimation of and often an initial estimate of
. In general, we denote the estimator of by (, ) to reect its dependence on and
.
Liang and Zeger (1986) dened their estimate in terms of general estimating equations
(GEE). Dene the n
i
p Hessian matrix,
D
i
=
a
(
i
)
, i = 1, . . . , m . (5.4.4)
Then their GEE estimator

LS
is the solution to the equations
m
i=1
D
T
i
V
1
i
[Y
i
a
(
i
)] = 0 . (5.4.5)
To motivate our estimator, it is convenient to write this in terms of the Euclidean norm.
Dene the dispersion function,
D
LS
() =
m
i=1
[Y
i
a
(
i
)]
T
V
1
i
[Y
i
a
(
i
)]
=
m
i=1
[
V
1/2
i
Y
i

V
1/2
i
a
(
i
)]
T
[
V
1/2
i
Y
i
V
1/2
i
a
(
i
)]
=
m
i=1
n
i
t=1
[y
it
d
it
()]
2
, (5.4.6)
where Y
i
=

V
1/2
i
Y
i
= (y
i1
, . . . , y
in
i
)
T
, d
it
() = c
T
t
a
(
i
), and c
T
t
is the tth row of

V
1/2
i
.
The gradient of D
LS
() is
D
LS
() =
m
i=1
D
T
i
V
1
i
[Y
i
a
()] . (5.4.7)
Thus the solution to the GEE equations (5.4.5) also can be expressed as
LS
= Argmin D
LS
() . (5.4.8)
From this point of view,

LS
is a nonlinear least squares (LS) estimator. We refer to it as
GEEWL2 estimator.
Consider, then, the robust rank-based nonlinear estimators discussed in Section 3.14.
For nonnegative weights (see expression (5.4.10) below), we assume for now that the score
function is odd about 1/2, i.e., satises (2.5.9). In situations where this assumption is
unwarranted, we can adjust the weights to accommodate scores appropriate for skewed error
distributions; see the discussion in Section 5.4.3.
Next consider the general model dened by expressions (5.4.1) and (5.4.2). As in the LS
development, let Y
i
=

V
1/2
i
Y
i
= (y
i1
, . . . , y
in
i
)
T
, g
it
() = c
T
t
a
(
i
), where c
T
t
is the tth row
of

V
1/2
i
, and let G
i
= [g
it
]. The rank-based dispersion function is given by
D
R
() =
m
i=1
n
i
t=1
[R(y
it
g
it
())/(n + 1)][y
it
g
it
()] . (5.4.9)
We next write the R estimator as weighted LS estimator. From this representation the
asymptotic theory of the R estimator can be derived. Furthermore, it naturally suggests
an IRLS algorithm. Let e
it
() = y
it
g
it
() denote the (i, t)th residual and let m() =
med
(i,t)
e
it
() denote the median of all the residuals. Then because the scores sum to 0 we
have the identity,
D
R
() =
m
i=1
n
i
t=1
[R(e
it
())/(n + 1)][e
it
() m()]
=
m
i=1
n
i
t=1
[R(e
it
())/(n + 1)]
e
it
() m()
[e
it
() m()]
2
=
m
i=1
n
i
t=1
w
it
()[e
it
() m()]
2
, (5.4.10)
where w
it
() = [R(e
it
())/(n + 1)]/[e
it
() m()] is a weight function. As usual, we
take w
it
() = 0 if e
it
() m() = 0. Note that by using the median of the residuals in
conjunction with property (2.5.9), the weights are positive. To accommodate other score
functions besides those that satisfy (2.5.9) quantiles other than the median can be used; see
Example 5.4.3 and Sievers and Abebe (2004) for discussion.
For the initial estimator of , we recommend the rank-based estimator of Chapter 3 based
on the score function (u). Denote this estimator by

(0)
R
. As estimates of the weights, we
use w
it
_
(0)
R
_
; i.e., the weight function evaluated at

(0)
. Expression (5.4.10) leads to the
dispersion function
D
R
_
[
(0)
R
_
=
m
i=1
n
i
t=1
w
it
_
(0)
R
__
e
it
() m
_
(0)
R
__
2
=
m
i=1
n
i
t=1
_
_
w
it
_
(0)
R
_
e
it
()
_
w
it
_
(0)
R
_
m
_
(0)
R
_
_
2
. (5.4.11)
Let
(1)
R
= ArgminD
_
[
(0)
R
_
. (5.4.12)
This establishes a sequence of IRLS estimates,
_
(k)
R
_
, k = 1, 2, . . ..
After some algebraic simplication, we obtain the gradient
D
R
_
[
(k)
R
_
= 2
m
i=1
D
T
i
V
1/2
i
W
i
V
1/2
i
_
Y
i
a
() m
(k)
R
__
, (5.4.13)
where m
(k)
R
_
=

V
1/2
i
m
_
(k)
R
_
1, 1 denotes a n
i
1 vector all of whose elements are 1,
and

W
i
= diag w
i1
, . . . , w
in
i
is the diagonal matrix of weights for the ith subject. Hence,
(k+1)
R
satises the general estimating equations (GEE) given by,
m
i=1
D
T
i
V
1/2
i
W
i
V
1/2
i
_
Y
i
a
() m
(k)
R
__
= 0 . (5.4.14)
We refer to this weighted, general estimation equations estimator as the GEEWR estimator.
5.4.1 Asymptotic Theory
Recall that both the GEEWL2 and GEEWR estimators were dened in terms of the uni-
variate variables y
it
. These of course are transformations of the original observations by
the estimates of the covariance matrix V
i
and the weight matrix W
i
. For the theory, we
need to consider similar transformed variables using the matrices V
i
and W
i
, where this
notation means that V
i
and W
i
are evaluated at the true parameters. For i = 1, . . . , m and
t = 1, . . . , n
i
, let
Y
i
= V
1/2
i
Y
i
= (y
i1
, . . . , y
in
i
)
T
G
i
() = V
1/2
i
a
i
() = [g
it
]
e
it
= y
it
g
it
(). (5.4.15)
To obtain asymptotic distribution theory for a GEE procedure, assumptions concerning these
errors e
it
must be made. Regularity conditions for the GEEWL2 estimates are discussed in
Liang and Zeger (1986). For the GEEWR estimator, assume these conditions and, further
that the marginal pdf of e
it
is continuous and the variance-covariance matrix given in (5.4.16)
is positive denite. Under these conditions, Abebe et al. (2010) derived the asymptotic
distribution of the GEEWR estimator. The proof involves a Taylor series expansion, as in
Liang and Zegers (1994) proof, and the rank-based theory found in Brunner and Denker
(1994) for dependent observations. We state the result in the next theorem.
Theorem 5.4.1. Assume that the initial estimate satises

m(
(0)
R
) = O
p
(1). Then
under the above assumptions, for k 1,
m(
(k)
R
) has an asymptotic normal distribution
with mean 0 and covariance matrix,
lim
m
m
_
m
i=1
D
T
i
V
1/2
i
W
i
V
1/2
i
D
i
_
1
_
m
i=1
D
T
i
V
1/2
i
Var(
i
)V
1/2
i
D
i
_
_
m
i=1
D
T
i
V
1/2
i
W
i
V
1/2
i
D
i
_
1
, (5.4.16)
where
i
denotes the n
i
1 vector ([R(e
i1
)/(n + 1)], . . . , [R(e
in
i
)/(n + 1)])
T
.
5.4.2 Implementation and a Monte Carlo Study
For practical use of the GEEWR estimate, the asymptotic covariance matrix (5.4.16) requires
estimation. This is true even in the case where percentile bootstrap condence intervals are
employed for inference, because appropriate standardized bootstrap estimates are gener-
ally used. We present a nonparametric estimator of the covariance structure and an then
approximation to it. We compare these in a small simulation study.
Nonparametric (NP) Estimator of Covariance
The covariance structure suggests a simple moment estimator. Let

(k)
and (for the ith
subject)

V
(k)
i
denote the nal estimates of and V
i
, respectively. Then the residuals which
estimate e
i
(e
i1
, . . . , e
in
i
)
T
are given by
e
i
=
_
V
(k)
i
_
1/2
Y
i

G
(k)
i
(
(k)
), i = 1, . . . , m, (5.4.17)
where

G
(k)
i
=
_
V
(k)
i
_
1/2
a
(k)
_
and

(k)
it
= h
_
x
T
it
(k)
_
. Let R(e
it
) denote the rank of e
it
among e
, t = 1, . . . , n
i
; i = 1, . . . , m. Let
i
= ([R(e
i1
)/(n + 1)], . . . , [R(e
in
i
)/(n +
1)])
T
. Let

S
i
=
i
1
n
i
. Then a moment estimator of the covariance matrix (5.4.16) is
that expression with Var(
i
) estimated by
Var(
i
) =

S
i
S
T
i
, (5.4.18)
and, of course, nal estimates of D
i
and V
i
. We label this estimator (NP) Although this
is a simple nonparametric estimate of the covariance structure, in a simulation study Abebe
et al. (2010) showed that this estimate often leads to a very liberal inference. Werner and
Brunner (2007) discovered this in a corresponding rank testing problem.
Approximation (AP) of the Nonparametric Estimator
The form of the weights, though, suggests a simple approximation, which is based on certain
ideal conditions. Suppose the model is correct. Assume that the true transformed errors
are independent. Then, because the scores have been standardized, asymptotically Var(
i
)
converges to I
n
i
, so replace it with I
n
i
. This is the rst part of the approximation.
Next consider the weights. The functional for the weights is of the form [F(e)]/e.
Assuming that F(0) = 1/2, a simple application of the Mean Value Theorem gives the
approximation [F(e)]/e =
[F(e)]f(e). The expected value of this approximation can be

expressed as
1
=
_

[F(t)]f
2
(t) dt =
_
1
0
(u)
_
[F
1
(u)]
f[F
1
(u)]
_
du, (5.4.19)
where the second integral is derived from the rst by integration by parts followed by a
substitution. The parameter is of course the usual scale parameter for the R estimates in
the linear model based on the score function (u). The second part of the approximation is
to replace the weight matrix by (1/ )I. We label this estimator of the covariance matrix of
(k)
by (AP).
Monte carlo Study
We report the results of a small simulation study in Abebe et al. (2010) which compares the
estimators (NP) and (AP). It also provides empirical information on the relative eciency
between

(k)
the maximum likelihood estimator (mle) under assumed normality.
The simulated model is a randomized block design with the xed factor at ve levels and
the random (block) factor at seven levels. The distribution of the random eect was taken to
be normal. Two error distributions were considered: a normal and a contaminated normal
with the contamination rate at 20% and ratio of the contaminated standard deviation to the
noncontaminated at ve. For the normal error model, the intraclass correlation coecient
was set at 0.5. For each distribution, 10,000 simulations were run.
We consider the GEEWR estimator based on a working independence covariance struc-
ture. We compared it with the maximum likelihood estimator (mle) for a randomized block
design. This yields the traditional analysis used in practice. We used the R function lme
(Pinheiro et al., 2007) to compute it.
Table 5.4.1 records the results of the empirical eciencies and empirical condences
between the GEEWR estimator and mle estimator for the xed eect contrasts between
level 1 and the other four levels. The empirical condence coecients are for nominal 95%
condence intervals based on asymptotic distribution of the GEEWR estimator using the
nonparametric (NP) estimate of the covariance structure, the approximation (AP) discussed
above, and the mle inference.
Table 5.4.1: Empirical Eciencies and Condence Coecients
Dist. Method Contrast
21

31

41

51
Empirical Eciency
Norm 0.974 0.974 0.972 0.973
CN 2.065 2.102 2.050 2.055
Empirical Conf. Coe.
Norm mle 0.916 0.915 0.914 0.914
NP 0.546 0.551 0.564 0.549
AP 0.951 0.955 0.954 0.951
CN mle 0.919 0.923 0.916 0.915
NP 0.434 0.445 0.438 0.441
AP 0.890 0.803 0.893 0.889
At the normal distribution, the loss in empirical eciency of the GEEWR estimates over
the mle estimates is only about 3%; while for the contaminated normal distribution the
gain in eciency of the GEEWR estimates over the maximum likelihood estimates is about
200%. Hence, for these situations the GEEWR estimator possesses robustness of eciency.
In terms of empirical condence coecients, the nonparametric procedure is quite liberal.
In contrast, the approximate procedure condences are quite close to the nominal condence
(95%) for the normal situation and similar to the those of the mle for the contaminated
normal situation.
5.4.3 Example
As an example, we selected part of a study by Plaisance et al. (2007) concerning the eect of
a single session of high intensity aerobic exercise on inammatory markers of subjects taken
over time. One purpose of the study was to see if these markers diered depending on the
tness level of the subject. Subjects were placed into one of two groups (High Fitness and
Moderate Fitness) depending on the level of their peak oxygen uptake. The response we
consider here is C-reactive protein (CRP). Elevated CRP levels are a marker of low-grade
chronic inammation and may predict a higher risk for cardiovascular disease (Ridker et al.,
2002). The eect of interest is the dierence in CRP between the two groups, which we
denote by . Hence, a one-sided hypothesis of interest is
H
0
: 0 versus H
A
: < 0. (5.4.20)
Out of the 21 subjects in the study, three were removed due to noncompliance or in-
complete information. Thus, we consider the remaining 18 individuals, 9 in each group.
CRP level was obtained 24 hours and immediately prior to the acute bout of exercise and
subsequently 24, 72, and 120 hours following exercise giving 90 data points in all. The data
are displayed in Table A.0.2 of Appendix B. The top left comparison boxplot of Figure
5.4.1 shows the eect based on the raw responses. An estimate of the eect based on the
raw data is dierence in medians which is 0.54. Note that the responses are skewed with
outliers in each group. We took the time of measurement as a covariate. Let y
i
and x
i
denote respectively the 5 1 vectors of observations and times of measurements for subject
i and let c
i
denote his/her indicator variable for Group, i.e., its components are either 0 (for
Moderate Fitness) or 1 (for High Fitness). Then our model is
y
i
= 1
5
+ c
i
+ x
i
+e
i
, i = 1, . . . 18 , (5.4.21)
where e
i
denotes the vector of errors for the ith individual. We present the results for
three covariance structures of e
i
: working independence (WI), compound symmetry (CS),
and autoregressive-one (AR(1)). We t the GEEWR estimate for each of these covariance
structures using Wilcoxon scores.
The error model for compound symmetry is the simple mixed model; i.e., e
i
= b
i
1
n
i
+a
i
,
where b
i
is the random eect for subject i and the components of a
i
are iid and independent
from b
i
. Let
2
b
and
2
a
denote the variances of b
i
and a
ij
, respectively. Let
2
t
=
2
b
+
2
a
denote the total variance and =
2
b
/
2
t
denote the intraclass coecient. In this case,
the covariance matrix of e
i
is of the form
2
t
[(1 )I + J]. We estimated these variance
component parameters
2
t
and at each step of the t of Model (5.4.21) using the robust
estimators discussed in Section 5.2.1
The error model for the AR(1) is e
ij
=
1
e
i,j1
+ a
ij
, j = 2, . . . n
i
, where the a
ij
s are
iid, for the ith subject. The (s, t) entry in the covariance matrix of e
i
is
|st|
1
, where
=
2
a
/(1
2
1
). To estimate the covariance structure at step k, for each subject, we model
this autoregressive model using the current residuals. For each subject, we then estimate
1
, using the Wilcoxon regression estimate of Chapter 3. As our estimate of
1
, we take the
median over subjects of these Wilcoxon regression estimates. Likewise, as our estimate of
2
a
we took the median over subjects of MAD
2
of the residuals based on the AR(1) ts.
Note that there are only 18 experimental units in this problem, nine for each treatment.
So it is a small sample problem. Accordingly, we used a bootstrap to standardize the
GEEWR estimates. Our bootstrap consisted of resampling the 18 experimenter units, nine
from each group. This keeps the covariance structure intact. Then for each bootstrap sample,
the GEEWR estimate was computed and recorded. We used 3000 bootstrap samples. With
these small samples, the outliers had an eect on the bootstrap, also. Hence, we used the
MAD of the bootstrap estimates of as our standard error of

.
Table 5.4.2 summarizes the three GEEWR estimates of and , along with the estimates
of the variance components for the CS and AR(1) models. As the comparison boxplot of
residuals shows in Figure 5.4.1, the three ts are similar. The WI and AR(1) estimates of
the eect are quite similar, including their bootstrap standard errors. The CS estimate of
, though, is more precise and it is closer to the dierence (based on the raw data) in medians
0.54. The traditional t of the simple mixed model (under CS covariance structure), would
High Fit Mod. Fit
0
1
2
3
4
C
R
P
Group Comparison Box Plots
AR(1) CS WI
0
1
2
3
4
R
e
s
i
d
u
a
l
s
Box Plots: Residuals
0.5 0.4 0.3 0.2 0.1 0.0
0
1
2
3
4
CS Fit
C
S

R
e
s
i
d
u
a
l
s
Residual Plot of CS Fit
2 1 0 1 2
0
1
2
3
4
Normal Quantiles
C
S

R
e
s
i
d
u
a
l
s
QQ Plot of Residuals for CS Fit
Figure 5.4.1: Plots for CRP Data.
be the maximum likelihood t based on normality. We obtained this t by using the lme
function in R. Its estimate of is 0.319 with standard error 0.297. For the hypotheses
of interest (5.4.20), based on asymptotic normality, the CS GEEWR estimate is marginally
signicant with p = 0.064, while the mle estimate is insignicant with p = 0.141.
Note that the residual and q q plots of the CS GEEWR t, bottom plots of Figure
5.4.1, show that the error distribution is right skewed with a heavy right tail. This suggests
using scores more appropriate for skewed error distributions than the Wilcoxon scores. We
considered a simple score from the class of Winsorized Wilcoxon scores. The Wilcoxon score
function is linear. For this data, a suitable Winsorizing score function is the piecewise linear
function, which is linear on the interval (0, c) and then constant on the interval (c, 1). As
discussed in Example 2.5.1 of Chapter 2, these scores are optimal for a skewed distribution
with a logistic left tail and an exponential right tail. We obtained the GEEWR t of this
data using this score function with c = 0.80, i.e, the bend is at 0.80. To insure positive
weights, we used the 47th percentile as the location estimator m() in the denition of
Table 5.4.2: Summary of Estimates and Bootstrap Standard Errors (BSE).
Wilcoxon Scores
COV.

BSE

BSE Cov. Parameters
WI 0.291 0.293 0.0007 0.0007 NA NA
CS 0.370 0.244 .0010 0.0007
2
a
= 0.013 = 0.968
AR(1) 0.303 0.297 0.0008 0.0015
1
= 0.023
2
a
= 0.032
Winsorized Wilcoxon Scores with Bend at 0.8
CS 0.442 0.282 0.008 0.0008
2
a
= 0.017 = 0.966
the weights; see the discussion around expression (5.4.10). The computed estimates and
their bootstrap standard errors are given in the last row of Table 5.4.2 for the compound
symmetry case. The estimate of is 0.442 which is closer than the Wilcoxon estimate to
the dierence in medians based on the raw data. Using the bootstrap standard error, the
corresponding z-test for hypotheses (5.4.20) is 1.57 with the p-value of 0.059, which is more
signicant than the test based on Wilcoxon scores. Computationally, the iterated reweighted
GEEWR algorithm remains the same except that the Wilcoxon scores are replaced by these
Winsorized Wilcoxon scores.
As a nal note, the residual plot of the GEEWR t for the compound symmetric de-
pendence structure also shows some heteroscedasticity. The variability of the residuals is
directly proportional to the tted values. This scalar trend can be modeled robustly using
the rank-based procedures discussed in Exercise 3.16.39.
5.5 Time Series
5.6. EXERCISES 349
5.6 Exercises
5.6.1. Assume the simple mixed model (5.2.1). Show that expression (5.2.2) is true.
5.6.2. Obtain the ARE between the R and traditional estimates found in expression (5.2.4),
for Wilcoxon scores when the random error vector has a multivariate normal distribution.
5.6.3. Show that the asymptotic distribution of the LS estimator for the Arnold transformed
model is given by expression (5.3.8).
5.6.4. Consider Example 5.3.1.
(a.) Verify the ATR and ATLS estimates in Table 5.3.2.
(b.) Over the range of y used in the example, verify the relative changes in the ATR and
ATLS estimates as shown in the example.
5.6.5. Consider the discussion of test statistics around expression (5.1.12). Explore the
asymptotic distributions of the drop in dispersion and aligned rank test statistics under the
null and contiguous alternatives for the general mixed model.
5.6.6. Continuing with the last exercise, suppose that the simple mixed model (5.2.1) is true.
Suppose further that the design is centered within each block; i.e., X
k
1
n
k
= 0
p
. For example,
this is true for an ANOVA design in which all subjects have all treatment combinations such
as the Plasma Example of Section 4.
(a.) Under this assumption, show that expression (5.2.2) simplies to V
=
2
(1
)(X
X)
1
;.
(b.) Show that the noncentrality parameter , (5.1.13), simplies to
=
1
(1
)
M
[M(X
X)
1
M
]
1
H.
(c.) Consider as a test statistic the standardized version of the reduction in dispersion,
F
RD,
=
RD
/q
(1
)(
/2)
.
Show that under the null hypothesis H
0
, qF
RD,
D
2
(q) and that under the sequence
of alternatives H
An
, qF
RD,
D

2
(q, ), where the noncentrality parameter is given
in Part (b).
(d.) Show that F
W,
, (5.1.12), and F
RD,
are asymptotically equivalent under the null and
local alternative models.
(e). Explore the asymptotic distribution of the aligned rank test under the conditions of
this exercise.
5.6.7. AR(1) generalized exercise.
Chapter 6
Multivariate
6.1 Multivariate Location Model
We now consider a statistical model in which we observe vectors of observations. For example,
we may record both the SAT verbal and math scores on students. We then wish to investigate
the bivariate distribution of scores. We may wish to test the hypothesis that the vector of
population locations has changed over time or to estimate the vector of locations. The
framework in which we carry out the statistical inference is the multivariate location model
which is similar to the location model of Chapter 1.
For simplicity and convenience, we will often discuss the bivariate case. The k-dimensional
results will often be obvious changes in notation. Suppose that X
1
, . . . , X
n
are iid random
vectors with X
T
i
= (X
i1
, X
i2
). In this chapter, T denotes transpose and we reserve prime
for dierentiation. We assume that X has an absolutely continuous distribution with cdf
F(s
1
, t
2
) and pdf f(s
1
, t
2
). We also assume that the marginal distributions
are absolutely continuous. The vector = (
1
,
2
)
T
is the location vector.
Denition 6.1.1. Distribution models for bivariate data. Let F(s, t) be a prototype cdf, then
the underlying model will be a shifted version: H(s, t) = F(s
1
, t
2
).
The following models will be used throughout this chapter.
1. We say the distribution is symmetric when X and X have the same distribution
or f(s, t) = f(s, t). This is sometimes called diagonal symmetry. The vector
(0, 0)
T
is the center of symmetry of F and the location functionals all equal the center of
symmetry. Unless stated otherwise, we will assume symmetry throughout this chapter.
2. The distribution has spherical symmetry when Xand Xhave the same distribution
where is an orthogonal matrix. The pdf has the form g(|x|) where |x| = (x
T
x)
1/2
is the Euclidean norm of x. The contours of the density are circular.
3. In an elliptical model the pdf has the form [det [
1/2
g(x
T
1
x), where det denotes
determinant and is a symmetric, positive denite matrix. The contours of the density
are ellipses.
351
352 CHAPTER 6. MULTIVARIATE
4. A distribution is directionally symmetric if X/|X| and X/|X| have the same
distribution.
Note that elliptical symmtery implies symmetry which is turn implies directional symmetry.
In an elliptical model, the contours of the density are elliptical and if is the identity
matrix then we have a spherically symmetric distribution. An elliptical distribution can be
transformed into a spherical one by a transformation of the form Y = DX where D is a
nonsingular matrix. Along with various models, we will encounter various transformations
in this chapter. The following denition summarizes the transformations.
Denition 6.1.2. Data transformations.
(a) Y = X is an orthogonal transformation when the matrix is orthogonal. These
transformations include rotations and reections of the data.
(b) Y = AX+b is called an ane transformation when A is a nonsingular matrix and b
is any vector of real numbers.
(c) When the matrix A in (b) is diagonal, we have a special ane transformation called a
scale and location transformation.
(d) Suppose t(X) represents one of the above transformations of the data. Let

(t(X))
denote the estimator computed from the transformed data. Then we say the estimator is
equivariant if

(t(X)) = t(
(X)). Let V (t(X)) denote a test statistic computed from the

transformed data. We say the test statistic is invariant when V (t(X)) = V (X).
Recall that Hotellings T
2
statistic is given by
T
2
= n(X)
T
S
1
(X),
where S is the sample covariance matrix. In Exercise 6.8.1, the reader is asked to show that
the vector of sample means is ane equivariant and Hotellings T
2
test statistic is ane
invariant.
As in the earlier chapters, we begin with a criterion function or with a set of estimating
equations. To x the ideas, suppose that we wish to estimate or test the hypothesis
H
0
: = 0 and we are given a pair of estimating equations:
S() =
_
S
1
()
S
2
()
_
= 0 ; (6.1.1)
see Example 6.1.1 for three criterion functions. We now list the usual set of assumptions that
we have been using throughout the book. These assumptions guarantee that the estimating
equations are Pitman regular in the sense of Denition 1.5.3 so that we can dene the
estimate and test and develop the necessary asymptotic distribution theory. It will often be
convenient to suppose that the true value of is 0 which we can do without loss of generality.
Denition 6.1.3. Pitman Regularity conditions.
6.1. MULTIVARIATE LOCATION MODEL 353
(a) The components of S() should be nonincreasing functions of
1
and
2
.
(b) E
0
(S(0)) = 0
(c)
1
n
S(0)
D
0
Z N
2
(0, A)
(d) sup
bB
n
S
_
1
n
b
_
n
S(0) +Bb
P
0 .
The matrix A in (c) is the asymptotic covariance matrix of
1
n
S(0) and the matrix B in (d)
can be computed in various ways, depending on when dierentiation and expectation can be
interchanged. We list the various computations of B for completeness. Note that denotes
dierentiation with respect to the components of .
B = E
0
1
n
S() [
=0
= E
1
n
S(0) [
=0
= E
0
[(log f(X))
T
(X)] (6.1.2)
where log f(X) denotes the vector of partial derivatives of log f(X) and ( ) is such that
1
n
S() =
1
n
n
i=1
(X
i
) +o
p
(1).
Brown (1985) proved a multivariate counterpart to Theorem 1.5.6. We state it next and
refer the reader to the paper for the proof.
Theorem 6.1.1. Suppose conditions (a) - (c) of Denition 6.1.3 hold. Suppose further that
B is given by the second expression in ( 6.1.2) and is positive denite. If, for any b,
trace
_
ncov
_
1
n
S
_
1
n
b
_
1
n
S(0)
__
0
then (d) of Denition 6.1.3 also holds.
The estimate of is, of course, the solution of the estimating equations, denoted

.
Conditions (a) and (b) make this reasonable. To test the hypothesis H
0
: = 0 versus
H
A
: ,= 0, we reject the null hypothesis when
1
n
S
T
(0)
A
1
S(0)
2
(2), where the upper

percentile of a chisquare distribution with 2 degrees of freedom. Note that

A A, in
probability, and typically

A will be a simple moment estimator of A. Condition (c) implies
that this is an asymptotically size test.
With condition (d) we can determine the asymptotic distribution of the estimate and the
asymptotic local power of the test; hence, asymptotic eciencies can be computed. We can
determine the quantity that corresponds to the ecacy in the univariate case described in
Section 1.5.2 of Chapter 1. We do this next before discussing specic estimating equations.
The following proposition follows at once from the assumptions.
Theorem 6.1.2. Suppose conditions (a)-(d) in Denition 6.1.3 are satised, = 0 is the
true parameter value, and
n
= /
n for some xed vector . Further

is the solution of
the estimating equation. Then
1.

n
= B
1 1
n
S(0) +o
p
(1)
D
0
Z MVN(0, B
1
AB
1
)
2.
1
n
S
T
(0)A
1
S(0)
D
n

2
(2,
T
BA
1
B) ,
,
where
2
(a, b) is noncentral chisquare with a degrees of freedom and noncentrality parameter
b.
Proof: Part 1 follows immediately from condition (d) and letting
n
=

0 in proba-
bility; see Theorem 1.5.7. Part 2 follows by observing (see Theorem 1.5.8) that
P
n
_
1
n
S
T
(0)A
1
S(0)
2
(2)
_
= P
0
_
1
n
S
T
_
_
A
1
S
_
_

2
(2)
_
and from (d),
1
n
S
_
_
=
1
n
S(0) +B +o
p
(1)
D
0
Z MVN(B, A).
Hence, we have a noncentral chisquare limiting distribution for the quadratic form. Note
that the inuence function of

is (x) = B
1
(x) and we say

has bounded inuence
provided |(x)| is bounded.
Denition 6.1.4. Estimation Eciency. The eciency of a bivariate estimator can be
measured using the Wilks generalized variance dened to be the determinant of the
covariance matrix of the estimator:
2
1
2
2
(1
2
12
) where ((
ij
j
)) is the covariance matrix
of the bivariate vector of estimates. The estimation eciency of

1
relative to

2
is the
square root of the reciprocal ratio of the generalized variances.
This means that the asymptotic covariance matrix given by B
1
AB
1
of the more ef-
cient estimator will be small in the sense of generalized variance. See Bickel (1964) for
further discussion of eciency in the multivariate case.
Denition 6.1.5. Test eciency. When comparing two tests based on S
1
and S
2
, since
the asymptotic local power is an increasing function of the noncentrality parameter, we dene
the test eciency as the ratio of the respective noncentrality parameters.
6.1. MULTIVARIATE LOCATION MODEL 355
In the bivariate case, we have
T
B
1
A
1
1
B
1
divided by
T
B
2
A
1
2
B
2
and, unlike the
estimation case, the test eciency may depend on the direction along which we approach
the origin; see Theorem 6.1.2. Hence, we note that, unlike the univariate case, the testing
and estimation eciencies are not necessarily equal. Bickel (1965) shows that the ratio of
noncentrality parameters can be interpreted as the limiting ratio of sample sizes needed for
the same asymptotic level and same asymptotic power along the same sequence of alterna-
tives, as in the Pitman eciency used throughout this book. We can see that BA
1
B should
be large just as B
1
AB
1
should be small. In the next section we consider how to set
up the estimating equations and consider what sort of estimates and tests result. We will
be in a position to compute the eciency of the estimates and tests relative to the tradi-
tional least squares estimates and tests. First we list three important criterion functions and
their associated estimating equations. Other criterion functions will be introduced in later
sections.
Example 6.1.1. Three criterion functions.
We now introduce three criterion functions that, in turn, produce estimating equations
through dierentiation. One of the criterion functions will generate the vector of means,
the L
2
or least squares estimates. The other two criterion functions will generate dierent
versions of what may be considered L
1
estimates or bivariate medians. The two types of
medians dier in their equivariance properties. See Small (1990) for an excellent review of
multidimensional medians. The vector of means is equivariant under ane transformations
of the data; see Exercise 6.8.1.
The three criterion functions are:
D
1
() =
_
n
i=1
[(x
i1
1
)
2
+ (x
i2
2
)
2
] (6.1.3)
D
2
() =
n
i=1
_
(x
i1
1
)
2
+ (x
i2
2
)
2
(6.1.4)
D
3
() =
n
i=1
[x
i1
1
[ +[x
i2
2
[ (6.1.5)
In each of these criterion functions we have pushed the square root operation deeper into
the expression. As we will see, this produces very dierent types of estimates. We now take
the gradients of these criterion functions and display the corresponding estimating functions.
The computation of these gradients is given in Exercise 6.8.2.
S
1
() = [D
1
()]
1
_
(x
i1
1
)
(x
i2
2
)
_
(6.1.6)
S
2
() =
n
i=1
|x
i
i
|
1
_
x
i1
1
x
i2
2
_
(6.1.7)
S
3
() =
_
sgn(x
i1
1
)
sgn(x
i2
2
)
_
(6.1.8)
In ( 6.1.8) if the vector is zero, then we take the term in the summation to be zero
also. In Exercise 6.8.3 the reader is asked to verify that S
2
() = S
3
() in the univariate
case; hence, we already see something new in the structure of the bivariate location model
over the univariate location model. On the other hand, S
1
() and S
3
() are componentwise
equations unlike S
2
() in which the two components are entangled. The solution to ( 6.1.8)
is the vector of medians, and the solution to ( 6.1.7) is the spatial median which is discussed
in Section 6.3. We will begin with an analysis of componentwise estimating equations and
then consider other types.
Sections 6.2.3 through 6.4.4 deal with one sample estimates and tests based on vector
signs and ranks. Both rotational and ane invariant/equivariant methods are developed.
Two and several sample models are treated in Section 6.6 as examples of location models.
In Section 6.6 we will be primarily concerned with componentwise methods.
6.2 Componentwise Methods
Note that S
1
() and S
3
() are of the general form
S() =
_
(x
i1
1
)
(x
i2
2
)
_
(6.2.1)
where (t) = t or sgn(t) for ( 6.1.6) and ( 6.1.8), respectively. We need to nd the matrices
A and B in Denition 6.1.3. It is straight forward to verify that, when the true value of
is 0,
A =
_
E
2
(X
11
) E(X
11
)(X
12
)
E(X
11
)(X
12
) E
2
(X
22
)
_
, (6.2.2)
and, from ( 6.1.2),
B =
_
E
(X
11
) 0
0 E
(X
12
)
_
. (6.2.3)
Provided that A is positive denite, the multivariate central limit theorem implies that
condition (c) in Denition 6.1.3 is satised for the componentwise estimating functions. In
the case that (t) = sgn(t), we use the second representation in ( 6.1.2). The estimating
functions in (6.2.1) are examples of M-estimating functions; see Maronna, Martin and Yohai
(2006).
Example 6.2.1. Pulmonary Measurements on Workers Exposed to Cotton Dust.
In this example we extend the discussion to k = 3 dimensions. The data consists of
n = 12 trivariate (k = 3) observations on workers exposed to cotton dust. The measurements
in Table 6.2.1 are changes in measurements of pulmonary functions: FVC (forced vital
capacity), FEV
3
(forced expiratory volume), and CC (closing capacity); see Merchant et al.
(1975).
6.2. COMPONENTWISE METHODS 357
Table 6.2.1: Changes in Pulmonary Function after Six Hours of Exposure to Cotton Dust
Subject FVC FEV
3
CC
1 .11 .12 4.3
2 .02 .08 4.4
3 .02 .03 7.5
4 .07 .19 .30
5 .16 .36 5.8
6 .42 .49 14.5
7 .32 .48 1.9
8 .35 .30 17.3
9 .10 .04 2.5
10 .01 .02 5.6
11 .10 .17 2.2
12 .26 .30 5.5
Let
T
= (
1
,
2
,
3
) and consider H
0
: = 0 versus H
A
: ,= 0. First we compute the
componentwise sign test. In ( 6.2.1) take (x) = sgn(x), then n
1/2
S
T
3
= n
1/2
(6, 6, 2)
and the estimate of A = Cov(n
1/2
S
3
) is
A =
1
n
_
_
n
sgnx
i1
sgnx
i2
sgnx
i1
sgnx
i3
sgnx
i1
sgnx
i2
n
sgnx
i2
sgnx
i3
sgnx
i1
sgnx
i3
sgnx
i2
sgnx
i3
n
_
_
=
1
12
_
_
12 8 4
8 12 0
4 0 12
_
_
.
Here the diagonal elements are
i
sgn
2
(X
is
) = n and the o-diagonal elements are values of
the statistics
i
sgn(X
is
)sgn(X
it
). Hence, the test statistic n
1
S
T
3
A
1
S
3
= 3.667, and using
2
(3), the approximate p-value is 0.299; see Section 6.2.2.
We can also consider the nite sample conditional distribution in which sign changes are
generated with a binomial with n = 12 and p = .5; see the discussion in Section 6.2.2. Again
note that the signs of all components of the observation vector are either changed or not.
The matrix

A remains unchanged so it is simple to generate many values of n
1
S
T
3
A
1
S
3
.
Out of 2500 values we found 704 greater than or equal to 3.667; hence, the randomization
or sign change p-value is approximately 704/2500 = 0.282, quite close to the asymptotic
approximation. At any rate, we fail to reject H
0
: = 0 at any reasonable level.
Further, Hotellings T
2
= nX
T
1
X = 14.02 with a p-value of 0.051, based on the F-
distribution for [(n p)/(n 1)p]T
2
with 3 and 9 degrees of freedom. Hence, Hotellings T
2
is signicant at approximately 0.05.
Figure 6.2.1 provides boxplots for the data and componentwise normal qq plots. These
boxplots suggest that any dierences will be due to the upward shift in the CC distribution.
The normal q q plot of the component CC shows two outlying values on the right side.
In the case of the componentwise Wilcoxon test, Section 6.2.3, we consider (n + 1)S
4
(0)
in ( 6.2.14) along with (n + 1)
2
A, essentially in ( 6.2.15). For the pulmonary function data
Figure 6.2.1: Panel A: Boxplots of the changes in pulmonary function for the Cotton Dust
Data. Note that the responses have been standardized by componentwise standard devia-
tions; Panel B: normal qq plot for the component FVC, original scale; Panel C: normal qq
plot for the component FEV3, original scale; Panel D: normal qq plot for the component
CC, original scale.
-
2
-
1
0
1
2
CC FEV_3 FVC
Component
S
t
a
n
d
a
r
d
i
z
e
d

r
e
s
p
o
n
s
e
s
Panel A
Normal Quantiles
C
h
a
n
g
e
s

i
n

F
V
C
-1.5 -0.5 0.5 1.5
-
0
.
4
-
0
.
2
0
.
0
Panel B
Normal Quantiles
C
h
a
n
g
e
s

i
n

F
E
V
3
-1.5 -0.5 0.5 1.5
-
0
.
4
-
0
.
2
0
.
0
0
.
2
Panel C

Normal Quantiles
C
h
a
n
g
e
s

i
n

C
C
-1.5 -0.5 0.5 1.5
-
5
0
5
1
0
1
5
Panel D
(n + 1)S
T
4
(0) = (63, 52, 28) and
(n + 1)
2
A =
1
n
_
_
649 620.5 260.5
620.5 649.5 141.5
260.5 141.5 650
_
_
.
The diagonal elements are
i
R
2
([X
is
[) which should be
i
i
2
= 650 but dier for the
rst two components due to ties among the absolute values. The o-diagonal elements are
i
R([X
is
[)R([X
it
[)sgn (X
is
)sgn (X
it
). The test statistic is then n
1
S
T
4
(0)
A
1
S
4
(0) = 7.82.
From the
2
(3) distribution, the approximate p-value is 0.0498. Hence, the Wilcoxon test
rejects the null hypothesis at essentially the same level as Hotellings T
2
test.
In the construction of tests we generally must estimate the matrix A. When testing
H
0
: = 0 the question arises as to whether or not we should center the data using

. If we
do not center then we are using a reduced model estimate of A; otherwise, it is a full model
estimate. Reduced model estimates are generally used in randomization tests. In this case,
generally,

A must only be computed once in the process of randomizing and recomputing
the test statistic n
1
S
T
A
1
S. Note also that when H
0
: = 0 is true,

P
0. Hence,
the centered

A is valid under H
0
. When estimating the asymptotic Cov(
), B
1
AB
1
, we
should center

A because we no longer assume that H
0
is true.
6.2.1 Estimation
Let = (
1
,
2
)
T
denote the true vector of location parameters. Then, when ( 6.1.2) holds,
the asymptotic covariance matrix in Theorem 6.1.2 is
B
1
AB
1
=
_
_
_
E
2
(X
11
1
)
[E
(X
11
1
)]
2
E(X
11
1
)(X
12
2
)
E
(X
11
1
)E
(X
12
2
)
E(X
11
1
)(X
12
2
)
E
(X
11
1
)E
(X
12
2
)
E
2
(X
12
2
)
[E
(X
12
2
)]
2
_
_
_
(6.2.4)
Now Theorem 6.1.2 can be applied for various M-estimates to establish asymptotic
normality. Our interest is in the comparison of L
2
and L
1
estimates and we now turn to that
discussion. In the case of L
2
estimates, corresponding to S
1
(), we take (t) = t. In this case,
in expression (6.2.4) is the vector of means. Then it is easy to see that B
1
AB
1
is equal
to the covariance matrix of the underlying model, say
f
. In applications, is estimated
by the vector of component sample means. For the standard errors of these estimates, the
vector of componentwise sample means replaces in expression (6.2.4) and the expected
values are replaced by the corresponding sample moments. Then it is easy to see that the
estimate of B
1
AB
1
is equal to the traditional sample covariance matrix.
In the rst L
1
case corresponding to S
3
(), using ( 6.1.2), we take (t) = sgn(t) and nd,
using the second representation in ( 6.1.2), that
B
1
AB
1
=
_
_
_
1
4f
2
1
(0)
Esgn(X
11
1
)sgn(X
12
2
)
4f
1
(0)f
2
(0)
Esgn(X
11
1
)sgn(X
12
2
)
4f
1
(0)f
2
(0)
1
4f
2
2
(0)
_
_
_
, (6.2.5)
where f
1
and f
2
denote the marginal pdfs of the joint pdf f(s, t) and
1
and
2
denote the
componentwise medians. In applications, the estimate of is the vector of componentwise
sample medians, which we denote by (
1
,
2
)
. For inference an estimate of the asymptotic

covariance matrix, (6.2.5) is reqiured. An estimate of Esgn(X
11

1
)sgn(X
12

2
) is the
simple moment estimator n
1
sgn(x
i1

1
)sgn(x
i2

2
). The estimators discussed in
Section 1.5.5, (1.5.28), can be used to estimate the scale parameters 1/2f
1
(0) and 1/2f
2
(0).
We now turn to the eciency of the vector of sample medians with respect to the vector
of sample means. Assume for each component that the median and mean are the same
and that without loss of generality their common value is 0. Let = det(B
1
AB
1
) =
det(A)/[det(B)]
2
be the Wilks generalized variance of

n
in Denition 6.1.4. For the

vector of means we have =
2
1
2
2
(1
2
), the determinant of the underlying variance-
covariance matrix. For the vector of sample medians we have
=
1 (EsgnX
11
sgnX
12
)
2
16f
2
1
(0)f
2
2
(0)
and the eciency of the vector of medians with respect to the vector of means is given by:
e(med,mean) = 4
1
2
f
1
(0)f
2
(0)
1
2
1 [EsgnX
11
sgnX
12
]
2
(6.2.6)
Note that EsgnX
11
sgnX
12
= 4P(X
11
< 0, X
12
< 0) 1. When the underlying distribution
is bivariate normal with means 0, variances 1, and correlation , Exercise 6.8.4 shows that
P(X
11
< 0, X
12
< 0) =
1
4
+
1
2 sin
. (6.2.7)
Further, the marginal distributions are standard normal; hence,( 6.2.6) becomes
e(med, mean) =
2
1
2
1 [(2/) sin
1
]
2
(6.2.8)
The rst factor 2/

= .637 is the univariate eciency of the median relative to the mean
when the underlying distribution is normal and also the eciency of the vector of medians
relative to the vector of means when the correlation in the underlying model is zero. The
second factor accounts for the bivariate structure of the model and, in general, depends on
the correlation . Some values of the eciency are given in Table 6.2.2.
Table 6.2.2: Eciency ( 6.2.8) of the vector of medians relative to the vector of means when
the underlying distribution is bivariate normal.
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .99
e .64 .63 .63 .62 .60 .58 .56 .52 .47 .40 .22
Clearly, as the elliptical contours of the underlying normal distribution atten out, the
eciency of the vector of medians decreases. This is the rst indication that the vector of
medians is not ane (or even rotation) equivariant. The vector of means is ane equivariant
and hence the dependency of the eciency on must be due to the vector of medians.
Indeed, Exercise 6.8.5 asks the reader to construct an example showing that when the axes
are rotated the vector of means rotates into the new vector of means while the vector of
medians fails to do so.
6.2.2 Testing
We now consider the properties of bivariate tests. Recall that we assume the underlying
bivariate distribution is symmetric. In addition, we would generally use an odd -function,
so that (t) = (t). This implies that (t) = ([t[)sgn(t) which will be useful shortly.
Now referring to Theorem 6.1.2 along with the corresponding matrix A, the test of
H
0
: = 0 vs H
A
: ,= 0 rejects the null hypothesis when
1
n
S
T
(0)A
1
S(0)
2
(2).
Note that the covariance term in A is E(X
11
)(X
12
) =
__
(s)(t)f(s, t) dsdt and it
depends upon the underlying bivariate distribution f. Hence, even the sign test based on
the componentwise sign statistics S
3
(0) is not distribution free under the null hypothesis
as it is in the univariate case. In this case, E(X
11
)(X
12
) = 4P(X
11
< 0, X
12
< 0) 1 as
we saw in the discussion of estimation.
To make the test operational we must estimate the components of A. Since they are
expectations, we use moment estimates, under the null hypothesis. Now condition (c) in
Denition 6.1.3 guarantees that the test with the estimated A is asymptotically distribution
free since it has a limiting chisquare distribution, independent of the underlying distribution.
What can we say about nite samples?
First note that
S(0) =
_
([x
i1
[)sgn(x
i1
)
([x
i2
[)sgn(x
i2
)
_
(6.2.9)
Under the assumption of symmetry, (x
1
, . . . , x
n
) is a realization of (s
1
x
1
, . . . , s
n
x
n
) where
(s
1
, . . . , s
n
) is a vector of independent random variables each equalling 1 with probability
1/2, 1/2. Hence Es
i
= 0 and Es
2
i
= 1. Condition on (x
1
, . . . , x
n
) then, under the null
hypothesis, there are 2
n
equally likely sign combinations associated with these vectors. Note
that the sign changes attach to the entire vector. From ( 6.2.9), we see that conditionally,
the scores are not aected by the sign changes and S(0) depends on the sign changes only
through the signs of the components of the observation vectors. It follows at once that the
conditional mean of S(0) under the null hypothesis is 0. Further the conditional covariance
matrix is given by
_

2
([x
i1
[)
([x
i1
[)([x
i2
[)sgn(x
i1
)sgn(x
i2
)
([x
i1
[)([x
i2
[)sgn(x
i1
)sgn(x
i2
)
2
([x
i2
[)
_
. (6.2.10)
Note that conditionally, n
1
times this matrix is an estimate of the matrix A above. Thus
we have a conditionally distribution free sign change distribution. For small to moderate
n the test statistic (quadratic form) can be computed for each combination of signs and a
conditional p-value of the test is the number of values (divided by 2
n
) of the test statistic
at least as large as the observed value of the test statistic. In the rst chapter on univariate
methods this argument also leads to unconditionally distribution free tests in the case of the
univariate sign and rank tests since in those cases the signs and the ranks do not depend
on the values of the conditioning variables. Again, the situation is dierent in the bivariate
case due to the matrix A which must be estimated since it depends on the unknown under-
lying distribution. In the Exercise 6.8.6 the reader is asked to construct the sign change
distributions for some examples.
We now turn to a more detailed analysis of the tests based on S
1
= S
1
(0) and S
3
= S
3
(0).
Recall that S
1
is the vector of sample means. The matrix A is the covariance matrix of the
underlying distribution and we take the sample covariance matrix as the natural estimate.
The resulting test statistic is nX
T
A
1
X which is Hotellings T
2
statistic. Note for T
2
, we
typically use a centered estimate of A. If we want the randomization distribution then we
use the uncentered estimate. Since BA
1
B =
1
f
, the covariance matrix of the underlying
distribution, the asymptotic noncentrality parameter for Hotellings test is
T
1
f
. The
vector S
3
is the vector of component sign statistics. By inverting ( 6.2.5) we can write down
the noncentrality parameter for the bivariate componentwise sign test.
To illustrate the eciency of the bivariate sign test relative to Hotellings test we simplify
the structure as follows: assume that the marginal distributions are identical. Let =
4P(X
11
< 0, X
12
< 0) 1 and let denote the underlying correlation, as usual. Then
Hotellings noncentrality parameter is
1
2
(1
2
)
T
_
1
1
_
=

2
1
2
1
2
+
2
2
2
(1
2
)
(6.2.11)
Likewise the noncentrality parameter for the bivariate sign test is
4f
2
(0)
(1
2
)
T
_
1
1
_
=
4f
2
(0)(
2
1
2
1
2
+
2
2
)
(1
2
)
(6.2.12)
The eciency of the bivariate sign test relative to Hotellings test is the ratio of the their
respective noncentrality parameters:
4f
2
(0)
2
(1
2
)(
2
1
2
1
2
+
2
2
)
(1
2
)(
2
1
2
1
2
+
2
2
)
(6.2.13)
Table 6.2.3: Minimum and maximum eciencies of the bivariate sign test relative to
Hotellings T
2
when the underlying distribution is bivariate normal.
0 .2 .4 .6 .8 .9 .99
min .64 .58 .52 .43 .31 .22 .07
max .64 .68 .71 .72 .72 .71 .66
There are three contributing factors in this eciency: 4f
2
(0)
2
which is the univariate
eciency of the sign test relative to the t test, (1
2
)/(1
2
) due to the dependence
structure in the bivariate distribution, and the nal factor which reects the direction of
approach of the sequence of alternatives. It is this last factor which separates the testing
eciency from the estimation eciency. In order to see the eect of direction on the eciency
we will use the following result from matrix theory; see Graybill (1983).
Lemma 6.2.1. Suppose D is a nonsingular, square matrix and C is any square matrix and
suppose
1
and
2
are the minimum and maximum eigen values of CD
1
, then
1

T
C
T
D

2
.
The following proposition is left as an Exercise 6.8.7.
Theorem 6.2.1. The eciency e(S
3
, S
1
) is bounded between the minimum and maximum
of 4f
2
(0)
2
(1 )/(1 ) and 4f
2
(0)
2
(1 +)/(1 +).
In Table 6.2.3 we give some values of the maximum and minimum eciencies when
the underlying distribution is bivariate normal with means 0, variances 1 and correlation
. This table can be compared to Table 6.2.2 which contains the corresponding estimation
eciencies. We have f
2
(0) = (2)
1
and = (2/) sin
1
. Hence, the dependence of the
eciency on direction determined by is apparent. The examples involving the bivariate
normal distribution also show the superiority of the vector of means over the vector of
medians and Hotellings test over the bivariate sign test as expected. Bickel (1964, 1965)
gives a more thorough analysis of the eciency for general models. He points out that
when heavy tailed models are expected then the medians and sign test will be much better
provided is not too close to 1.
In the exercises the reader is asked to show that Hotellings T
2
statistic is ane invariant.
Thus the eciency properties of this statistic do not depend on . This means that the bi-
variate sign test cannot be ane invariant; again, this is developed in the exercises. It is now
natural to inquire about the properties of the estimate and test based on S
2
. This estimating
function cannot be written in the componentwise form that we have been considering. Before
we turn to this statistic, we consider estimates and tests based on componentwise ranking.
6.2.3 Componentwise Rank Methods
In this part we will sketch the results for the vector of Wilcoxon signed rank statistics
discussed in Section 1.7 for each component. See Example 6.2.1 for an illustration of the
calculations. In Section 6.6 we provide a full development of componentwise rank-based
methods for location and regression models with examples. We let
S
4
() =
_

R(|x
i1
1
|)
n+1
sgn(x
i1
1
)
R(|x
i2
2
|)
n+1
sgn(x
i2
2
)
_
(6.2.14)
Using the projection method, Theorem 2.4.6, we have from Exercise 6.8.8, for the case
= 0,
S
4
(0) =
_
F
+
1
([x
i1
[)sgn(x
i1
)
F
+
2
([x
i2
[)sgn(x
i2
)
_
+o
p
(1) =
_
2
[F
1
(x
i1
) 1/2]
2
[F
2
(x
i2
) 1/2]
_
+o
p
(1)
where F
+
j
is the marginal distribution of [X
1j
[ for j = 1, 2 and F
j
is the marginal distribution
of X
1j
for j = 1, 2; see, also, Section A.2.3 of the Appendix. Symmetry of the marginal
distributions is used in the computation of the projections. The conditions (a)-(d) of De-
nition 6.1.3 can now be veried for the projection and then we note that the vector of rank
statistics has the same asymptotic properties. We must identify the matrices A and B for
the purposes of constructing the quadratic form test statistic, the asymptotic distribution of
the vector of estimates and the noncentrality parameter.
The rst two conditions, (a) and (b), are easy to check since the multivariate central limit
theorem can be applied to the projection. Since under the null hypothesis that = 0, F(X
i1
)
has a uniform distribution on (0, 1), and introducing and dierentiating with respect to
1
and
2
, the matrices A and B are
1
n
A =
_
1
3

1
3
_
and B =
_
2
_
f
2
1
(t)dt 0
0 2
_
f
2
2
(t)dt
_
(6.2.15)
where = 4
__
F
1
(s)F
2
(t)dF(s, t) 1. Hence, similar to the vector of sign statistics, the vec-
tor of Wilcoxon signed rank statistics also has a covariance which depends on the underlying
bivariate distribution. We could construct a conditionally distribution free test but not an
unconditionally distribution free one. Of course, the test is asymptotically distribution free.
A consistent estimate of the parameter in A is given by
=
1
n
n
t=1
R
it
R
jt
(n + 1)(n + 1)
sgnX
it
sgnX
jt
, (6.2.16)
where R
it
is the rank of [X
it
[ in the tth component among [X
1t
[, . . . , [X
nt
[. This estimate
is the conditional covariance and can be used in estimating A in the construction of an
asymptotically distribution free test; when we estimate the asymptotic covariance matrix of
we rst center the data and then compute ( 6.2.16).

Table 6.2.4: Eciencies of componentwise Wilcoxon methods relative to L
2
methods when
the underlying distribution is bivariate normal.
0 .2 .4 .6 .8 .9 .99
min .96 .94 .93 .91 .89 .88 .87
max .96 .96 .97 .97 .96 .96 .96
est .96 .96 .95 .94 .93 .92 .91
The estimator that solves S
4
() = 0 is the vector of Hodges-Lehmann estimates for
the two components; that is, the vector of medians of Walsh averages for each component.
Like the vector of medians, the vector of HL estimates is not equivariant under orthogonal
transformations and the test is not invariant under these transformations. This will show
up in the eciency with respect to the L
2
methods which are an equivariant estimate and
an invariant test. Theorem 6.1.2 provides the asymptotic distribution of the estimator and
the asymptotic local power of the test.
Suppose the underlying distribution is bivariate normal with means 0, variances 1, and
correlation , then the estimation and testing eciencies are given by
e(HL, mean) =
3
_
1
2
1 9
2
(6.2.17)
e(Wilcoxon, Hotelling) =
3
(1
2
)
(1 9
2
)
2
1
6
1
2
+
2
2
2
1
2
1
2
+
2
2
(6.2.18)
Exercise 6.8.9 asks the reader to apply Lemma 6.2.1 and show the testing eciency is
bounded between
3(1 +)
2[2
3
cos
1
(
2
)]
and
3(1 )
2[2
3
cos
1
(
2
)]
(6.2.19)
In Table 6.2.4 we provide some values of the minimum and maximum eciencies as well
as estimation eciency. Note how much more stable the rank methods are than the sign
methods. Bickel (1964) points out, however, that when there is heavy contamination and
is close to 1 the estimation eciency can be arbitrarily close to 0. Further, this eciency
can be arbitrarily large. This behavior is due to the fact that the sign and rank methods
are not invariant and equivariant under orthogonal transformations, unlike the L
2
methods.
Hence, we now turn to an analysis of the methods generated by S
2
(). Additional material
on the componentwise methods can be found in the papers of Bickel (1964, 1965) and the
monograph by Puri and Sen (1971). The extension of the results to dimensions higher than
two is straightforward and the formulas are obvious. One interesting question is how the
eciencies of the sign or rank methods relative to the L
2
methods depend on the dimension.
See Section 6.6 and Davis and McKean (1993) for component wise linear model rank-based
methods.
6.3 Spatial Methods
6.3.1 Spatial sign Methods
We are now ready to consider the estimate and test generated by S
2
(); recall ( 6.1.4)
and ( 6.1.7). This estimating function cannot be written in componentwise fashion because
|x
i
| appears in both components. Note that S
2
() =
|x
i
|
1
(x
i
), a sum of unit
vectors, so that the estimating function depends on the data only through the directions and
not on the magnitudes of x
i
, i = 1, . . . , n. The vector |x|
1
x is also called the spatial
sign of x. It generalizes the notion of univariate sign: sgn(x) = [x[
1
x. Hence, the test is
sometimes called the angle test or spatial sign test and the estimate is called the spatial
median; see Brown (1983). Milasevic and Ducharme (1987) show that the spatial median
is always unique, unlike the univariate median. We will see that the test is invariant under
orthogonal transformations and the estimate is equivariant under these transformations.
Hence, the methods are rotation invariant and equivariant, properties suitable for methods
used on spatial data. However, applications do not have to be conned to spatial data and
we will consider these methods competitors to the other methods already discussed.
Following our pattern above, we rst consider the matrices A and B in Denition 6.1.3.
Suppose = 0, then since S
2
(0) is a sum of independent random variables, condition (c) is
immediate with A = E|X|
2
XX
T
and the obvious estimate of A, under H
0
, is
A =
1
n
n
i=1
|x
i
|
2
x
i
x
T
i
, (6.3.1)
which can be used to construct the spatial sign test statistic with
1
n
S
2
(0)
D
N
2
(0, A) and
1
n
S
T
2
(0)
A
1
S
2
(0)
D
2
(2) . (6.3.2)
In order to compute B, we rst compute the partial derivatives; then we take the expectation.
This yields
B = E
_
1
|X|
_
I
1
|X|
2
(XX
T
)
__
, (6.3.3)
where I is the identity matrix. Use a moment estimate for B similar to the estimate of A.
The spatial median is determined by
= Argmin
n
i=1
|x
i
| (6.3.4)
or as the solution to the estimating equations
S
2
() =
n
i=1
x
i
|x
i
|
= 0. (6.3.5)
6.3. SPATIAL METHODS 367
The R package SpatialNP provides routines to compute the spatial median. Gower (1974)
calls the estimate the mediancentre and provides a Fortran program for its computation.
See Bedall and Zimmerman (1979) for a program in dimensions higher than 2. Further, for
higher dimensions see Mottonen and Oja (1995).
We have the asymptotic representation
1
= B
1
1
n
S
2
(0) +o
p
(1)
D
N
2
(0, B
1
AB
1
). (6.3.6)
Chaudhuri (1992) provides a sharper analysis for the remainder term in his Theorem 3.2.
The consistency of the moment estimates of A and B is established rigorously in the linear
model setting by Bai, Chen, Miao, and Rao (1990). Hence, we would use

A and

B computed
from the residuals. Bose and Chaudhuri (1993) develop estimates of A and B that converge
more quickly than the moment estimates. Bose and Chaudhuri provide a very interesting
analysis of why it is easier to estimate the asymptotic covariance matrix of
than to estimate
the asymptotic variance of the univariate median. Essentially, unlike the univariate case, we
do not need to estimate the multivariate density at a point. It is left as an exercise to show
that the estimate is equivariant and the test is invariant under orthogonal transformations
of the data; see Exercise 6.8.13.
Example 6.3.1. Cork Borings Data
We consider a well known example due to Rao (1948) of testing whether the weight of cork
borings on trees is independent of the directions: North, South, East and West. In this case
we have 4 measurements on each tree and we wish to test the equality of marginal locations:
H
0
:
N
=
S
=
E
=
W
. This is a common hypothesis in repeated measure designs. See
Jan and Randles (1996) for an excellent discussion of issues in repeated measures designs.
We reduce the data to trivariate vectors via NE, ES, S W. Then we test = 0 where
T
= (
N

S
,
S
E
,
E
W
). Table 6.3.1 displays the original n = 28 four component
data vectors.
In Table 6.3.2 we display the data dierences: N S, S E, and E W along with the
unit spatial sign vectors |x|
1
x for each data point. Note that, except for rounding error,
the sum of squares in each row is 1 for the spatial sign vectors.
We compute the spatial sign statistic to be S
T
2
= (7.78, 4.99, 6.65) and, from ( 6.3.1),
A =
_
_
.2809 .1321 .0539
.1321 .3706 .0648
.0539 .0648 .3484
_
_
.
Then n
1
S
T
2
(0)
A
1
S
2
(0) = 14.74 which yields an asymptotic p-value of .002, using a
2
approximation with 3 degrees of freedom. Hence, we easily reject H
0
: = 0 and conclude
that boring size depends on direction.
Table 6.3.1: Weight of Cork Borings (in Centigrams) in Four Directions for 28 Trees
N E S W N E S W
72 66 76 77 91 79 100 75
60 53 66 63 56 68 47 50
56 57 64 58 79 65 70 61
41 29 36 38 81 80 68 58
32 32 35 36 78 55 67 60
30 35 34 26 46 38 37 38
39 39 31 27 39 35 34 37
42 43 31 25 32 30 30 32
37 40 31 25 60 50 67 54
33 29 27 36 35 37 48 39
32 30 34 28 39 36 39 31
63 45 74 63 50 34 37 40
54 46 60 52 43 37 39 50
47 51 52 43 48 54 57 43
For estimation we return to the original component data. Since we have rejected the null
hypothesis of equality of locations, we want to estimate the 4 components of the location
vector:
T
= (
1
,
2
,
3
,
4
). The spatial median solves S
2
() = 0, and we nd

T
=
(45.38, 41.54, 43.91, 41.03). For comparison the mean vector is (50.54, 46.18, 49.68, 45.18)
T
.
These computations can be performed using the R package SpatialNP. The issue of how to
apply rank methods in repeated measure designs has an extensive literature. In addition to
Jan and Randles (1996), Kepner and Robinson (1988) and Akritas and Arnold (1994) discuss
the use of rank transforms and pure ranks for testing hypotheses in repeated measure designs.
The Friedman test, Exercise 4.8.19, can also be used for repeated measure designs.
Eciency for Spherical Distributions
Expressions for A and B can be simplied and the computation of eciencies made easier
if we transform to polar coordinates. We write
x = r
_
cos
sin
_
= rs
_
cos
sin
_
(6.3.7)
where r = |x| 0, 0 < 2, and s = 1 depending on whether x is above or below the
horizontal axis with 0 < < . The second representation is similar to 6.2.9 and is useful in
the development of the conditional distribution of the test under the null hypothesis. Hence
S
2
(0) =
s
i
_
cos
i
sin
i
_
(6.3.8)
Table 6.3.2: Each row is a data vector for N-S, S-E, E-W along with the components of the
spatial sign vector.
Row N E E S S W S
1
S
2
S
3
1 6 -10 -1 0.51 -0.85 -0.09
2 7 -13 3 0.46 -0.86 0.20
3 -1 -7 6 -0.11 -0.75 0.65
4 12 -7 -2 0.85 -0.50 -0.14
5 0 -3 -1 0.00 -0.95 -0.32
6 -5 1 8 -0.53 0.11 0.84
7 0 8 4 0.00 0.89 0.45
8 -1 12 6 -0.07 0.89 0.45
9 -3 9 6 -0.27 0.80 0.53
10 4 2 -9 0.40 0.19 -0.90
11 2 -4 6 0.27 -0.53 0.80
12 18 -29 11 0.50 -0.80 0.31
13 8 -14 8 0.44 -0.78 0.44
14 -4 -1 9 -0.40 -0.10 0.91
15 12 -21 25 0.34 -0.60 0.71
16 -12 21 -3 -0.49 0.86 -0.12
17 14 -5 9 0.81 -0.29 0.52
18 1 12 10 0.06 0.77 0.64
19 23 -12 7 0.86 -0.44 0.26
20 8 1 -1 0.98 0.12 -0.12
21 4 1 -3 0.78 0.20 -0.59
22 2 0 -2 0.71 0.00 -0.71
23 10 -17 13 0.42 -0.72 0.55
24 -2 -11 9 -0.14 -0.77 0.63
25 3 -3 8 0.33 -0.33 0.88
26 16 -3 -3 0.97 -0.18 -0.18
27 6 -2 -11 0.47 -0.16 -0.87
28 -6 -3 14 -0.39 -0.19 0.90
where
i
is the angle measured counterclockwise between the positive horizontal axis and
the line through x
i
extending indenitely through the origin and s
i
indicates whether the
observation is above or below the axis. Under the null hypothesis = 0, s
i
= 1 with
probabilities 1/2, 1/2 and s
1
, . . . , s
n
are independent. Thus, we can condition on
1
, . . . ,
n
to get a conditionally distribution free test. The conditional covariance matrix is
n
i=1
_
cos
2
i
cos
i
sin
i
cos
i
sin
i
sin
2
i
_
(6.3.9)
and this is used in the quadratic form with S
2
(0) to construct the test statistic; see Mottonen
and Oja (1995, Section 2.1).
To consider the asymptotically distribution free version of this test we use the form
S
2
(0) =
_
cos
i
sin
i
_
(6.3.10)
where, recall 0 < 2, and the multivariate central limit theorem implies that
1
n
S
2
(0)
has a limiting bivariate normal distribution with mean 0 and covariance matrix A. We now
translate A and its estimate into polar coordinates.
A = E
_
cos
2
cos sin
cos sin sin
2
_
and

A =
1
n
n
i=1
_
cos
2
i
cos
i
sin
i
cos
i
sin
i
sin
2
i
_
(6.3.11)
Hence,
1
n
S
T
2
(0)
A
1
S
2
(0)
2
(2) is an asymptotically size test.

We next express B in terms of polar coordinates:
B = Er
1
_
I
_
cos
2
cos sin
cos sin sin
2
__
= Er
1
_
sin
2
cos sin
cos sin cos
2
_
(6.3.12)
Hence,

n times the spatial median is limiting bivariate normal with asymptotic covariance
matrix equal to B
1
AB
1
. The corresponding noncentrality parameter of the noncentral
chisquare limiting distribution of the test is
T
BA
1
B. We are now in a position to
evaluate the eciency of the spatial median and the spatial sign test with respect to the
mean vector and Hotellings test under various model assumptions. The following result is
basic and is derived in Exercise 6.8.10.
Theorem 6.3.1. Suppose the underlying distribution is spherically symmetric so that the
joint density is of the form f(x) = h(|x|). Let (r, ) be the polar coordinates. Then r
and are stochastically independent, the pdf of is uniform on (0, 2] and the pdf of r is
g(r) = 2rf(r), for r > 0.
Theorem 6.3.2. If the underlying distribution is spherically symmetric, then the matrices
A = (1/2)I and B = [(Er
1
)/2]I. Hence, under the null hypothesis, the test statistic
n
1
S
T
2
(0)A
1
S
2
(0) is distribution free over the class of spherically symmetric distributions.
Proof. First note that
E cos sin =
1
2
_
cos sin df = 0 .
Then note that
Er
1
cos sin = Er
1
E cos sin = 0 .
Finally note, E cos
2
= E sin
2
= 1/2.
We can then compute B
1
AB
1
= [2/(Er
1
)
2
]I and BA
1
B = [(Er
1
)
2
/2]I. This
implies that the generalized variance of the spatial median and the noncentrality parameter
of the angle sign test are given by detB
1
AB
1
= 2/(Er
1
)
2
and [(Er
1
)
2
/2]
T
. Notice
that the eciencies relative to the mean and Hotellings test are now equal and independent
of the direction. Recall, for the mean vector and T
2
, that A = 2
1
E(r
2
)I, det B
1
AB
1
=
2
1
E(r
2
), and
T
BA
1
B = [2/E(r
2
)]
T
. This is because both the spatial L
1
methods
and the L
2
methods are equivariant and invariant with respect to orthogonal (rotations and
reections) transformations. Hence, we see that the eciency
e(spatialL
1
, L
2
) =
1
4
Er
2
Er
1
2
. (6.3.13)
If, in addition, we assume the underlying distribution is spherical normal (bivariate nor-
mal with means 0 and identity covariance matrix) then Er
1
=
_
/2, Er
2
= 2 and
e(spatialL
1
, L
2
) = /4 .785. Hence, the eciency of the spatial L
1
methods based on
S
2
() are more ecient relative to the L
2
methods at the spherical normal model than the
componentwise L
1
methods (.637) discussed in Section 6.2.3.
In Exercise 6.8.12 the reader is asked to show that the eciency of the spatial L
1
methods
relative to the L
2
methods with a k-variate spherical model is given by
e
k
(spatial L
1
, L
2
) =
_
k 1
k
_
2
E(r
2
)[E(r
1
)]
2
. (6.3.14)
When the k-variate spherical model is normal, the exercise shows that Er
1
=
[(k1)/2)]
2(k/2)
with
(1/2) =

. Table 6.3.3 gives some values for this eciency as a function of dimension.
Hence, we see that the eciency increases with dimension. This suggests that the spatial
methods are superior to the componentwise L
1
methods, at least for spherical models.
Eciency for Elliptical Distributions
We need to consider what happens to the eciency when the model is elliptical but not
spherical. Since the methods that we are considering are equivariant and invariant to rota-
tions, we can eliminate the correlation from the elliptical model with a rotation but then the
variances are typically not equal. Hence, we study, without loss of generality, the eciency
Table 6.3.3: Eciency as a function of dimension for a k-variate spherical normal model.
k 2 4 6
e(spatial L
1
, L
2
) 0.785 0.884 0.920
when the underlying model has unequal variances but covariance 0. Now the L
2
methods
are ane equivariant and invariant but the spatial L
1
methods are not scale equivariant and
invariant (hence not ane equivariant and invariant); hence, the eciency will be a function
of the underlying variances.
The computations are now more dicult. To x the ideas suppose the underlying model
is bivariate normal with means 0, variances 1 and
2
, and covariance 0. If we let X and Z
denote iid N(0, 1) random variables, then the model distribution is that of X and Y = Z.
Note that W
2
= Z
2
/X
2
has a standard Cauchy distribution. Now we are ready to determine
the matrices A and B.
First, by symmetry, we have E cos sin = E[XY/(X
2
+Y
2
)] = 0 and Er
1
cos sin =
E[XY/(X
2
+ Y
2
)
3/2
] = 0; hence, the matrices A and B are diagonal. Next, cos
2
=
X
2
/[X
2
+
2
W
2
] = 1/[1 +
2
W
2
] so we can use the Cauchy density to compute the expec-
tation. Using the method of partial fractions:
E cos
2
=
_
1
(1 +
2
w
2
)
1
(1 +w
2
)
dw =
1
1 +
.
Hence, E sin
2
= /(1 + ). The next two formulas are given by Brown (1983) and are
derivable by several steps of partial integration:
Er
1
=
_
j=0
_
(2j)!
2
2j
(j!)
2
_
2
(1
2
)
j
,
Er
1
cos
2
=
1
2
_
j=0
_
(2j + 2)!(2j)!
2
4j+1
(j!)
2
[(j + 1)!]
2
_
2
(1
2
)
j
,
and
Er
1
sin
2
= Er
1
Er
1
cos
2
.
Thus A = diag[(1 + )
1
, (1 + )
1
] and the distribution of the test statistic, even
under the normal model depends on . The formulas can be used to compute the eciency
of the spatial L
1
methods relative to the L
2
methods; numerical values are given in Table
6.3.4. The dependency of the eciency on reects the dependency of the eciency on the
underlying correlation which is present prior to rotation.
Hence, just as the componentwise L
1
methods have decreasing eciency as a function of
the underlying correlation, the spatial L
1
methods have decreasing eciency as a function of
the ratio of underlying variances. It should be emphasized that the spatial methods are most
appropriate for spherical models where they have equivariance and invariance properties. The
Table 6.3.4: Eciencies of spatial L
1
2
methods for bivariate normal
model with means 0, variances 1 and
2
, and 0 correlation, the elliptical case.
1 .8 .6 .4 .2 .05 .01
e(spatial L
1
, L
2
) 0.785 0.783 0.773 0.747 0.678 0.593 0.321
componentwise methods, although equivariant and invariant under scale transformations of
the components, cannot tolerate changes in correlation. See Mardia (1972) and Fisher (1987,
1993) for further discussion of spatial methods. In higher dimensions, Mardia refers to the
angle test as Rayleighs test; see Section 9.3.1 of Mardia (1972). Mottonen and Oja (1995)
extend the spatial median and the spatial sign test to higher dimensions. See Table 6.3.6
below for eciencies relative to Hotellings test for higher dimensions and for a multivariate
t underlying distribution. Note that for higher dimensions and lower degrees of freedom, the
spatial sign test is superior to Hotellings T
2
.
6.3.2 Spatial Rank Methods
Spatial Signed Rank Test
Mottonen and Oja (1995) develop the concept of a orthogonally invariant rank vector.
Hence, rather than use the univariate concept of rank in the construction of a test, they
dene a spatial rank vector that has both magnitude and direction. This problem is delicate
since there is no inherently natural way to order or rank vectors.
We must rst review the relationship between sign, rank, and signed rank. Recall the
norm, ( 1.3.17) and ( 1.3.21), that was used to generate the Wilcoxon signed rank statistic.
Further, recall that the second term in the norm was the basis, in Section 2.2.2, for the
Mann-Whitney-Wilcoxon rank sum statistic. We reverse this approach here and show how
the one sample signed rank statistic based on ranks of the absolute values can be developed
from the ranks of the data. This will provide the motivation for a one sample spatial signed
rank statistic.
Let x
1
, . . . , x
n
be a univariate sample. Then 2[R
n
(x
i
)(n+1)/2] =
j
sgn(x
i
x
j
). Thus
the centered rank is constructed from the signs of the dierences. Now to construct a one
sample statistic, we introduce the reections x
1
, . . . , x
n
and consider the centered rank
of x
i
among the 2n combined observations and their reections. The subscript 2n indicates
that the reections are included in the ranking.
2[R
2n
(x
i
)(2n+1)/2] =
j
sgn(x
i
x
j
)+
j
sgn(x
i
+x
j
) = [2R
n
([x
i
[)1]sgn(x
i
) ; (6.3.15)
see Exercise 6.8.14. Hence, ranking observations in the combined observations and reections
is essentially equivalent to ranking the absolute values [x
1
[, . . . , [x
n
[. In this way, one sample
methods can be developed from two sample methods.
Mottonen and Oja (1995) use this approach to develop a one sample spatial signed
rank statistic. The key is the expression sgn(x
i
x
j
) + sgn(x
i
+ x
j
) which requires only
the concept of sign, not rank. Hence, we must nd the appropriate extension of sign to two
dimensions. In one dimension, sgn(x) = [x[
1
x can be thought of as a unit vector pointing
in the positive or negative directions toward x.
Likewise u(x) = |x|
1
x is a unit vector in the direction of x. Hence, as in the previous
section, we take u(x) to be the vector spatial sign. The vector centered spatial rank
of x
i
is then R(x
i
) =
j
u(x
i
x
j
). Thus, the vector spatial signed rank statistic is
S
5
(0) =
j
u(x
i
x
j
) +u(x
i
+x
j
) (6.3.16)
This is also the sum of the centered spatial ranks of the observations when ranked in the
combined observations and their reections. Note that u(x
i
x
j
) = u(x
j
x
i
) so that
u(x
i
x
j
) = 0 and the statistic can be computed from
S
5
(0) =
j
u(x
i
+x
j
) , (6.3.17)
which is the direct analog of ( 1.3.24).
We now develop a conditional test by conditioning on the data x
1
, . . . , x
n
. From ( 6.3.16)
we can write
S
5
(0) =
i
r
+
(x
i
) , (6.3.18)
where r
+
(x) =
j
u(x x
j
) + u(x + x
j
). Now it is easy to see that r
+
(x) = r
+
(x).
Under the null hypothesis of symmetry about 0, we can think of S
5
(0) as a realization of
i
b
i
r
+
(x
i
) where b
1
, . . . , b
n
are iid variables with P(b
i
= +1) = P(b
i
= 1) = 1/2. Hence,
Eb
i
= 0 and var(b
i
) = 1. This means that, conditional on the data,
ES
5
(0) = 0 and

A =

Cov
_
1
n
3/2
S
5
(0)
_
=
1
n
3
n
i=1
(r
+
(x
i
))(r
+
(x
i
))
T
. (6.3.19)
The approximate size conditional test of H
0
: = 0 versus H
A
: ,= 0 rejects H
0
when
1
n
3
S
T
5
A
1
S
5

2
(2) , (6.3.20)
where
2
(2) is the upper percentile from a chisquare distribution with 2 degrees of freedom.
Note that the extension to higher dimensions is done in exactly the same way. See Chaudhuri
(1992) for rigorous asymptotics.
Example 6.3.2. Cork Borings, Example 6.3.1 continued
Table 6.3.5: Each row is a spatial signed rank vector for the data dierences in Table 6.3.2.
Row SR1 SR2 SR3 Row SR1 SR2 SR3
1 0.28 -0.49 -0.07 15 0.30 -0.54 0.69
2 0.28 -0.58 0.12 16 -0.40 0.73 -0.07
3 -0.09 -0.39 0.31 17 0.60 -0.14 0.39
4 0.58 -0.29 -0.11 18 0.10 0.56 0.49
5 -0.03 -0.20 -0.07 19 0.77 -0.34 0.22
6 -0.28 0.07 0.43 20 0.48 0.10 -0.03
7 0.07 0.43 0.23 21 0.26 0.08 -0.16
8 0.01 0.60 0.32 22 0.12 0.00 -0.11
9 -0.13 0.46 0.34 23 0.32 -0.58 0.48
10 0.23 0.13 -0.49 24 -0.14 -0.53 0.42
11 0.12 -0.20 0.33 25 0.19 -0.12 0.45
12 0.46 -0.76 0.28 26 0.73 -0.07 -0.14
13 0.30 -0.56 0.34 27 0.31 -0.12 -0.58
14 -0.22 -0.05 0.49 28 -0.30 -0.14 0.67
We use the spatial signed-rank method ( 6.3.20) to test the hypothesis. Table 6.3.5 provides
the vector signed-ranks, r
+
(x
i
) dened in expression ( 6.3.18).
Then S
T
5
(0) = (4.94, 2.90, 5.17),
n
3
A
1
=
_
_
.1231 .0655 .0050
.0655 .1611 .0373
.0050 .0373 .1338
_
_
,
and n
1
S
T
5
(0)
A
1
S
5
(0) = 11.19 with an approximate p-value of 0.011 based on a
2
-
distribution with 3 degrees of freedom. The Hodges-Lehmann estimate of , which solves
S
5
()
.
= 0, is computed to be

T
= (49.30, 45.07, 48.90, 44.59).
Eciency
The test in ( 6.3.20) can be developed from the point of view of asymptotic theory and
the eciency can be computed. The computations are quite involved. The multivariate t
distributiuons provide both a range of tailweights and a range of dimensions. A summary of
these eciencies is found in Table 6.3.6; see Mottonen, Oja and Tienari (1997) for details.
The Mottonen and Oja (1995) test eciency increases with the dimension; see especially,
the circular normal case. The eciency begins at .95 and increases! The eciency also
increases with tailweight, as expected. This strongly suggests that the Mottonen and Oja
approach is an excellent way to extend the idea of signed rank from the univariate case. See
Example 6.6.2 for a discussion of the two sample spatial rank test.
Table 6.3.6: The row labeled Spatial SR are the asymptotic eciencies of multivariate spatial
signed-rank test, ( 6.3.20), relative to Hotellings test under the multivariate t distribution;
the eciencies for the spatial sign test, ( 6.3.2), are given in the rows labeled Spatial Sign.
Degress of Freedom
Dimension Test 3 4 6 8 10 15 20
1 Spatial SR 1.90 1.40 1.16 1.09 1.05 1.01 1.00 0.95
Spatial Sign 1.62 1.13 0.88 0.80 0.76 0.71 0.70 0.64
2 Spatial SR 1.95 1.43 1.19 1.11 1.07 1.03 1.01 0.97
Spatial Sign 2.00 1.39 1.08 0.98 0.93 0.88 0.85 0.79
3 Spatial SR 1.98 1.45 1.20 1.12 1.08 1.04 1.02 0.97
Spatial Sign 2.16 1.50 1.17 1.06 1.01 0.95 0.92 0.85
4 Spatial SR 2.00 1.46 1.21 1.13 1.09 1.04 1.025 0.98
Spatial Sign 2.25 1.56 1.22 1.11 1.05 0.99 0.96 0.88
6 Spatial SR 2.02 1.48 1.22 1.14 1.10 1.05 1.03 0.98
Spatial Sign 2.34 1.63 1.27 1.15 1.09 1.03 1.00 0.92
10 Spatial SR 2.05 1.49 1.23 1.14 1.10 1.06 1.04 0.99
Spatial Sign 2.42 1.68 1.31 1.19 1.13 1.06 1.03 0.95
Hodges-Lehmann Estimator
The estimator derived fromS
5
()
.
= 0 is the spatial median of the pairwise averages, a spatial
Hodges-Lehmann (1963) estimator. This estimator is studied in great detail by Chaudhuri
(1992). His paper contains a thorough review of multidimensional location estimates. He de-
velops a Bahadur representation for the estimate. From his Theorem 3.2, we can immediately
conclude that
1
= B
1
2
n
n(n 1)
n
i=1
n
j=1
u
_
1
2
(x
i
+x
j
)
_
+o
p
(1) (6.3.21)
where B
2
= E|x
|
1
(I |x
|
2
x
(x
)
T
) and x
=
1
2
(x
1
+ x
2
). Hence, the asymptotic
distribution of
1
is determined by that of n
3/2
S
5
(0). This leads to
1
D
N
2
(0, B
1
2
A
2
B
1
2
) , (6.3.22)
where A
2
= Eu(x
1
+x
2
)(u(x
1
+x
2
))
T
. Moment estimates of A
2
and B
2
can be used. In
fact the estimator

A, dened in expression ( 6.3.19), is a consistent estimate of A
2
. Bose
and Chaudhuri (1993) and Chaudhuri (1993) discuss renements in the estimation of A
2
and B
2
.
Choi and Marden (1997) extend these spatial rank methods to the two-sample model and
the one-way layout. They also consider tests for ordered alternatives; see, also, Oja (2010).
6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS 377
6.4 Ane Equivariant and Invariant Methods
6.4.1 Blumens Bivariate Sign Test
It is clear from Tables 6.3.4 and 6.3.6 of eciencies in the previous section that is desirable to
have robust sign and rank methods that are ane invariant and equivariant to compete with
LS methods. We begin with yet another representation of the estimating function S
2
(),
( 6.1.7). Let the ordered angles be given by 0
(1)
<
(2)
< . . . <
(n)
< and let
s
(i)
= 1 when the observation corresponding to
(i)
is above or below the horizontal axis.
Then we can write, as in expression ( 6.3.8),
S
2
() =
n
i=1
s
(i)
_
cos
(i)
sin
(i)
_
(6.4.1)
Now under the assumption of spherical symmetry,
(i)
is distributed as the ith order statistic
from the uniform distribution on [0, ) and, hence, E
(i)
= i/(n + 1), i = 1, . . . , n. Recall,
in the univariate case, if we believe that the underlying distribution is normal then we could
replace the data by the normal scores (expected values of the order statistics from a normal
distribution) in a signed rank statistic. The result is the distribution free normal scores test.
We will do the same thing here. We replace
(i)
by its expected value to construct a scores
statistic. Let
S
6
() =
n
i=1
s
(i)
_
cos
i
n+1
sin
i
n+1
_
=
n
i=1
s
i
_
cos
R
i
n+1
sin
R
i
n+1
_
(6.4.2)
where R
1
, . . . , R
n
are the ranks of the unordered angles
1
, . . . ,
n
. Note that s
1
, . . . , s
n
are
iid with P(s
i
= 1) = P(s
i
= 1) = 1/2 even if the underlying model is elliptical rather than
spherical. Since we now have constant vectors in S
6
(), it follows that the sign test based
on S
6
() is distribution free over the class of elliptical models. We look at the test in more
detail and consider the eciency of this sign test relative to Hotellings test. First, we have
immediately, under the null hypothesis, from the distribution of s
1
, . . . , s
n
that
cov
_
1
n
S
6
(0)
_
=
_
P
cos
2
[i/(n+1)]
n
P
cos[i/(n+1)] sin[i/(n+1)]
n P
cos[i/(n+1)] sin[i/(n+1)]
n
P
sin
2
[i/(n+1)]
n
_
A ,
where
A =
_
_
1
0
cos
2
tdt
_
1
0
cos t sin tdt
_
1
0
cos t sin tdt
_
1
0
sin
2
tdt
_
=
1
2
I ,
as n . So reject H
0
: = 0 if
2
n
S
6
(0)S
6
(0)
2
(2) for the asymptotic size distribution

free version of the test where,
2
n
S
6
(0)S
6
(0) =
2
n
_
_
s
(i)
cos
i
n + 1
_
2
+
_
s
(i)
sin
i
n + 1
_
2
_
. (6.4.3)
This test is not ane invariant. Blumen (1958) created an asymptotically equivariant test
that is ane invariant. We can think of Blumens statistic as an elliptical scores version
of the angle statistic of Brown (1983). In (6.4.3) i/(n +1) is replaced by (i 1)/n. Blumen
rotated the axes so that
(1)
is equal to zero and the data point is on the horizontal axis.
Then the remaining scores are uniformly spaced. In this case, (i 1)/n is the conditional
expectation of
(i)
given
(1)
= 0. Estimation methods corresponding to Blumens test,
however, have not yet been developed.
To compute the eciency of Blumens test relative to Hotellings test we must com-
pute the noncentrality parameter of the limiting chisquare distribution. Hence, we must
compute BA
1
B and this leads us to B. Theorem 6.3.2 provides the matrices A and B for
the angle sign statistic when the underlying distribution is spherically symmetric. The fol-
lowing theorem shows that the ane invariant sign statistic has the same A and B matrices
as in Theorem 6.3.2 and they hold for all elliptical distributions. We discuss the implications
after the proof of the proposition.
Theorem 6.4.1. If the underlying distribution is elliptical, then corresponding to S
6
(0)
we have A =
1
2
I and B = (Er
1
/2)I. Hence, the eciency of Blumens test relative to
Hotellings test is e(S
6
, Hotelling) = E(r
2
)[E(r
1
]
2
/4 which is the same for all elliptical
models.
Proof. To prove this we show that under a spherical model the angle statistic S
2
(0) and
scores statistic S
6
(0) are asymptotically equivalent. Then S
6
(0) will have the same A and
B matrices as in Theorem 6.3.2. But since S
6
(0) leads to an ane invariant test statistic,
it follows that the same A and B continue to apply for elliptical models.
Recall that under the spherical model, s
(1)
, . . . , s
(n)
are iid with P(s
i
= 1) = P(s
i
=
1) = 1/2 random variables. Then we consider
1
n
n
i=1
s
(i)
_
cos
i
n+1
sin
i
n+1
_
n
n
i=1
s
(i)
_
cos
i
sin
i
_
=
1
n
n
i=1
s
(i)
_
cos
i
n+1
cos
(i)
sin
i
n+1
sin
(i)
_
We treat the two components separately. First
s
(i)
_
cos(
i
n + 1
_
cos
(i)
)
max
i
cos
_
i
n + 1
_
cos
(i)
s
(i)
The cdf of the uniform distribution on [0, ) is equal to t/ for 0 t < . Let G
n
(t) be the
empirical cdf of the angles
i
, i = 1, . . . , n. Then G
1
n
(
i
n+1
) =
(i)
and max
i
[
i
n+1

(i)
[
sup
t
[G
1
n
(t) t[ = sup
t
[G
n
(t) t[ 0 wp1 by the Glivenko-Cantelli Lemma. The result
now follows by using a linear approximation to cos(
i
n+1
) cos
(i)
and noting that the cos
and sin are bounded. The same argument applies to the second component. Hence, the
dierence of the two statistics are o
p
(1) and are asymptotically equivalent. The results for
the angle statistic now apply to S
6
(0) for a spherical model. The ane invariance extends
the result to an elliptical model.
The main implication of this proposition is that the eciency of the test based on S
6
(0)
relative to Hotellings test is /4 .785 for all bivariate normal models, not just the spherical
normal model. Recall that the test based on S
2
(0), the angle sign test, has eciency /4
only for the spherical normal and declining eciency for elliptical normal models. Hence,
we not only gain ane invariance but also have a constant, nondecreasing eciency.
Oja and Nyblom (1989) study a class of sign tests for the bivariate location problem. They
show that Blumens test is locally most powerful invariant for the entire class of elliptical
models. Ducharme and Milasevic (1987) dene a normalized spatial median as an estimate
of location of a spherical distribution. They construct a condence region for the modal
direction. These methods are resistant to outliers.
6.4.2 Ane Invariant Sign Tests in the Multivariate Case
Ane Invariant Sign Tests
Ane invariance is determined in the Blumen test statistic by rearranging the data axes to
be uniformly spaced scores. Further, note that the asymptotic covariance matrix A is (1/2)I,
where I is the identity. This the covariance matrix for a random vector that is uniformly
distributed on the unit circle. The equally spaced scores cannot be constructed in higher
dimensions. The approach taken here is due to Randles (2000) in which we seek a linear
transformation of the data that makes the data axes roughly equally spaced and the resulting
direction vectors will be roughly uniformly distributed on the unit sphere. We choose the
transformation so that the sample covariance matrix of the unit vectors of the transformed
data is that of a random vector uniformly distributed on the unit sphere. We then compute
the spatial sign test (6.3.2) on the transformed data. The result is an ane invariant test.
Let x
1
, ..., x
n
be a random sample of size n from a k-variate multivariate symmetric
distribution with symmetry center 0. Suppose for the moment that a nonsingular matrix
U
x
determined by the data, exists and satises
1
n
n
i=1
_
U
x
x
i
|U
x
x
i
|
__
U
x
x
i
|U
x
x
i
|
_
T
=
1
k
I. (6.4.4)
Hence, the unit vectors of the transformed data have covariance matrix equal to that of
a random vector uniformly distributed on the unit k sphere. Below we describe a simple
and fast way to compute U
x
for any dimension k. The test statistic in (6.4.4) computed on
the transformed data becomes
1
n
S
T
7
A
1
S
7
=
k
n
S
T
7
S
7
(6.4.5)
where
S
7
=
n
i=1
U
x
x
i
|U
x
x
i
|
(6.4.6)
and

A in (6.3.1) becomes k
1
I because of the denition of U
x
in (6.4.4).
Theorem: Suppose n > k(k 1) and the underlying distribution is symmetric about
0. Then
k
n
S
T
7
S
7
in (6.4.5) is ane invariant and the limiting distribution, as n , is
chisquare with k degrees of freedom.
The following lemma will be helpful in the proof of the theorem. The lemmas proof
depends on a uniqueness result from Tyler (1987).
Lemma: Suppose n > k(k 1) and D is a xed, nonsingular transformation matrix.
Suppose U
x
and U
Dx
are dened by (6.4.4). Then
a. D
T
U
T
Dx
U
Dx
D = c
0
U
T
x
U
x
for some positive constant c
0
that may depend on D and
the data and
b. there exists an orthogonal matrix G such that

c
0
GU
x
= U
Dx
D.
Proof: Dene D
= U
x
D
1
then
1
n
n
i=1
_
U
Dx
i
|U
Dx
i
|
__
U
Dx
i
|U
Dx
i
|
_
T
=
1
n
n
i=1
_
U
x
x
i
|U
x
x
i
|
__
U
x
x
i
|U
x
x
i
|
_
T
=
1
k
I.
Tyler (1987) showed that the matrix U
Dx
dened from Dx
1
, ..., Dx
n
is unique up to a
positive constant. Hence, U
Dx
= aU
for some positive constant a. Hence,

U
T
Dx
U
Dx
= a
2
U
T
U
= a
2
(D
T
)
1
U
T
x
U
x
D
1
and D
T
U
t
Dx
U
Dx
D = a
2
U
T
x
U
x
which completes the proof of part a with c
0
= a
2
.
Dene G = c
1/2
0
U
Dx
DU
1
x
where c
0
comes from the lemma. Then, using part a, it
follows that G
T
G = I and G is orthogonal. Hence,
c
0
GU
x
= c
0
c
1/2
0
U
Dx
DU
1
x
U
x
= c
1/2
0
U
Dx
D
and part b follows.
Proof of ane invariance: Given D is a xed, nonsingular matrix, let y
i
= Dx
i
for
i = 1, ..., n. Then (6.4.6) becomes
S
D
7
=
n
i=1
U
Dx
Dx
i
|U
Dx
Dx
i
|
.
We will show that S
DT
7
S
D
7
= S
T
7
S
7
and hence does not depend on D.
Now, from the lemma,
U
Dx
Dx
|U
Dx
Dx|
=
c
1/2
0
GU
x
x
| c
1/2
0
GU
x
x |
= G
U
x
x
|U
x
x|
and
S
D
7
= G
n
i=1
U
x
x
i
|U
x
x
i
|
= GS
7
.
Hence, S
DT
7
S
D
7
= S
T
7
S
7
and the ane invariance follows from the orthogonal invariance
of S
T
7
S
7
.
Sketch of argument that the asymptotic distribution is chisquared with k
degrees of freedom. Tyler (1987) showed that there exists a unique upper triangular
matrix U
with upper left diagonal element equal to 1 and such that

E
_
_
U
X
|U
X|
__
U
X
|U
X|
_
T
_
=
1
k
I
and

n(U
x
U
) = O
p
(1). Theorem 6.1.2 implies that (k/n)S
T
7
S
7
is asymptotically
chisquared with k degrees of freedom where U
replaces U
x
in S
7
. But since U
x
and U
are close, (k/n)S

T
7
S
7
(k/n)S
T
7
S
7
= o
p
(1), the asymptotic distribution follows. See the
appendix in Randles (2000) for details.
We have assumed symmetry of the underlying multivariate distribution. The results
continue to hold with the weaker assumption of directional symmetry about 0 in which
X/|X| and X/|X| have the same distribution. In addition to the asymptotic distribution,
we can compute or approximate the conditional distribution (given the direction axes of the
data) of
k
n
S
T
7
S
7
under the assumption of directional symmetry by listing or sampling the 2
n
equi-likely values of
k
n
_
n
i=1
i
U
x
x
i
|U
x
x
i
|
_
T
_
n
i=1
i
U
x
x
i
|U
x
x
i
|
_
where
i
= 1 for i = 1, ..., n. Hence, it is straight forward to approximate the k value
of the test.
Computation of U
x
It remains to compute U
x
from the data x
1
, ..., x
n
. The following ecient iterative procedure
is due to Tyler (1987) who also shows the sequence of iterates converges when n > k(k 1).
We begin with
V
0
=
1
n
n
i=1
_
x
i
|x
i
|
__
x
i
|x
i
|
_
T
.
and U
0
= Chol (V
1
0
), where Chol (M) is the upper triangular Cholesky decomposition of the
positive denite matrix M divided by the upper left diagonal element of the upper triangular
matrix. This places a 1 as the rst element of the main diagonal and makes Chol (M) unique.
If |V
0
k
1
I| is suciently small (a prespecied tolerance) stop and take U
x
= U
0
. If
|V
0
k
1
I| is large, compute
V
1
=
1
n
n
i=1
_
U
0
x
i
|U
0
x
i
|
__
U
0
x
i
|U
0
x
i
|
_
T
.
and compute U
1
= Chol (V
1
1
).
If |V
1
k
1
I| is suciently small stop and take U
x
= U
1
U
0.
If |V
1
k
1
I| is large
compute
V
2
=
1
n
n
i=1
_
U
1
U
0
x
i
|U
1
U
0
x
i
|
__
U
1
U
0
x
i
|U
1
U
0
x
i
|
_
T
.
and U
2
= Chol (V
1
2
). If |V
2
k
1
I| is suciently small, stop and take U
x
= U
2
U
1
U
0
.
If |V
2
k
1
I| is large compute V
3
and U
3
and proceed until |V
j
0
k
1
I| is suciently
small and take U
x
= U
j
0
U
jo2
...U
0
Ane Equivariant Median. We now turn to the problem of constructing an ane
equivariant estimate of the center of symmetry of the underlying distribution. Our goal is to
produce an estimate that is computationally ecient for large samples in any dimension, a
problem that plagued some earlier attempts; see Small (1990) for an overview of multivariate
medians. The estimate described below was proposed by Hettmansperger and Randles (2002)
and we refer to it as the HR estimate. The estimator

is chosen to be the solution of
1
n
n
i=1
U
x
(x
i
)
|U
x
(x
i
)|
= 0 (6.4.7)
in which U
x
is the k k upper triangular positive denite matrix, with a one in the
upper left position on the diagonal, chosen to satisfy
1
n
n
i=1
_
U
x
(x
i
)
|U
x
(x
i
)|
__
U
x
(x
i
)
|U
x
(x
i
)|
_
T
=
1
k
I. (6.4.8)
This is a transform-retransform estimate; see, for example, Chakraborty, Chaudhuri and
Oja (1998). The data are transformed using U
x
, and the estimate = U
x
is computed.
Then the estimate is retransformed back to the original scale

= U
1
x
. The simultaneous
solutions of (6.4.7) and (6.4.8) are M-estimates; see Section 6.5.4 for the explicit representa-
tion. It follows from this that the estimate is ane equivariant. It is also possible to directly
verify the ane equivariance.
The calculation of (U
x
,

) involves two routines. The rst routine nds the value that
solves (6.4.7) with U
x
xed. This is done by letting y
i
= U
x
x
i
and nding that solves
(y
i
)/ | y
i
|= 0. Hence, is the spatial median of y
1
, . . . , y
n
; see Section 6.3.1. The
solution to (6.4.7) is

= U
1
x
. The second routine then nds U
x
in (6.4.8) as described
above for the computation of U
x
for a xed value of with x
i
replaced by x
i
.
The calculation of (U
x
,

) alternates between these two routines until convergence. To
obtain starting values, let
0j
= x
j
. Use the second routine to obtain U
0j
for this value of
. The starting (
0j
, U
0j
) is the pair that minimizes, for j = 1, ..., n, the inner product
_
n
i=1
U
0j
(x
i
0j
)
|U
0j
(x
i
0j
)|
_
T
_
n
i=1
U
0j
(x
i
0j
)
|U
0j
(x
i
0j
)|
_
.
This starting procedure is used, since starting values need to be ane invariant and equiv-
ariant.
For a xed U
x
there exists a unique solution for , and for xed there exists a unique U
x
up to multiplicative constant. In simulations and calculations described in Hettmansperger
and Randles (2002) the alternating algorithm did not fail to converge. However, the equations
dening the simultaneous solution (U
x
,

) do not fully satisfy all conditions stated in the
literature for existence and uniqueness; see Maronna (1976), Tyler (1988), Kent and Tyler
(1991).
The asymptotic distribution theory developed in Hettmansperger and Randles (2002)
show that

is approximately multivariate normally distributed under the assumption of di-
rectional symmetry and, hence, symmetry. The asymptotic covariance matrix is complicated
and we recommend a bootstrap estimate of the covariance matrix of

.
The approach taken above is more general. If we begin with the orthogonally invariant
statistic in (6.3.2) and use a matrix U that satises the invariance property in part b of the
Lemma then the resulting statistic is ane invariant. For example we could take U to be the
inverse of the sample covariance matrix. This results in a test statistic studied by Hossjer
and Croux (1995). We prefer the more robust matrix U
x
proposed by Tyler (1987).
Example 6.4.1. Mathematics and Statistics Exam Scores
We now illustrate the one-sample ane invariant spatial sign test (6.4.5) and the ane
equivariant spatial median on a small data set. A major advantage of this method is the
speed of computation which allows for bootstrap estimates of the covariance matrix and
standard errors for the estimator. The data consists of 20 vectors, chosen at random from
a larger data set published in Mardia, Kent, and Bibby (1979). Each vector consists of four
components and records test scores in Mechanics, Vectors, Analysis, and Statistics. We wish
to test the hypothesis that there are no dierences among the examination topics. This
is a traditional hypothesis in repeated measures designs; see Jan and Randles (1996) for
a thorough discussion of this problem. Similar to our ndings above on eciencies, they
found that mulitivariate sign and signed rank tests were often superior to least squares in
robustness of level and eciency .
Table 6.4.1 provides the original quadrivariate data along with the trivariate data that re-
sult when the Statistics score is subtracted from the other three. We suppose that the trivari-
ate data are a sample of size 20 from a symmetric distribution with center = (
1
,
2
,
3
)
T
and we wish to test H
0
: = 0 versus H
A
: ,= 0. In Table 6.4.1 we have the HR estimates
(standard errors) and the tests for the ane spatial methods, Hotellings T
2
, and Ojas ane
methods described later in Section 6.4.3. The standard errors of the HR estimate are ob-
tained from a boostrap estimate of the covariance matrix. The following estimates are based
on 500 bootstrap resamples.
Cov (
) =
_
_
33.88 10.53 21.05
10.53 17.03 12.49
21.05 12.49 32.71
_
_
.
Table 6.4.1: Test Score Data: Mechanics (M), Vectors (V ), Analysis (A), Statistics (S) and
dierences when Statistics is subtracted from the other three.
M V A S M S V S A S
59 70 62 56 3 14 6
52 64 63 54 -2 10 9
44 61 62 46 -2 15 16
44 56 61 36 8 20 25
30 69 52 45 -15 24 7
46 49 59 37 9 12 22
31 42 54 68 -37 -26 -14
42 60 49 33 9 27 16
46 52 41 40 6 12 1
49 49 48 39 10 10 9
17 53 43 51 -34 2 -8
37 56 28 45 -8 11 -17
40 43 21 61 -21 -18 -40
35 36 48 29 6 7 19
31 52 27 40 -9 12 -13
17 51 35 31 -14 20 4
49 50 23 9 40 41 14
8 42 26 40 -32 2 -14
15 38 28 17 -2 21 11
0 40 9 14 -14 26 -5
The standard errors in Table 6.4.1 are the squareroots of the main diagonal of this matrix.
The ane sign methods suggest that the major source of statistical signicance is the V
S dierence. In particular, Vector scores are higher than Statistics scores. A more convenient
comparision is achieved by estimating the locations in the four dimensional problem. We nd
the ane equivariant spatial median for M, V, A, S to be (along with bootstrap standard
errors) 36.54 (8.41), 53.04 (5.09), 44.28 (8.39), and 39.65 (7.06). This again reects the
signicant dierences between Vector scores and Statistics. In fact, it appears the Vector
exam was easiest while the other subjects are roughly equivalent.
An outlier was created in V by replacing the 70 (rst observation) by 0. The results
are shown in the lower part of Table 6.4.2. Note, in particular, unlike the robust methods,
the p-value for Hotellings T
2
test has shifted above 0.05 and, hence, would no longer be
considered signicant.
An ane Invariant Signed Rank Test and Ane Equivariant Estimate
The test statistic can be constructed in the same way that the ane invariant sign test was
constructed. We will sketch this development below. For a detailed and rigorous development
Table 6.4.2: Results for the original and contaminated test score data: mean of signed-rank
vectors, usual mean vectors, the Hodges-Lehmann estimate of ; results for the signed-rank
test ( 6.4.16) and Hotellings T
2
test
Test Asymp.
M S V S AS Statistic p-value
Original Data
HR Estimate 2.12 13.85 6.21
SE HR 5.82 4.13 5.72
Mean -4.95 12.10 2.40
SE Mean 4.07 3.33 3.62
Oja HL-est. -3.05 14.06 4.96
Ane Sign Test (6.4.5) 14.19 0.0027
Hotellings T
2
13.47 0.0037
Oja Signed rank (6.4.16) 14.07 0.0028
Contaminated Data
HR Estimate 2.92 12.83 6.90
SE HR 5.58 8.27 6.60
Mean Vector -4.95 8.60 2.40
Oja HL-estimate -3.90 12.69 4.64
Ane Sign Test (6.4.5) 10.76 0.0131
Hotellings T
2
6.95 .0736
Oja Signed rank (6.4.16) 10.09 0.0178
see Oja (2010, Chapter 7) or Oja and Randles (2004). The spatial signed rank statistic is
given by S
5
in (6.3.19) along with the spatial signed rank covariance matrix, given in this
case by
1
n
n
i=1
r
+
(x
i
)r
+
(x
i
)
T
. (6.4.9)
Now suppose we can construct a matrix V
x
such that when x
i
is replaced by V
x
x
i
in
(6.4.9) we have
1
1
n
r
+
(V
x
x
i
)
T
r
+
(V
x
x
i
)
1
n
r
+
(V
x
x
i
)r
+
(V
x
x
i
)
T
=
1
k
I. (6.4.10)
The divisor in 6.4.10 is the average squared length of the signed rank vectors and is
needed to normalize (on average) the signed rank vectors. In the simpler sign vector case
n
1
[x
T
i
x
i
/ | x
i
|
2
] = 1. The normalized signed rank vectors now have roughly the same
covariance structure as vectors uniformly distributed on the unit k sphere. It is straight
forward to develop an iterative routine to compute V
x
in the same way we computed U
x
for
the sign statistic.
The signed rank test statistic developed from (6.3.22) is then
k
n
S
T
8
S
8
, (6.4.11)
where S
8
=
r
+
(V
x
x
i
). Again, it can be veried directly that this test statistic is ane
invariant. In addition, the p value of the test can be approximated using the chisquare
distribution with k degrees of freedom or by simulation, conditionally using the 2
n
equally
likely values of
k
n
_
n
i=1
T
i
r
+
(V
x
x
i
)
T
__
n
i=1
i
r
+
(V
x
x
i
)
_
with
i
= 1.
Recall that the Hodges-Lehmann estimate related to the spatial signed rank statistic is
the spatial median of the pairwise averages of the data vectors. This estimate is orthogonally
equivariant but not ane equivariant. We use the transformation-retransformation method.
We transform the data using V
x
to get y
i
= V
x
x
i
i = 1, ..., n and then compute the spatial
median of the pairwise averages (y
i
+y
j
)/2 which we denote by . Then we retransform it
back:

= V
1
x
. This estimate is now ane equivariant. Because of the complexity of the
asymptotic covariance matrix we recommend a bootstrap estimate of the covariance matrix
of

.
Eciency
Recall Table 6.3.6 which provides eciency values for either the spatial sign test or the spatial
signed rank test relative to Hotellings T
2
test. The calculations were made for the spherical
t distribution for various degrees of freedom and nally for the spherical normal distribution.
Now that we have ane invariant sign and signed rank tests and ane equivariant estimates
we can apply these eciency results to elliptical t and normal distributions. Hence, we again
see the superiority of the sign and signed rank methods over Hotellings test and the sample
mean. The ane invariant tests and ane equivariant estimates are ecient and robust
alternatives to the traditional least squares methods.
In the case of the ane invariant sign test, Randles (2000) presents a power sensitivity
simulation comparing his test to Hotellings T
2
test, Blumens test, and Ojas sign test
(6.4.14). In addition to the multivariate normal distribution, he included t distributions
and a skewed distribution. Randles ane invariant sign test performed extremely well.
Although Ojas sign test performed comparably, it is much more computationally intensive
than Randles test.
6.4.3 The Oja Criterion Function
This method provides a direct approach to ane invariance equivariance and does not require
a transform-retransform technique. It is, however, much more computationally intensive.
We will only sketch the results in this section and give references where the more detailed
derivations can be found. Recall from the univariate location model that L
1
and L
2
are
special cases of methods that are derived from minimizing
[x
i
[
m
, for m = 1 and
m = 2. Oja (1983) proposed the bivariate objective function: D
8
() =
i<j
A
m
(x
i
, x
j
, )
where A(x
i
, x
j
, ) is the area of the triangle formed by the three vectors x
i
, x
j
, . When
m = 2 Wilks (1960) showed that D
8
() is proportional to the determinant of the classical
scatter matrix and the sample mean vector minimizes this criterion. Thus, by analogy with
the univariate case, the m = 1 case will be called the L
1
case. The same results carry over to
dimensions greater than 2 in which the triangles are replaced by simplices. For the remainder
of the section, m = 1.
We introduce the following notation:
A
ij
=
1
2
_
_
1 1 1
1
x
i1
x
j1
2
x
i2
x
j2
_
_
.
Then D
8
() =
1
2
i<j
absdetA
ij
where det stands for determinant and abs stands for
absolute value. Now if we dierentiate this criterion function with respect to
1
and
2
we
get a new set of estimating equations:
S
8
() =
1
2
n1
i=1
n
j=i+1
sgndetA
ij
(x
j
x
i
) = 0 , (6.4.12)
where x
i
is the vector x
i
rotated counterclockwise by

2
radians, hence, x
i
= (x
i2
, x
i1
)
T
.
Note that enters only through the A
ij
. The expression in (6.4.12) is found as follows:
sgn(x
j
x
i
)
T
( x
i
)(x
j
x
i
) =
_
x
j
x
i
if x
i
x
j
is counterclockwise
(x
j
x
i
) if x
i
x
j
is clockwise
The estimator that solves 6.4.12 is called the Oja median and we will be interested in
its properties. This estimator minimizes the sum of triangular areas formed by all pairs of
observations along with . Niinimaa, Nyblom, and Oja (1992) provide a fortran program for
computing the Oja median and discuss further aspects of its computation; see, also, the R
package OjaNP. Brown and Hettmansperger (1987a) present a geometric description of the
determination of the Oja median. The statistic S
8
(0) forms the basis of a sign type statistic
for testing H
0
: = 0. We will refer to this test as the Oja sign test. In order to study the
Oja median and the Oja sign test we need once again to determine the matrices A and B.
Before doing this we will rewrite (6.4.12) in a more convenient form, a form that expresses
it as a function of s
1
, . . . , s
n
. Recall the polar form of x, ( 6.3.7), that we have been using
and at the same time introduce the vector y as follows:
x = r
_
cos
sin
_
= rs
_
cos
sin
_
= sy
As usual 0 < , s indicates whether x is above or below the horizontal axis, and r is
the length of x. Hence, if s = 1 then y = x, and if s = 1 then y = x, so y is always
above the horizontal axis.
Theorem 6.4.2. The following string of equalities is true:
1
n
S
8
(0) =
1
2n
n1
i=1
n
j=i+1
sgndet
_
x
i1
x
j1
x
i2
x
j2
_
(x
j
x
i
)
=
1
2n
n1
i=1
n
j=i+1
s
i
s
j
(s
j
y
j
s
i
y
i
)
=
1
2
n
i=1
s
i
z
i
where
z
i
=
1
n
n1
j=1
y
i+j
and y
n+i
= y
i
Proof: The rst formula follows at once from (6.4.12). In the second formula we need to
recall the operation. It entails a counterclockwise rotation of 90 degrees. Suppose, without
loss of generality, that 0
1
. . .
n
. Then
sgn
_
det
_
x
i1
x
j1
x
i2
x
j2
__
= sgn
_
det
_
s
i
r
i
cos
i
s
j
r
j
cos
j
s
i
r
i
sin
i
s
j
r
j
sin
j
__
= sgns
i
s
j
r
i
r
j
cos
i
sin
j
sin
i
cos
j
= s
i
s
j
sgnsin(
j
i
)
= s
i
s
j
Now if x
i
is in the rst or second quadrants then y
i
= x
i
= s
i
x
i
and if x
i
is in the third or
fourth quadrant then y
i
= x
i
= s
i
x
i
. Hence, in all cases we have x
i
= s
i
y
i
. The second
formula now follows.
The third formula follows by straightforward algebraic manipulations. We leave those details
to the reader, Exercise 6.8.15, and instead point out the following helpful facts:
z
i
=
n
j=i+1
y
j

i1
j=1
y
j
, i = 2, . . . , n 1, z
1
=
n
j=2
y
j
, z
n
=
n1
j=1
y
j
(6.4.13)
The third formula shows that we have a sign statistic similar to the ones that we have
been studying. Under the null hypothesis (s
1
, . . . , s
n
) and (z
1
, . . . , z
n
) are independent.
Hence conditionally on z
1
, . . . , z
n
(or equivalently conditionally on y
1
, . . . , y
n
) the conditional
covariance matrix of S
8
(0) is

A =
1
4
i
z
i
z
T
i
. A conditional distribution free test is
reject H
0
: = 0 when S
T
8
(0)
A
1
S
8
(0)
2
(2) . (6.4.14)
Theorem 6.4.2 shows that conditionally on the data, the
2
-approximation is appropriate.
The next theorem shows that the approximation is appropriate unconditionally as well. For
additional discussion of this test see Brown and Hettmansperger (1989). We want to describe
the asymptotically distribution free version of the Oja sign test. Then we will show that,
for elliptical models, the Oja sign test and Blumens test are equivalent. It is left to the
exercises to show that the Oja median is ane equivariant and the Oja sign test is ane
invariant so they compete with Blumens invariant test, the ane spatial methods in Section
6.3, and with the L
2
methods (vector of means and Hotellings test); see Exercise 6.8.16.
Since the Oja sign test is ane invariant, we will consider the behavior under spherical
models, without loss of generality. The elliptical models can be reduced to spherical models
by ane transformations. The next proposition shows that z
i
has a useful limiting value.
Theorem 6.4.3. Suppose that we sample from a spherical distribution centered at the origin.
Let
z(t) =
2
E(r)
_
cos t
sin t
_
then
1
n
3/2
S
8
(0) =
1
2
n
n
i=1
s
i
z
_
i
n
_
+o
p
(1)
Proof. We sketch the argument. A more general result and a rigorous argument can be
found in Brown et al. (1992). We begin by referring to formula (6.4.13). Recall that
1
n
n
i=1
y
i
=
1
n
n
i=1
r
i
_
sin
i
cos
i
_
Consider the second component and let

= mean that the approximation is valid up to o
p
(1)
terms. From the discussion of max
i
[i/(n + 1)
(i)
[ in Theorem 6.4.1, we have
1
n
[nt]
i=1
r
i
cos
i

= Er
1
n
[nt]
i=1
cos
i
=
_
Er
n
[nt]
i=1
cos
i
n + 1
=
_
Er
__
t
0
cos udu =
_
Er
_
sin t
Furthermore,
1
n
n
i=[nt]
r
i
cos
i

=
_
Er
__

t
cos udu =
_
Er
n
_
sin t
Hence the formula holds for the second component. The rst component formula follows in
a similar way.
This proposition is important since it shows that the Oja sign test is asymptotically
equivalent to Blumens test under elliptical models since they are both invariant under ane
transformations. Hence, the eciency results for Blumens test carry over for spherical and
elliptical models to the Oja sign test. Also recall that Blumens test is locally most powerful
invariant for the class of elliptical models so the Oja sign test should be quite good for
elliptical models in general. The two tests are not equivalent for nonelliptical models. In
Brown et. al. (1992) the eciency of the Oja sign test relative to Blumens test was computed
for a class of symmetric densities with contours of the form [x
1
[
m
+ [x
2
[
m
. When m = 2
we have spherical densities, and when m = 1 we have Laplacian densities with independent
marginals. Table 1 of Brown et. al.(1992) shows that the Oja sign test is more ecient than
Blumens test except when m = 2 where, of course, the eciency is 1. Hettmansperger,
Nyblom, and Oja (1994) extend the Oja methods to dimensions higher than 2 in the one
sample case and Hettmansperger and Oja (1994) extend the methods to higher dimensions
for the multisample problem.
In Brown and Hettmansperger (1987a), the idea of an ane invariant rank vector
is introduced. The approach is similar to that of Mottonen and Oja (1995) for the spatial
rank vector discussed earlier; see Section 6.3.2. The Oja criterion D
8
() with m = 1 in
Section 6.4.3 is a multivariate extension of the univariate L
1
criterion function and we take
its gradient to be the centered rank vector. Recall in the univariate case D() =
[x
j
[
and the derivative D
() =
sgn( x
j
). Hence, D
(x
i
) is the centered rank of x
i
. Likewise
the vector centered rank of x
k
is dened to be:
R
n
(x
k
) = D
8
(x
k
) =
1
2

i<j
sgn
_
_
_
det
_
_
1 1 1
x
k1
x
i1
x
j1
x
k2
x
i2
x
j2
_
_
_
_
_
(x
j
x
i
) (6.4.15)
Again we use the idea of ane invariant vector rank to dene the Oja signed rank statis-
tic. Let R
2n
(x
k
) be the rank vector when x
k
is ranked among the observation vectors
x
1
, . . . , x
n
and their reections x
1
, . . . , x
n
. Then the test statistic is S
9
(0) =
R
2n
(x
j
).
Now R
2n
(x
j
) = R
2n
(x
j
) so that the conditional covariance matrix (conditioning on the
observed data) is
A =
n
j=1
R
2n
(x
j
)R
T
2n
(x
j
)
The approximate size test of H
0
: = 0 is:
Reject H
0
if S
T
9
(0)
A
1
S
9
(0)
2
(2) . (6.4.16)
In addition, the Hodges-Lehmann estimate of based on S
9
()
.
= 0 is the Oja median
of a set of linked pairwise averages; see Brown and Hettmansperger (1987a) for details.
Hettmansperger, Mottonen and Oja (1997a, 1997b) extend the ane invariant one and
two sample rank tests to dimensions greater than 2. Because of ane invariance, Table
6.3.6 provides the eciencies relative to Hotellings test for a multivariate t distribution; see
Mottonen, Hettmansperger and Tienari (1997). Note that the eciency is quite high even
for the multivariate normal distribution. Further, note that this eciency is the same for all
elliptical normal distributions as well since the test is ane invariant.
Example 6.4.2. Mathematics and Statistics Exam Scores, Example 6.4.1 continued
We apply the Oja signed-rank test and the Oja HL-estimate to the data in Table 6.4.1.
The numerical results are similar to the results of the ane spatial methods; see Table
6.4.2 for the results. Note that due to computational complexity it is not possible to boot-
strap the covariance matrix of the Oja HL-estimate. The R-library OjaNP can be used for
computations.
6.4.4 Additional Remarks
Many authors have worked on the problem of developing multidimensional sign tests under
various invariance conditions. The sign statistics are important for dening medians, and
further in dening the concept of centered rank. Oja and Nyblom (1989) propose a family of
locally most powerful sign tests that are ane invariant and show that the Blumen (1958)
test is optimal for elliptical alternatives. Using a dierent approach that involves data based
coordinate systems, Chaudhuri and Sengupta (1993) introduce a family of ane invariant
sign tests. See also Dietz (1982) for a development of ane invariant sign and rank pro-
cedures based on rotations of the coordinate systems. Another interesting approach to the
construction of a multivariate median and rank is based on the idea of data depth due to
Liu (1990). In this case, the median is a point contained in a maximum number of triangles
formed by the
_
n
3
_
dierent choices of 3 data vectors. See, also, Liu and Singh (1993).
Hence, we conclude that if we are fairly certain that we have a spherical model, in a
spatial statistics context, for example, then the spatial median and the spatial sign test are
quite good. If the model is likely to be elliptical with heavy tails then either Blumens test
or the ane invariant spatial sign or spatial signed-rank tests along with the corresponding
equivariant estimators are both statistically and computationally quite ecient. If we suspect
that the model is nonelliptical then the methods of Oja are preferable. On the other hand, if
invariance and equivariance considerations are not relevant then the componentwise methods
should work quite well. Finally, departures from bivariate normality should be considered.
The L
1
type methods are good when there is a heavy tailed model. However, the eciency
can be improved by rank type methods when the tail weight is more moderate and perhaps
close to normality. Even at the bivariate normal model the rank methods loose very little
eciency when invariance is taken into account. Oja and Randles (2004) discuss ane
invariant rank tests for several samples and, further, discuss tests of independence.
6.5 Robustness of Multivariate Estimates of Location
In this section we sketch some results on the inuence and breakdown points for the esti-
mators derived from the various estimating equations. Recall from Theorem 6.1.2 that the
vector inuence is proportional to the vector (x). Typically (x) is a projection and re-
duces the problem of nding the asymptotic distribution of the estimating function
1
n
S()
to a central limit problem. To determine whether an estimator has bounded inuence or
not, it is only necessary to check that the norm of (x) is bounded. Further, recall that the
breakdown point is the smallest proportion of contamination needed to carry the estimator
beyond all bounds. We now briey consider the dierent invariance models:
6.5.1 Location and Scale Invariance: Componentwise Methods
In the case of component medians, the inuence function is given by
T
(x) (sgn(x
11
), sgn(x
21
)) .
The norm is clearly bounded. Further, the breakdown point is 50% as it is in the uni-
variate case. Likewise, for the Hodges-Lehmann component estimates
T
(x) (F
1
(x
11
)
1/2, F
2
(x
21
) 1/2), where F
i
( ) is the ith marginal cdf. Hence, the inuence is bounded in
this case as well. The breakdown point is 29%, the same as the univariate case. Note, how-
ever, that the componentwise methods are neither rotation nor ane invariant/equivariant.
6.5.2 Rotation Invariance: Spatial Methods
We assume in this subsection that the underlying distribution is spherical. For the spatial
median, we have (x) = u(x), the unit vector in the x direction. Hence, again we have
bounded inuence. Lopuhaa and Rousseeuw (1991) were the rst to point out that the
6.5. ROBUSTNESS OF MULTIVARIATE ESTIMATES OF LOCATION 393
spatial median has 50% breakdown point. The proof is given in the following theorem.
First note from Exercise 6.8.17 that the maximum breakdown point for any translation
equivariant estimator is
[(n+1)/2]
n
and the spatial median is translation equivariant.
Theorem 6.5.1. The spatial median

has breakdown point
=
[(n+1)/2]
n
for every dimen-
sion.
Proof. In view of the preceding remarks, we only need to show

[(n+1)/2]
n
. Let
X = (x
1
, . . . , x
n
) be a collection of n observations in k dimensions. Let Y
m
= (y
1
, . . . , y
n
)
be formed from X by corrupting any m observations. Then

(Y
m
) minimizes
|y
i
|.
Assume, without loss of generality, that

(X) = 0. (Use translation equivariance.) We
suggest that the reader follow the argument with a picture in two dimensions.
Let M = max
i
|x
i
| and let B(0, 2M) be the sphere of radius 2M centered at the origin.
Suppose the number of corrupted observations m [
n1
2
]. We will show that sup|
(Y
m
)|
over Y
m
is nite. Hence,

(n1)/2+1
n
=
(n+1)/2
n
and we will be nished.
Let d
m
= inf|
(Y
m
) | : in B(0, 2M), the distance of

(Y
m
) from B(0, 2M).
Then the distance of

(Y
m
) from the origin is |
(Y
m
)| d
m
+ 2M. Now
|y
j
(Y
m
)| |y
j
| |
(Y
m
)| |y
j
| (d
m
+ 2M) (6.5.1)
Suppose the contamination has pushed

(Y
m
) far outside B(0, 2M). In particular, suppose
d
m
> 2M[(n+1)/2]. We will show this leads to a contradiction. We know that X B(0, M)
and if x
k
is not contaminated,
|x
k
(Y
m
)| M +|x
k
| + d
m
(6.5.2)
Next split the following sum up over contaminated and not contaminated observations using
( 6.5.1) and ( 6.5.2).
n
i=1
|y
i
(Y
m
)|
contam
(|y
i
| (d
m
+ 2M)) +
not
(|y
i
| d
m
)
=
|y
i
|
_
n 1
2
_
(d
m
+ 2M) + (n
_
n 1
2
_
)d
m
=
|y
i
| 2M(
_
n 1
2
_
) + d
m
(n 2(
_
n 1
2
_
))
>
|y
i
| 2M(
_
n 1
2
_
) + 2M(
_
n 1
2
_
)(n 2(
_
n 1
2
_
))
=
|y
i
| + 2M(
_
n 1
2
_
)(n 1 2(
_
n 1
2
_
))
|y
i
|
But, recall that

(Y
m
) minimizes
|y
i
|, hence we have a contradiction. Thus d
m

2M(
_
n1
2
).
But then
must be at least
[(n1)/2]+1
n
and
_
n1
2
+1 =
_
n+1
2
and the proof is complete.

6.5.3 The Spatial Hodges-Lehmann Estimate
This estimate is the spatial median of the pairwise averages:
1
2
(x
i
+x
j
). It was rst studied
in detail by Chaudhuri (1992) and it is the estimate corresponding to the spatial signed rank
statistic (6.3.16) of Mottonen and Oja (1995).
From ( 6.3.21) it is clear that the inuence function is bounded. Further, since it is the
spatial median of the pairwise averages, the argument that shows that the breakdown of
the univariate Hodges-Lehmann estimate is 1
1
2
.29 works in the multivariate case; see
Exercise 1.12.13 in Chapter 1.
6.5.4 Ane Equivariant Spatial Median
We can represent the ane equivariant spatial median as an M-estimate; see Maronna (1976)
or Maronna et. al (2006). Our multivariate estimators

and U
x
are the solutions of the
following M-estimating equations:
1
n
n
i=1
u
1
(d
i
)U
x
(x
i
) = 0,
1
n
n
i=1
u
2
(d
i
)U
x
(x
i
)(x
i
)
T
U
T
x
= I (6.5.3)
where d
i
= |U
x
(x
i
)|
2
with u
1
(d) = d
1/2
and u
2
(d) = kd
1
. Because they are M-
estimators, the breakdown value for
is between (k +1)
1
and k
1
where k is the dimension
of the underlying population. The asymptotic theory for

, developed in the appendix of
Hettmansperger and Randles (2002), shows that the inuence function for

is
B
1
U
x
(x )
|U
x
(x
i
)|
where B = E
_
1
|U
x
(I x
i
)|
_
I
U
x
(x )
|U
x
(x
i
)|
(x )
T
U
T
x
|U
x
(x
i
)|
__
;
(6.5.4)
recall (6.3.3). Hence, we see that the inuence function is bounded with a positive breakdown.
Note however that the breakdown decreases as the dimension of the underlying distribution
increases.
6.5.5 Ane Equivariant Oja Median
This estimator is ane equivariant and solves the equation ( 6.4.12). From the projection
representation of the statistic in Theorem 6.4.2 notice that the vector z(t) is bounded. It
then follows that, for spherical models (with nite rst moment), the inuence function is
bounded. See Niinimaa and Oja (1995) for a rigorous derivation of the inuence function.
The breakdown properties of the Oja median are more interesting. As shown by Niinimaa,
Oja, and Tableman (1990), even though the inuence function is bounded, the estimate can
be broken down with just two contaminated points; that is, they showed that the breakdown
of Ojas median is 2/n. Further, Niinimaa and Oja (1995) show that the breakdown point of
the Oja median depends on the dispersion of the contaminated data. When the dispersion
of the contaminated data is less than the dispersion of the original data then the asymptotic
6.6. LINEAR MODEL 395
breakdown point is positive. If, for example, the contaminated points are all at a single
point, then the breakdown point is
1
3
.
6.6 Linear Model
We consider the bivariate linear model. As examples of the linear model, we will nd bivariate
estimates and tests for a general regression eect as well as shifts in the bivariate two-sample
location model and multisample location models. We will focus primarily on compontentwise
rank methods; however, we will discuss some other methods for the multiple sample location
model in the examples of Section 6.6.1. Spatial and ane invariant/equivariant methods for
the general linear model are currently under development in the research literature. See Davis
and McKean (1993) for a thorough development of the linear model rank-based methods.
In Chapter 3, Section 3.2, we present the notation for the univariate linear model. Here,
we will think of the multivariate linear model as a series of concatenations of the univariate
models. Hence, we introduce
Y
n2
=
_
_
_
Y
11
Y
12
.
.
.
.
.
.
Y
n1
Y
n2
_
_
_
= (Y
(1)
, Y
(2)
) =
_
_
_
Y
T
1
.
.
.
Y
T
n
_
_
_
(6.6.1)
The superscript indicates a column, a subscript a row, and, as usual in this chapter, T
denotes transpose. Now the multivariate linear model is
Y = 1
T
+X + , (6.6.2)
where 1 is n1 vector of ones,
T
= (
(1)
,
(2)
), X is np full rank, centered design matrix,
is p 2 matrix of unknown regression constants, and is n 2 matrix of errors. The
rows of , and hence, Y, are independent and the rows of are identically distributed with
a continuous bivariate cdf F(s, t).
Model 6.6.2 is the concatenation of two univariate linear models: Y
(i)
= 1
(i)
+X
(i)
+
(i)
for i = 1, 2. We have restricted attention to the bivariate case to simplify the presentation.
In most cases the general multivariate results are obvious.
We rank within components or columns. Hence, the rank-score of the ith item in the jth
column is:
a
ij
= a(R
ij
) = a(R(Y
ij
x
T
i

(j)
)) (6.6.3)
where R
ij
is the rank of Y
ij
x
T
i

(j)
when ranked among Y
1j
x
T
1
(j)
, . . . , Y
nj
x
T
n
(j)
. The
rank scores are generated by a(i) = (
i
n+1
), 0 < (u) < 1,
_
(u)du = 0, and
_

2
(u)du = 1;
see Section 3.4. Let the score matrix A be dened as follows:
A =
_
_
_
a
11
a
12
.
.
.
.
.
.
a
n1
a
n2
_
_
_
= (a
(1)
, a
(2)
) (6.6.4)
so that each column is the set of rank scores within the column.
The criterion function is
D() =
n
i=1
a
T
i
r
i
(6.6.5)
where a
T
i
= (a
i1
a
i2
) = (a(R(Y
i1
x
T
i

(1)
), a(R(Y
i2
x
T
i

(2)
)) and r
T
i
= (Y
i1
x
T
i

(1)
, Y
i2
x
T
i

(2)
). Note at once that this is an analog, using inner products, of the univariate criterion
in Section 3.2.1. In fact, D() is the sum of the corresponding univariate criterion functions.
The matrix of the negatives of the partial derivatives is:
L() = X
T
A =
_
_
_
x
i1
a
i1
x
i1
a
i2
.
.
.
.
.
.
x
ip
a
i1
x
ip
a
i2
_
_
_
=
_
a
i1
x
i
,
a
i2
x
i
_
; (6.6.6)
see Exercise 6.8.18 and equation ( 3.2.11). Again, note that the two columns in ( 6.6.6) are
the estimating equations for the two concatenated univariate linear models and x
i
is the ith
row of X written as a column.
Hence, the componentwise multivariate R-estimator of is

that minimizes ( 6.6.5) or
solves L()
.
= 0. Further, L(0) is the basic quantity that we will use to test H
0
: = 0. We
must statistically assess the size of L(0) and reject H
0
and claim the presence of a regression
eect when L(0) is too large or too far from the zero matrix.
We rst consider testing H
0
: = 0 since the distribution theory of the test statistic will
be useful later for the asymptotic distribution theory of the estimate.
For the linear model we need some results on direct products; see Magnus and Neudedker
(1988) for a complete discussion. We list here the results that we need:
1. Let A and B be mn and p q matrices. The mp nq matrix AB dened by
AB =
_
_
a
11
B a
1n
B
.
.
.
.
.
.
a
m1
B a
mn
B
_
_
(6.6.7)
is called the direct product or Kronecker product of A and B.
2.
(AB)
T
= A
T
B
T
, (6.6.8)
(AB)
1
= A
1
B
1
, (6.6.9)
(AB)(CD) = (ACBD) . (6.6.10)
3. Let D be a m n matrix. Then D
col
is the mn 1 vector formed by stacking the
columns of D. We then have
tr(ABCD) = (D
T
col
)
T
(C
T
A)B
col
= D
T
col
(AC
T
)(B
T
)
col
. (6.6.11)
4.
(AB)
col
= (B
T
I)A
col
= (I A)B
col
. (6.6.12)
These facts are used in the proofs of the theorems in the rest of this chapter.
6.6.1 Test for Regression Eect
As mentioned above, we will base the test of H
0
: = 0 on the size of the random matrix
L(0). We deal with this random matrix by rolling out the matrix by columns. Note from
6.6.4 and ( 6.6.6) that L(0) = X
A = (X
a
(1)
, X
a
(2)
). Then we dene the vector
L
col
=
_
X
T
a
(1)
X
T
a
(2)
_
=
_
X
T
0
0 X
T
__
a
(1)
a
(2)
_
. (6.6.13)
Now from the discussion in Section 3.5.1, let the column variances and covariances be
2
a
(i)
=
1
n 1
n
j=1
a
2
ji

2
i
=
_

2
(u) du = 1
2
a
(1)
a
(2)
=
1
n 1
n
j=1
a
j1
a
j2

12
=
_
(F
1
(s))(F
2
(t)) dF(s, t) , (6.6.14)
where F
1
(s) and F
2
(t) are the marginal cdfs of F(s, t). Since the ranks are centered and
using the same argument as in Theorem 3.5.1, E(L
col
) = 0 and
V = Cov(L
col
) =
_

2
a
(1)
X
T
X
a
(1)
a
(2) X
T
X
a
(1)
a
(2) X
T
X
2
a
(2)
X
T
X
_
=
_
1
n 1
A
T
A
_
_
X
T
X
_
. (6.6.15)
Further,
1
n
V
_
1
12
12
1
_
, (6.6.16)
where n
1
X
T
X and is positive denite.
The test statistic for H
0
: = 0 is the quadratic form
A
R
= L
T
col
V
1
L
col
= (n 1)L
T
col
_
(A
T
A)
1
(X
T
X)
1
L
col
(6.6.17)
where we use a basic formula for nding the inverse of a direct product; see ( 6.6.9). Before
discussing the distribution theory we record one nal result from traditional multivariate
analysis:
A
R
= (n 1)traceL
T
(X
T
X)
1
L(A
T
A)
1
; (6.6.18)
see Exercise 6.8.19. This result is useful in translating a quadratic form involving a direct
product into a trace involving ordinary matrix products. Expression ( 6.6.18) corresponds
to the Lawley-Hotelling trace statistic based on ranks within the components. The
following theorem summarizes the distribution theory needed to carry out the test.
Theorem 6.6.1. Suppose H
0
: = 0 is true and the conditions in Section 3.4 hold. Then
P
0
(A
R

2
(2p)) as n
where
2
(2p) is the upper percentile from a chisquared distribution with 2 degrees of free-
dom.
Proof: This theorem follows along the same lines as Theorem 3.5.2. Use a projection
to establish that
1
n
L
col
is asymptotically normally distributed and then A
R
will be asymp-
totically chisquared. The details are left as an Exercise 6.8.20; however, the projection is
provided below for use with the estimator.
1
n
L
col
=
1
n
_
X
T
(1)
X
T
(2)
_
+o
p
(1) (6.6.19)
where
(i)
T
= ((F
i
(
1i
) . . . (F
i
(
ni
)) i = 1, 2 and F
1
, F
2
are the marginal cdfs. Recall also
that a(i) = (
i
n+1
) where ( ) is the score generating function. The asymptotic covariance
matrix is given in ( 6.6.16).
Example 6.6.1. Multivariate Mann-Whitney-Wilcoxon Test
We now specialize to the Wilcoxon score function a(i) =
12(
i
n+1
.5) and consider the
two sample model. The test is a multivariate version of the Mann-Whitney-Wilcoxon test.
Note that
a(i) = 0,
2
a
=
1
n1
a
2
(i) =
n
n+1
1, and
a
(1)
a
(2) =
12
n 1
n
i=1
_
R
i1
n + 1

1
2
__
R
i2
n + 1

1
2
_
where R
11
, . . . , R
n1
are the ranks of the combined samples in the rst component and sim-
ilarly for R
12
, . . . , R
n2
for the second component. Note that
a
(1)
a
(2) =
n
n+1
r
s
, where r
s
is
Spearmans Rank Correlation Coecient. Hence,
1
n 1
A
T
A =
_
n
n+1

a
(1)
a
(2)
a
(1)
a
(2)
n
n+1
_
=
n
n + 1
_
1 r
s
r
s
1
_
_
1
12
12
1
_
where
12
= 12
_ _ _
F
1
(r)
1
2
__
F
2
(s)
1
2
_
dF(r, s)
depends on the underlying bivariate distribution.
Next , we must consider the design matrix X for the two sample model. Recall ( 2.2.1)
and ( 2.2.2) which cast the two sample model as a linear model in the univariate case. The
design matrix (or vector in this case) is not centered. For convenience we modify C in
( 2.2.1) to have 1 in the rst n
1
places and 0 elsewhere. Note that the mean of C is
n
1
n
and
subtracting this from the elements of C yields the centered design:
X =
1
n
_
_
_
_
_
_
_
_
_
n
2
.
.
.
n
2
n
1
.
.
.
n
1
_
_
_
_
_
_
_
_
_
where n
2
appears n
1
times. Then X
T
X =
n
1
n
2
n
and
1
n
X
T
X =
n
1
n
2
n
2

1
2
. We assume as
usual that 0 <
i
< 1, i = 1, 2.
Now L = L(0) = (l
1
, l
2
) and l
i
= X
T
a
(i)
. It is easy to see that
l
i
=
n
1
j=1
a
ji
=
12
n
1
j=1
_
R
ji
n + 1

1
2
_
, i = 1, 2
So l
i
is the centered and scaled sum of ranks of the rst sample in the ith component.
Now L
col
= (l
1
, l
2
)
T
has an approximate bivariate normal distribution with covariance
matrix:
Cov(L
col
) =
1
n 1
(A
T
A) (X
T
X) =
n
1
n
2
n(n 1)
A
T
A =
n
1
n
2
n + 1
_
1 r
s
r
s
1
_
.
Note that
12
is unknown but estimated by Spearmans rank correlation coecient r
s
(see
above discussion). Hence the test is based on A
R
in ( 6.6.18). It is easy to invert Cov(L
col
)
and we have, (see Exercise 6.8.20),
A
R
=
n + 1
n
1
n
2
(1 r
2
s
)
l
2
1
+ l
2
2
2r
s
l
1
l
2
=
1
1 r
2
s
_
l
2
1
+ l
2
2
2r
s
l
1
l
2
_
,
where l
1
and l
2
are the standardized MWW statistics. We reject H
0
: = 0 at approximately
level when A
R

2
(2). The test statistic A

R
is a quadratic form in the component Mann-
Whitney-Wilcoxon rank statistics and r
s
provides the adjustment for the correlation between
the components.
Example 6.6.2. Brains of Mice Data
In Table 6.6.1 we provide bivariate data on levels of certain biochemical components in the
brains of mice. The treatment group received a drug which was hypothesized to alter these
levels. The control group received a placebo.
The ranks of the combined treatment and control data for each component are given in
the table, under component ranks. The Spearman rank correlation coecient is r
s
= .149,
Table 6.6.1: Levels of Biochemical Components in the Brains of Mice
Data Component Ranks Centered Ane Ranks
Control Treatment Control Treatment Control Treatment
(1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2)
1.21 .61 1.40 1.50 16 22 22 18.5 .90 18.53 8.28 8.06
.92 .43 1.17 .39 3 12 18 13 -9.02 3.55 2.06 -7.76
.80 .35 1.23 .44 1 4.5 18 13 -8.81 -6.26 4.90 1.05
.85 .48 1.19 .37 2 17 15 6 -9.40 9.37 4.15 -11.78
.98 .42 1.38 .42 4 10.5 21 10.5 -6.80 .74 9.79 -4.63
1.15 .52 1.17 .45 11 20 12.5 14.5 -.53 15.51 .55 5.96
1.10 .50 1.31 .41 9 18.5 20 9 -3.25 13.63 8.15 -7.05
1.02 .53 1.30 .47 6 21 19 16 -6.23 15.97 6.72 6.71
1.18 .45 1.22 .29 14 14.5 17 2 2.28 5.27 5.52 -16.78
1.09 .4 1.00 .30 7.5 8 5 3 -3.04 -4.56 -4.84 -15.02
1.12 .27 10 1 .95 -18.31
1.09 .35 7.5 4.5 -2.03 -12.54
the standardized MWW statistics are l
1
= 2.74 and l
2
= 2.17. Hence A
R
= 14.31 with
the p-value of .0008 based on a
2
-distribution with 2 degrees of freedom. Panels A and
B of Figure 6.6.1 show, respectively, plots of the bivariate data and the component ranks.
The strong separation between treatment and control is clear from the plot. The treatment
group contains an outlier which is brought in by the component ranking.
We have discussed the multivariate version of the MWW test based on the centered ranks
of the combined data where the ranking mechanism is represented by the matrix A. Given A
and the design matrix X, the test statistic A
R
can be computed. Recall from Section 6.3.2
that Mottoen and Oja (1995) introduced the vector spatial rank R(x
i
) =
j
u(x
i
x
j
),
where u(x) = |x|
1
x is a unit vector in the direction of x. In Section ??, an ane rank
vector R(x
i
) is given by ( 6.4.15). Both spatial and ane rank vectors are centered. Let
R(x
i
) be the ith row of A. Note that in these two cases that the columns of A are not of
length 1. Nevertheless, from ( 6.6.18), we have
A
R
=
n(n 1)
n
1
n
2
[l
1
l
2
](A
T
A)
1
[l
1
l
2
]
T
=
n(n 1)
n
1
n
2
1
1 r
2
_
l
2
1
|a
(1)
|
2
+
l
2
2
|a
(2)
|
2
2r
l
1
l
2
|a
(1)
|a
(2)
|
_
,
where r is the correlation coecient between the two columns of A; see Brown and Hett-
mansperger (1987b). Table 6.6.1 contains the ane rank vectors and the corresponding
ane invariant MWW test is A
R
= 15.69 with an approximate p-value of .0004 based on a
2
(2)-distribution. See Exercise 6.8.21.
Figure 6.6.1: Panel A: Plot of the data for the Brains of Mice Data; Panel B: Plot of the
corresponding ranks of the Brains of Mice Data.
C
C
C
C
C
C
C
C
C
C
Component 1 responses
C
o
m
p
o
n
e
n
t

2

r
e
s
p
o
n
s
e
s
0.8 1.0 1.2 1.4
0
.
4
0
.
8
1
.
2
T
T
T
T
T
T
T
T
T T
T
T
Panel A
C
C
C
C
C
C
C
C
C
C
Component 1 ranks
C
o
m
p
o
n
e
n
t

2

r
a
n
k
s
5 10 15 20
5
1
0
1
5
2
0
T
T
T
T
T
T
T
T
T
T
T
T
Panel B
Example 6.6.3. Multivariate Kruskal-Wallis Test
In this example we develop the multivariate version of the Kruskal-Wallis statistic for use
in a multivariate one-way layout; see Section 4.2.2 for the univariate case. We suppose we
have k samples from k independent distributions. The n (k 1) design matrix is given by
C =
_
_
_
_
_
_
_
0
n
1
0
n
1
. . . 0
n
1
1
n
2
0
n
2
. . . 0
n
2
0
n
3
1
n
3
. . . 0
n
3
.
.
.
.
.
.
.
.
.
.
.
.
0
n
k
0
n
k
. . . 1
n
k
_
_
_
_
_
_
_
and the column means are c
= (
2
, . . . ,
k
) where
i
=
n
i
n
. The centered design is X =
C1c
and has full column rank k 1.

In this design the rst of the k populations is taken as the reference population with
location (
1
,
2
)
T
. The ith row of the matrix is the vector of shift parameters for the
(i + 1)st population relative to the rst population. We wish to test H
0
: = 0 that all
populations have the same (unknown) location vector.
The matrix A = (a
(1)
, a
(2)
) has the centered and scaled Wilcoxon scores of the previous
example. Hence, a
(1)
is the vector of rank scores for the combined k samples in the rst
component. Since the rank scores are centered, we have
X
T
a
(i)
= (C1c
T
)
T
a
(i)
= C
T
a
(i)
and the second version is easier to compute. Now L(0) = (L
(1)
, L
(2)
) and the hth component
of column i is
l
hi
=
12
jS
h
_
R
ji
n + 1

1
2
_
=
12
n + 1
n
h
_
R
hi
n + 1
2
_
where S
h
is the index set corresponding to the hth sample and R
hi
is the average rank of
the hth sample in the ith component.
As in the previous example, we replace
1
n1
A
T
A by its limit with 1 on the main diagonal
and
12
o the diagonal. Then let ((
ij
)) be the inverse matrix. This is easy to compute
and will be useful below. The test statistic is then, from ( 6.6.17)
A
R
(L
(1)
T
, L
(2)
T
)
_

11
(X
T
X)
1
12
(X
T
X)
1
12
(X
T
X)
1
22
(X
T
X)
1
__
L
(1)
L
(2)
_
=
11
L
(1)
T
(X
T
X)
1
L
(1)
+ 2
12
L
(1)
T
(X
T
X)
1
L
(2)
+
22
L
(2)
T
(X
T
X)
1
L
(2)
The indicates that the right side contains asymptotic quantities which must be estimated
in practice. Now
L
(1)
T
(X
T
X)
1
L
(1)
=
k
j=1
n
1
j
l
2
j1
=
12
(n + 1)
2
k
j=1
n
j
_
R
j1
n + 1
2
_
=
n
n + 1
H
1
where H
1
is the Kruskal-Wallis statistic computed on the rst component. Similarly,
L
(1)
T
(X
T
X)
1
L
(2)
=
12
(n + 1)
2
k
j=1
n
j
_
R
j1
n + 1
2
__
R
j2
n + 1
2
_
=
n
n + 1
H
12
and H
12
is a cross component statistic. Using Spearmans rank correlation coecient r
s
to
estimate
12
, we get
A
R
=
1
1 r
2
s
H
1
2r
s
H
12
+ H
2
The test rejects the null hypothesis of equal location vectors at approximately level when
A
R

2
(2(k 1)).
In order to compute the test, rst compute componentwise rankings. We can display the
means of the rankings in a 2 k table as follows:
Treatment
1 2 k
Component 1 R
11
R
21
R
1k
Component 2 R
12
R
22
R
2k
Then use Minitab or some other package to nd the two Kruskal-Wallis statistics. To com-
pute H
12
either use the formula above or use
H
12
=
k
j=1
_
1
n
j
n
_
Z
j1
Z
j2
, (6.6.20)
where Z
ji
= (R
ji
(n+1)/2)/
_
VarR
ji
and VarR
ji
= (nn
j
)(n+1)/n
j
; see Exercise 6.8.22.
The package Minitab lists Z
ji
in its output.
The last example shows that in the general regression problem with Wilcoxon scores, if
we wish to test H
0
: = 0, the test statistic ( 6.6.17) can be written as
A
R
=
1
1
12
2
L
(1)
T
(X
T
X)
1
L
(1)
2
12
L
(1)
T
(X
T
X)
1
L
(2)
+L
(2)
T
(X
T
X)
1
L
(2)
(6.6.21)
where the estimate of
12
can be taken to be r
s
or
n
n+1
r
s
and r
s
is Spearmans rank correlation
coecient and
l
hi
=
12
n + 1
n
j=1
_
R
ji
n + 1
2
_
x
jh
=
n
j=1
a(R(Y
ji
))x
jh
Then reject H
0
: = 0 when A
R

2
(2p).
6.6.2 The Estimate of the Regression Eect
In the introduction to Section 6.6, we pointed out that the R-estimate

solves L()
.
= 0,
( 6.6.7). Recall the representation of the R-estimate in the univariate case given in Corollary
3.5.2. This immediately extends to the multivariate case as
n(

0
) =
_
1
n
X
T
X
_
1
1
n
(
1
X
T
(1)
,
2
X
T
(2)
) +o
p
(1) (6.6.22)
where
(i)
T
= ((F
i
(
1i
)), . . . , (F
i
(
ni
))), i = 1, 2 Further,
i
is given by ( 3.4.4) and we
dene the matrix by = diag
1
,
2
. To investigate the asymptotic distribution of the
random matrix ( 6.6.22), we again roll it out by columns. We need only consider the linear
approximation on the right side.
Theorem 6.6.2. Assume the regularity conditions in Section 3.4. Then, if is the true
matrix,
n(
col
col
)
D
N
2p
_
0,
_
_
1
12
12
1
_
1
_
where
12
=
_ _
(F
1
(s))(F
2
(t))dF(s, t) , = diag
1
,
2
and
i
is given by ( 3.4.4), and
1
n
X
X , positive denite.
Proof. We will sketch the argument based on ( 6.6.1), (6.6.13), and Theorem 3.5.2. Consider,
with
1
replaced by (
1
n
X
X)
1
,
1
n
_

1
1
X
T
(1)
1
X
T
(2)
_
=
_

1
1
0
0
2
1
_
_
1
n
X
T
(1)
1
n
X
T
(2)
_
The multivariate central limit theorem establishes the asymptotic multivariate normality.
From the discussion after ( 6.6.1), we have E
(i)
(i)
T
= I, i = 1, 2 and E
(1)
(2)
T
=
12
I.
Hence, the covariance matrix of the above vector is:
_

1
1
0
0
2
1
__

12
12

__

1
1
0
0
2
1
_
=
_

2
1
12
12
2
2
1
_
=
_

2
1

1
12
12

2
2
_
1
=
_
_
1
12
12
1
_
1
and this is the asymptotic covariance matrix for

n(
col
col
).
We remind the reader that when we use the Wilcoxon score (u) =
12(u
1
2
), then
1
i
=
12
_
f
2
i
(x)dx, f
i
the marginal pdf i = 1, 2 and
12
=
n
n+1
r
s
, where r
s
is Spearmans
rank correlation coecient. See Section 3.7.1 for a discussion of the estimation of
i
.
6.6.3 Tests of General Hypotheses
Recall the model ( 6.6.1) and let the matrix M be r p of rank r and the matrix K be
2 s of rank s. The matrices M and K are fully specied by the researcher. We consider a
test of H
0
: MK = 0. For example, when K = I
2
, and M = (O I
r
) where O denotes the
r (p r) matrix of zeros, we have H
0
: M = 0 and this null hypothesis species that the
last r parameters in both components are 0. This is the usual subhypothesis in the linear
model applied to both components. Alternatively we might let M = I
p
, p p, and
K =
_
1
1
_
.
Then H
0
: K = 0 and we test the null hypothesis that the parameters of the two con-
catenated linear models are equal:
i1
=
i2
for i = 1, . . . , p. This could be appropriate for
a pre-post test model. Thus, we generalize ( 3.6.1) to the multivariate linear model. The
development will preceed in steps beginning with H
0
: M = 0, i.e. K = I
2
.
0
: M = 0
n(M
)
col
D
N
2r
_
0,
_
_
1
12
12
1
_
_
[M
1
M
T
]
_
here ,
12
, are given in Theorem 6.6.2. Let V denote the asymptotic covariance matrix
then
n(M
)
T
col
V
1
(M
)
col
=

T
col
[
1
n 1
A
T
A ]
1
M
T
(M(X
T
X)
1
M
T
)
1
M
col
= trace(M
)
T
(M(X
T
X)
1
M
T
)
1
(M
)
[
1
n 1
A
T
A ]
1
(6.6.23)
is asymptotically
2
(2r). Note that we have estimated unknown parameters in the asymptotic
covariance matrix V.
Proof. First note that
(M
)
col
=
_
M 0
0 M
_
col
Using Theorem 6.6.2, the asymptotic covariance is, with = diag
1
,
2
,
V =
_
M 0
0 M
___
_
1
12
12
1
_
1
__
M
T
0
0 M
T
_
=
_
M 0
0 M
__

1
12
12
1
__
M
T
0
0 M
T
_
=
_

1
M
1
M
T
12
M
1
M
T
12
M
1
M
T
2
M
1
M
T
_
=
_
_
1
12
12
1
_
_
M
1
M
T
Hence, by the same argument,
T
col
_
M
T
0
0 M
T
_
V
1
_
M 0
0 M
_
col
=
T
col
[
_
1
12
12
1
_
]
1
M
T
(M
1
M
T
)
1
M
col
=
trace(M
)
T
(M
1
M
T
)
1
(M
)[
_
1
12
12
1
_
]
1
Denote the test statistic, ( 6.6.23), dened in the last theorem to be

Q
MV R
= trace(M
)
T
(M(X
T
X)
1
M
T
)
1
(M
)[
1
n 1
A
T
A ]
1
. (6.6.24)
Then the corresponding level asymptotic decision rule is:
Reject H
0
: M = 0 in favor of H
A
: M ,= 0 if Q
MV R

(2r) . (6.6.25)
The next theorem describes the test when only K is involved. After that we put the two
results together for the general statement.
0
: K = 0, where K is a 2 s matrix,

n(
K)
col
is asymptot-
ically
N
ps
_
0,
_
K
T
_
1
12
12
1
_
K
_
1
_
where ,
12
, and are given in Theorem 6.6.2. Let V denote the asymptotic covariance
matrix. Then
n(
K)
T
col
V
1
(
K)
col
= trace(
K)
T
(X
T
X)(
K)[K
1
n 1
A
T
A K]
1
is asymptotically
2
(ps).
Proof: First note that (
T
K)
col
=
_
K
T
I
_
T
col
. Then from Theorem 6.6.2, the asymp-
totic covariance matrix of

n
T
col
is
AsyCov(
T
col
) =
_
_
1
12
12
1
_
1
.
Hence, the asymptotic covariance matrix of

n(
K)
col
is,
AsyCov(
n(
K)
col
) =
_
K
T
I
_
__
_
1
12
12
1
_
1
_
_
K
T
I
_
T
=
_
K
T
_
1
12
12
1
_

1
_
(KI)
=
_
K
T
_
1
12
12
1
_
K
_
1
,
which is the desired result. The asymptotic normality and chisquare distribution follow from
Theorem 6.6.2.
The previous two theorems can be combined to yield the general case.
0
: MK = 0,
n(M
K)
col
D
N
rs
_
0, [K
T
_
1
12
12
1
_
K] MM
T1
_
.
If V is the asymptotic covariance matrix with estimate

V then
n(M
K)
T
col
V
1
(M
K)
col
=
(M
K)
T
col
[K
T

1
n 1
A
T
A K]
1
[M(X
T
X)
1
M
T
]
1
(M
K)
col
=
trace(M
K)
T
[M(X
T
X)
1
M
T
]
1
(M
K)[K
T

1
n 1
A
T
A K]
1
(6.6.26)
has an asymptotic
2
(rs) distribution.
The last theorem provides great exibility in composing and testing hypotheses in the
multivariate linear model. We must estimate the matrix along with the other parameters
familiar in the linear model. However, once we have these estimates it is a simple series of
matrix multiplications and the trace operation to yield the test statistic.
Denote the test statistic, ( 6.6.26), dened in the last theorem to be
Q
MV RK
=
trace(M
K)
T
[M(X
T
X)
1
M
T
]
1
(M
K)[K
T

1
n 1
A
T
A K]
1
(6.6.27)
Then the corresponding level asymptotic decision rule is:
Reject H
0
: MK = 0 in favor of H
A
: MK ,= 0 if Q
MV RK

(rs) . (6.6.28)
The test statistics Q
MV R
and Q
MV RK
are extensions to the multivariate linear model
of the quadratic form test statistic F
,Q
, ( 3.6.14). The score or aligned test and the drop
in dispersion test are also available. Davis and McKean (1993) develop these in detail and
provide the rigorous development of the asymptotic theory. See also Puri and Sen (1985) for
a development of rank methods in the multivariate linear model.
In traditional analysis, based on the least squares estimate of the matrix of regression
coecients, there are several tests of the hypothesis H
0
: MK = 0. The test statistic
Q
MV RK
, ( 6.6.26), is an analogue of the Lawley (1938) and Hotelling (1951) trace criterion.
This traditional test statistic is given by
Q
LH
= trace
_
(M
LS
K)
T
[M(X
T
X)
1
M
T
]
1
M
LS
K(K
T
K)
1
_
, (6.6.29)
where

LS
= (X
X)
1
X
Yis the least squares estimate of the matrix of regression coecients

and

is the usual estimate of , the covariance matrix of the matrix of errors , given by
= (Y X
LS
)
(Y X
LS
)/(n p 1) . (6.6.30)
Under the above assumptions and the assumption that is positive denite and assuming
H
0
: MK = 0 is true, Q
LH
has an asymptotic
2
distribution with rs degrees of freedom.
This type of hypothesis arises in prole analysis; see Chinchilli and Sen (1982) for this
application.
In order to illustrate these tests, we complete this section with an example.
Example 6.6.4. Tablet Potency Data
The data are the results from a pharmaceutical experiment on the eects of four factors
on ve measurements of a tablet. There are n = 34 data cases. The ve responses are:
(POT2), potency of the tablet at the end of 2 weeks; (POT4), potency of the tablet at the
end of 4 weeks; the third and fourth responses are measures of the tablets purity (RSDCU)
and hardness (HARD); and the fth response is its water content (H
2
O); hence, we have a
5-dimensional response rather than the bivariate responses discussed so far. This means that
the degrees of freedom are 5r rather than 2r in Theorem 6.6.3. The factors are: SAI, the
amount of intragranular steric acid, which was set at the three levels 1, 0 and 1; SAE, the
amount of extragranular steric acid, which was set at the three levels 1, 0 and 1; ADS, the
amount of cross carmellose sodium, which was set at the three levels 1, 0 and 1; and TYPE
of steric acid which was set at two levels 1 and 1. The initial potency of the compound,
POT0, served as a covariate. The data are displayed in Table 6.6.2. It was used as an
example in the article by Davis and McKean (1993) and much of our discussion below is
taken from this article.
This data set was treated as a univariate model for the response POT2 in Chapter 3;
see Examples 3.3.3 and 3.9.2. As our full model we choose the same model described in
expression ( 3.3.1) of Example 3.3.3. It includes: the linear eects of the four factors; six
simple two-way interactions between the factors; the three quadractic terms of the factors
SAI, SAE, and ADS; and the covariate for a total of fteen terms. The need for the quadratic
terms was discussed in the diagnostic analysis of this model for the response POT2; see
Example 3.9.2. Hence, Y is 34 5, X is 34 14, and is 14 5,
Table 6.6.3 displays the results for the test statistic Q
MV R
, ( 6.6.24) for the usual ANOVA
hypotheses of interest: main eects, interaction eects broken down as simple two-way and
quadratic, and covariate. Also listed are the hypothesis matrices M for each eect where
the notation O
tu
represents a t u matrix of 0s and I
t
is the t t identity matrix. Also
given for comparison purposes are the results of the traditional Lawley-Hotelling test, based
on the statistic ( 6.6.29) with K = I
5
. For example, M = [I
4
O
410
] yields a test of the
hypothesis:
H
0
:
11
= =
41
= 0,
12
= =
42
= 0, . . . ,
15
= =
45
= 0 ;
Table 6.6.2: Responses and Levels of the Factors for the Potency Data
Responses Factors Covariate
POT2 POT4 RSDCU HARD H
2
O SAE SAI ADS TYPE POT0
7.94 3.15 1.20 8.50 0.188 1 1 1 1 9.38
8.13 3.00 0.90 6.80 0.250 1 1 1 -1 9.67
8.11 2.70 2.00 9.50 0.107 1 1 -1 1 9.91
7.96 4.05 2.30 6.00 0.125 1 1 -1 -1 9.77
7.83 1.90 0.50 9.80 0.142 -1 1 1 1 9.50
7.91 2.30 0.90 6.60 0.229 -1 1 1 -1 9.35
7.82 1.40 1.10 8.43 0.112 -1 1 -1 1 9.58
7.42 2.60 2.60 8.50 0.093 -1 1 -1 -1 9.69
8.06 2.00 1.90 6.17 0.207 1 -1 1 1 9.62
8.51 2.80 1.70 7.20 0.184 1 -1 1 -1 9.89
7.88 3.35 4.70 9.30 0.107 1 -1 -1 1 9.80
7.58 3.05 4.00 8.10 0.102 1 -1 -1 -1 9.73
8.14 1.20 0.80 7.17 0.202 -1 -1 1 1 9.51
8.06 2.95 2.50 7.80 0.027 -1 -1 1 -1 9.82
7.31 1.85 2.10 8.70 0.116 -1 -1 -1 1 9.20
8.66 4.10 3.60 6.40 0.114 -1 -1 -1 -1 9.53
8.16 3.95 2.00 8.00 0.183 0 0 0 1 9.67
8.02 2.85 1.10 6.61 0.139 0 0 0 -1 9.41
8.03 3.20 3.60 9.80 0.171 0 1 0 1 9.62
7.93 3.20 6.10 7.33 0.152 0 1 0 -1 9.49
7.84 3.95 2.00 7.70 0.165 0 -1 0 1 9.96
7.59 1.15 2.10 7.03 0.149 0 -1 0 -1 9.79
8.28 3.95 0.70 8.40 0.195 1 0 0 1 9.46
7.75 3.35 2.20 6.37 0.168 1 0 0 -1 9.78
7.95 3.85 7.20 9.30 0.158 -1 0 0 1 9.48
8.69 2.80 1.30 6.57 0.169 -1 0 0 -1 9.46
8.38 3.50 1.70 8.00 0.249 0 0 1 1 9.73
8.15 2.00 2.30 6.80 0.189 0 0 1 -1 9.67
8.12 3.85 2.50 7.90 0.116 0 0 -1 1 9.84
7.72 3.50 2.20 5.60 0.110 0 0 -1 -1 9.84
7.96 3.55 1.80 7.85 0.135 0 0 0 1 9.50
8.20 2.75 0.60 7.20 0.161 0 0 0 -1 9.78
8.10 3.30 0.97 8.73 0.152 0 0 0 1 9.71
8.16 3.90 2.40 7.50 0.155 0 0 0 -1 9.57
Table 6.6.3: Tests of the Eects for the Potency Data
Eects M-matrix df Q
MV R
p-value Q
LH
p-value
Main [I
4
O
410
] 20 179.6 .00 91.8 .00
Higher Order [O
94
I
9
0
91
] 45 102.1 .00 70.7 .01
Interaction [O
64
I
6
O
64
] 30 70.2 .00 52.2 .01
Quadratic [O
310
I
3
0
31
] 15 34.5 .00 18.7 .23
Covariate [O
113
1] 5 3.88 .57 4.34 .50
that is, the linear terms vanish in all 5 components. Note that M is 414 so r = 4 and hence
we have 4 5 = 20 degrees of freedom in Theorem 6.6.3. The other hypothesis matrices are
developed similarly. The robust analysis indicates that all eects are signicant except the
covariate eect. In particular the quadratic eect is signicant for the robust analysis but
not for the Lawley-Hotelling test. This conrms the discussion on LS and robust residual
plots for this data set given in Example 3.9.2.
Are the eects of the factors dierent on potencies of the tablet after 2 weeks, POT2, or 4
weeks, POT4? This question can be answered by evaluating the statistic Q
MV RK
, ( 6.6.27),
for hypotheses of the form MK, for the matrices M given in Table 6.6.3 and the 5 1
matrix K where K
= [1 1 0 0 0]. For example,

11
, . . . ,
41
are the linear eects of SAE,
SAI, ADS, and TYPE on PO2 and
12
, . . . ,
42
are the linear eects on PO4. We may want
to test the hypothesis
H
0
:
11
=
12
, . . . ,
41
=
42
.
The M matrix picks the appropriate s within a component and the K matrix compares the
results across components. From Table 6.6.3, choose M = [I
4
O
410
]. Then
M =
_
11

12

15
.
.
.
.
.
.
.
.
.
41

42

45
_
_
Next choose K
T
= [1 1 0 0 0] so that
MK =
_
11
12
21
22
.
.
.
41
42
_
_
.
Then the null hypothesis is H
0
: MK = 0. In this example r = 4 and s = 1 so the
test has rs = 4 degrees of freedom. The test is illustrated in column 3 of Table 6.6.4.
Other comparisons are also given. Once again, for comparison purposes the results for the
Lawley-Hotelling test based on the test statistic, ( 6.6.29), are given also. The robust and
traditional analyses seem to agree on the contrasts. Although there is some overall dierence
the factors behave somewhat the same on the responses.
Table 6.6.4: Contrast analyses between responses POT2 and POT4
All Terms Covariate main eect Higher order Interaction Quadratic
except mean terms terms terms terms terms
df 14 1 4 9 6 3
Q
MV RK
21.93 3.00 5.07 12.20 8.22 4.77
p-value .08 .08 .28 .20 .22 .19
Q
LH
22.28 2.67 6.36 11.73 6.99 5.48
p-value .07 .10 .17 .23 .32 .14
Suppose we have the linear model 6.6.2 along with a matrix of scores that sum to zero.
The criterion function and the matrix of partial derivatives are given by ( 6.6.5) and ( 6.6.6).
Then the test statistic for a general regression eect is given by ( 6.6.17) or ( 6.6.18). Special
cases yield the two sample and k-sample tests discussed in Examples 6.6.1 and 6.6.3.
The componentwise rank case uses chisquare critical values. The computation of the tests
require the score matrix A along with the design matrix X. For example, we could use the
L
1
norm componentwise and produce multivariate sign tests that extend Moods test to the
multivariate model.
This approach can be extended to the spatial rank and ane rank cases; recall the
discussion in Example 6.3.2. In the spatial case the criterion function is D(, ) =
|y
T
i

T
x
i
|, ( 6.1.4). Let u(x) = |x|
1
x and r
T
i
=
T
x
i
, then D(, ) =
u
T
(r
i
)r
i
and hence,
A =
_
_
_
u
T
(r
1
)
.
.
.
u
T
(r
n
)
_
_
_
Further, let R
c
(r
i
) =
j
u(r
i
r
j
) be the centered spatial rank vector. Then the criterion
function is D(, ) =
R
T
c
(r
i
)r
i
and
A
=
_
_
_
R
T
(r
1
)
.
.
.
R
T
(r
n
)
_
_
_
The tests then can be carried out using the chisquare critical values. See Brown and Hett-
mansperger (1987b) and Mottonen and Oja (1995) for details. For the details in the ane
invariant sign or rank vector cases see Brown and Hettmansperger (1987b), Hettmansperger,
Nyblom, and Oja (1994), and Hettmansperger, Mottonen and Oja (1997a,b).
Rao (1988) and Bai, Chen, Miao, and Rao (1990) consider a dierent formulation of a
linear model. Suppose, for i = 1, . . . , n, Y
i
= X
i
+
i
where Y
i
is a 21 vector, X
i
is a q2
matrix of known values, is a 2 1 vector of unknown parameters. Further,
1
, . . . ,
n
is an
iid set of random vectors from a distribution with median vector 0. The criterion function is
|Y
i
X
i
|, the spatial criterion function. Estimates, tests, and the asymptotic theory
are developed in the above references.
6.7 Experimental Designs
Recall that in Chapter 4 we developed rank-based procedures for experimental designs based
on the general R-estimation and testing theory of Chapter 3. Analogously in the multivariate
case, rank-based procedures for experimental designs can be based on the R-estimation and
testing theory of the last section. In this short section we show how this development can
proceed. In particular, we use the cell median model, (the basic model of Chapter 4), and
show how the test ( 6.6.28) can be used to test general linear hypotheses involving contrasts
in these cell medians. This allows the testing of MANOVA type hypotheses as well as, for
instance, prole analyses for multivariate data.
Suppose we have k groups and within the jth group, j = 1, . . . , k, we have a sample of size
n
j
. For each subject a d-dimensional vector of variables has been recorded. Let y
ijl
denote
the response for the ith subject in Group j for the lth variable and let y
ij
= (y
ij1
, . . . , y
ij2
)
T
denote the vector of responses for this subject. Consider the model,
y
ij
=
j
+e
ij
, j = 1, . . . , k , i = 1, . . . , n
k
,
where the e
ij
are independent and identically distributed. Let n =
n
j
denote the total
sample size. Let Y
nd
denote the matrix of responses, (the y
ij
s are stacked sequentially by
group and let be the corresponding nd matrix of e
ij
. Let = (
1
, . . . ,
k
)
T
be the k d
matrix of parameters. We can then write the model as
Y = W + , (6.7.1)
where W is the incidence matrix in expression ( 4.2.5). This is our full model and it is the
multivariate analog of the basic model of Chapter 4, ( 4.2.1). If
j
is the vector of medians
then this is the multivariate medians model. On the other hand, if
j
is the vector of
means then this is the multivariate means model.
We are interested in the following general hypotheses:
H
0
: MK = O versus H
A
: MK ,= O , (6.7.2)
where M is an r k contrast matrix (the rows of M sum to zero) of rank r and K is a d s
matrix of rank s.
In order to use the theory of Section 6.6 we need to transform Model ( 6.7.1) into a
model of the form ( 6.6.2). Consider the k k elementary column matrix E which replaces
the rst column of a matrix by the sum of all columns of the matrix; i.e,
[c
1
c
2
c
k
]E =
_
k
i=1
c
i
c
2
c
k
_
, (6.7.3)
for any matrix [c
1
c
2
c
k
]. Note that E is nonsingular. Hence we can write Model ( 6.7.1)
as
Y = W + = WEE
1
+ = [1 W
1
]
_

T
_
+ , (6.7.4)
6.7. EXPERIMENTAL DESIGNS 413
where W
1
is the last k 1 columns of W and E
1
= [
T
]
T
. This is a model of the form
( 6.6.2). Since M is a contrast matrix, its rows sum to zero. Hence the hypothesis simplies
to:
MK = MEE
1
K = [0 M
1
]
_

T
_
K = M
1
K . (6.7.5)
Therefore the hypotheses ( 6.7.2) can be tested by the procedure ( 6.6.28) based on the t
of Model ( 6.7.4).
Most of the interesting hypotheses in MANOVA can be written in the form ( 6.7.2)
for some specied contrast matrix M. Therefore based on the theory developed in Section
6.6, a robust rank-based methodology can be developed for MANOVA type models. This
methodology is demonstrated in Example 6.7.1, which follows, and Exercise 6.8.23.
For the multivariate setting, Davis and McKean (1993) developed an analog of Theorem
3.5.7 which gives the joint asymptotic distribution of [

T
]
T
. They further developed a test
of the hypothesis H
0
: MK = O, where M is any full row rank matrix, not necessarily
a contrast matrix. Hence, this provides a robust rank-based analysis for any multivariate
linear model.
Example 6.7.1. Paspalum Grass
This data set, discussed on page 460 of Seber (1984), concerns the eect on growth of
paspalum grass due to a fungal infection. The experiment was a 4 2 twoway design. Half
of the forty-eight pots of paspalum grass in the experiment were inoculated with a fungal
infection and half were left as controls. The second factor was the temperature (14, 18, 22,
26
o
C) at which the inoculation was applied. The design was balanced so that 6 plants were
used for each combination of treatment and temperature. After a specied amount of time,
the following three measurements were made on each plant:
y
1
= the fresh weight of the roots of the plant (gm)
y
2
= the maximum root length of the plant (mm)
y
3
= the fresh weight of the tops of the plant (gm) .
The data are are displayed in Table 6.7.1.
As a full model we t Model 6.7.1. Based on the residual analysis found in Exercise
6.8.24, though, the t clearly shows heteroscedasticity and suggests the log transformation.
The subsequent analysis is based on the transformed data. Table 6.7.2 displays the estimates
of Model 6.7.1 based on the Wilcoxon score function and LS. Note the ts are very similar.
The estimates of the vector and the matrix A
T
A are also displayed.
The hypotheses of interest concern the average main eects and interaction. For Model
Table 6.7.1: Responses for the Paspalum Grass Data, Example 6.7.1
Treatments
Temperature Control Inoculated
2.2 23.5 1.7 2.3 23.5 2.0
3.0 27.0 2.3 3.0 21.0 2.7
14
o
C 3.3 24.5 3.2 2.3 22.0 1.8
2.2 20.5 1.5 2.5 22.5 2.4
2.0 19.0 2.0 2.4 21.5 1.1
3.5 23.5 2.9 2.7 25.0 2.6
21.8 41.5 23.0 10.1 43.5 14.2
11.0 32.5 15.4 7.6 27.0 14.7
18
o
C 16.4 46.5 22.8 19.7 32.5 21.4
13.1 31.0 21.5 4.3 28.5 9.7
15.4 41.5 20.8 5.2 33.5 12.2
14.5 46.0 20.3 3.9 24.5 8.2
13.6 29.5 30.8 10.0 21.0 23.6
6.2 23.5 14.6 12.3 49.0 28.1
22
o
C 16.7 58.5 36.0 4.9 28.5 13.3
12.2 40.5 23.9 9.6 27.0 24.6
8.7 37.0 20.3 6.5 29.0 19.3
12.3 41.5 27.7 13.6 30.5 31.5
3.0 24.0 10.2 4.2 25.5 13.3
5.3 26.5 15.6 2.2 23.5 8.5
26
o
C 3.1 24.5 14.7 2.8 19.5 11.8
4.8 34.0 20.5 1.3 21.5 7.8
3.4 22.5 14.3 4.2 28.5 15.1
7.4 32.0 23.2 3.0 25.0 11.8
6.7. EXPERIMENTAL DESIGNS 415
Table 6.7.2: Estimates based on the Wilcoxon and LS ts for the Paspalum Grass Data,
Example 6.7.1. V is the variance-covariance matrix of vector of random errors .
Wilcoxon Fit LS Fit
Components Components
Parameter (1) (2) (3) (1) (2) (3)
11
1.04 3.14 .82 .97 3.12 .78
21
2.74 3.70 3.05 2.71 3.67 3.02
31
2.47 3.63 3.25 2.40 3.61 3.20
41
1.49 3.29 2.79 1.45 3.29 2.76
12
.94 3.12 .77 .92 3.12 .70
22
1.95 3.43 2.58 1.96 3.43 2.55
32
2.26 3.36 3.19 2.19 3.39 3.11
42
1.09 3.18 2.45 1.01 3.17 2.41
or .376 .188 .333 .370 .197 .292
1.04 .62 .92 .14 .04 .09
A
T
A or V .62 1.04 .57 .04 .04 .03
.92 .57 1.04 .09 .03 .09
6.7.1, matrices for treatment eects, temperature eects and interaction are given by
M
Treat.
= [ 1 1 1 1 1 1 1 1 ]
M
Temp.
=
_
_
1 1 0 0 1 1 0 0
1 0 1 0 1 0 1 0
1 0 0 1 1 0 0 1
_
_
M
Treat.Temp.
=
_
_
1 1 0 0 1 1 0 0
0 1 1 0 0 1 1 0
0 0 1 1 0 0 1 1
_
_
.
Take the matrix K to be I
3
. Then the hypotheses of intertest can be expressed as MK = O
for the above M-matrices. Using the summary statistics in Table 6.7.2 and the elemenatry
column matrix E, as dened above expression ( 6.7.3), we obtained the test statistics Q
MV RK
,
( 6.6.27) based on the Wilcoxon t. For comparison we also obtain the LS test statistics
Q
LH
, ( 6.6.29). The values of these statistics for the hypotheses of interest are summarized
in Table 6.7.3. The test for interaction is not signicant while both main eects, Treatment
and Temperature, are signicant. The results are quite similar for the traditional test also.
We also tabulated the marginal test statistics, F
. The results for each component are the

similar to the multivariate result.
Table 6.7.3: Test statistics Q
MV RK
and Q
LH
based on the Wilcoxon and LS ts, respec-
tively, for the Paspalum Grass Data, Example 6.7.1. Marginal F-tests are also given. The
numerator degrees of freedom are given. Note that the denominator degrees of freedom for
the marginal F-tests is 40.
Wilcoxon LS
MVAR Marginal F
MVAR Marginal F
LS
Eect df Q
MV RK
df (1) (2) (3) df Q
LH
df (1) (2) (3)
Treat. 3 14.9 1 9.19 7.07 11.6 3 12.2 1 11.4 6.72 8.66
Temp. 9 819 3 32.5 13.4 61.4 9 980 3 45.2 13.4 162
Treat. Temp. 9 11.2 3 2.27 1.49 1.35 9 7.98 3 2.01 .79 1.36
6.8 Exercises
6.8.1. Show that the vector of sample means of the components is ane equivariant. See
Denition 6.1.1.
6.8.2. Compute the gradients of the three criterion functions ( 6.1.3)-( 6.1.5).
6.8.3. Show that in the univariate case S
2
() = S
3
(), ( 6.1.7) and ( 6.1.8).
6.8.4. Establish ( 6.2.7).
6.8.5. Construct an example in the bivariate case for which the mean vector rotates into
the new mean vector but the vector of componentwise medians does not rotate into the new
vector of medians.
6.8.6. Students were given a math aptitude and reading comprehension test before starting
an intensive study skills workshop. At the end of the program they were given the test again.
The following data represents the change in the math and reading tests for the 5 students
in the program.
Math Reading
11 7
20 40
-10 -4
10 12
16 5
We would like to test the hypothesis H
0
: = 0 vs H
A
: ,= 0. Following the discussion
at the beginning of Section 6.2.2, nd the sign change distribution of the componentwise
sign test and nd the conditional p-value.
6.8. EXERCISES 417
6.8.8. Using the projection method discussed in Chapter 2, derive the projection of the
statistic given in ( 6.2.14).
6.8.9. Apply Lemma 6.2.1 and show that ( 6.2.19) provides the bounds on the testing
eciency of the Wilcoxon test relative to Hotellings test in the case of a bivariate normal
distribution.
6.8.11. Show that ( 6.3.13) can be generalized to k dimensions.
6.8.12. Consider the spatial L
1
methods.
(a). Show that the eciency of the spatial L
1
2
methods with a
k-variate spherical model is given by
e
k
(spatial L
1
, L
2
) =
_
k 1
k
_
2
E(r
2
)[E(r
1
)]
2
(b). Next assume that the kvariate spherical model is normal. Show that Er
1
=
[(k1)/2)]
2(k/2)
with (1/2) =
.
6.8.13. . Show that the spatial median is equivariant and that the spatial sign test is
invariant under orthogonal transformations of the data;.
6.8.14. Verify ( 6.3.15).
6.8.15. Complete the proof of Theorem 6.4.2 by establishing the third formula for S
8
(0).
6.8.16. Show that the Oja median and Oja sign test are ane equivariant and ane invari-
ant, respectively. See Section 6.4.3.
6.8.17. Show that the maximum breakdown point for a translation equivariant estimator is
(n+1)/(2n). An estimator is translation equivariant if T(X + a1) = T(X) + a1, for every
real a. Note that 1 is the vector of all ones.
6.8.18. Verify ( 6.6.6).
6.8.19. Show that ( 6.6.18) can be derived from ( 6.6.17).
6.8.20. Fill in the details of the proof of Theorem 6.6.1.
6.8.21. Show that A
R
= 15.69 in Example 6.6.2, using Table 6.6.1.
6.8.22. Verify formula ( 6.6.20).
6.8.23. Consider Model ( 6.7.1) for a repeated measures design in which the responses are
recorded on the same variable over time; i.e., y
ijl
is response for the ith subject in Group
j at time period l. In this model the vector
j
is the prole vector for the jth group and
the plot of
ij
versus i is called the prole plot for Group j. Let
j
denote the estimate of
j
based on the R-t of Model ( 6.7.1). The plot of
ij
versus j is called the sample prole
plot of Group j. These group plots are overlaid and are called the sample proles. A
hypothesis of interest is whether or not the population proles are parallel.
(a.) Let A
t1
be the (t 1) t matrix given by
A
t1
=
_
_
1 1 0 0
0 1 1 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1
_
_
.
Show that parallel proles are equivalent to the null hypothesis H
0
dened by:
H
0
: A
k1
A
d1
= O versus H
A
: A
k1
A
d1
,= O , (6.8.1)
where is dened in Model 6.7.1. Hence show that a test of parallel proles can be
based on the test ( 6.6.28).
(b.) The data in Table 6.8.1 are the times (in seconds) it took three dierent species (A,
B, and C) of rats to run a maze at four dierent times (I, II, III, and IV). Each row
contains the scores of a single rat. Compare the sample prole plots based on Wilcoxon
and LS estimates.
(c.) Test the hypotheses ( 6.8.1) using the procedure ( 6.6.28) based on Wilcoxon scores.
Repeat using the LS test procedure ( 6.6.29).
(d.) Repeat items (b) and (c) if the 13th rat at time period 2 took 80 seconds to run the
maze instead of 34. Note that p-value of the LS procedure changes from .77 to .15
while the p-value of the Wilcoxon procedure changes from .95 to .85.
6.8.24. Consider the data of Example 6.7.1.
(a.) Using the Wilcoxon scores, t Model ( 6.7.4) to the original data displayed in Table
6.7.1. Obtain the marginal residual plots which show heteroscedasticity. Reason that
the log transformation is appropriate. Show that the residual plots based on the
transformed remove much of the heteroscedasticity. For both the transformed and
original data obtain the internal Wilcoxon studentized residuals. Identify the outliers.
(b.) In order to see the eect of the transformation, obtain the Wilcoxon and LS analyses
of Example 6.7.1 based on the original data. Discuss your ndings.
6.8. EXERCISES 419
Table 6.8.1: Data for Exercise 6.8.23
Group A Group B Group C
Times Times Times
Rat I II III IV Rat I II III IV Rat I II III IV
1 47 53 51 28 6 44 57 46 27 11 45 33 30 18
2 35 66 38 39 7 47 29 21 30 12 30 50 21 25
3 43 40 34 40 8 28 76 29 39 13 33 32 32 24
4 49 60 44 32 9 57 63 60 15 14 44 62 38 22
5 41 61 38 32 10 34 62 41 27 15 40 42 33 24
Appendix A
Asymptotic Results
A.1 Central Limit Theorems
The following version of the Lindeberg-Feller Central Limit Theorem will be useful. A proof
of it can be found in Arnold (1981).
Theorem A.1.1. Consider the sequence of independent random variables W
1n
, . . . , W
nn
, for
n = 1, 2 . . . . Suppose E(W
in
) = 0, Var(W
in
) =
2
in
< , and
max
1in
2
in
0 , as n , (A.1.1)
n
i=1
2
in

2
, 0 <
2
< , as n , (A.1.2)
and
lim
n
n
i=1
E(W
2
in
I
([W
in
[) = 0 , (A.1.3)
for all > 0, where I
a
([x[) is 0 or 1 when [x[ > a or [x[ a, respectively. Then
n
i=1
W
in
D
N(0,
2
) .
A useful corollary to this theorem is given next; see, also, page 153 of Hajek and

Sidak
(1967).
Corollary A.1.1. Suppose that the sequence of random variables X
1
, . . . , X
n
are iid with
E(X
i
) = 0 and Var(X
i
) =
2
< . Suppose the sequence of constants a
1n
, . . . , a
nn
are such
that
n
i=1
a
2
in

2
a
, as n , 0 <
2
a
< , (A.1.4)
max
1in
[a
in
[ 0 , as n . (A.1.5)
421
422 APPENDIX A. ASYMPTOTIC RESULTS
Then
n
i=1
a
in
X
i
D
N(0,
2
2
a
) .
Proof: Take W
in
of Theorem A.1.1 to be a
in
X
i
. Then the mean of W
in
is 0 and its
variance is
2
in
= a
2
in
2
. By ( A.1.5), max
2
in
0 and by ( A.1.4),
2
in

2
2
a
. Hence we
need only show that condition ( A.1.3) is true. For i = 1, . . . , n, dene
W
in
= max
1jn
[a
jn
[[X
i
[ .
Then [W
in
[ [W
in
[; hence, I
([W
in
[) I
([W
in
[), for > 0. Therefore,
n
i=1
E
_
W
2
in
I
([W
in
[)
i=1
E
_
W
2
in
I
([W
in
[)
=
_
n
i=1
a
2
in
_
E
_
X
2
1
I
([W
1n
[)
. (A.1.6)
Note that the sum in braces converges to
2
2
a
. Because X
2
1
I
([W
1n
[) converges to 0 pointwise
and it is bounded above by the integrable function X
2
1
, it then follows that by Lebesgues
Dominated Convergence Theorem that the right side of ( A.1.6) converges to 0. Thus
condition ( A.1.3) of Theorem A.1.1 is true and we are nished.
Note that the simple Central Limit Theorem follows from this corollary by taking a
in
=
n
1/2
, so that ( A.1.4) and ( A.1.5) hold.
A.2 Simple Linear Rank Statistics
In the next two subsections, we present the asymptotic distribution theory for a simple
linear rank statistic under the null and local alternative hypotheses. This theory is used in
Chapters 1 and 2 for location models and, also, in Section A.3, it will be useful in establishing
asymptotic linearity and quadraticity results for Chapters 3 and 5. The theory for a simple
linear rank statistic is presented in detail in Chapters 5 and 6 of the book by Hajek and
Sidak (1967); hence, here we will only present a heuristic development with appropriate
references to Hajek and

Sidak. Also, Chapter 8 of Randles and Wolfe (1979) presents a
detailed development of the null asymptotic theory of a simple linear rank statistic.
In this section we assume that the sequence of random variables Y
1
, . . . , Y
n
are iid with
common density function f(y) which follows Assumption E.1, ( 3.4.1). Let x
1
, . . . , x
n
denote
a sequence of centered, (x = 0), regression coecients and assume that they follow as-
sumptions D.2, ( 3.4.7), and D.3, ( 3.4.8). For this one-dimensional case, these assumptions
simplify to:
max x
2
i
n
i=1
x
2
i
0 (A.2.1)
1
n
n
i=1
x
2
i

2
x
,
2
x
> 0 , (A.2.2)
A.2. SIMPLE LINEAR RANK STATISTICS 423
for some constant
2
x
. It follows from these assumptions that max
i
[x
i
[/
n 0, a fact that
we will nd useful. Assume that the score function (u) is dened on the interval (0, 1) and
that it satises (S.1), ( 3.4.10); in particular,
_
1
0
(u) du = 0 and
_
1
0

2
(u) du = 1 . (A.2.3)
Consider then the linear rank statistics,
S =
n
i=1
x
i
a(R(Y
i
)) , (A.2.4)
where the scores are generated as a(i) = (i/(n + 1)).
A.2.1 Null Asymptotic Distribution Theory
It follows immediately that the mean and variance of S are given by
E(S) = 0 and Var(S) =
n
i=1
x
2
i
_
1
n1
n
i=1
a
2
(i)
_
.
=
n
i=1
x
2
i
, (A.2.5)
where the approximation is due to the fact that the quantity in braces is a Riemann sum of
_
1
0

2
(u) du = 1.
Note that we can write S as
S =
n
i=1
x
i
_
n
n + 1
F
n
(Y
i
)
_
, (A.2.6)
where F
n
is the empirical distribution function of Y
1
, . . . , Y
n
. This suggests the approxima-
tion,
T =
n
i=1
x
i
(F(Y
i
)) . (A.2.7)
We have immediately from ( A.2.3) that the mean and variance of T are
E(T) = 0 and Var(T) =
n
i=1
x
2
i
. (A.2.8)
Furthermore, by assumptions ( A.2.1) and ( A.2.2), we can apply Corollary A.1.1 to show
that
1
n
T is asymptotically distributed as N(0,
2
x
) . (A.2.9)
Because the means of S and T are the same, it will follow that S has the same asymptotic
distribution as T provided the second moment of their dierence goes to 0. But this follows
from the string of inequalities:
E
_
_
1
n
S
1
n
T
_
2
_
=
1
n
E
_
_
_
n
i=1
x
i
_
_
n
n + 1
F
n
(Y
i
)
_
(F(Y
i
))
_
_
2
_
_
n
n 1
_
1
n
n
i=1
x
2
i
_
E
_
_
_
n
n + 1
F
n
(Y
1
)
_
(F(Y
1
))
_
2
_

2
x
0 ,
where the inequality and the derivation of the limit is given on page 160 of Hajek and

Sidak
(1967). This results in the following theorem,
Theorem A.2.1. Under the above assumptions,
1
n
(T S)
P
0 , (A.2.10)
and
1
n
S
D
N(0,
2
x
) . (A.2.11)
Hence we have established the null asymptotic distribution theory of a simple linear rank
statistic.
A.2.2 Local Asymptotic Distribution Theory
We rst need the denition of contiguity between two sequences of densities.
Denition A.2.1. A sequence of densities q
n
is contiguous to another sequence of den-
sities p
n
, if for any sequence of events A
n
,
_
An
p
n
0
_
An
q
n
0 .
This concept is discussed in some detail in Hajek and

Sidak (1967).
The following fact follows immediately from this denition. Suppose the sequence of
densities q
n
is contiguous to the sequence of densities p
n
. Let X
n
be a sequence of
random variables. If X
n
P
0 under p
n
then X
n
P
0 under q
n
.
Then according to LeCams First Lemma, if log(q
n
/p
n
) is asymptotically N(
2
/2,
2
)
under p
n
, then q
n
is contiguous to p
n
. Further by LeCams Third Lemma, if (S
n
, log(q
n
/p
n
))
is asymptotically bivariate normal (
1
,
2
,
2
1
,
2
2
,
1
2
) with
2
=
2
2
/2 under p
n
, then S
n
is asymptotically N(
1
+
1
2
,
2
1
) under q
n
; see pages 202-209 in Hajek and

Sidak (1967).
In this section, we assume that the random variables Y
1
, . . . , Y
n
and the regression coe-
cients x
1
, . . . , x
n
follow the same assumptions that we made in the last section; see expressions
( A.2.1) and ( A.2.2). We denote the likelihood function of Y
1
, . . . , Y
n
by
p
y
=
n
i=1
f(y
i
) . (A.2.12)
In the last section we derived the asymptotic distribution of S under p
y
. In this section we
are further concerned with the likelihood function
q
d
=
n
i=1
f(y
i
+ d
i
) , (A.2.13)
for a sequence of constants d
1
, . . . , d
n
which satises the conditions
n
i=1
d
i
= 0 (A.2.14)
n
i=1
d
2
i

2
d
> 0 , as n (A.2.15)
max
1in
d
2
i
0 , as n (A.2.16)
1
n
n
i=1
x
i
d
i

xd
, as n . (A.2.17)
In applications (eg. power in simple linear models) we take d
i
= x
i
/
n. For x
i
s following
assumptions ( A.2.1) and ( A.2.2), the above assumptions would hold for these d
i
s.
In this section, we establish the asymptotic distribution of S under q
d
. Consider the log
of the ratio of the likehood functions q
d
and p
y
given by
l() =
n
i=1
log
f(Y
i
+ d
i
)
f(Y
i
)
. (A.2.18)
Expanding l() about 0 and evaluating the resulting expression at = 1 results in
l =
n
i=1
d
i
f
(Y
i
)
f(Y
i
)
+
1
2
n
i=1
d
2
i
f(Y
i
)f
(Y
i
) (f
(Y
i
))
2
f
2
(Y
i
)
+ o
p
(1) , (A.2.19)
provided that the third derivative of the log-ratio, evaluated at 0, is square integrable.
Under p
y
, the middle term converges in probability to I(f)
2
d
/2, provided that the second
derivative of the log-ratio, evaluated at 0, is square integrable.
Hence, under p
y
and some further regularity conditions we can write,
l =
n
i=1
d
i
f
(Y
i
)
f(Y
i
)

I(f)
2
d
2
+ o
p
(1) . (A.2.20)
The random variables in the rst term, f
(Y
i
)/f(Y
i
) are iid with mean 0 and variance I(f).
Because the sequence d
1
, . . . , d
n
satises ( A.2.14) - ( A.2.16), we can use Corollary A.1.1 to
show that, under p
y
, l converges in distribution to N(I(f)
2
d
/2, I(f)
2
d
). By the denition
of contiguity ( A.2.1) and the immediate following discussion of LeCams rst lemma, we
have the result
the densities q
d
=
n
i=1
f(y
i
+ d
i
) are contiguous to p
y
=
n
i=1
f(y
i
) ; (A.2.21)
see, also, page 204 of Hajek and

Sidak (1967).
We next establish the key result:
Theorem A.2.2. For T given by ( A.2.7) and under p
y
and the assumptions ( 3.4.1),
( A.2.1), ( A.2.2), ( A.2.14) - ( A.2.17),
_
1
n
T
l
_
D
N
2
__
0
I(f)
2
d
2
_
,
_

2
x

xd
xd
f
I(f)
2
d
_
_
. (A.2.22)
Proof: Consider the random vector V = (T/
n, l)
, where T is dened in expression ( A.2.7).

To show that V is asymptotically normal under p
n
it suces to show that for t 1
2
, t ,= 0,
t
V is asymptotically univariate normal. By the above discussion, for the second component
of V, we need only be concerned with the rst term in expression ( A.2.19); hence, for
t = (t
1
, t
2
)
, dene the random variables W

in
by
n
i=1
_
1
n
x
i
t
1
(F(Y
i
)) +t
2
d
i
f
(Y
i
)
f(Y
i
)
_
=
n
i=1
W
in
. (A.2.23)
We want to apply Theorem A.1.1. The random variables W
in
are independent and have
mean 0. After some simplication, we can show that the variance of W
in
is
2
in
=
1
n
x
2
i
t
2
1
+ t
2
2
d
2
i
I(f) 2t
1
t
2
d
i
x
i
f
, (A.2.24)
where
f
is given by
f
=
_
1
0
(u)
_
(F
1
(u))
f(F
1
(u))
_
du . (A.2.25)
Note by assumptions ( A.2.1), ( A.2.2), and ( A.2.15) - ( A.2.17) that
n
i=1
2
in
t
2
1
2
x
+ t
2
2
2
d
I(f) 2t
1
t
2
xd
> 0 , (A.2.26)
and that
max
1in
2
in
max
1in
1
n
x
2
i
t
2
1
+ t
2
2
I(f) max
1in
d
2
i
+ 2[t
1
t
2
[
f
max
1in
1
n
[x
i
[ max
1in
[d
i
[ 0 ; (A.2.27)
hence conditions ( A.1.2) and ( A.1.1) are true. Thus to obtain the result we need to show
lim
n
n
i=1
E
_
W
2
in
I
([W
in
[)
= 0 , (A.2.28)
for > 0. But [W
in
[ W
in
where
W
in
= [t
1
[ max
1jn
1
n
[x
j
[[(F(Y
i
))[ +[t
2
[ max
1jn
[d
j
[
(Y
i
)
f(Y
i
)
.
Hence,
n
i=1
E
_
W
2
in
I
([W
in
[)
i=1
E
_
W
2
in
I
(W
in
)
=
n
i=1
E
__
1
n
t
2
1
x
2
i
2
(F(Y
1
))
+t
2
2
d
2
i
_
f
(Y
1
)
f(Y
1
)
_
2
+ 2t
1
t
2
1
n
x
i
d
i
(F(Y
1
))
_
(Y
1
)
f(Y
1
)
_
_
I
(W
1n
)
_
= t
2
1
_
n
i=1
1
n
x
2
i
_
E
_
2
(F(Y
1
))I
(W
1n
)
+t
2
2
_
n
i=1
d
2
i
_
E
_
_
f
(Y
i
)
f(Y
i
)
_
2
I
(W
1n
)
_
+2t
1
t
2
_
1
n
n
i=1
x
i
d
i
_
E
_
(F(Y
1
))
_
(Y
1
)
f(Y
1
)
_
2
I
(W
1n
)
_
. (A.2.29)
Because I
(W
1n
) 0 pointwise and each of the other random variables in the expectations of
( A.2.29) are absolutely integrable, the Lebesgue Dominated Convergence Theorem implies
that each of these expectations converge to 0. The desired limit in expression ( A.2.28),
then follows from assumptions ( A.2.1), ( A.2.2) and ( A.2.15) - ( A.2.17). Hence V is
asymptotically bivariate normal. We can obtain its asymptotic variance-covariance matrix
from expression ( A.2.26), which completes the proof.
Based on Theorem A.2.2, an application of LeCams third lemma leads to the asymptotic
distribution of T/
n under local alternatives which we state in the following theorem.

Theorem A.2.3. Under the sequence of densities q
d
=
n
i=1
f(y
i
+d
i
), and the assumptions
( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) - ( A.2.17),
1
n
T
D
N(
xd
f
,
2
x
) , (A.2.30)
1
n
S
D
N(
xd
f
,
2
x
) . (A.2.31)
The result for S/
n follows because (T S)/
n 0 in probability under the densities

p
y
; hence, due to the contiguity cited in expression ( A.2.21), (T S)/
n 0, also, under
the densities q
d
. A proof of the asymptotic power lemma, Theorem 2.4.13, follows from this
result.
We now investigate the relationship between S and the shifted process given by
S
d
=
n
i=1
x
i
a(R(Y
i
+ d
i
)) . (A.2.32)
Consider the analogous process,
T
d
=
n
i=1
x
i
(F(Y
i
+ d
i
)) . (A.2.33)
We next establish the connection between T and T
d
; see Theorem 1.3.1, also.
Theorem A.2.4. Under the likelihoods q
d
and p
y
, we have the following identity:
P
q
d
_
1
n
T t
_
= P
py
_
1
n
T
d
t
_
. (A.2.34)
Proof: The proof follows from the following string of equalities.
P
q
d
_
1
n
T t
_
= P
q
d
_
1
n
n
i=1
x
i
(F(Y
i
)) t
_
= P
q
d
_
1
n
n
i=1
x
i
(F((Y
i
d
i
) + d
i
)) t
_
= P
py
_
1
n
n
i=1
x
i
(F(Z
i
+ d
i
)) t
_
= P
py
_
1
n
T
d
t
_
,
(A.2.35)
where in line three the sequence of random variables Z
1
, . . . , Z
n
follows the likelihood p
y
.
We next establish an asymptotic relationship between T and T
d
.
Theorem A.2.5. Under p
y
and the assumptions ( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) -
( A.2.17),
_
T [T
d
E
py
(T
d
)]
n
_
P
0 .
Proof: Since E(T) = 0, it suces to show that the V [(T T
d
)/
n] 0. We have,
V
_
T T
d
n
_
=
1
n
n
i=1
x
2
i
V [(F(Y
i
)) (F(Y
i
+ d
i
))]
1
n
n
i=1
x
2
i
E [(F(Y
i
)) (F(Y
i
+ d
i
)]
2
=
1
n
n
i=1
x
2
i
_

[(F(y)) (F(y + d
i
)]
2
f(y) dy
_
1
n
n
i=1
x
2
i
_
__

max
1in
[(F(y)) (F(y + d
i
)]
2
f(y) dy
_
.
The rst factor in the last expression converges to
2
x
; hence, it suces to show that the lim
of the second factor is 0. Fix y. Let > 0 be given. Then since (u) is continuous a.e. we can
assume it is continuous at F(y). Hence there exists a
1
> 0 such that [(z) (F(y))[ <
for [zF(y)[ <
1
. By the uniform continuity of F, choose
2
> 0 such that [F(t)F(s)[ <
1
for [s t[ <
2
. By ( A.2.16) choose N
0
so that for n > N
0
implies
max
1in
[d
i
[ <
2
.
Thus for n > N
0
,
[F(y) F (y + d
i
)[ <
1
, for i = 1, . . . , n ,
and, hence,
[(F(y)) (F (y + d
i
))[ < , for i = 1, . . . , n .
Thus for n > N
0
,
max
1in
[(F(y)) (F(y + d
i
)]
2
<
2
,
Therefore,
lim
__

max
1in
[(F(y)) (F(y + d
i
))]
2
f(y) dy
_

2
,
and we are nished.
The next result yields the asymptotic mean of T
d
.
y
( A.2.17),
E
py
_
1
n
T
d
_
xd
.
Proof: By Theorem A.2.3,
1
n
T
f
xd
x
D
N(0, 1) , under q
d
.
Hence by the transformation Theorem A.2.4,
1
n
T
d
xd
x
D
N(0, 1) , under p
y
. (A.2.36)
By ( A.2.9),
1
n
T
x
D
N(0, 1) , under p
y
;
hence by Theorem A.2.5, we must have
1
n
T
d
E
_
1
n
T
d
_
x
D
N(0, 1) , under p
y
. (A.2.37)
The conclusion follows from the results ( A.2.36) and ( A.2.37).
By the last two theorems we have under p
y
1
n
T
d
=
1
n
T +
f
xd
+ o
p
(1) .
We need to express these results for the random variables S, ( A.2.4), and S
d
, ( A.2.32).
Because the densities q
d
are contiguous to p
y
and (T S)/
n 0 in probability under p
y
,
it follows that (T S)/
n 0 in probability under q
d
. By a change of variable this means
(T
d
S
d
)/
n 0 in probability under p
y
. This discussion leads to the following two results
which we state in a theorem.
y
( A.2.17),
1
n
S
d
=
1
n
S +
f
xd
+ o
p
(1) (A.2.38)
1
n
S
d
=
1
n
T +
f
xd
+ o
p
(1) . (A.2.39)
Next we relate the result Theorem A.2.7 to ( 2.5.27), the asymptotic linearity of the
general scores statistic in the two sample problem. Recall in the two sample problem that
c
i
= 0 for 1 i n
1
and c
i
= 1 for n
1
+ 1 i n
1
+n
2
= n, ( 2.2.1). Hence, x
i
= c
i
c =
n
2
/n for 1 i n
1
and x
i
= n
1
/n for n
1
+ 1 i n. Dening d
i
= x
i
/
n, it is easy
to check that conditions ( A.2.14) - ( A.2.17) hold with
xd
=
1
2
. Further ( A.2.32)
becomes S
(/
n) =
x
i
a(R(Y
i
x
i
/
n)) and ( A.2.4) becomes S
(0) =
x
i
a(R(Y
i
)),
where a(i) = (i/(n + 1)),
_
= 0 and
_

2
= 1. Hence ( A.2.38) becomes
1
n
S
(/
n) =
1
n
S
(0)
1
f
+ o
p
(1) .
Finally using the usual partition argument, Theorem 1.5.6, and the monotonicity of S
(/
n)
we have:
Theorem A.2.8. Assuming Finite Fisher information, nondecreasing and square integrable
(u), and n
i
/n
i
, 0 <
i
< 1, i = 1, 2,
P
px
_
sup
n||c
n
S
n
_
n
S
(0) +
1

_
0 , (A.2.40)
for all > 0 and for all c > 0.
This theorem establishes ( 2.5.27). As a nal note from ( A.2.11), n
1/2
S
(0) is asymp-
totically N(0,
2
x
), where
2
x
=
2
(0) = limn
1
x
2
i
=
1
2
. Hence to determine the ecacy
using this approach, we have
c
=

1
f
(0)
=
_
, (A.2.41)
see ( 2.5.28).
A.2.3 Signed-Rank Statistics
In this section we develop the asymptotic local behavior for the general signed rank statistics
dened in Section 1.8. Assume that X
1
, . . . X
n
are a random sample having distribution
function H(x) with density h(x) which is symmetric about 0. Recall that general signed
rank statistics are given by
T
+ =
a
+
(R([X
i
[))sgn(X
i
) , (A.2.42)
where the scores are generated as a
+
(i) =
+
(i/(n + 1)) for a nonnegative and square
integrable function
+
(u) which is standardized such that
_
(
+
(u))
2
du = 1.
The null asymptotic distribution of T
+ was derived in Section 1.8 so here we will be

concerned with its behavior under local alternatives. Also the derivations here are similar
to those for simple linear rank statistics, Section A.2.2; hence, we will be brief.
Note that we can write T
+ as
T
+ =
+
_
n
n + 1
H
+
n
([X
i
[)
_
sgn(X
i
) ,
where H
+
n
denotes the empirical distribution function of [X
1
[, . . . , [X
n
[. This suggests the
approximation
T
+ =
+
(H
+
([X
i
[))sgn(X
i
) , (A.2.43)
where H
+
(x) is the distribution function of [X
i
[.
Denote the likelihood of the sample X
1
, . . . X
n
by
p
x
=
n
i=1
h(x
i
) . (A.2.44)
A result that we will need is,
1
n
_
T
+ T
+
_
P
0 , under p
x
. (A.2.45)
This result is shown on page 167 of Hajek and

Sidak (1967).
For the sequence of local alternatives, b/
n with b 1, (here we are taking d

i
= b/
n),
we denote the likelihood by
q
b
=
n
i=1
h
_
x
i
n
_
. (A.2.46)
For b 1, consider the log of the likelihoods given by,
l() =
n
i=1
log
h(X
i
n
)
h(X
i
)
. (A.2.47)
If we expand l() about 0 and evaluate it at = 1, similar to the expansion ( A.2.19), we
obtain
l =
b
n
n
i=1
h
(X
i
)
h(X
i
)
+
b
2
2n
n
i=1
h(X
i
)h
(X
i
) (h
(X
i
))
2
h
2
(X
i
)
+ o
p
(1) , (A.2.48)
provided that the third derivative of the log-ratio, evaluated at 0, is square integrable.
Under p
x
, the middle term converges in probability to I(h)b
2
/2, provided that the second
derivative of the log-ratio, evaluated at 0, is square integrable. An application of Theorem
A.1.1 shows that l converges in distribution to a N(
I(h)b
2
2
, I(h)b
2
). Hence, by LeCams rst
lemma,
the densities q
b
=
n
i=1
h
_
x
i
n
_
are contiguous to p
x
=
n
i=1
h(x
i
) ; (A.2.49)
Similar to Section A.2.2, by using Theorem A.1.1 we can derive the asymptotic distri-
bution of the random vector (T
+
/
n, l), which we record as:

x
and some regularity conditions on h,
_
1
n
T
+
l
_
D
N
2
__
0
I(h)b
2
2
_
,
_
1 b
h
b
h
I(h)b
2
__
, (A.2.50)
where
h
= 1/
+ and
+ is given in expression ( 1.8.24).

By this last theorem and LeCams third lemma, we have
1
n
T
+
D
N(b
h
, 1) , under q
b
. (A.2.51)
By the result on contiguity, ( A.2.49), the test statistic T
+/
n has the same distribution

under q
b
. A proof of the asymptotic power lemma, Theorem 1.8.1, follows from this result.
Next consider a shifted version of T
+
given by
T
b
+ =
n
i=1
+
_
H
+
_
X
i
+
b
__
sgn
_
X
i
+
b
n
_
. (A.2.52)
The following identity is readily established:
P
q
b
[T
+ t] = P
px
[T
b
+ t] ; (A.2.53)
see, also, Theorem 1.3.1. We need the following theorem:
x
,
_
T
+
[T
b
+
E
px
(T
b
+
)]
n
_
P
0 .
Proof: As in Theorem A.2.5, it suces to show that V [(T
+
T
b
+
)/
n] 0. But this
variance reduces to
V
_
T
+
T
b
+
n
_
=
_

+
_
H
+
([x[)
_
sgn(x)
+
_
H
+
_
x +
b
__
sgn
_
x +
b
n
__
2
h(x) dx .
A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS 433
Since
+
(u) is square integrable, the quantity in braces is dominated by an integrable func-
tion. Since it converges pointwise to 0, a.e., an application of the Lebesgue Dominated
Convergence Theorem establishes the result.
Using the above results, we can proceed as we did for Theorem A.2.6 to show that under
p
x
,
E
px
_
1
n
T
b
+
_
b
h
. (A.2.54)
Hence,
1
n
T
b
+ =
1
n
T
+ + b
h
+ o
p
(1) . (A.2.55)
A similar result holds for the signed-rank statistic.
For the results needed in Chapter 1, however, it is convenient to change the notation to:
T
+(b) =
n
i=1
a
+
(R[X
i
b[)sgn(X
i
b) . (A.2.56)
The above results imply that
1
n
T
+() =
1
n
T
+(0)
h
+ o
p
(1) , (A.2.57)
for

n[[ B, for B > 0.
The general signed-rank statistics found in Chapter 1 are based on norms. In this case,
since the scores are nondecreasing, we can strengthen our results to include uniformity; that
is,
Theorem A.2.11. Assuming Finite Fisher information, nondecreasing and square inte-
grable
+
(u),
P
px
[ sup
n||B
[
1
n
T
+()
1
n
T
+(0) +
h
[ ] 0 , (A.2.58)
for all > 0 and all B > 0.
A proof can be obtained by the usual partitioning type of argument on the interval
[B, B]; see the proof of Theorem 1.5.6. Hence, since
_
(
+
(u))
2
du = 1, the ecacy is
given by c
+ =
h
; see ( 1.8.21).
A.3 Results for Rank-Based Analysis of Linear Models
In this section we consider the linear model dened by ( 3.2.3) in Chapter 3. The distribution
of the errors satises Assumption E.1, ( 3.4.1). The design matrix satises conditions D.2,
( 3.4.7), and D.3, ( 3.4.8). We shall assume without loss of generality that the true vector
of parameters is 0.
It will be easier to work with the following transformation of the design matrix and
parameters. We consider such that
n = O(1). Note that we will suppress the notation

indicating that depends on n. Let,
= (X
X)
1/2
, (A.3.1)
C = X(X
X)
1/2
, (A.3.2)
d
i
= c
i
, (A.3.3)
where c
i
is the ith row of C and note that = O(1) because n
1
X
X > 0 and
n = O(1). Then C
C = I
p
and H
C
= H
X
, where H
C
is the projection matrix onto the
column space of C. Note that since X is centered, C is also. Also |c
i
|
2
= h
2
nii
where h
2
nii
is the ith diagonal entry of H
X
. It is straightforward to show that c
i
= x
i
. Using the
conditions (D.2) and (D.3), the following conditions are readily established:
d = 0 (A.3.4)
n
i=1
d
2
i

n
i=1
|c
i
|
2
||
2
= p||
2
, for all n (A.3.5)
max
1in
d
2
i
||
2
max
1in
|c
i
|
2
(A.3.6)
= ||
2
max
1in
h
2
nii
0 as n ,
since || is bounded.
For j = 1, . . . , p dene
S
nj
() =
n
i=1
c
ij
a(R(Y
i
c
i
)) , (A.3.7)
where the scores are generated by a function which staises (S.1), ( 3.4.10). We now show
that the theory established in the Section A.2 for simple linear rank statistics holds for S
nj
,
for each j.
Fix j, then the regression coecients x
i
of Section A.2 are given by x
i
=

nc
ij
. Note
from ( A.3.2) that
x
2
i
/n =
c
2
ij
= 1; hence, condition ( A.2.2) is true. Further by ( A.3.6),
max
1in
x
2
i
n
i=1
x
2
i
= max
1in
c
2
ij
0 ;
hence, condition ( A.2.1) is true.
For the sequence d
i
= c
i
, conditions ( A.3.4) - ( A.3.6) imply conditions ( A.2.14) -
( A.2.16), (the upper bound in condition ( A.3.6) was actually all that was needed in the
proofs of Section A.2). Finally for ( A.2.17), because C is orthogonal,
xd
is given by
xd
=
1
n
n
i=1
x
i
d
i
=
n
i=1
c
ij
c
i
=
p
k=1
_
n
i=1
c
ij
c
ik
_
k
=
j
. (A.3.8)
Thus by Theorem A.2.7, for j = 1, . . . , p, we have the results,
S
nj
() = S
nj
(0)
f
j
+ o
p
(1) (A.3.9)
S
nj
() = T
nj
(0)
f
j
+ o
p
(1) , (A.3.10)
where
T
nj
(0) =
n
i=1
c
ij
(F(Y
i
)) . (A.3.11)
Let S
n
()
= (S
n1
(), . . . , S
np
()). Because component-wise convergence in probability
implies that the corresponding vector converges, we have shown that the following theorem
is true:
Theorem A.3.1. Under the above assumptions, for > 0 and for all
lim
n
P (|S
n
() (S
n
(0) ) | ) = 0 . (A.3.12)
The conditions we want are asymptotic linearity and quadraticity. Asymptotic linear-
ity is the condition
lim
n
P
_
sup
c
|S
n
() (S
n
(0) ) |
_
= 0 , (A.3.13)
for arbitrary c > 0 and > 0. This result was rst shown by Jureckova (1971) under more
stringent conditions on the design matrix.
Consider the dispersion function discussed in Chapter 2. In terms of the above notation
D
n
() =
n
i=1
a(R(Y
i
c
i
))(Y
i
c
i
) . (A.3.14)
An approximation of D
n
() is the quadratic function
Q
n
() =
/2
S
n
(0) + D
n
(0) . (A.3.15)
Using Jureckovas conditions, Jaeckel (1972) extended the result ( A.3.13) to asymptotic
quadraticity which is given by
lim
n
P
_
sup
c
[D
n
() Q
n
()[
_
= 0 , (A.3.16)
for arbitrary c > 0 and > 0. Our main result of this section shows that ( A.3.12),
( A.3.13), and ( A.3.16) are equivalent. The proof proceeds as in Heiler and Willers (1988)
who established their results based on convex function theory. Before proceeding with the
proof, for the readers convenience, we present some notes on convex functions.
A.3.1 Convex Functions
Let f be a real valued function dened on 1
p
. Recall the denition of a convex function:
Denition A.3.1. The function f is convex if
f(x + (1 )y) f(x) + (1 )f(y) , (A.3.17)
for 0 < < 1. Further, a convex function f is called proper if it is dened on an open set
C 1
p
and is everywhere nite on C.
The convex functions of interest in this appendix are proper with C = R
p
.
The proof of the following theorem can be found in Rockafellar (1970); see pages 82 and
246.
Theorem A.3.2. Suppose f is convex and proper on an open subset C of 1
p
. Then f is
continuous on C and is dierentiable almost everywhere on C.
We will nd it useful to dene a subgradient:
Denition A.3.2. The vector D(x
0
) is called a subgradient of f at x
0
if
f(x) f(x
0
) D(x
0
)
(x x
0
) , for all x C . (A.3.18)
As shown on page 217 of Rockafellar (1970), a proper convex function which is dened
on an open set C has a subgradient at each point in C. Furthermore, at the points of
dierentiability, the subgradient is unique and it agrees with the gradient. This is a theorem
proved on page 242 of Rockafellar which we next state.
Theorem A.3.3. Let f be convex. If f is dierentiable at x
0
then f(x
0
), the gradient of
f at x
0
, is the unique subgradient of f at x
0
.
Hence combining Theorems A.3.2 and A.3.3, we see that for proper convex functions
the subgradient is the gradient almost everywhere; hence if f is a proper convex function we
have,
f(x) f(x
0
) f(x
0
)
(x x
0
) , a.e. x C . (A.3.19)
The next theorem can be found on page 90 of Rockafellar (1970).
Theorem A.3.4. Let the sequence of convex functions f
n
be proper on C and suppose the
sequence converges for all x C
where C
is dense in C. Then the functions f

n
converge
on the whole set C to a proper and convex function f and, furthermore, the convergence is
uniform on each compact subset of C.
The following theorem is a modication by Heiler and Willers (1988) of a theorem found
on page 248 of Rockafellar (1970).
Theorem A.3.5. Suppose in addition to the assumptions of the last theorem the limit func-
tion f is dierentiable, then
lim
n
f
n
(x) = f(x) , for all x C . (A.3.20)
Furthermore the convergence is uniform on each compact subset of C.
The following result is proved in Heiler and Willers (1988).
Theorem A.3.6. Suppose the hypotheses of Theorem A.3.4 hold. Assume, also, that the
limit function f is dierentiable. Then
lim
n
f
n
(x) = f(x) , for all x C
(A.3.21)
and
lim
n
f
n
(x
0
) = f(x
0
) , for at least one x
0
C
(A.3.22)
where C
is dense in C, imply that

lim
n
f
n
(x) = f(x) , for all x C (A.3.23)
and the convergence is uniform on each compact subset of C.
A.3.2 Asymptotic Linearity and Quadraticity
We now proceed with Heiler and Willers (1988) proof of the equivalence of ( A.3.12),
( A.3.13), and ( A.3.16).
Theorem A.3.7. Under Model ( 3.2.3) and the assumptions ( 3.4.7), ( 3.4.8), and ( 3.4.1),
the expressions ( A.3.12), ( A.3.13), and ( A.3.16) are equivalent.
Proof:
( A.3.12) ( A.3.16). Both functions D
n
() and Q
n
() are proper convex functions
for 1
p
. Their gradients are given by,
Q
n
() = S
n
(0) (A.3.24)
D
n
() = S
n
() , a.e. 1
p
. (A.3.25)
By Theorem A.3.2 the gradient of D exists almost everwhere. Where the derivative of
D
n
() is not dened, we will use the subgradient of D
n
(), ( A.3.2), which, in the case
of proper convex functions, exists everwhere and which agrees uniquely with the gra-
dient at points where D() is dierentiable; see Theorem A.3.3 and the surrounding
discussion. Combining these results we have,
(D
n
() Q
n
()) = [S
n
() S
n
(0) + ] (A.3.26)
Let N denote the set of positive integers. Let
(1)
,
(2)
, . . . be a listing of the vectors
in p-space with rational components. By ( A.3.12) the right side of ( A.3.26) goes to 0
in probability for
(1)
. Hence, for every innite index set N
N there exists another

innite index set N
1
N
such that
[S
n
(
(1)
) S
n
(0) +
(1)
]
a.s.
0 , (A.3.27)
for n N
1
. Since the right side of ( A.3.26) goes to 0 in probability for
(2)
and N
1
is an innite index set, there exists another innite index set N
2
N
1
such that
[S
n
(
(i)
) S
n
(0) +
(i)
]
a.s.
0 , (A.3.28)
for n N
2
and i 2. We continue and, hence, get a sequence of nested innite index
sets N
1
N
2
N
i
such that
[S
n
(
(j)
) S
n
(0) +
(j)
]
a.s.
0 , (A.3.29)
for n N
i
N
i+1
and j i. Let

N be a diagonal innite index set of the
sequence N
1
N
2
N
i
. Then
[S
n
() S
n
(0) + ]
a.s.
0 , (A.3.30)
for n

N and for all rational .
Dene the convex function H
n
() = D
n
() D
n
(0) +
S
n
(0). Then
D
n
() Q
n
() = H
n
()
/2 (A.3.31)
(D
n
() Q
n
()) = H
n
() . (A.3.32)
Hence by ( A.3.30) we have
H
n
()
a.s.
=
/2 , (A.3.33)
for n

N and for all rational . Also note
H
n
(0) = 0 =
/2[
=0
. (A.3.34)
Since H
n
is convex and ( A.3.33) and ( A.3.34) hold, we have by Theorem A.3.6 that
H
n
()
n
e
N
converges to
/2 a.s., uniformly on each compact subset of R

p
. That
is by ( A.3.31), D
n
() Q
n
() 0 a.s., uniformly on each compact subset of 1
p
.
Since N
is arbitrary, we can conclude, (see Theorem 4, page 103 of Tucker, 1967),

that D
n
() Q
n
()
P
0 uniformly on each compact subset of 1
p
.
( A.3.16) ( A.3.13). Let c > 0 be given and let C = : || c. By ( A.3.16) we
know that D
n
() Q
n
()
P
0 on C. Using the same diagonal argument as above,
for any innite index set N
N there exists an innite index set

N N
such that
D
n
() Q
n
()
a.s.
0 for n

N and for all rational . As in the last part, introduce
the function H
n
as
D
n
() Q
n
() = H
n
()
/2 . (A.3.35)
Hence,
H
n
()
a.s.

/2 , (A.3.36)
for n

N and for all rational . By ( A.3.36) and the fact that the function
/2
is dierentiable we have by Theorem A.3.5,
H
n
()
a.s.
, (A.3.37)
for n

N and uniformly on C. This leads to the following string of convergences,
(D
n
() Q
n
())
a.s.
0
S
n
() (S
n
(0) )
a.s.
0 , (A.3.38)
where both convergences are for n

N and uniformly on C. Since N
was arbitrary
we can conclude that
S
n
() (S
n
(0) )
P
0 , (A.3.39)
uniformly on C. Hence ( A.3.13) holds.
( A.3.13) ( A.3.12). This is trivial.
These are the results we wanted. For convenience we summarize asymptotic linearity
and asymptotic quadraticity in the following theorem:
Theorem A.3.8. Under Model ( 3.2.3) and the assumptions ( 3.4.7), ( 3.4.8), and ( 3.4.1),
lim
n
P
_
sup
c
|S
n
() (S
n
(0) ) |
_
= 0 , (A.3.40)
lim
n
P
_
sup
c
[D
n
() Q
n
()[
_
= 0 , (A.3.41)
for all > 0 and all c > 0.
Proof: This follows from the Theorems A.3.1 and A.3.7.
A.3.3 Asymptotic Distance Between

and

This section contains a proof of Theorem 3.5.5. It shows that the R-estimate in Chapter 3 is
close to the value which minimizes the quadratic approximation to the dispersion function.
The proof is due to Jaeckel (1972). For convenience, we have restated the theorem.
Theorem A.3.9. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4,
n(
)
P
0 .
Proof: Choose > 0 and > 0. Since

n
converges in distribution, there exists a c

0
such that
P
_
|
| c
0
/
n
_
< /2 , (A.3.42)
for n suciently large. Let
T = min
_
Q(Y X) : |
| = /
n
_
Q(Y X
) . (A.3.43)
Since

is the unique minimizer of Q, T > 0; hence, by asymptotic quadraticity we have
P
_
max
<(c
0
+)/
n
[D(YX) Q(Y X)[ T/2
_
/2 , (A.3.44)
for suciently large n. By ( A.3.42) and ( A.3.44) we can assert with probability greater
than 1 that for suciently large n, [Q(YX
)D(YX
)[ < (T/2) and |
| < c
0
/
n.
This implies with probability greater than 1 that for suciently large n,
D(YX
) < Q(Y X
) + T/2 and |
| < c
0
/
n . (A.3.45)
Next suppose is arbitrary and on the ring |
| = /
n. For |
| < c
0
/
n it then
follows that || (c
0
+ )/
n. Arguing as above, we have with probability greater than

1 that D(Y X) > Q(Y X) T/2, for suciently large n. From this, ( A.3.43),
and ( A.3.45) we get the following string of inequalities
D(YX) > Q(Y X) T/2
min
_
Q(YX) : |
| = /
n
_
T/2
= T + Q(YX
) T/2
= T/2 +Q(YX
) > D(YX
) (A.3.46)
Thus, D(YX) > D(YX
), for |
| = /
n. Since D is convex, we must also have

D(Y X) > D(Y X
), for |
| /
n. But D(Y X
) min D(Y X) =
D(YX
). Hence

must lie inside the disk |
| = /
n with probability of at least

1 2; that is, P
_
|
| < /
n
_
> 1 2. This yields the result.
A.3.4 Consistency of the Test Statistic F
This section contains a proof of the consistency of the test statistic F
, Theorem 3.6.2. We
begin with a lemma.
Lemma A.3.1. Let a > 0 be given and let t
n
= min
n
e
=a
(Q() Q(
)). Then
t
n
= (2)
1
a
2
n,1
where
n,1
is the minimum eigenvalue of n
1
X
X.
Proof: After some computation, we have
Q() Q(
) = (2)
1
n(
n
1
X
n(
) .
Let 0 <
n,1

n,p
be the eigenvalues of n
1
X
Xand let
n,1
, . . . ,
n,p
be a correspond-
ing set of orthonormal eigenvectors. The spectral decomposition of n
1
X
X is n
1
X
X =
p
i=1
n,i
n,i
n,i
. From this we can show for any vector that
n
1
X
X
n,1
||
2
and,
that further, the minimum is achieved over all vectors of unit length when =
n,1
. It then
follows that
min
=a
n
1
X
X =
n,1
a
2
,
which yields the conclusion.
Note that by (D.2) of Section 3.4,
n,1

1
, for some
1
> 0. The following is a
restatement and a proof of Theorem 3.6.2.
Theorem A.3.10. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section 3.4 hold.
The test statistic F
is consistent for the hypotheses ( 3.2.5).

Proof: By the above discussion we need only show that ( 3.6.23) is true. Let > 0 be
given. Let c
0
= (2)
1
2
,q
. By Lemma A.3.1, choose a > 0 so large that (2)
1
a
2
1
> 3c
0
+.
Next choose n
0
so large that (2)
1
a
2
n,1
> 3c
0
, for n n
0
. Since

n|

0
| is bounded
in probability, there exits a c > 0 and a n
1
such that for n n
1
P
0
(C
1,n
) 1 (/2) , (A.3.47)
where we dene the event C
1,n
=
n|
0
| < c. Since t > 0 by asymptotic quadraticity,
Theorem A.3.8, there exits an n
2
such that for n > n
2
,
P
0
(C
2,n
) 1 (/2) , (A.3.48)
where C
2,n
= max
n
0
c+a
[Q() D()[ < (t/3). For the remainder of the proof
assume that n maxn
0
, n
1
, n
2
= n
. Next suppose is such that

n|
| = a. Then
on C
1,n
it follows that

n|
| c + a. Hence on both C
1,n
and C
2,n
we have
D() > Q() (t/3)
Q(
) + t (t/3)
= Q(
) + 2(t/3)
> D(
) + (t/3) .
Therefore, for all such that
n|
| = a, D() D(
) > (t/3) > c

0
. But D is convex;
hence on C
1,n
C
2,n
, for all such that

n|
| a, D() D(
) > (t/3) > c

0
.
Finally choose n
3
such that for n n
3
, > (c + a)/
n where is the positive distance

between
0
and 1
r
. Now assume that n maxn
, n
3
and C
1,n
C
2,n
is true. Recall that
the reduced model R-estimate is

r
= (
r,1
, 0
where

r,1
lies in 1
r
; hence,
n|
n|
0
|
n|

0
|
n c > a .
Thus on C
1,n
C
2,n
, D(
r
) D(
) > c
0
. Thus for n suciently large we have,
P[D(
r
) D(
) > (2)
1
2
,q
] 1 .
Because was arbitrary ( 3.6.23) is true and consistency of F
follows.
A.3.5 Proof of Lemma 3.5.1
The following lemma was used to establish the asymptotic linearity for the sign process for
linear models in Chapter 3. The proof of this lemma was rst given by Jureckova (1971) for
general scores. We restate the lemma and give its proof for sign scores.
Lemma A.3.2. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4.
For any > 0 and for any a 1,
lim
n
P[[S
1
(Y an
1/2
X
R
) S
1
(Y an
1/2
)[
n] = 0 .
Proof: Let a be arbitrary but xed and let c > [a[. After matching notation, Theorem
A.4.3 leads to the result,
max
(X
X)
1/2
c
n
S
1
(Y an
1/2
X)
1
n
S
1
(Y) + (2f(0))a
= o
p
(1) . (A.3.49)
Obviously the above result holds for = 0. Hence for any > 0,
P
_
max
(X
X)
1/2
c
n
S
1
(Y an
1/2
X)
1
n
S
1
(Y an
1/2
)

_
P
_
max
(X
X)
1/2
c
n
S
1
(Y an
1/2
X)
1
n
S
1
(Y) + (2f(0)a

2
_
+P
_
n
S
1
(Yan
1/2
)
1
n
S
1
(Y) + (2f(0)a

2
_
.
By ( A.3.49), for n suciently large, the two terms on the right side are arbitrarily small.
The desired result follows from this since (X
X)
1/2
A.4. ASYMPTOTIC LINEARITY FOR THE L
1
ANALYSIS 443
A.4 Asymptotic Linearity for the L
1
Analysis
In this section we obtain a linearity result for the L
1
analysis of a linear model. Recall from
Section 3.6 that the L
1
-estimates are equivalent to the R-estimates when the rank scores
are generated by the sign function; hence, the distribution theory for the L
1
-estimates is
derived in Section 3.4. The linearity result derived below oers another way to obtain this
result. More importantly though, we need the linearity result for the proof of Lemma 3.5.6
of Section 3.5. As we next show, this result is a corollary to the linearity results derived in
the last section.
We will assume the same linear model and use the same notation as in Section 3.2. Recall
that the L
1
estimate of minimizes the dispersion function,
D
1
(, ) =
n
i=1
[Y
i
x
i
[ .
The corresponding gradient function is the (p + 1) 1 vector whose components are
j
D
1
=
_

n
i=1
sgn(Y
i
x
i
) if j = 0
n
i=1
x
ij
sgn(Y
i
x
i
) if j = 1, . . . , p
,
where j = 0 denotes the partial of D
1
with respect to . The parameter will denote the
location functional med(Y
i
x
i
), i.e., the median of the errors. Without loss of generality
we will assume that the true parameters are 0.
We rst consider the simple linear model. Consider then the notation of Section A.3;
see ( A.3.1) - ( A.3.7). We will derive the analogue of Theorem A.3.8 for the processes
U
0
(, ) =
n
i=1
sgn(Y
i
n
c
i
) (A.4.1)
U
1
(, ) =
n
i=1
c
i
sgn(Y
i
n
c
i
) . (A.4.2)
Let p
d
=
n
i=1
f
0
(y
i
) denote the likelihood for the iid observations Y
1
, . . . , Y
n
and let q
d
=
n
i=1
f
0
(y
i
+ /
n + c
i
) denote the liklihood of the variables Y
i
n
c
i
. We assume
throughout that f(0) > 0. Similar to Section A.2.2, the sequence of densities q
d
is contiguous
to the sequence p
d
. Note that the processes U
0
and U
1
are already sums of independent
variables; hence, projections are unnecessary.
We rst work with the process U
1
.
Lemma A.4.1. Under the above assumptions and as n ,
E
0
(U
1
(, )) 2f
0
(0) .
Proof: After some simplication we get
E
0
(U
1
(, )) = 2
n
i=1
c
i
_
F
0
(0) F
0
(/
n + c
i
)
= 2
n
i=1
c
i
(c
i
/
n)f
0
(
in
) ,
where, by the mean value theorem,
in
is between 0 and [/
n + c
i
[. Since the c
i
s are
centered, we further obtain
E
0
(U
1
(, )) = 2
n
i=1
c
2
i
[f
0
(
in
) f
0
(0)] 2
n
i=1
c
2
i
f
0
(0) .
By assumptions of Section A.2.2, it follows that max
i
[/
n + c
i
[ 0 as n . Since
n
i=1
c
2
i
= 1 and the assumptions that f
0
continuous and positive at 0 the desired result
easily follows.
This leads us to our main result for U
1
(, ):
Theorem A.4.1. Under the above assumptions, for all and
U
1
(, ) [U
1
(0, 0) 2f
0
(0)]
P
0 ,
as n .
Because the c
i
s are centered it follows that E
p
d
(U
1
(0, 0)) = 0. Thus by the last lemma,
we need only show that Var(U
1
(, ) U
1
(0, 0)) 0. By considering the variance of the
sign of a random variable, simplication leads to the bound:
Var((U
1
(, ) U
1
(0, 0)) 4
n
i=1
c
2
i
[F
0
(/
n + c
i
) F
0
(0)[ .
By our assumptions, max
i
[c
i
+/
n[ 0 as n . From this and the continuity of F

0
at 0, it follows that Var(U
1
(, ) U
1
(0, 0)) 0.
We need analogous results for the process U
0
(, ).
Lemma A.4.2. Under the above assumptions,
E
0
[U
0
(, )] 2f
0
(0) ,
as n .
A.4. ASYMPTOTIC LINEARITY FOR THE L
1
ANALYSIS 445
Proof: Upon simplication and an application of the mean value theorem,
E
0
[U
0
(, )] =
2
n
n
i=1
_
F
0
(0) F
0
_

n
+ c
i
__
=
2
n
n
i=1
_

n
+ c
i
_
f
0
(
in
)
=
2
n
n
i=1
[f
0
(
in
) f
0
(0)] 2f
0
(0) ,
where we have used the fact that the c
i
s are centered. Note that [
in
[ is between 0 and
[/
n +c
i
[ and that max [/
n +c
i
[ 0 as n . By the continuity of f
0
at 0, the
desired result follows.
Theorem A.4.2. Under the above assumptions, for all and
U
0
(, ) [U
0
(0, 0) 2f
0
(0)]
P
0 ,
as n .
Because the medY
i
is 0, E
0
[U
0
(0, 0)] = 0. Hence by the last lemma it then suces to
show that Var(U
0
(, ) U
0
(0, 0)) 0. But,
Var(U
0
(, ) U
0
(0, 0))
4
n
n
i=1
F
0
_

n
+ c
i
_
F
0
(0)
Because max [/
n + c
i
[ 0 and F
0
is continuous at 0, Var(U
0
(, ) U
0
(0, 0)) 0.
Next consider the multiple regression model as discussed in Section A.3. The only
dierence in notation is that here we have the intercept parameter included. Let =
(,
1
, . . . ,
p
)
denote the vector of all regression parameters. Take X = [1

n
: X
c
], where
X
c
denotes a centered design matrix and as in ( A.3.2) take C = X(X
X)
1/2
. Note that
the rst column of C is (1/
n)1
n
. Let U() = (U
0
(), . . . , U
p
())
denote the vector of

processes. Similar to the discussion prior to Theorem A.3.1, the last two theorems imply
that
U() [U(0) 2f
0
(0)]
P
0 ,
for all real in 1
p+1
.
As in Section A.3, we dene the approximation quadratic to D
1
as
Q
1n
() = (2f
0
(0))
/2
U(0) + D
1
(0) .
The asymptotic linearity of U and the asymptotic quadraticity of D
1
then follow as in the
last section. We state the result for reference:
Theorem A.4.3. Under conditions ( 3.4.1), ( 3.4.3), ( 3.4.7) and ( 3.4.8),
lim
n
P
_
max
c
|U() (U(0) (2f
0
(0))) |
_
= 0 , (A.4.3)
lim
n
P
_
max
c
[D
1
() Q
1
()[
_
= 0 , (A.4.4)
for all > 0 and all c > 0.
A.5 Inuence Functions
In this section we derive the inuence functions found in Chapters 1-3 and Chapter 5.
Discussions of the inuence function can be found in Staudte and Sheather (1990), Hampel
et al. (1986) and Huber (1981). For the inuence functions of Chapter 3, we will nd the
Gateux derivative of a convenient functional; see Fernholz (1983) and Huber (1981) for
rigourous discussions of functionals and derivatives.
Denition A.5.1. Let T be a statistical functional dened on a space of distribution func-
tions and let H denote a distribution function in the domain of T. We say T is Gateux
dierentiable at H if for any distribution function W, such that the distribution functions
(1 s)H + sW lie in the domain of T, the following limit exists:
lim
s0
T[(1 s)H + sW] T[H]
s
=
_

H
dW , (A.5.1)
for some function
H
.
Note by taking W to be H in the above denition we have,
_

H
dH = 0 . (A.5.2)
The usual denition of the inuence function is obtained by taking the distribution
function W to be a point mass distribution. Denote the point mass distribution function at
t by
t
(x). Letting W(x) =
t
(x), the Gateux derivative of T(H) is
lim
s0
T[(1 s)H + s
s
(x)] T[H]
s
=
H
(x); . (A.5.3)
The function
H
(x) is called the inuence function of T(H). Note that this is the deriva-
tive of the functional T[(1 s)H + s
s
(x)] at s = 0. It measures the rate of change of the
functional T(H) at H in the direction of
s
. A functional is said to be robust when this
derivative is bounded.
A.5. INFLUENCE FUNCTIONS 447
A.5.1 Inuence Function for Estimates Based on Signed-Rank
Statistics
In this section we derive the inuence function for the one-sample location estimate

+,
( 1.8.5), discussed in Chapter 1. We will assume that we are sampling from a symmetric
density h(x) with distribution function H(x), as in Section 1.8. As in Chapter 2, we will
assume that the one sample score function
+
(u) is dened by
+
(u) =
_
u + 1
2
_
, (A.5.4)
where (u) is a nondecreasing, dierentiable function dened on the interval (0, 1) satisfying
(1 u) = (u) . (A.5.5)
Recall from Chapter 2 that this assumption is appropriate for scores for samples from sym-
metrical distributions. For convenience we extend
+
(u) to the interval (1, 0) by
+
(u) =
+
(u) . (A.5.6)
Our functional T (H) is dened implicitly by the equation ( 1.8.5). Using the symmetry of
h(x), ( A.5.5), and ( A.5.6) we can write the dening equation for = T (H) as
0 =
_

+
(H(x) H(2 x))h(x) dx
0 =
_

(1 H(2 x))h(x) dx . (A.5.7)

For the derivation, we will proceed as discussed above; see the discussion around expres-
sion ( A.5.3). Consider the contaminated distribution of H(x) given by
H
t,
(x) = (1 )H(x) +
t
(x) , (A.5.8)
where 0 < < 1 is the proportion of contamination and
t
(x) is the distribution function
for a point mass at t. By ( A.5.3) the inuence function is the derivative of the functional
at = 0. To obtain this derivative we implicitly dierentiate the dening equation ( A.5.7)
at H
t,
(x); i.e., at
0 = (1 )
_

(1 (1 )H(2 x)
t
(2 x))h(x) dx
=
_

(1 (1 )H(2 x)
t
(2 x)) d
t
(x)
Let

denote the derivative of the functional. Implicitly dierentiating this equation and
then setting = 0 and without loss of generality = 0, we get
0 =
_

(H(x))h(x) dx +
_

(H(x))H(x)h(x) dx
= 2
(H(x))h
2
(x) dx
_

(H(x))
t
(x)h(x) dx + (H(t)) .
Label the four integrals in the above equation as I
1
, . . . , I
4
. Since
_
(u) du = 0, I
1
= 0. For
I
2
we get
I
2
=
_

(H(x))h(x) dx
_

(H(x))H(x)h(x) dx
=
_
1
0
(u) du
_
1
0
(u)u du = (0) .
Next I
4
reduces to
_
t
(H(x))h(x) dx =
_
H(t)
0
(u) du = (H(t)) +(0) .

Combining these results and solving for

leads to the inuence function which we can write
in either of the following two ways,
(t,
+) =
(H(t))
_
(H(x))h
2
(x) dx
=

+
(2H(t) 1)
4
_
0

+
(2H(x) 1)h
2
(x) dx
. (A.5.9)
A.5.2 Inuence Functions for Chapter 3
In this section, we derive the inuence functions which were presented in Chapter 3. Much
of this work was developed in Witt (1989) and Witt, McKean and Naranjo (1995). The
correlation model of Section 3.11 is the underlying model for the inuence functions derived
in this section. Recall that the joint distribution function of x and Y is H, the distribution
functions of x, Y and e are M, G and F, respectively, and is the variance-covariance
martix of x.
Let

denote the R-estimate of for a specied score function (u). In this section we
are interested in deriving the inuence functions of this R-estimate and of the corresponding
R-test statistic for the general linear hypotheses. We will obtain these inuence functions
by using the denition of the Gateux derivative of a functional, ( A.5.1). The inuence
functions are then obtained by taking W to be the point mass distribution function
(x
0
,y
0
)
;
see expression ( A.5.3). If T is Gateux dierentiable at H then by setting W =
(x
0
,y
0
)
we
see that the inuence function of T is given by
(x
0
, y
0
; T) =
_

H
d
(x
0
,y
0
)
=
H
(x
0
, y
0
) . (A.5.10)
As a simple example, we will obtain the inuence function of the statistic D(0) =
a(R(Y
i
))Y
i
. Since G is the distribution function of Y , the corresponding functional is
T[G] =
_
(G(y))ydG(y). Hence for a given distribution function W,
T[(1 s)G+ sW] = (1 s)
_
[(1 s)G(y) + sW(y)]ydG(y)
+s
_
[(1 s)G(y) + sW(y)]ydW(y) .
Taking the partial derivative of the right side with respect to s, setting s = 0, and substituting
y
0
for W leads to the inuence function
(y
0
; D(0)) =
_
(G(y))ydG(y)
_

(G(y))G(y)ydG(y)
+
_

y
0
(G(y))ydG(y) + (G(y
0
))y
0
. (A.5.11)
Note that this is not bounded in the Y -space and, hence, the statistic D(0) is not robust.
Thus, as noted in Section 3.11, the coecient of multiple determination R
1
, ( 3.11.16), is
not robust. A similar development establishes the inuence function for the denominator of
LS coecient of multiple determination R
2
, showing too that it is not bounded. Hence R
2
is not a robust statistic.
Another example is the the inuence function of the least squares estimate of which is
given by,
(x
0
, y
0
;
LS
) =
1
y
0
1
x
0
. (A.5.12)
The inuence function of the least squares estimate is, thus, unbounded in both the Y - and
x-spaces.
Inuence Function of

Recall that H is the joint distribution function of x and Y . Let the p1 vector T(H) denote
the functional corresponding to

. Assume without loss of generality that the true = 0,

= 0, and that the Ex = 0. Hence the distribution function of Y is F(y) and Y and x are
independent; i.e., H(x, y) = M(x)F(y).
Recall that the R-estimate satises the equations
n
i=1
x
i
a(R(Y
i
x
i
))
.
= 0 .
Let

G
n
denote the empirical distribution function of Y
i
x
i
. Then we can rewrite the above
equations as
n
n
i=1
x
i
_
n
n + 1
n
(Y
i
x
i
)
_
1
n
.
= 0 .
Let G
denote the distribution function of Y x
T(H). Then the functional T(H) satises

_
(G
(y x
T(H))xdH(x, y) = 0 . (A.5.13)
We can show that,
G
(t) =
_ _
uv
T(H)+t
dH(v, u) . (A.5.14)
Let H
s
= (1 s)H + sW for an arbitrary distribution function W. Then the functional
T(H) evaluated at H
s
satises the equation
(1 s)
_
(G
s
(y x
T(H
s
))xdH(x, y) + s
_
(G
s
(y x
T(H
s
))xdW(x, y) = 0 ,
where G
s
is the distribution function of Y x
T(H
s
). We will obtain T/s by implicit
dierentiation. Then upon substituting
x
0
,y
0
for W the inuence function is given by
(T/s) [
s=0
, which we will denote by

T. Implicit dierentiation leads to
0 =
_
(G
s
(y x
T(H
s
))xdH(x, y) (1 s)
_

(G
s
(y x
T(H
s
))
G
s
s
xdH(x, y)
+
_
(G
s
(y x
T(H
s
))xdW(x, y) + sB
1
, (A.5.15)
where B
1
is irrelevant since we will be setting s to 0. We rst get the partial derivative of
G
s
with respect to s. By ( A.5.14) and the independence between Y and x at H, we have
G
s
(y x
T(H
s
)) =
_ _
uyT(Hs)
(xv)
dH
s
(v, u)
= (1 s)
_
F[y T(H
s
)
(x v)]dM(v) + s
_ _
uyT(Hs)
(xv)
dW(v, u) .
Thus,
G
s
(y x
T(H
s
))
s
=
_
F[y T(H
s
)
(x v)]dM(v)
+(1 s)
_
F
[y T(H
s
)
(x v)](v x)
T
s
dM(v)
+
_ _
uyT(Hs)
(xv)
dW(v, u) + sB
2
,
where B
2
is irrelevant since we are setting s to 0. Therefore using the independence between
Y and x at H, T(H) = 0, and Ex = 0, we get
G
s
(y x
T(H
s
))
s
[
s=0
= F(y) f(y)x

T + W
Y
(y) , (A.5.16)
where W
Y
denotes the marginal (second variable) distribution function of W.
Upon evaluating expression ( A.5.15) at s = 0 and substituting into it expression ( A.5.16)
we have
0 =
_
x(F(y))dH(x, y) +
_
x
(F(y))[F(y) f(y)x

T + W
Y
(y)]dH(x, y)
+
_
x(F(y))dW(x, y)
=
_

(F(y))f(y)xx

TdH(x, y) +
_
x(F(y))dW(x, y)
Substituting
x
0
,y
0
in for W, we get
0 =

T +x
0
(F(y
0
)) .
Solving this last expression for

T, the inuence function of

is given by
(x
0
, y
0
;
) =
1
(F(y
0
))x
0
. (A.5.17)
Hence the inuence function of

is bounded in the Y -space but not in the x-space. The

estimate is thus bias robust. In Chapter 5 we presented R-estimates whose inuence functions
are bounded in both spaces; see Theorems ?? and 3.12.4. Note that the asymptotic
representation of

in Corollary 3.5.24 can be written in terms of this inuence function

as
= n
1/2
n
i=1
(x
i
, Y
i
;
) + o
p
(1) .
Inuence Function of F
Rewrite the correlation model as

Y = +x
1
+x
2
+ e
and consider testing the general linear hypotheses
H
0
:
2
= 0 versus H
A
:
2
,= 0 , (A.5.18)
where
1
and
2
are q 1 and (pq)1 vectors of parameters, respectively. Let

1,
denote
the reduced model estimate. Recall that the R-test based upon the drop in dispersion is
given by
F
=
RD/q
/2
,
where RD = D(
1,
) D(
) is the reduction in dispersion. In this section we want to

derive the inuence function of the test statistic.
Let RD(H) denote the functional for the statistic RD. Then
RD(H) = D
1
(H) D
2
(H) ,
where D
1
(H) and D
2
(H) are the reduced and full model functionals given by
D
1
(H) =
_
[G
(y x
1
T
1
(H))](y x
1
T
1
(H))dH(x, y)
D
2
(H) =
_
[G
(y x
T(H))](y x
T(H))dH(x, y) , (A.5.19)
and T
1
(H) and T
2
(H) denote the reduced and full model functionals for
1
and , respec-
tively. Let
r
= (
1
, 0
denote the true vector of parameters under H

0
. Then the random
variables Y x
r
and x are independent. Next write as
=
_

11

12
21

22
_
.
It will be convenient to dene the matrices
r
and
+
r
as
r
=
_

11
0
0 0
_
and
+
r
=
_

1
11
0
0 0
_
.
As above, let H
s
= (1 s)H + sW. We begin with a lemma,
Lemma A.5.1. Under the correlation model,
(a) RD(0) = 0
(b)
RD(H
s
)
s
[
s=0
= 0
(c)

2
RD(H
s
)
s
2
[
s=0
=
2
[F(y x
r
)]x
x . (A.5.20)
Proof: Part (a) is immediate. For Part (b), it follows from ( A.5.19) that
D
2
(H
s
)
s
=
_
[G
s
(y x
T(H
s
))](y x
T(H
s
))dH
+(1 s)
_

[G
s
(y x
T(H
s
))](y x
T(H
s
))
G
s
s
dH
+(1 s)
_
[G
s
(y x
T(H
s
))](x
T
s
)dH
+
_
[G
s
(y x
T(H
s
))](y x
T(H
s
))dW(y) + sB , (A.5.21)
where B is irrelevant because we are setting s to 0. Evaluating this at s = 0 and using the
independence of Y x
r
and x, and E(x) = 0 we get after some simplication
D
2
(H
s
)
s
[
s=0
=
_
[F(y x
r
)](y x
r
)dH
_

[F(y x
r
)]F(y x
r
)(y x
r
)dH
+
_

[F(y x
r
)]W
Y
(y x
r
)(y x
r
)dH + [F(y
0
x
0
r
)](y
0
x
0
r
) .
Dierentiating as above and using x
r
= x
1
, we get the same expression for
D
1
s
[
s=0
.
Hence Part (b) is true. Taking the second partial derivatives of D
1
(H) and D
2
(H) with
respect to s, the result for Part (c) can be obtained. This is a tedious derivation and details
of it can be found in Witt (1989) and Witt et al. (1995).
Since F
is nonnegative, there is no loss in generality in deriving the inuence function

of
_
qF
. Letting Q
2
= 2
1
RD we have
(x
0
, y
0
;
_
qF
) = lim
s0
Q[(1 s)H + s
x
0
,y
0
] Q[H]
s
.
But Q[H] = 0 by Part (a) of Lemma A.5.1. Hence we can rewrite the above limit as
(x
0
, y
0
;
_
qF
) =
_
lim
s0
Q
2
[(1 s)H + s
x
0
,y
0
]
s
2
_
1/2
.
Using Parts (a) and (b) of Lemma A.5.1, we can apply Lhospitals rule twice to evaluate
this limit. Thus
(x
0
, y
0
;
_
qF
) =
_
lim
s0
1
2
2
Q
2
s
2
_
1/2
=
_
2
1
2
RD
s
2
_
1/2
= [[F(y x
r
)][
_
x
[
1
+
] x (A.5.22)
Hence, the inuence function of the rank-based test statistic F
is bounded in the Y -space

as long as the score function is bounded. It can be shown that the inuence function of the
least squares test statistic is not bounded in Y -space. It is clear from the above argument
that the coecient of multiple determination R
2
is also robust. Hence, for R-ts R
2
is the
preferred coecient of determination.
However, F
is not bounded in the x-space. In Chapter 5 we present statistics whose

inuence function are bounded in both spaces; although, they are less ecient.
The asymptotic distribution of qF
was derived in Section 3.6; however, we can use the

above result on the inuence function to immediately display it. If we expand Q
2
into a
vonMises expansion at H, we have
Q
2
(H
s
) = Q
2
(H) +
Q
2
s
[
s=0
+
1
2
2
Q
2
s
2
[
s=0
+R
=
__
(F(y x
r
)x
d
x
0
,y
0
(x, y)
_
_
__
(F(y x
r
)xd
x
0
,y
0
(x, y)
_
+ R . (A.5.23)
Upon substituting the empirical distribution function for
x
0
,y
0
in expression ( A.5.23), we
have at the sample
nQ
2
(H
s
) =
_
1
n
n
i=1
x
_
1
n
R(Y
i
x
r
)
_
_
_
_
1
n
n
i=1
x
i
_
1
n
R(Y
i
x
r
)
_
_
+o
p
(1) .
This expression is equivalent to the expression ( 3.6.11) which yields the asymptotic distri-
bution of the test statistic in Section 3.6.
A.5.3 Inuence Function of

HBR
of Chapter 5
The inuence function of the high breakdown estimator

HBR
is discussed in Section 3.12.4.
In this section, we restate Theorem A.5.24 and then derive a proof of it.
Theorem A.5.1. The inuence function for the estimate

HBR
is given by
(x
0
, y
0
,
HBR
) = C
1
H
1
2
_ _
(x
0
x
1
)b(x
1
, x
0
, y
1
, y
0
)sgny
0
y
1
dF(y
1
)dM(x
1
) , (A.5.24)
where C
H
is given by expression ( 3.12.22).
Proof: Let
0
(x, y) denote the distribution function of the point mass at the point (x
0
, y
0
)
and consider the contaminated distribution H
t
= (1 t)H + t
0
for 0 < t < 1. Let (H
t
)
denote the functional at H
t
. Then (H
t
) satises
0 =
_ _
x
1
b(x
1
, x
2
, y
1
, y
2
)
_
I(y
2
y
1
< (x
2
x
1
)
(H
t
))
1
2
_
dH
t
(x
1
, y
1
)dH
t
(x
2
, y
2
) .
(A.5.25)
We next implicitly dierentiate ( A.5.25) with respect to t to obtain the derivative of the
functional. The value of this derivative at t = 0 is the inuence function. Without loss of
generality, we can assume that the true parameter = 0. Under this assumption x and y
are independent. Substituting the value of H
t
into ( A.5.25) and expanding we obtain the
A.6. ASYMPTOTIC THEORY FOR CHAPTER 5 455
four terms:
0 = (1 t)
2
_ _ _
x
1
_
_
y
1
+(x
2
x
1
)
(Ht)
b(x
1
, x
2
, y
1
, y
2
)dF(y
2
)
1
2
_
dM(x
2
)dM(x
1
)dF(y
1
)
+(1 t)t
_ _ _ _
x
1
b(x
1
, x
2
, y
1
, y
2
)
_
I(y
2
y
1
< (x
2
x
1
)
(H))
1
2
_
dM(x
2
)dF(y
2
)d
0
(x
1
, y
1
)
+(1 t)t
_ _ _ _
x
1
b(x
1
, x
2
, y
1
, y
2
)
_
I(y
2
y
1
< (x
2
x
1
)
(H))
1
2
_
d
0
(x
2
, y
2
)dM(x
1
)dF(y
1
)
+t
2
_ _ _ _
x
1
b(x
1
, x
2
, y
1
, y
2
)
_
I(y
2
y
1
< (x
2
x
1
)
(H))
1
2
_
d
0
(x
2
, y
2
)d
0
(x
1
, y
1
) .
Let

denote the derivative of the functional evaluted at 0. Proceeding to implicitly dier-
entiate this equation and evaluating the derivative at 0, we get, after some derivation,
0 =
_ _ _
x
1
b(x
1
, x
2
, y
1
, y
1
)f
2
(y
1
)(x
2
x
1
)
dy
1
dM(x
1
)dM(x
2
)

+
_ _
x
0
b(x
0
, x
2
, y
0
, y
2
)
_
I(y
2
< y
0
)
1
2
_
dF(y
2
)dM(x
2
)
+
_ _
x
1
b(x
1
, x
0
, y
1
, y
0
)
_
I(y
0
< y
1
)
1
2
_
dF(y
1
)dM(x
1
)
Once again using the symmetry in the x arguments and y arguments of the function b, we
can simplify this expression to
0 =
_
1
2
_ _ _
(x
2
x
1
)b(x
1
, x
2
, y
1
, y
1
)(x
2
x
1
)
f
2
(y
1
) dy
1
dM(x
1
)dM(x
2
)
_

+
_ _
(x
0
x
1
)b(x
1
, x
0
, y
1
, y
0
)
_
I(y
1
< y
0
)
1
2
_
dF(y
1
)dM(x
1
) .
Using the relationship between the indicator function and the sign function and the denition
of C
H
,( 3.12.22), we can rewrite this last expression as
0 = C
H

+
1
2
_ _
(x
0
x
1
)b(x
1
, x
0
, y
1
, y
0
)sgny
0
y
1
dF(y
1
)dM(x
1
) .
Solving for

leads to the desired result.
A.6 Asymptotic Theory for Chapter 5
In this section we derive the results that are needed in Section 3.12.3 of Chapter 5. These
results were rst derived by Chang (1995). Our development is taken from the article by
Chang, McKean, Naranjo and Sheather (1996). The main goal is to prove Theorem 3.12.2
which we restate here:
Theorem A.6.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) -
( 3.12.13),
n(
HBR
)
d
N( 0, (1/4)C
1
H
C
1
).
Besides the notation of Chapter 5, we need:
1. W
ij
() = (1/2)[sgn(z
j
z
i
) sgn(y
j
y
i
)],
where z
j
= y
j
x
j
/
n . (A.6.1)
2. t
ij
() = (x
j
x
i
)
n . (A.6.2)
3. B
ij
(t) = E[b
ij
I(0 < y
i
y
j
< t)] . (A.6.3)
4.
ij
= B
ij
(0)/E(b
ij
) . (A.6.4)
5. C
n
=
i<j
ij
b
ij
(x
j
x
i
)(x
j
x
i
)
. (A.6.5)
6. R() = n
3/2
_
i<j
b
ij
(x
j
x
i
)W
ij
() +C
n
/
n
_
. (A.6.6)
Without loss of generality we will assume that the true
0
is 0. We begin with the
following lemma.
Lemma A.6.1. Under assumptions (E.1), ( 3.4.1), and (H.1), ( 3.12.13),
B
ij
(t) =
_

b(x
i
, x
j
, y
j
+ t, y
j
,
0
) f(y
j
+ t) f(y
j
)
k=i,j
f(y
k
) dy
1
dy
n
is continuous in t.
Proof: This result follows from ( 3.4.1), ( 3.12.13) and an application of Leibnitzs rule on
dierentiation of denite integrals.
Let be arbitrary but xed. Denote W
ij
() by W
ij
, suppressing dependence on .
Lemma A.6.2. Under assumptions (E.1), ( 3.4.1), and (H.4), ( 3.12.13), there exist con-
stants [
ij
[ < [t
ij
[ such that E(b
ij
W
ij
) = t
ij
B
ij
(
ij
).
Proof: Since W
ij
= 1, 1, or 0 according as t
ij
< y
j
y
i
< 0, 0 < y
j
y
i
< t
ij
, or otherwise,
we have
E
0
(b
ij
W
ij
) =
_
t
ij
<y
j
y
i
<0
b
ij
f
Y
(y) dy
_
0<y
j
y
i
<t
ij
b
ij
f
Y
(y) dy .
When t
ij
> 0, E(b
ij
W
ij
) = B
ij
(t
ij
) = B
ij
(0) B
ij
(t
ij
) = t
ij
B
ij
(
ij
) by Lemma A.6.1
and the Mean Value Theorem. The same result holds for t
ij
< 0, which proves the lemma.
Lemma A.6.3. Under assumptions (H.3), ( 3.12.12), and (H.4), ( 3.12.13), we have
b
ij
= g
ij
(
0
) = g
ij
(0) + [g
ij
()]
0
= g
ij
(0) + O
p
(1/
n),
uniformly over all i and j, where || |
0
|.
Proof: Follows from a multivariate Mean Value Theorem (see e.g. page 355 of Apostol, 1974),
and by ( 3.12.12) and ( 3.12.13).
Lemma A.6.4. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),
(i) E(g
ij
(0)g
ik
(0)W
ij
W
ik
) 0, as n
(ii) E(g
ij
(0)W
ij
) 0, as n
uniformly over i and j.
Proof: Without loss of generality, let t
ij
> 0 and t
ik
> 0, where the indices i, j and k are all
dierent. Then
E(g
ij
(0)g
ik
(0)W
ij
W
ik
) = E[g
ij
g
ik
I(0 < y
j
y
i
< t
ij
) I(0 < y
k
y
i
< t
ik
)]
=
_
y
i
+t
ik
y
i
_
y
i
+t
ij
y
i
g
ij
g
ik
f
i
f
j
f
k
dy
j
dy
k
dy
i
.
Assumptions ( 3.4.7) and ( 3.4.8) imply (1/n)max
i
(x
ik
x
k
)
2
0 for all k, or equivalently
(1/
n)max
i
[x
ik
x
k
[ 0 for all k, which implies that t
ij
0. Since the integrand is
bounded, this proves (i).
Similarly, E(g
ij
(0)W
ij
) =
_
_
y
i
+t
ij
y
i
g
ij
f
i
f
j
dy
j
dy
i
0, which proves (ii).
Lemma A.6.5. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),
(i) Cov(b
12
W
12
, b
34
W
34
) = o(n
1
).
(ii) Cov(b
12
W
12
, b
34
) = o(n
1
).
(iii) Cov(b
12
W
12
, b
13
W
13
) = o(1).
(iv) Cov(b
12
W
12
, b
13
) = o(1).
Proof: To prove (i), recall that b
12
= g
12
(0) + [g
12
()]
0
. Thus
Cov(b
12
W
12
, b
34
W
34
) = Cov(g
12
(0)W
12
, g
34
(0)W
34
)
+2 Cov([g
12
()]
0
W
12
, g
34
(0)W
34
)
+Cov([g
12
()]
0
W
12
, [g
34
()]
0
W
34
).
Let I
1
, I
2
and I
3
denote the three terms on the right hand side. The term I
1
is 0, by
independence. Now,
I
2
= 2E
_
[g
12
()]
0
W
12
g
34
(0)W
34
_
2E
_
[g
12
()]
0
W
12
_
E g
34
(0)W
34
= I
21
I
22
.
Write the rst term above as
I
21
= 2(1/n)E
_
[g
12
()]
0
g
34
(0)(
nW
12
)(
nW
34
)
_
.
The term [g
12
()]
0
= b
12
g
12
(0) is bounded and of magnitude o
p
(1). If we can show
that

nW
12
is integrable, then it follows using standard arguments that I
21
= o(1/n). Let
F
denote the cdf of y

2
y
1
and f
denote its pdf. Using the mean value theorem,

E[
nW
12
()] =
n(1/2)E[sgn(y
2
y
1
(x
2
x
1
)
)/
n) sgn(y
2
y
1
)]
=
n(1/2)[2F
((x
2
x
1
)
)/
n) 2F
(0)]
=
nf
)(x
2
x
1
)
n f
)[(x
2
x
1
)
[ ,
for [ [ < [(x
2
x
1
)
n[. The right side of the inequality in expression ( A.6.7) is

bounded. This proves that I
21
= o(1/n). Similarly,
I
22
= 2(1/n)E
_
[g
12
()]
0
(
nW
12
)
_
E
_
g
34
(0)(
nW
34
)
_
= o(1/n),
which proves I
2
= 0.
The term I
3
can be shown to be o(n
1
) similarly, which proves (i). The proof of (ii) is
analogous to (i). To prove (iii), note that
Cov(b
12
W
12
, b
13
W
13
) = Cov(g
12
(0)W
12
, g
13
(0)W
13
)
+ 2 Cov([g
12
()]
0
W
12
, g
13
(0)W
13
)
+ Cov([g
12
()]
0
W
12
, [g
13
()]
0
W
13
).
The rst term is o(1) by Lemma A.6.4. The second and third terms are clearly o(1). This
proves (iii). Result (iv) is analogously proved.
We are now ready to state and prove asymptotic linearity. Consider the negative gradient
function
S() = D() =
i<j
b
ij
sgn(z
j
z
i
)(x
j
x
i
). (A.6.7)
Theorem A.6.2. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8).
Then
sup
nC
n
3/2
[S() S(0) + 2 C
n
]
p
0.
Proof: Write R() =
_
S(n
1/2
) S(0) + 2n
1/2
C
n
. We will show that

sup
C
R() = 2 sup
C
_
n
3/2
i<j
b
ij
(x
j
x
i
) W
ij
() + n
1/2
C
n
_
p
0.
It will suce to show that each component converges to 0. Consider the kth component
R
k
() = 2
_
n
3/2
i<j
b
ij
(x
jk
x
ik
) W
ij
() +
i<j
ij
b
ij
(x
jk
x
ik
) t
ij
_
= 2 n
3/2
i<j
(x
jk
x
ik
)(b
ij
W
ij
+
ij
t
ij
b
ij
).
We will show that E(R
k
()) 0 and V ar(R
k
()) 0. By Lemma A.6.2 and the
denition of
ij
,
E(R
k
) = 2 n
3/2
i<j
(x
jk
x
ik
)[E(b
ij
W
ij
) +
ij
t
ij
E(b
ij
)]
= 2 n
3/2
i<j
(x
jk
x
ik
)t
ij
[B
ij
(0) B
ij
(
ij
)]
2 n
3/2
_
i<j
(x
jk
x
ik
)
2
_
1/2
_
i<j
t
2
ij
_
1/2
sup
i,j
[B
ij
(0) B
ij
(
ij
)[
= 2
_
(1/n
2
)
i<j
(x
jk
x
ik
)
2
_
1/2
_
(1/n)
i<j
t
2
ij
_
1/2
sup
i,j
[B
ij
(0) B
ij
(
ij
)[ 0
since (1/n)
i<j
t
2
ij
= (1/n)
X = O(1) and sup

i,j
[B
ij
(0) B
ij
(
ij
)[ 0 by Lemma
A.6.1.
Next, we will show that V ar(R
k
) 0.
V ar(R
k
) = V ar
_
2 n
3/2
i<j
(x
jk
x
ik
)(b
ij
W
ij
+
ij
t
ij
b
ij
)
_
= V ar
_
2 n
3/2
n
i=1
n
j=1
(x
jk
x
k
)(b
ij
W
ij
+
ij
t
ij
b
ij
)
_
= 4 n
3
n
i=1
n
j=1
(x
jk
x
k
)
2
V ar(b
ij
W
ij
+
ij
t
ij
b
ij
)
+4 n
3

(i,j)=(l,m)
(x
jk
x
k
)(x
mk
x
k
)Cov(b
ij
W
ij
+
ij
t
ij
b
ij
, b
lm
W
lm
+
lm
t
lm
b
lm
).
The double sum term above goes to 0, since there there n
2
bounded terms in the double
sum, multiplied by n
3
. There are two types of covariance terms in the quadruple sum,
covariance terms with all four indices dierent, e.g. ((i, j), (l, m)) = ((1, 2), (3, 4)), and
covariance terms with one index of the rst pair equal to one index of the second pair, e.g.
((i, j), (l, m)) = ((1, 2), (1, 3)). Since there are of magnitude n
4
terms with all four indices
dierent, we need to show that each covariance term is o(n
1
). This immediately follows
from Lemma A.6.5. Finally there are of magnitude n
3
covariance terms with one shared
index, and we need to show each term is o(1). Again, this immediately follows from Lemma
A.6.5. Hence, we have established the desired result.
Next dene the approximating quadratic process,
Q() = D(0)
i<j
b
ij
sgn(y
j
y
i
)(x
j
x
i
)
C
n
. (A.6.8)
Let
D
() = n
1
D(n
1/2
) (A.6.9)
and
Q
() = n
1
Q(n
1/2
) . (A.6.10)
Note that minimizing D
() and Q
() is equivalent to minimizing D(n

1/2
) and Q(n
1/2
),
respectively.
The next result is asymptotic quadraticity.
Theorem A.6.3. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),
for a xed constant C and for any > 0,
P
_
sup
<C
[Q
() D
()[
_
0 . (A.6.11)
Proof: Since
Q
= 2n
3/2
_
i<j
b
ij
(x
j
x
i
)W
ij
+C(n
1/2
)
_
= R(), it follows
from Theorem A.6.2 that for > 0 and C > 0,
P
_
sup
<C
|
Q
| /C
_
0.
For 0 t 1, let
t
= t . Then
d
dt
[Q
(
t
) D
(
t
)]
k=1
k
(
Q
tk
tk
)
| | sup
<C
|
Q
|<| | (/C) <

with probability approaching 1. Now, let h(t) = Q
(
t
) D
(
t
). By the previous result,
we have [h
(t)[ < with high probability. Thus

[h(1)[ = [h(1) h(0)[ =
_
1
0
h
(t) dt
_
1
0
[h
(t)[ dt < ,
with probability approaching one. This proves the theorem.
The next theorem states asymptotic normality of S(0).
Theorem A.6.4. Under assumptions ( 3.12.10)-( 3.12.13), ( 3.4.1), ( 3.4.7), and ( 3.4.8),
n
3/2
S(0)
D
N(0,
H
) . (A.6.12)
Proof: Let S
P
denote the projection of S
(0) = n
3/2
S(0) onto the space of linear combina-
tions of independent random variables. Then
S
P
=
n
k=1
E[S
(0)[y
k
] =
n
k=1
E
_
n
3/2
i<j
(x
j
x
i
)b
ij
sgn(y
j
y
i
) [ y
k
_
=
n
k=1
n
3/2
_
k1
i=1
(x
k
x
i
)E[b
ik
sgn(y
k
y
i
)[y
k
] +
n
j=k+1
(x
j
x
k
)E[b
kj
sgn(y
j
y
k
)[y
k
]
_
= n
3/2
n
k=1
n
j=1
(x
j
x
k
)E[b
kj
sgn(y
j
y
k
)[y
k
]
= (1/
n)
n
k=1
U
k
,
where U
k
is dened in expression ( 3.12.9) of Chapter 4. By assumption (D.3), ( 3.4.8),
and a multivariate extension of the Lindeberg-Feller theorem (Rao, 1973), it follows that
S
P
AN(0,
H
). If we show that E | S
P
S
(0) |
2
0, then it follows from the
Projection Theorem 2.4.6 that S
(0) has the same asymptotic distribution as S

P
, and the
proof will be done. Equivalently, we may show that E(S
P,r
S
r
(0))
2
0 for each component
r = 1, . . . , p. Since for each r we have E(S
P,r
S
r
(0)) = 0, then
E(S
P,r
S
r
(0))
2
= V ar(S
P,r
S
r
(0))
= V ar
_
n
3/2
n
k=1
n
j=1
(x
jr
x
kr
) E[b
kj
sgn(y
j
y
k
)[y
k
] b
kj
sgn(y
j
y
k
)
_
V ar
_
n
3/2
n
k=1
n
j=1
T(y
j
, y
k
)
_
= n
3
n
k=1
n
j=1
V ar(T(y
j
, y
k
)) +n
3
m
Cov[T(y
j
, y
k
), T(y
l
, y
m
)]
where the quadruple sum is taken over (j, k) ,= (l, m). The double sum term goes to 0 since
there are n
2
bounded terms divided by n
3
. There are two types of covariance terms in the
quadruple sum: terms with four dierent indices , and terms with three dierent indices (i.e.,
one shared index). Covariance terms with four dierent indices are zero (this can be shown
by writing out the covariance in terms of expectations, and using symmetry to show that
each covariance term is zero). Thus we only need to consider covariance terms with three
dierent indices and show that the sum goes to 0. Letting k be the shared index (without
loss of generality), and noting that E T(y
j
, y
k
) = 0 for all j, k, we have
n
3
j=k
l=k,j
Cov[T(y
j
, y
k
), T(y
l
, y
k
)]
= n
3
j=k
l=k,j
E T(y
j
, y
k
) T(y
l
, y
k
)
= n
3
j=k
l=k,j
E [E(b
kj
sgn(y
j
y
k
)[y
k
) b
kj
sgn(y
j
y
k
)]
[E(b
kl
sgn(y
l
y
k
)[y
k
) b
kl
sgn(y
l
y
k
)]
= n
3
j=k
l=k,j
E [E(g
kj
(0) sgn(y
j
y
k
)[y
k
) g
kj
(0) sgn(y
j
y
k
)]
[E(g
kl
(0) sgn(y
l
y
k
)[y
k
) g
kl
(0) sgn(y
l
y
k
)] + o
p
(1)
where the last equality follows from the relation b
kj
= g
kj
(0) + 0
p
(1/
n). Expanding the

product, each term in the triple sum may be written as
E
_
[E(g
kj
(0) sgn(y
j
y
k
)[y
k
)]
2
_
+ E g
kj
(0) sgn(y
j
y
k
)g
kl
(0) sgn(y
l
y
k
)
2 E g
kj
(0) sgn(y
j
y
k
) [E(g
kl
(0) sgn(y
l
y
k
)[y
k
)]
= (1 + 1 2)
_
[E(g
kj
(0) sgn(y
j
y
k
)[y
k
)]
2
_
= 0
where the rst equality follows by taking conditional expectations with respect to k inside
appropriate terms.
A similar method applies to terms where k is not the shared index. The theorem is
proved.
Proof of Theorem A.6.1: Let

denote the value which minimizes Q(). Then

is the
solution to
0 = S(0) 2C
n
so that

n
= (1/2)n
2
C
1
n
[n
3/2
S(0)] AN(0, (1/4) C
1
C
1
), by Theorem A.6.4 and
Assumption (D.2), ( 3.4.7). It remains to show that

n(

) = o
p
(1). This follows from
Theorem A.6.3 and convexity of D(), using standard arguments as in Jaeckel (1972).
Studentized Residuals: NEEDS EDITED????????????????????????
Assume without loss of generality that = 0, = 0, and med e
i
= 0. In this section we
must further assume that the variance of e
i
is nite, i.e.,
2
< . From the above proof of
Theorem ??, asymptotically

can be expressed as
=
1
2
(n
2
X
A
n
X)
1
1
n
n
i=1
U
k
+ o
p
(1) , (A.6.13)
where U
k
is given in the rst paragraph of Section 5. In this section, we will assume the
weights b
ij
are given. We use the approximation ( ??) in place of U
k
, with a
ij
= b
ij
/(
12),
see ( ??). Hence, U
k
is replaced by,
k
=
12
n
2
n
j=1
(x
j
x
k
)a
kj
(1 2F(e
k
)) . (A.6.14)
The estimate of given by ( ??) can be expressed asymptotically as
=
S
n
1/2
n
i=1
sgn(e
i
) + o
p
(1) ; (A.6.15)
see McKean et al. (1990). Using ( A.6.13) and ( A.6.15) we have the following rst order
expression for the residuals e
i
, ( 3.12.7),
e
.
= e
S
1
n
n
i=1
sgn(e
i
)1
1
2
n
X(n
2
X
X)
1
1
n
n
i=1
U
k
, (A.6.16)
where A
= [a
ij
]. Because E[U
k
] = 0 and med e
i
= 0, taking expectations of both sides of
( A.6.16) leads to
E[e
]
.
= E[e
1
]1 . (A.6.17)
Write
Var(e
)
.
= E[(e
E[e
1
]1)(e
E[e
1
]1)
] . (A.6.18)
The approximate variance-covariance matrix of e
can then be obtained by subtituting the

right side of expression ( A.6.16) for e
in expression ( A.6.18) and then expanding and taking

expectations term-by-term. This is a tedious derivation, but by making use of E[U
k
] = 0,
med e
i
= 0,
n
i=1
x
i
= 0, and
n
j=1
a
ij
=
n
j=1
a
ji
= 0 we obtain expression ( 5.4.17) of
Section 5.
*************** cut this part some???????????????????????
Proof of Theorem 3.13.1
Let X
1
= [1, X] denote the matrix of centered explanatory variables with a column of ones.
Recall that

LS
= (X
1
X
1
)
1
X
1
Y . Since 1
X = 0, it can be shown that the vector of slope

parameters

satises

LS
= (X
X)
1
X
Y . Since Y = 1 +X + e, we get the relation
LS
= + (X
X)
1
X
e. (A.6.19)
From McKean et al. (1990), we have the equivalent relation
R
= +
12(X
X)
1
X
F
c
(e) + o
p
(n
1/2
) (A.6.20)
where F
c
(e) = [F
c
(e
1
) 1/2, . . . , F
c
(e
n
) 1/2]
is an n 1 vector of independent random

variables.
Now,
V ar(
LS
R
) = V ar(
LS
) + V ar(
R
) 2E(
LS
)(
R
)
=
2
(X
X)
1
+
2
(X
X)
1
2
12E[(X
X)
1
X
eF
c
(e)X(X
X)
1
]
= (X
X)
1
where =
2
+
2
2
12E[e
1
(F(e
1
) 1/2)].
Finally, note that from expressions ( A.6.19) and ( A.6.20) both

LS
and

R
are both
functions of the errors e
i
. Hence, asymptotic normality follows in the usual way by using
(A2) to show that the Lindeberg condition holds.
Proof of Theorem 3.2
Recall that
LS
= (Y ) = (1/n)
n
i=1
y
i
. Dening the R-intercept as the median of
residuals, it can be shown that

R
= (1/n)
s
n
i=1
sgn(y
i
) + o
p
(n
1/2
), (A.6.21)
which gives
V ar(
LS

R
) = (1/n
2
)
n
i=1
V ar(y
i
s
sgn(y
i
))
= (1/n
2
)
n
i=1
[V ar(y
i
) +
2
s
V ar(sgn(y
i
)) 2
s
cov(y
i
, sgn(y
i
))]
= (1/n
2
)
n
i=1
[
2
+
2
s
2
s
E(e
i
sgn(e
i
))
= (1/n)[
2
+
2
s
2
s
E(e
1
sgn(e
1
))].
Next we need to show that the intercept and slope dierences have zero covariance (i.e., that
the o diagonal term of A
D
is 0). This follows from the fact that 1
X = 0. Asymptotic
normality follows as in the proof of Theorem 3.1.
Proof of Theorem 4.2
Since the intercept
GR
is dened as the median of residuals, the linear approximation
is the same as
R
, so that the asymptotic variance is the same.
From Naranjo et al. (1994), we have the following approximation for the slope parameters:
GR
= + (
3/n)(X
WX)
1
S() + o
p
(n
1/2
)
where S() =
i<j
w
i
w
j
(x
i
x
j
)sgn(y
i
y
j
(x
i
x
j
)
) =
i<j
w
i
w
j
(x
i
x
j
)sgn(e
i
e
j
) is a p 1 1 random vector. Now,
V ar(
GR
LS
) = V ar(
GR
) + V ar(
LS
) 2E(
GR
)(
LS
)
=
2
(X
WX)
1
(X
W
2
X)(X
WX)
1
+
2
(X
X)
1
2E[
3/n)(X
WX)
1
S()][e
X(X
X)
1
].
It can be shown using an elementwise argument that E[S()e
] = 2nE[e
1
(F(e
1
) 1/2)]X
W
and the result follows. Asymptotic normality follows as in the proof of Theorem 3.1.
Appendix B
Larger Data Sets
This appendix contains some of the larger data sets discussed in the book.
465
466 APPENDIX B. LARGER DATA SETS
Table A.0.1: data for Example 5.2.1. For each center, the columns are weight of the unit of
crabgrass, Nitrogen level, Phosphorus level, Potassium level, and the density level.
Center 1 Center 2
Wt N Ph Po D Wt N Ph Po D
96.4 1 1 1 20 73.4 1 1 1 20
86.1 1 1 0 20 77.8 1 1 0 20
59.6 1 0 1 20 64.7 1 0 1 20
68.9 1 0 0 20 47.1 1 0 0 20
39.7 0 1 1 20 41.4 0 1 1 20
35.4 0 1 0 20 40.6 0 1 0 20
36.9 0 0 1 20 45.4 0 0 1 20
42.2 0 0 0 20 79.0 0 0 0 20
95.7 1 1 1 15 91.0 1 1 1 15
107.3 1 1 0 15 60.6 1 1 0 15
67.4 1 0 1 15 972.5 1 0 1 15
70.1 1 0 0 15 68.6 1 0 0 15
35.0 0 1 1 15 36.4 0 1 1 15
38.3 0 1 0 15 50.8 0 1 0 15
43.5 0 0 1 15 38.4 0 0 1 15
53.7 0 0 0 15 29.1 0 0 0 15
117.0 1 1 1 10 87.4 1 1 1 10
123.4 1 1 0 10 86.7 1 1 0 10
92.0 1 0 1 10 66.5 1 0 1 10
72.0 1 0 0 10 56.0 1 0 0 10
41.8 0 1 1 10 43.1 0 1 1 10
39.1 0 1 0 10 53.4 0 1 0 10
47.8 0 0 1 10 47.3 0 0 1 10
39.8 0 0 0 10 55.1 0 0 0 10
224.3 1 1 1 5 130.0 1 1 1 5
162.7 1 1 0 5 128.6 1 1 0 5
88.2 1 0 1 5 82.4 1 0 1 5
54.0 1 0 0 5 91.0 1 0 0 5
53.3 0 1 1 5 46.5 0 1 1 5
75.3 0 1 0 5 50.3 0 1 0 5
47.4 0 0 1 5 49.5 0 0 1 5
86.6 0 0 0 5 74.9 0 0 0 5
467
Table A.0.2: Data for CRP Data. Sub denotes subject, Grp denoted group. The covariate
is the same vector for each subject and is given in the rst row. For each subjects, the
responses (CRP) follow row by row in the order of the covariate.
Covariate (Times)
24 0 24 72 120
Sub Grp Responses (CRP)
1 M 1.79 0.78 0.68 0.86 0.83
2 M 0.09 0.14 0.07 0.14 0.51
3 M 1.13 0.99 1.06 0.91 0.92
4 M 2.94 2.07 1.51 1.05 1.03
5 M 1.82 2.28 2.99 1.95 1.78
6 M 0.17 0.08 0.29 0.30 0.22
7 M 0.38 0.19 0.29 0.49 0.26
8 M 1.32 0.96 1.80 0.82 1.68
9 M 0.40 0.11 0.25 0.20 0.31
10 H 0.11 0.19 0.21 0.22 0.08
11 H 0.15 0.17 0.15 0.11 0.21
12 H 0.06 0.05 0.05 0.06 0.16
13 H 0.31 0.20 0.28 0.14 0.24
14 H 0.21 0.28 0.35 0.10 0.20
15 H 0.36 0.37 4.54 1.80 1.10
16 H 0.92 1.23 1.47 0.96 0.81
17 H 0.52 0.52 0.51 0.31 0.35
18 H 2.05 1.61 1.35 0.73 0.67
468 APPENDIX B. LARGER DATA SETS
Bibliography
[1] Adichi, J. N. (1978), Rank tests of sub-hypotheses in the general regression model,
Annals of Statistics, 6, 1012-1026.
[2] A, A. A. and Azen, S. P. (1972), Statistical Analysis: A Computer Oriented Approach,
New York: Academic Press.
[3] Akritas, M. G. (1990), The rank transform method in some two-factor designs, Journal
of the American Statistical Association, 85, 73-78.
[4] Akritas, M. G. (1991), Limitations of the rank transform procedure: A study of repeated
measures designs, Part I, Journal of the American Statistical Association, 86, 457-460.
[5] Akritas, M. G. (1993), Limitations of the rank transform procedure: A study of repeated
measures designs, Part II, Statistics and Probability Letters, 17, 149-156.
[6] Akritas, M. G. and Arnold, S. F. (1994), Fully nonparametric hypotheses for factorial
designs I: Multivariate repeated measures designs, Journal of the American Statistical
Association, 89, 336-343.
[7] Akritas, M. G., Arnold, S. F. and Brunner, E. (1997), Nonparametric hypotheses and
rank statistics for unbalanced factorial designs, Journal of the American Statistical As-
sociation, 92, 258-265.
[8] Ammann, L. P. (1993), Robust singular value decompositions: A new approach to
projection pursuit, Journal of the American Statistical Association, 88, 505-514.
[9] Ansari, A. R., and Bradley, R. A. (1960), Rank-sum tests for dispersion, Annals of
Mathematical Statistics, 31, 1174-1189.
[10] Apostal, T. M. (1974), Mathematical Analysis, 2nd Edition, Reading, Massachusetts:
Addison-Wesley.
[11] Arnold, S. F. (1980), Asymptotic validity of F-tests for the ordinary linear model and
the multiple correlation model, Journal of the American Statistical Association, 75,
890-894.
469
470 BIBLIOGRAPHY
[12] Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, New
York: John Wiley and Sons.
[13] Aubuchon, J. C. and Hettmansperger, T. P. (1984), A note on the estimation of the
integral of f
2
(x), Journal of Statistical Inference and Planning, 9, 321-331.
[14] Aubuchon, J. C. and Hettmansperger, T. P. (1989), Rank-based inference for linear
models: Asymmetric errors, Statistics and Probability Letters, 8, 97-107.
[15] Babu, G. J., and Koti, K. M. (1996), Sign test for ranked-set sampling, Communications
in Statistics, Part A-Theory and Methods, 25(7), 1617-1630.
[16] Bahadur, R. R. (1967), Rates of convergence of estimates and test statistics, Annals of
[17] Bai, Z. D., Chen, X. R., Miao, B. Q., and Rao, C. R. (1990), Asymptotic theory of least
distance estimate in multivariate linear models, Statistics, 21, 503-519.
[18] Bassett, G. and Koenker, R. (1978), Asymptotic theory of least absolute error regression,
Journal of the American Statistical Association, 73, 618-622.
[19] Bedall, F. K. and Zimmerman, H. (1979), Algorithm AS143, the mediancenter, Applied
Statistics 28, 325-328.
[20] Belsley, D. A. Kuh, K. and Welsch, R. E. (1980), Regression Diagnostics, New York:
John Wiley and Sons.
[21] Bickel, P. J. (1964), On some alternative estimates for shift in the p-variate one sample
problem, Annals of Mathematical Statistics, 35, 1079-1090.
[22] Bickel, P. J. (1965), On some asymptotically nonparametric competitors of Hotellings
T
2
, Annals of Mathematical Statistics, 36, 160-173.
[23] Bickel, P. J. (1974), Edgeworth expansions in nonparametric statistics, Annals of Statis-
tics, 2, 1-20.
[24] Bickel, P. J. (1976), Another look at robustness: A review of reviews and some new
developments, (Reply to Discussant), Scandinavian Journal of Statistics, 3, 167.
[25] Bickel, P. J. and Lehmann, E. L. (1975), Descriptive statistics for nonparametric model,
II Location, Annals of Statistics, 3, 1045-1069.
[26] Blair, R. C., Sawilowsky, S. S. and Higgins, J. J. (1987), Limitations of the rank trans-
form statistic in tests for interaction, Communications in Statistics, Part B-Simulation
and Computation, 16, 1133-1145.
[27] Bloomeld, B. and Steiger, W. L. (1983), Least Absolute Deviations, Boston:
Birkhauser.
BIBLIOGRAPHY 471
[28] Blumen, I. (1958), A new bivariate sign test, Journal of the American Statistical Asso-
ciation, 53, 448-456.
[29] Bohn, L. L. and Wolfe, D. A. (1992), Nonparametric two-sample procedures for ranked-
set samples data, Journal of the American Statistical Association, 87, 552-561.
[30] Boos, D. D. (1982), A test for asymmetry associated with the Hodges-Lehmann esti-
mator, Journal of the American Statistical Association, 77, 647-651.
[31] Bose, A. and Chaudhuri, P. (1993), On the dispersion of multivariate median, Annals
of the Institute of Statistical Mathematics, 45, 541-550.
[32] Box, G. E. P. and Cox, D. R. (1964), An analysis of transformations, Journal of the
Royal Statistical Society, Series B, Methodological, 26, 211-252.
[33] Brown, B. M. (1983), Statistical uses of the spatial median, Journal of the Royal Sta-
tistical Society, Series B, Methodological, 45, 25-30.
[34] Brown, B. M. (1985), Multiparameter linearizaton theorems, Journal of the Royal Sta-
tistical Society, Series B, Methodological, 47, 323-331.
[35] Brown, B. M. and Hettmansperger, T. P. (1987a), Ane invariant rank methods in the
bivariate location model, Journal of the Royal Statistical Society, Series B, Methodolog-
ical, 49, 301-310.
[36] Brown, B. M. and Hettmansperger, T. P. (1987b), Invariant tests in bivariate models
and the L
1
criterion function, In: Statistical Data Analysis Based on the L
1
Norm and
Related Methods, ed. Y. Dodge, 333-344. North Holland, Amsterdam.
[37] Brown, B. M. and Hettmansperger, T. P. (1989), An ane invariant version of the sign
test, Journal of the Royal Statistical Society, Series B, Methodological, 51, 117-125.
[38] Brown, B. M. and Hettmansperger, T. P. (1994), Regular redescending rank estimates,
[39] Brown, B. M., Hettmansperger, T. P., Nyblom, J., and Oja, H., (1992), On certain
bivariate sign tests and medians, Journal of the American Statistical Association, 87,
127-135.
[40] Brunner, E. and Neumann, N. (1986), Rank tests in 22 designs, Statistica Neerlandica,
40, 251-272.
[41] Brunner, E. and Puri, M. L. (1996), Nonparametric methods in design and analysis of
experiments, In: Handbook of Statistics, S. Ghosh and C. R. Rao, eds, 13, 631-703, The
Netherlands: Elsevier Science, B. V.
472 BIBLIOGRAPHY
[42] Carmer, S. G. and Swanson, M. R. (1973), An evaluation of ten pairwise multiple
comparison procedures by Monte Carlo methods, Journal of the American Statistical
[43] Chang, W. H. (1995), High break-down rank-based estimates for linear models, Unpub-
lished Ph.D. Thesis, Western Michigan University, Kalamazoo, MI.
[44] Chang, W. H., McKean, J. W., Naranjo, J. D. and Sheather, S. .J. (1997), High break-
down rank regression, Submitted.
[45] Chaudhuri, P. (1992), Multivariate location estimation using extension of R-estimates
through U-statistics type approach, Annals of Statistics, 20, 897-916.
[46] Chaudhuri, P. and Sengupta, D. (1993), Sign tests in multidimensional inference based
on the geometry of the data cloud, Journal of the American Statistical Association, 88,
1363-1370.
[47] Cherno, H. and Savage, I. R. (1958), Asymptotic normality and eciency of certain
nonparametric test statistics, Annals of Mathematical Statistics, 39, 972-994.
[48] Chiang, C.-Y. and Puri, M. L. (1984), Rank procedures for testing subhypotheses in
linear regression, Annals of the Institute of Statistical Mathematics, 36, 35-50.
[49] Chinchilli, V. M. and Sen, P. K. (1982), Multivariate linear rank statistics for prole
analysis, Journal of Multivariate Analysis, 12, 219-229.
[50] Choi, K. and Marden, J. (1997), An approach to multivariate rank tests in multivariate
analysis of variance, Journal of the American Statistical Association, To appear.
[51] Coakley, C. W. and Hettmansperger, T. P. (1992), Breakdown bounds and expected
test resistance, Journal of Nonparametric Statistics, 1, 267-276.
[52] Conover, W. J. and Iman, R. L. (1981), Rank transform as a bridge between parametric
and nonparametric statistics, The American Statistician, 35, 124-133.
[53] Conover, W. J., Johnson, M. E., and Johnson, M. M. (1981), A comparative study
of tests for homogeneity of variances, with applications to the outer continental shelf
bidding data, Technometrics, 23, 351-361.
[54] Cook, R. D., Hawkins, D. M. and Weisberg, S. (1992), Comparison of model misspeci-
cation diagnostics using residuals from least mean of squares and least median of squares
ts, Journal of the American Statistical Association, 87, 419-424.
[55] Cook, R. D. and Weisberg, S. (1982), Residuals and Inuence in Regression, New York:
Chapman and Hall.
BIBLIOGRAPHY 473
[56] Cook, R. D. and Weisberg, S. (1989), Regression diagnostics with dynamic graphics,
Technometrics, 31, 277-291.
[57] Cook, R. D. and Weisberg, S. (1994), An Introduction to Regression Graphics, New
[58] Croux, C. Rousseeuw, P. J. and Hossjer, O. (1994), Generalized S-estimators, Journal
of the American Statistical Association, 89, 1271-1281.
[59] Cushney, A. R. and Peebles, A. R. (1905), The action of optical isomers, II, Hyoscines,
Journal of Physiology, 32, 501-510.
[60] Davis, J. B. and McKean, J. W. (1993), Rank based methods for multivariate linear
models, Journal of the American Statistical Association, 88, 245-251.
[61] Dietz, E. J. (1982), Bivariate nonparametric tests for the one-sample location problem,
[62] Dixon, S. L. and McKean, J. W. (1996), Rank-based analysis of the heteroscedastic
linear model, Journal of the American Statistical Association, 91, 699-712.
[63] Dongarra, J. J. Bunch, J. R. Moler, C. B. and Stewart, G. W. (1979), Linpack Users
Guide, Philadelphia: SIAM.
[64] Donoho, D. L. and Huber, P. J. (1983), The notion of breakdown point, In: A Festschrift
for Erich L. Lehmann, Eds. P. J. Bickel, K. A. Doksum, J. L. Hodges Jr., 157-184,
Belmont, CA: Wadsworth.
[65] Dowell, M. and and Jarratt, P. (1971), A modied regula falsi method for computing
the root of an equation, BIT 11, 168-171.
[66] DuBois, C., ed. (1960) Lowies Selected Papers in Anthropology. Berkeley: University of
California Press.
[67] Draper, D. (1988), Rank-based robust analysis of linear models. I. Exposition and re-
view, Statistical Science, 3, 239-257.
[68] Draper, N. R. and Smith, H. (1966), Applied Regression Analysis, New York: John
Wiley and Sons.
[69] Ducharme, G. R. and Milasevic, P. (1987), Spatial median and directional data,
Biometrika, 74, 212-215.
[70] Dwass, M. (1960), Some k-sample rank order tests, in I. Olkin, et al. (eds.), Contribu-
tions to Probability and Statistics, Stanford: Stanford University Press.
[71] Efron, B. (1979), Bootstrap methods: another look at the jackknife, Annals of Statistics,
7, 1-26.
474 BIBLIOGRAPHY
[72] Efron B. and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, New York:
Chapman and Hall.
[73] Eubank, R. L., LaRiccia, V. N. and Rosenstein (1992), Testing symmetry about an
unknown median, via linear rank procedures, Journal of Nonparametric Statistics, 1,
301-311.
[74] Fernholz, L. T. (1983), von Mises Calculus for Statistical Functionals, Lecture Notes in
Statistics 19, New York: Springer.
[75] Fisher, N. I. (1987), Statistical Analysis for Spherical Data, Cambridge: Cambridge
University Press.
[76] Fisher, N. I. (1993), Statistical Analysis for Circular Data, Cambridge: Cambridge
University Press.
[77] Fix, E. and Hodges, J. L., Jr. (1955), Signicance probabilities of the Wilcoxon test,
Annals of Mathematical Statistics, 26, 301-312.
[78] Fligner, M. A. (1981), Comment, American Statistician, 35, 131-132.
[79] Fligner, M. A. and Hettmansperger, T. P. (1979), On the use of conditional asymptotic
normality, Journal of the Royal Statistical Society, Series B, Methodological, 41, 178-
183.
[80] Fligner, M. A. and Killeen, T. J. (1976), Distribution-free two-sample test for scale,
[81] Fligner, M. A. and Policello, G. E. (1981), Robust rank procedures for the Behrens-
Fisher problem, Journal of the American Statistical Association, 76, 162-168.
[82] Fligner, M. A. and Rust, S. W. (1982), A modication of Moods median test for the
generalized Behrens-Fisher problem, Biometrika, 69, 221-226.
[83] Fraser, D. A. S. (1957), Nonparametric Methods in Statistics, New York: John Wiley
and Sons.
[84] Gastwirth, J. L. (1968), The rst median test: A two-sided version of the control median
test, Journal of the American Statistical Association, 63, 692-706.
[85] Gastwirth, J. L. (1971), On the sign test for symmetry, Journal of the American Sta-
tistical Association, 66, 821-823.
[86] George, K. J., McKean, J. W., Schucany, W. R. and Sheather, S. J. (1995), A com-
parison of condence intervals from R-estimators in regression, Journal of Statistical
Computation and Simulation, 53, 13-22.
BIBLIOGRAPHY 475
[87] Ghosh, M. and Sen, P. K. (1971), On a class of rank order tests for regression with
partially formed stochastic predictors, Annals of Mathematical Statistics, 42, 650-661.
[88] Gower, J. C. (1974), The mediancenter, Applied Statistics, 32, 466-470.
[89] Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate,
Massachusetts: Duxbury.
[90] Graybill, F. A. (1983), Matrices with Applications in Statistics, Belmont, CA:
Wadsworth.
[91] Graybill, F. A. and Iyer, H. K. (1994), Regression Analysis: Concepts and Applications,
Belmont, California: Duxbury Press.
[92] Hadi, A. S. and Simono, J. S. (1993), Procedures for the identication of multiple
outliers in linear models, Journal of the American Statistical Association, 88, 1264-
1272.
[93] Hajek, J. and

Sidak, Z. (1967), Theory of Rank Tests, New York: Academic Press.
[94] Hald, A. (1952), Statistical Theory with Engineering Applications, New York: John
Wiley and Sons.
[95] Hampel, F. R. (1974), The inuence curve and its role in robust estimation, Journal of
the American Statistical Association, 69, 383-393.
[96] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. J. (1986), Robust
Statistics, the Approach Based on Inuence Functions, New York: John Wiley and Sons.
[97] Hardy, G. H., Littlewood, J. E., and Polya, G. (1952), Inequalities. 2nd ed., Cambridge:
Cambridge University Press.
[98] Hawkins, D. M., Bradu, D. and Kass, G. V. (1984), Location of several outliers in
multiple regression data using elemental sets, Technometrics, 26, 197 -208
[99] He, X., Simpson, D. G. and Portnoy, S. L. (1990), Breakdown robustness of tests,
[100] Heiler, S. and Willers, R. (1988), Asymptotic normality of R-estimation the linear
model, Statistics, 19, 173-184.
[101] Hendy, M. F. and Charles, J. A. (1970), The production techniques, silver content and
circulation history of the twelfth-century Byzantine, Archaeometry, 12, 13-21.
[102] Hettmansperger, T. P. (1984a), Statistical Inference Based on Ranks, New York: John
Wiley and Sons.
476 BIBLIOGRAPHY
[103] Hettmansperger, T. P. (1984b), Two-sample inference based on one-sample sign statis-
tics, Applied Statistics, 33, 45-51.
[104] Hettmansperger, T. P. (1995), The rank-set sample sign test, Journal of Nonparametric
Statistics, 4, 263-270.
[105] Hettmansperger, T. P. and Malin, J. S. (1975), A modied Moods test for location
with no shape assumptions on the underlying distributions, Biometrika, 62, 527-529.
[106] Hettmansperger, T. P. and McKean, J. W. (1978), Statistical inference based on ranks,
Psychometrika, 43, 69-79.
[107] Hettmansperger, T. P. and McKean, J. W. (1983), A geometric interpretation of in-
ferences based on ranks in the linear model, Journal of the American Statistical Asso-
ciation, 78, 885-893.
[108] Hettmansperger, T. P. McKean, J. W. and Sheather, S. J. (1997), Rank-based analyses
of linear models, Handbook of Statistics, 145-173, S. Ghosh and C. R. Rao, eds, 15,
Amsterdam: Elsevier Science.
[109] Hettmansperger, T. P., Mottonen, J., and Oja, H. (1997a), Ane invariant multivari-
ate one-sample signed-rank tests, Journal of the American Statistical Association, To
appear.
[110] Hettmansperger, T. P., Mottonen, J., and Oja, H. (1997b), Ane invariant multivari-
ate two-sample rank tests, Statistica Sinica, To appear.
[111] Hettmansperger, T. P. and Oja, H. (1994), Ane invariant multivariate multisample
sign tests, Journal of the Royal Statistical Society, Series B, Methodological, 56, 235-249.
[112] Hettmansperger, T. P., Nyblom, J. and Oja, H. (1994), Ane invariant multivariate
one-sample sign tests, Journal of the Royal Statistical Society, Series B, Methodological,
56, 221-234.
[113] Hettmansperger, T. P. and Sheather, S. J. (1986), Condence intervals based on in-
terpolated order statistics, Statistics and Probability Letters, 4, 75-79.
[114] Hocking, R. R. (1985), The Analysis of Linear Models, Monterey, California:
Brooks/Cole.
[115] Hodges, J. L. Jr. (1967), Eciency in normal samples and tolerance of extreme values
for some estimates of location, In: Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, 1, 163-186. Berkeley: University of California
Press.
[116] Hodges, J. L., Jr., and Lehmann, E. L. (1956), The eciency of some nonparametric
competitors of the t-test, Annals of Mathematical Statistics, 27, 324-335.
BIBLIOGRAPHY 477
[117] Hodges, J. L., Jr., and Lehmann, E. L. (1961), Comparison of the normal scores and
Wilcoxon tests, In: Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, 1, 307-317, Berkeley: University of California Press.
[118] Hodges, J. L. Jr. and Lehmann, E. L. (1962), Rank methods for combination of inde-
pendent experiments in analysis of variance, Annals of Mathematical Statistics 33,482-
497.
[119] Hodges, J. L., Jr., and Lehmann, E. L. (1963), Estimates of location based on rank
tests. Annals of Mathematical Statistics, 34, 598-611.
[120] Hogg, R. V. (1974), Adaptive robust procedures: A partial review and some suggestions
for future applications and theory, Journal of the American Statistical Association, 69,
909-923.
[121] Hora, S. C. and Conover, W. J. (1984), The F-statistic in the two-way layout with
rank-score transformed data, Journal of the American Statistical Association, 79, 688-
673.
[122] Hossjer, O. (1994), Rank-based estimates in the linear model with high breakdown
point, Journal of the American Statistical Association, 89, 149-158.
[123] Hossjer, O. and Croux, C. (1995), Generalizing univariate signed rank statistics for
testing and estimating a multivariate location parameter, Journal of Nonparametric
Statistics, 4, 293-308.
[124] otelling, H. (1951), A genralized T-test and measure of multivariate dispersion, In
Proceedings of the Second Berkeley Symposium on Mathematical Statistics, 23-41.
[125] Hyland, A. (1965), Robustness of the Hodges-Lehmann estimates for shift, Annals of
[126] Hsu, J. C. (1996), Multiple Comparisons, London: Chapman Hall.
[127] Huber, P. J. (1981), Robust Statistics, New York: John Wiley and Sons.
[128] Huitema, B. E. (1980), The Analysis of Covariates and Alternatives, New York: John
Wiley and Sons.
[129] Iman, R. L. (1974), A power study of the rank transform for the two-way classication
model when interaction may be present, Canadian Journal of Statistics, 2, 227-239.
[130] International Mathematical and Statistical Libraries, Inc. (1987), Users Manual:
Stat/Library, Author: Houston, TX.
[131] Jaeckel, L. A. (1972), Estimating regression coecients by minimizing the dispersion
of the residuals, Annals of Mathematical Statistics, 43, 1449-1458.
478 BIBLIOGRAPHY
[132] Jan, S. L., and Randles, R. H. (1995), A multivariate signed sum test for the one-sample
location problem, Journal of Nonparametric Statistics, 4, 49-63.
[133] Jan, S. L., and Randles, R. H. (1996), Interdirection tests for simple repeated measures
designs, Journal of the American Statistical Association, 91, 1611-1618.
[134] Johnson, G. D., Nussbaum, B. D., Patil, G. P., and Ross, N. P. (1996), Designing
cost-eective environmental sampling using concomitant information. Chance, 9, 4-16.
[135] Jonckheere, A. R. (1954), A distribution-free k-sample tests against ordered alterna-
tives, Biometrika, 41, 133-145.
[136] Jureckova, J. (1969), Asymptotic linearity of rank statistics in regression parameters,
Annals of Mathematical Statistics, 40, 1449-1458.
[137] Jureckova, J. (1971), Nonparametric estimate of regression coecients, Annals of
[138] Kahaner, D. and Moler, C. and Nash, S (1989), Numerical Methods and Software,
Englewood Clis, New Jersey: Prentice Hall.
[139] Kalbeisch, J. D. and Prentice, R. L. (1980), The Statistical Analysis of Failure Time
Data, New York: John Wiley and Sons.
[140] Kapenga, J. A., McKean, J. W. and Vidmar, T. J. (1988), RGLM: Users Manual,
Amer. Statist. Assoc. Short Course on Robust Statistical Procedures for the Analysis
of Linear and Nonlinear Models, New Orleans.
[141] Kepner, J. C. and Robinson, D. H. (1988), Nonparametric methods for detecting treat-
ment eects in repeated measures designs, Journal of the American Statistical Associ-
ation, 83, 456-461.
[142] Killeen, T. J., Hettmansperger, T. P., and Sievers, G. L. (1972), An elementary theorem
on the probability of large deviations, Annals of Mathematical Statistics, 43, 181-192.
[143] Klotz, J. (1962), Nonparametrics tests for scale, Annals of Mathematical Statistics, 33,
498-512.
[144] Koul, H. L. (1992), Weighted Empiricals and Linear Models, Hayward, California:
Institute of Mathematical Statistics.
[145] Koul, H. L. Sievers, G. L. and McKean, J. W. (1987), An estimator of the scale param-
eter for the rank analysis of linear models under general score functions, Scandinavian
Journal of Statistics, 14, 131-141.
[146] Kramer, C. Y. (1956), Extension of multiple range tests to group means with unequal
numbers of replications, Biometrics, 12, 307-310.
BIBLIOGRAPHY 479
[147] Kruskal, W. H. and Wallis, W. A. (1952), Use of ranks in one criterion variance analysis,
[148] Larsen, R. J. and Stroup, D. F. (1976), Statistics in the Real World, New York: Macmil-
lan.
[149] Lawless, J. F. (1982), Statistical Models and Methods for Lifetime Data, New York:
John Wiley and Sons.
[150] Lawley, D. N. (1938), A generalization of Fishers z-test, Biometrika, 30, 180-187.
[151] Lehmann, E. L. (1975), Nonparametrics: Statistical Methods Based on Ranks, San
Francisco: Holden-Day.
[152] Li, H. (1991), Rank Procedures for the Logistic Model, Unpublished PhD. Thesis, West-
ern Michigan University, Kalamazoo, MI.
[153] Liu, R. Y. (1990), On a notion of data depth based on simplices, Annals of Statistics,
18, 405-414.
[154] Liu, R. Y. and Singh, K. (1993), A quality index based on data depth and multivariate
rank tests, Journal of the American Statistical Association, 88, 405-414.
[155] Lopuha, H. P. and Rousseeuw, P. J. (1991), Breakdown properties of ane equivariant
estimators of multivariate location and covariance matrices, Annals of Statistics, 19,
229-248.
[156] Magnus, J. R. and Neudecker, H. (1988), Matrix Dierential Calculus with Applications
in Statistics and Econometrics, New York: John Wiley and Sons.
[157] Mann, H. B. (1945), Nonparametric tests against trend, Econometrica, 13, 245-259.
[158] Mann, H. B. and Whitney, D. R. (1947), On a test of whether one of two random
variables is stochastically larger than the other, Annals of Mathematical Statistics, 18,
50-60.
[159] Mardia, K. V. (1972), Statistics of Directional Data, London: Academic Press.
[160] Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979), Multivariate Analysis, Orlando,
Fl.: Academic Press.
[161] Maritz, J. S. (1981), Distribution-Free Statistical Methods, London: Chapman and
Hall.
[162] Maritz, J. S. and Jarrett, R. G. (1978), A note on estimating the variance of the sample
median, Journal of the American Statistical Association, 73, 194-196.
480 BIBLIOGRAPHY
[163] Maritz, J. S., Wu, M. and Staudte, R. G. Jr. (1977), A location estimator based on a
U-statistic, Annals of Statistics, 5, 779-786.
[164] Marsaglia, G. and Bray, T. A. (1964), A convenient method for generating normal
variables, SIAM Review, 6, 260-64.
[165] Mason, R. L., Gunst, R. F., and Hess, J. L. (1989), Statistical Design and Analysis of
Experiments, New York: John Wiley and Sons.
[166] Mathisen, H. C. (1943), A method of testing the hypothesis that two samples are from
the same population, Annals of Mathematical Statistics, 14, 188-194.
[167] McIntyre, G. A. (1952), A method of unbiased selective sampling, using ranked sets,
Australian Journal of Agricultural Research, 3, 385-390.
[168] McKean, J. W. and Hettmansperger, T. P. (1976), Tests of hypotheses of the general
linear model based on ranks, Communications in Statistics, Part A-Theory and Methods,
5, 693-709.
[169] McKean, J. W. and Hettmansperger, T. P. (1978), A robust analysis of the general
linear model based on one step R-estimates, Biometrika, 65, 571-579.
[170] McKean, J. W. Naranjo, J. D. and Sheather, S. J. (1996a), Diagnostics to detect
dierences in robust ts of linear models, Computational Statistics, 11, 223-243.
[171] McKean, J. W. Naranjo, J. D. and Sheather, S. J. (1996b), An ecient and high
breakdown procedure for model criticism, Communications in Statistics, Part A-Theory
and Methods, 25, 2575-2595.
[172] McKean, J. W. and and Ryan, T. A. Jr. (1977), An algorithm for obtaining con-
dence intervals and point estimates based on ranks in the two sample location problem,
Transactions of Mathematical Software, 3, 183-185.
[173] McKean, J. W. and Schrader, R. (1980), The geometry of robust procedures in linear
models, Journal of the Royal Statistical Society, Series B, Methodological, 42, 366-371.
[174] McKean, J. W. and Schrader, R. M. (1984), A comparison of methods for studentizing
the sample median, Communications in Statistics, Part B-Simulation and Computation,
6, 751-773.
[175] McKean, J. W. and Sheather, S. J. (1991), Small sample properties of robust analyses
of linear models based on r-estimates, In: A Survey. in Directions in Robust Statistics
and Diagnostics, Part II, 1-20, W. Stahel and S. Weisberg, eds., New York: Springer-
Verlag.
BIBLIOGRAPHY 481
[176] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1990), Regression diag-
nostics for rank-based methods, Journal of the American Statistical Association, 85,
1018-28.
[177] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1991), Regression diag-
nostics for rank-based methods II, In: Directions in Robust Statistics and Diagnostics,
Part II, 21-31, Editors: W. Stahel and S. Weisberg, New York: Springer-Verlag: New
York.
[178] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1993), The use and inter-
pretation of residuals based on robust estimation, Journal of the American Statistical
[179] McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1994), Robust and high
breakdown ts of polynomial models, Technometrics, 36, 409-415.
[180] McKean, J. W. and Sievers, G. L. (1987), Coecients of determination for least abso-
lute deviation analysis, Statistics and Probability Letters, 5, 49-54.
[181] McKean, J. W. and Sievers, G. L. (1989), Rank scores suitable for the analysis of linear
models under asymmetric error distributions, Technometrics, 31, 207-218.
[182] McKean, J. W. and Vidmar, T. J. (1992), Using procedures based on ranks: cau-
tions and recommendations, American Statistical Association 1992 Proceedings of the
Biopharmaceutical Section, 280-289.
[183] McKean, J. W. and Vidmar, T. J. (1994), A comparison of two rank-based methods
for the analysis of linear models, The American Statistician, 48, 220-229.
[184] McKean, J. W., Vidmar, T.J., and Sievers, G. L. (1989), A robust two stage multiple
comparison procedure with application to a random drug screen, Biometrics 45, 1281-
1297.
[185] Merchant, J. A., Halprin, G. M. Hudson, A. R., Kilburn, K. H. McKenzie, W.N.,
Jr., Hurst, D. J., and Bermazohn, P. (1975), Responses to cotton dust, Archives of
Environmental Health, 30, 222-229.
[186] Milasevic, P. and Ducharme, G. R. (1987), Uniqueness of the spatial median, Annals
of Statistics, 15, 1332-1333.
[187] Mielke, P. W. (1972), Asymptotic behavior of two-sample tests based on the powers of
ranks for detecting scale and location alternatives, Journal of the American Statistical
[188] Miller, R. G. (1981), Simultaneous Statistical Inference, New York: Springer-Verlag.
[189] Mood, A. M. (1950), Introduction to the Theory of Statistics, New York: McGraw-Hill.
482 BIBLIOGRAPHY
[190] Mood, A. M. (1954), On the asymptotic eciency of certain nonparametric two-sample
tests, Annals of Mathematical Statistics, 25, 514-533.
[191] Morrison, D. F. (1983), Applied Linear Statistical Models, Englewood Clis, New Jer-
sey: Prentice Hall.
[192] Mottonen, J. (1997a), SAS/IML Macros for spatial sign and rank tests, Mathematics
Department, University of Oulu, Finland.
[193] Mottonen, J. (1997b), SAS/IML Macros for ane invariant multivariate sign and rank
tests, Mathematics Department, University of Oulu, Finland.
[194] Mottonen, J, Hettmansperger, T. P., Oja, H., and Tienari, J. (1997), On the eciency
of multivariate ane invariant rank methods, Journal of Multivariate Analysis, In press.
[195] Mottonen, J. and Oja, H. (1995), Multivariate spatial sign and rank methods. Journal
of Nonparametric Statistics, 5, 201-213.
[196] Mottonen, J, Oja, H., and Tienari, J. (1997), On eciency of multivariate spatial sign
and rank methods, Annals of Statistics, In press.
[197] Naranjo, J. D. and Hettmansperger, T. P. (1994), Bounded-inuence rank regression,
Journal of the Royal Statistical Society, Series B, Methodological, 56, No. 1, 209-220.
[198] Naranjo, J. D., McKean, J. W. Sheather, S. J. and Hettmansperger, T. P. (1994), The
use and interpretation of rank-based residuals, Journal of Nonparametric Statistics, 3,
323-341.
[199] Nelson, W. (1982), Applied Lifetime Data Analysis, New York: John Wiley and Sons.
[200] Neter, J., Kutner, M. H., Nachtsheim, C. J. and Wasserman, W. (1996), Applied Linear
Statistical Models, 4th Ed., Chicago: Irwin.
[201] Niinimaa, A. and Oja, H. (1995), On the Inuence Functions of Certain bivariate
medians, Journal of the Royal Statistical Society, Series B, Methodological, 57, 565-574.
[202] Niinimaa, A., Oja, H. and Nyblom, J. (1992), Algorithm AS277: The Oja bivariate
median, Applied Statistics, 41, 611-617.
[203] Niinimaa, A., Oja, H., and Tableman, M. (1990), The Finite-sample breakdown point
of the Oja bivariate median, Statistics and Probability Letters, 10, 325-328.
[204] Noether, G. E. (1955), On a theorem of Pitman, Annals of Mathematical Statistics,
26, 64-68.
[205] Noether, G. E. (1987), Sample size determination for some common nonparametric
tests, Journal of the American Statistical Association, 82, 645-647.
BIBLIOGRAPHY 483
[206] Numerical Algorithms Group, Inc. (1983), Library Manual Mark 15, Oxford: Numer-
ical Algorithms Group.
[207] Nyblom, J. (1992), Note on interpolated order statistics, Statistics and Probability
Letters, 14, 129-131.
[208] Oja, H. (1983), Descriptive statistics for multivariate distributions, Statistics and Prob-
ability Letters, 1, 327-333.
[209] Oja, H. and Nyblom, J. (1989), Bivariate sign tests, Journal of the American Statistical
[210] Olshen, R. A. (1967), Sign and Wilcoxon test for linearity, Annals of Mathematical
Statistics, 38, 1759-1769.
[211] Osborne, M. R. (1985), Finite Algorithms in Optimization and Data Analysis, Chich-
ester: John Wiley & Sons.
[212] Peters, D. and Randles, R. H. (1990a), Multivariate rank tests in the two-sample
location problem, Communications in Statistics, Part A-Theory and Methods, 15(11),
4225-4238.
[213] Peters, D. and Randles, R. H. (1990b), A multivariate signed-rank test for the one-
sample location problem, Journal of the American Statistical Association, 85, 552-557.
[214] Pitman, E. J. G. (1948), Notes on nonparametric statistical inference, Unpublished
notes.
[215] Policello, G. E. II, and Hettmansperger, T. P. (1976), Adaptive robust procedures
for the one-sample location model, Journal of the American Statistical Association, 71,
624-633.
[216] Puri, M. L. (1968), Multisample scale problem: Unknown location parameters. Annals
of the Institute of Statistical Mathematics, 40, 619-632.
[217] Puri, M. L. and Sen, P. K. (1971), Nonparametric Methods in Multivariate Analysis,
New York: John Wiley and Sons.
[218] Puri, M. L. and Sen, P. K. (1985), Nonparametric Methods in General Linear Models,
[219] Randles, R. H. (1989), A distribution-free multivariate sign test based on interdirec-
tions, Journal of the American Statistical Association, 84, 1045-1050.
[220] Randles, R. H., Fligner, M. A. Policello, G. E., and Wolfe, D. A. (1980), An asymptot-
ically distribution-free test for symmetry versus asymmetry, Journal of the American
Statistical Association, 75, 168-172.
484 BIBLIOGRAPHY
[221] Randles, R. H. and Wolfe, D. A. (1979), Introduction to the Theory of Nonparametric
Statistics, New York: John Wiley and Sons.
[222] Rao, C. R. (1948), Tests of signicance in multivariate analysis, Biometrika, 35, 58-79.
[223] Rao, C. R. (1973), Linear Statistical Inference and Its Applications, 2nd Edition, New
[224] Rao, C. R. (1988), Methodology based on L
1
-norm in statistical inference, Sankhya,
Series A, 50, 289-313.
[225] Rockafellar, R. T. (1970), Convex Analysis, Princeton, New Jersey: Princeton Univer-
sity Press.
[226] Rousseeuw, P. J. (1984), Least median squares regression, Journal of the American
Statistical Association, 79, 871-880.
[227] Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regression and Outlier Detection,
[228] Rousseeuw. P. J. and van Zomeren, B. C. (1990), Unmasking multivariate outliers and
leverage points, Journal of the American Statistical Association, 85, 633-648.
[229] Rousseeuw. P. J. and van Zomeren, B. C. (1991), Robust distances: Simulations and
cuto values, In: Directions in Robust Statistics and Diagnostics, Part II, 195-203,
Editors: W. Stahel and S. Weisberg, New York: Springer-Verlag.
[230] Savage, I. R. (1956), Contributions to the theory of rank order statistics-the two sample
case, Annals of Mathematical Statistics, 27, 590-615.
[231] Sawilowsky, S. S. (1990), Nonparametric tests of interaction in experimental design,
Review of Educational Research, 60, 91-126.
[232] Sawilowsky, S. S., Blair, R. C. and Higgins, J. J. (1989), An investigation of the Type
I Error and power properties of the rank transform procedure in factorial ANOVA,
Journal of Educational Statistics, 14, 255-267.
[233] Schee. H. (1959), The Analysis of Variance, New York: John Wiley and Sons.
[234] Schrader, R. M. and McKean, J. W. (1977), Robust analysis of variance, Communica-
tions in Statistics, Part A-Theory and Methods, 6, 879-894.
[235] Schrader, R. M. and McKean, J. W. (1987), Small sample properties of least absolute
values analysis of variance, In: Y. Dodge, editor, Statistical Analysis Based on the
L
1
-Norm and related methods, Amsterdam: North Holland, 307-321.
[236] Schuster, E. F. (1975), Estimating the distribution function of a symmetric distribu-
tion, Biometrika, 62, 631-635.
BIBLIOGRAPHY 485
[237] Schuster, E. F. (1987), Identifying the closest symmetric distribution or density func-
tion, Annals of Mathematical Statistics, 15, 865-874.
[238] Schuster, E. F. and Becker, R. C. (1987), Using the bootstrap in testing symmetry
versus asymmetry, Communications in Statistics, Part B-Simulation and Computation,
16, 19-84.
[239] Searle, S. R. (1971), Linear Models, New York: John Wiley and Sons.
[240] Sheather, S. J. (1987), Assessing the accuracy of the sample median: Estimated stan-
dard errors versus interpolated condence intervals. In: Statistical Data Analysis Based
on the L
1
Norm and Related Methods, ed. Y. Dodge, 203-216, Statistical Data Analysis
Y. Dodge ed., Amsterdam: North-Holland.
[241] Sheather, S. J. McKean, J. W. and Hettmansperger, T. P. (1997), Finite sample sta-
bility properties of the least median of squares estimator, Journal of Statistical Compu-
tation and Simulation, In press.
[242] Shirley, E. A. C. (1981), A distribution-free method for analysis of covariance based
on rank data, Applied Statistics, 30, 158-162.
[243] Siegel, S. and Tukey, J. W. (1960), A nonparametric sum of ranks procedure for relative
spread in unpaired samples, Journal of the American Statistical Association, 55, 429-
444.
[244] Sievers, G. L. (1983), A weighted dispersion function for estimation in linear models,
Communications in Statistics, Part A-Theory and Methods, 12(10), 1161-1179.
[245] Simono, J. S. and Hawkins, D. M. (1993), Algorithm AS 282: High Breakdown
Regression and Multivariate Estimation, Apllied Statistics, 42, 423-432.
[246] Simpson, D. G. Ruppert, D. and Carroll, R. J. (1992), On one-step GM-estimates and
stability of inferences in linear regression, Journal of the American Statistical Associa-
tion, 87, 439-450.
[247] Small, C. G. (1990), A survey of multidimensional medians, International Statistical
Review, 58, 263-277.
[248] Speed, F. M. and Hocking, R. R. and Hackney, O. P. (1978), Methods of analysis with
unbalanced data, Journal of the American Statistical Association, 73, 105-112.
[249] Steel, R. G. D. (1960), A rank sum test for comparing all pairs of treatments, Techno-
metrics, 2, 197-207.
[250] Stefanski, L. A., Carroll, R. J. and Ruppert, D. (1986), Optimally bounded score func-
tions for generalized linear models with applications to logistic regression, Biometrika,
73, 413-424.
486 BIBLIOGRAPHY
[251] Stewart, G. .W. (1973), Introduction to Matrix Computations, New York: Academic
Press.
[252] Stromberg, A. J. (1993), Computing the exact least median of squares estimate and
stability diagnostics in multiple linear regression, SIAM Journal of Scientic Computing,
14, 1289-1299.
[253] Student (1908), The probable error of a mean, Biometrika, 6, 1-25.
[254] Tableman, M. (1990), Bounded-inuence rank regression: A one-step estimator based
on Wilcoxon scores, Journal of the American Statistical Association, 85, 508-513.
[255] Terpstra, T. J. (1952), The asymptotic normality and consistency for Kendalls test
against trend, when ties are present, Indagationes Mathematicae, 14, 327-333.
[256] Thompson, G. L. (1991a), A note on the rank transform for interaction, Biometrika,
78, 697-701.
[257] Thompson, G. L. (1991b), A unied approach to rank tests for multivariate and re-
peated measure designs, Journal of the American Statistical Association, 86, 410-419.
[258] Thompson, G. L. (1993), Correction Note to: A note on the rank transform for inter-
actions, (V. 78, 697-701), Biometrika, 80, 211.
[259] Thompson, G. L. and Ammann, L. P. (1989), Ecacies of rank-transform statistics in
two-way models with no interaction, Journal of the American Statistical Association,
85, 519-528.
[260] Tierney, L. (1990), XLISP-STAT, New York: John Wiley and Sons.
[261] Tucker, H. G. (1967), A Graduate Course in Probability, New York: Academic Press.
[262] Vidmar, T. J., McKean, J. W. and Hettmansperger (1992), Robust procedures for
drug combination problems with quantal responses, Applied Statistics, 41, 299-315.
[263] Vidmar, T. J. and McKean, J. W. (1996), A Monte Carlo study of robust and least
squares response surface methods, Journal of Statistical Computation and Simulation,
54, 1-18.
[264] Welch, B. L. (1937), The signicance of the dierence between two means when the
population variances are unequal, Biometrika, 29, 350-362.
[265] Wilcoxon, F. (1945), Individual comparisons by ranking methods. Biometrics, 1, 80-83.
[266] Witt, L. D. (1989), Coecients of Multiple Determination Based on Rank Estimates,
Unpublished PhD. Thesis, Western Michigan University, Kalamazoo, MI.
BIBLIOGRAPHY 487
[267] Witt, L. D., McKean, J. W. and Naranjo, J. D. (1994), Robust measures of association
in the correlation model, Statistics and Probability Letters, 20, 295-306.
[268] Witt, L. D., Naranjo, J. D. and McKean, J. W. (1995), Inuence functions for rank-
based procedures in the linear model, Journal of Nonparametric Statistics, 5, 399-358.
[269] Witt, L. D., Naranjo, J. D. and McKean, J. W. (1994), Robust measures of association
in the correlation model, Statistics and Probability Letters, 20, 295-306.
[270] Ylvisaker, D. (1977), Test resistance, Journal of the American Statistical Association,
72, 551-556.
[271] Wang, M. H. (1996), Statistical Graphics: Applications to the R and GR Methods in
Linear Models, Unpublished Ph.D. Thesis, Western Michigan University, Kalamazoo,
MI.
[272] Wilks, S. S. (1960), Multidimensional statistical scatter, In: Contributions to Proba-
bility and Statistics in Honor of Harold Hotelling, ed. I. Olkin et al, 486-503. Stanford:
Stanford University Press.
Index
log-rank scores
computation, 122
accelerated failure time models, 214
added variable plot, 200
ane, 352
transformation, 352
ane invariant rank methods
Oja signed rank statistic, 391
algorithm
tracing, 45
analysis of covariance models, see experimen-
tal designs
angle sign test, 366
eciency relative to Hotellings T
2
, 371,
376
anti-ranks, 10, 36, 46
argmin, 5
Arnold transformation, 335, 336
asymptotic linearity, 21
L
1
, 24
general signed-rank scores, 433
linear model, 168, 439
L
1
, 445
two sample scores, 104
asymptotic power, 26
asymptotic power lemma, 26
asymptotic relative eciency, 26
asymptotic representation
R-estimates for mixed model, 325
R-estimates in linear model, 176
spatial median, 367
Behrens-Fisher problem, 135
Mann-Whitney-Wilcoxon, 135
modied, 140
Mathisens test, 138
Welch t-test, 141
Blumens bivariate sign test, 378
2
, 378
bootstrap, 28
bounded in probability, 24
bounded inuence, see HBR-estimates
breakdown, see general signed-rank scores,
see Mann-Whitney-Wilcoxon
L
1
two sample, 116
acceptance breakdown, 33
asymptotic value, 31
estimation, 30
rejection breakdown, 33
expected rejection breakdown, 34
breakdown
HBR - estimates, 239
Central Limit Theorem
Lindeberg-Feller, 421
cluster correlated data, see mixed model
componentwise estimates, 359
breakdown, 392
eciency, 360, 363
Hodges-Lehmann estimate
eciency relative to the mean vector,
365
inuence function, 392
componentwise estimating equations, 356
componentwise tests
sign tests, 361
Wilcoxon test
2
, 365
conditional residuals, 328
condence interval, 7
488
INDEX 489
comparison of two samples, 60
eciency, 27
estimate of standard error, 27
interpolated condence intervals, 57
shift parameter, 62
consistent test, 19
rank-based tests in linear model
reduction in dispersion, 182
contaminated normal distribution, 25, 40
contiguity, 424
convex, 436
correlation model, 220
Hubers condition, 221
R-coecient of multiple determination, 224
properties of, 225
traditional coecient of multiple deter-
mination, 223
delete i model, 208
dierence in means, 77
direct product, 396
dispersion function
linear model, 155
functional representation, 185
one sample, 5
quadratic approximation, 169
two sample, 76
ecacy, 21
eciency, 25
bivarite, 354
eciency:L
1
versus L
2
, 25
elliptical model, 351
equivariance
scale, 96
translation, 96
equivariant estimator, 115
bivariate, 352
examples, data
correlation model
Hald Data, 230
diagnostics
Cloud Data, 200, 206
Free Fatty Acid Data, 210
experimental design
Box-Cox Data, 320
LDL Cholesterol of Quail, 279, 283, 285,
287, 293
Lifetime of Motors, 298
Marketing Data, 304
Pigs and Diets, 307
Poland China Pigs, 283, 317
Rat Data, 314
Rate Data, 319
Snake Data, 302, 318
linear model
Baseball Salaries Data, 160
Bonds Data, 250
Hawkins Data, 251
Potency Data, 162, 202
Quadratic Data, 246
Stars Data, 234, 235, 246
Telephone Data, 159
log linear model
Insulating Fluid Data, 218
longitudinal model
CRP data, 345
mixed models
crab grass, 330
Milliken and Johnson Data, 337
multivariate experimental design
Paspalum Grass Data, 413
multivariate linear model
Tablet Potency Data, 408
multivariate location
Brains of Mice Data, 399
Cork Borings Data, 367, 375
Cotton Dust Data, 356
Mathematics and Statistics Exam Scores,
383, 391
one sample
Cushney-Peebles Data, 13, 60
Darwin Data, 144
Shoshoni Rectangles Data, 15, 52
proportional hazards
490 INDEX
Lifetime of Insulation Fluid, 122
two samples
Hendy and Charles Coin Data, 63, 82,
98
Quail Data, 80, 106, 112, 115
exchangeable, 327
experimental designs
analysis of covariance models, 300
covariates, 300
contrasts, 276
estimation, 285
hypotheses testing, 283
incidence matrix, 277
means model, 276
medians model, 276, 286
multiple comparison procedures, 288
Bonferroni, 289
experiment error rate, 288
family error rate, 288
pairwise condence intervals, 293
pairwise tests, joint rankings, 291
pairwise tests, separate rankings, 292
Protected LSD Procedure, 289
Tukey, 290
Tukey-Kramer, 291, 297
multivariate
means model, 412
medians model, 412
Oneway design, 277
pseudo-observations, 287
twoway model, 296
additive model, 296
interaction, 296
main eect, 296
prole plots, 296
extreme value distribution, 119
Fisher information, 49, 93
Fligner-Kileen
rank-based R software (RBR), 129
folded aligned samples, 125
Friedman test statistic, 319
full model, 4
Gateux derivative, 446
GEE estimates, 340
GEEWR estimates, 342
asymptotic distribution, 343
general rank scores, 165, see regression model
piecewise linear, 106
two sample scores, 100
estimate of shift, 101, 105
gradient function, 101
gradient rank test, 101
normal scores, 106
null distribution, 102
pseudo-norm, 99
general scores
computation, 101
general signed-rank scores, 45, 431
asymptotic breakdown, 51
derived from two sample general rank scores,
107
ecacy, 48
functional, 45
inuence function of the estimate, 51
linearity, 433
local asymptotic distribution theory, 431,
432
optimal score function, 47
Pitman regular, 48
test statistic
generalized longitudinal model
GEE estimates, 340
GEEWR estimates, 342
longitudinal data, 339
GF(2m
1
, 2m
2
), 215
GR-estimates, 233
gradient process, 5
gradient test, 6
INDEX 491
consistency, 20
hazard function, 118
HBR-estimates, 232
asymptotic distribution, 237, 455
breakdown, 239
dispersion function
quadratic approximation, 460
gradient, 232
asymptotic null distribution, 237, 460
inuence functions, 241
intercept, 244
linearity, 458
pseudo-norm, 232
Helmert transformation, see Arnold transfor-
mation
high breakdown estimates, see HBR-estimates,
see LTS-estimates
Hotellings T
2
, 352
HR estimate, 382
Hubers condition, 165
independent error model, see linear model
inuence function, 32, 446, see R-estimates
in linear model, see rank-based tests
in linear model
intraclass correlation coecient, 328
invariant test statistic, 352
JR estimates, 324
Kendalls , 261
Kronecker product, 396
Kruskal-Wallis, 282, see experimental designs,
see multivariate linear model
lack of t, 219
Lawley-Hotelling trace statistic, 397
least squares, see norm
LeCams Lemmas, 424
Lehmann alternatives, 118
linear model, 153, see R-estimates in linear
model, see rank-based tests in linear
model, see multivariate linear model
approximation of R-residual, 204
external R-studentized residual, 208
general linear hypotheses, 154
independent error model, 323
internal R-studentized residuals, 205
least squares, 156
reduction in sums of squares, 157
L
1
-estimates, 194
model misspecication, 196
departure from orthogonality, 198
R pseudo-norm, 154
RDBETAS, 210
RDCOOK, 209
RDFFIT, 209
linear rank statistics, see regression model
linearity, see asymptotic linearity
location functional, 2
center of symmetry, 4
in two sample model, 74
location model
multivariate location model, 351
one sample, 2
symmetric, 36
two sample, 74
location parameter, 2
log linear models, 215
long form, see mixed model
generalized model, 339
LTS-estimates, 234
Manns test for trend, 70
Mann-Whitney-Wilcoxon, 78, see multivari-
ate linear model
computation, 80
condence interval, 79, 92, 97
ecacy, 95
estimate of shift, 78, 96
approximate distribution, 97
breakdown, 116
492 INDEX
relative eciency to dierence in means,
97
gradient, 78
Hodges-Lehmann estimate, 78
Pitman regular, 94
, 95
estimate of, 97
test, 78
consistency, 91
eciency relative to the t-test, 95
general asymptotic distribution, 89
null asymptotic distribution, 90
null distribution theory, 83
power function, 92
projection, 87
sample size determination, 96
unbiased, 93
marginal residuals, 328
Mathisens two sample test, 112
mean
breakdown, 31
two sample location, 117
mean shift model, 206
median, 7
bootstrap, 29
breakdown, 31
estimate of standard error, 28
spatial median, 366
standard deviation of, 8
minimum distance, 5
Minitab software, 12
one-sample computation, 12
mixed model, 323
general, 323
long form, 323
R estimates, 324
Moods median test, 109
ecacy, 111
estimate of shift, 109
multiple comparison procedures, see experi-
mental designs
multivariate linear model, 395
estimating equations, 396
Kruskal-Wallis, 402
means model, 412
medians model, 412
prole analysis, 418
R-estimates, 404
test for regression eect, 397
tests of general hypotheses, 405
MWW, see Mann-Whitney-Wilcoxon
Noethers condition, 165
norm, 5, see pseudo-norm
L
1
-norm, 7, see Mathisens two sample
test
dispersion, 7
ecacy, 24
estimating equation, 7
gradient, 7
Pitman regularity, 23
L
2
-norm, 8
t-test, 9
ecacy, 24
estimate induced by, 5
Weighted L
1
-Norm, 9
normal scores, 45
breakdown, 51
eciency relative to the L
2
, 49
INDEX 493
empirical ARE, 52
Oja criterion function, 387
Oja median, 388
Oja sign test, 388
one sample location model, see norm
optimal score function
one sample, 47
two sample, 103
ordered alternative, 320
orthogonal transformation, 352
outlier, 206
paired design model, 143
compared with completely randomized de-
sign, 143
Pitman regular, 21, 352
power function, 18
prole analysis, see multivariate linear model
projection theorem, 86
proportional hazards models, 118
linear model, 215
log exponential model, 119
log rank test, 119
pseudo-median, 11
pseudo-norm, 75, see linear model, see HBR-
estimates
L
1
-pseudo norm, 108
L
1
-pseudo-norm, 109
L
2
-pseudo-norm, 77
gradient, 76
gradient test, 76
pure error dispersion, 219
QR-decomposition, 192
quadratic approximation
L
1
, 445
dispersion function linear model, 169, 439
quadraticity, 169
R rank-based software
one-sample computation, 12
R software, 2
url, 2
R-estimates for Arnold transformed model,
336, see also R-estimates for mixed
model
R-estimates for mixed model, 324
R-estimates for simple mixed model, 327, see
also R-estimates for mixed model
R-estimates in linear model, 155
inuence function, 170, 451
intercept, 171
joint asymptotic distribution, 173, 174,
176
internal R-studentized residuals, 159
Newton type algorithm, 192
R normal equations, 156
randomized block design, see simple mixed
model
randomized block designs, see mixed model
rank scores, see general rank scores
rank transform, 310
rank-based R software (RBR), 2, see Url
one sample general scores, 45
one sample sign, 12
one sample t, 13
one sample Wilcoxon, 12
one sample, normal scores, 52
two sample log-rank scores, 122
two sample general scores, 101
two sample Mann-Whitney-Wilcoxon, 80
two sample scale, 129
Winsorized signed-rank Wilcoxon, 72
rank-based software (RBR)
494 INDEX
one sample, normal scores, 69
rank-based tests in linear model, see multi-
variate linear model
aligned rank test, 181
eciency relative to LS, 184
gradient test, 156
F
, 157, 158
asymptotic distribution, local alterna-
tives, 183
consistency, 182, 441
scores test, 158, 166
Wald type test, 158
ranked set sampling, 53
reduced model, 6
regression model, 422
linear rank statistics, 423
local asymptotic distribution theory, 424,
427
null distribution theory, 423
repeated measures, see mixed model
resolving test, 19
RGLM, 191
robust distance, 233
RSS, 53
sample size determination
one sample, 26
two sample, 96
scale problem
unknown locations, 128
score test, 6
scores, see general signed-rank scores, see gen-
eral rank scores
selection of predictors, 232
shift parameter, 74
estimate of, 76
sign test, 8
consistency, 20
distribution free, 8
nonparametric, 8
ranked set sampling, 54
signed-rank scores, see general signed-rank
scores
signed-rank statistics, see general signed-rank
scores
simple mixed model, 327
conditional residuals, 328
intraclass correlation coecient, 328
marginal residuals, 328
robust estimates of variance components,
328
Studentized residuals, 329
total variance, 328
variance components, 328
spatial median, 366
asymptotic representation, 367
breakdown point, 393
spatial rank methods
rotational invariant rank vectors, 373
spatial signed rank statistic, 374
spatial ranks
spatial signed rank tests eciency, 376
spatial sign test, 366, see angle sign test
Spearmans r
s
, 260
multivariate linear model, 398
stepwise model building, 232
stochastic ordering, 83
Studentized residuals
for Arnold transformed model, 337
for simple mixed model, 329
HBR t of linear model, 245
R t linear model, 205
simple mixed model, 329
symmetry
INDEX 495
diagonal symmetry, 351
t-test
one sample, 9
two sample, 77
+, 48
, 103, 164
estimate of, 188
consistency, 190
through the origin, 264
ties
Wilcoxon signed-rank, 13
total variance, 328
two sample location model, see pseudo-norm
two sample scale model, 125
folded aligned samples, 125
general scores, 127
Mood test, 131
Siegel-Tukey test, 151
two sample scale problem
Fligner-Killeen test, 129
unbiased test, 18
Url
R package, 2
url
rank-based R software (RBR), 2
variance components, 328
Wald type tests, 6
Walsh averages, 11
Wilcoxon
ecacy, 37
eciency to the L
2
, 40
Hodges-Lehmann
approximate distribution, 11
estimate of location, 11
Pitman regular, 37
pseudo-median, 11
breakdown, 31, 51
signed-rank test, 12
ties, 13
Wilks generalized variance, 354
Winsorized scores
signed-rank, 70
working independence, 336

Robust Nonparametric Statistical Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Nonparametric Statistical Methods

Uploaded by

Copyright:

Available Formats

Robust Nonparametric Statistical Methods

(x) = P(aX + b x) = H[a

, we then write the statistical location model,

(y) = argmin|y 1| = a argmin

(x) + b. For property (3), the dening

(y) = argmin|y 1| = argmin|x 1()| .

(x). Furthermore, for the norms considered in this book it

). The reduced model is the full model subject

has an asymptotic N(,

nX/s, which, under the

H(2 x) H(x) dH(x) +

H(x) H(2 x) dH(x) = 0 .

(S(0) c) for positive , where is the size of the test. At times

denote the event [ X

) < c. Hence, for m suciently large,

> 0 and F. For > 0 and for large n, we have n

, for some > 0.

(0) > 0 and either S(0)

)S() for some > 0 such that the conditions ( 1.5.7),

n). By ( 1.5.8) we have

(0)[) for all i. There exists N such that n N implies P[max

(0) = 2f(0). Then condition ( 1.5.8) is

t. Again, the probability of this event can be made arbitrarily close

[ t) is arbitrarily close to 1 and we have boundedness in probability.

. Solving for n yields

We begin with the denition of breakdown for the estimator

(x)[ where the sup is taken over all possible corrupted

/n. Hodges (1967) was the rst to introduce these ideas.

, the asymptotic breakdown value.

represent a new, outlying , observation. Since

)/2. Now each time we make an

)/2] + 1, [.] the

is the upper quantile from a t-distribution with n 1 degrees of freedom. The

and solve for m to get m = nt

and take the dierence between

), given in the statement of the theorem; see

(u) is the inverse cdf (or

+ or equivalently solves the equation

+() changes. This is the algorithm behind

+(0)[ c] = . We briey develop the statistical and robustness

+ in the next two subsections.

+(0) in terms of the anti-ranks as,

+(0) is distribution free under H

+(0) from which critical values, c,

+(0), the following additional assumption on

+(0) is asymptotically normal; see, also, Exercise 1.12.16. Hence, an

+() is Pitman regular. Assume

+() is nondecreasing in ; hence, the

+(0)/n and consider

+() converges in probability to () in ( 1.8.8). Hence,

+() is Pitman regular with ecacy given by

+ is dened by the reciprocal of ( 1.8.21),

+. This in turn can be used to

+; see Exercise ??.

+, we need to maximize the correlation coecient which can be accomplished by selecting

is the N(0, 1) pdf.

(x)/(x) = x. Hence, with dv = dx, we have

+ is derived in Section A.5 of the Appendix. It is given

(0) in probability, pointwise in b, i.e., U

+ and its associated

= e . Then the distribution function of e

(x) = F(x+) and its functional is T(F

) = 0. Thus the model, (2.2.3), can be expressed

(x ( + )). Thus T(G) = + is a location functional