Departments of Preventive Medicine and Biometrics and of Physiology
and Biophysics, School of Medicine, University of Colorado Health
Sciences Center, Denver, Colorado 80262
Statistical procedures underpin the process of scientific discovery. As
researchers, one way we use these procedures is to test the validity of
a null hypothesis. Often, we test the validity of more than one null
hypothesis. If we fail to use an appropriate procedure to account for
this multiplicity, then we are more likely to reach a wrong scientific
conclusion
we are more likely to make a mistake. In physiology,
experiments that involve multiple comparisons are common: of the
original articles published in 1997 by the American Physiological
Society, ~40% cite a multiple comparison procedure. In this review,
I demonstrate the statistical issue embedded in multiple comparisons,
and I summarize the philosophies of handling this issue. I also
illustrate the three procedures
Newman-Keuls, Bonferroni, least
significant difference
cited most often in my literature review; each
of these procedures is of limited practical value. Last, I demonstrate
the false discovery rate procedure, a promising development in multiple
comparisons. The false discovery rate procedure may be the best
practical solution to the problems of multiple comparisons that exist
within physiology and other scientific disciplines.
Bonferroni inequality, false discovery rate, least significant
difference, Newman-Keuls, statistics
 |
INTRODUCTION |
STATISTICAL PROCEDURES are inherent to
scientific discovery. As researchers, we use these procedures for two
main reasons: to obtain point and interval estimates about the value of
a population parameter, and to test the validity of a null hypothesis
(5). Point and interval estimates emphasize the magnitude
and uncertainty of the experimental results. The test of a null
hypothesis helps guard against an unwarranted scientific conclusion, or
it helps argue for a real experimental effect (18). When
more than one hypothesis is tested
when multiple comparisons are
made
the validity of our scientific conclusions may be weakened if we
fail to use an appropriate multiple comparison procedure (6, 8,
11, 14, 19, 20).
In studies published recently by the American Physiological Society
(APS), the citation of a multiple comparison procedure is common (Table
1). This finding raises an important
question: do physiologists understand the philosophies and assumptions
behind competing multiple comparison procedures? This question is
relevant for three reasons: there are many procedures available,
textbooks of statistics (for example, Refs. 1, 13, and 18) provide little more than a cursory description of the procedures themselves, and there can be several solutions to the problem created by multiple comparisons.
In this paper, I summarize the statistical issue embedded in
multiple comparisons, and I review the philosophies of handling this
issue. Then, I illustrate the three procedures
Newman-Keuls, Bonferroni, least significant difference
cited most often in my literature review. Last, I review the false discovery rate, a promising
development in multiple comparisons.
Glossary
|
Error rate for a single comparison
|
 |
Error rate for a family of k comparisons
|
| H0 |
Null hypothesis
|
| µ |
Population mean
|
| P |
Achieved significance level
|
| Pr{A} |
Probability of event A
|
|
Sample mean
|
 * |
Critical difference between two sample means
|
 |
THE ISSUE EMBEDDED IN MULTIPLE COMPARISONS |
To test a null hypothesis, we must formulate the hypothesis
beforehand. Then, using data collected during the experiment, we must
compute the observed value T of some test statistic. Last, we must compare the observed value T to a critical value
T*, chosen from the distribution of the test statistic that
is based on the null hypothesis. If T is more extreme than
T*, then that is surprising if the null hypothesis is true,
and we are entitled to become skeptical about the scientific validity
of the null hypothesis.
Suppose we want to assess renal blood flow in two independent samples.
If our objective is to compare the underlying population means,
µ1 and µ2, then one pair of null and
alternative hypotheses, H0 and
H1, is
The probability that we reject H0 given
that H0 is true is the error rate
. We can
use mathematical notation1 to
write this statement as
|
(1)
|
Note that the critical value T* is the 100[1
(
/ 2)]th percentile from the distribution of the test
statistic given that the null hypothesis is true. Equation 1
can be rewritten as
|
(2)
|
Multiple comparisons.
Suppose we want to assess renal blood flow in three independent
samples.2 In this setting,
there are three alternative hypotheses,
H1-H3, that correspond to the
comparisons among population means:
Associated with each of these comparisons is an error rate of
magnitude
. If the three comparisons are considered to be a family,
then the family will have an error rate 
, where 
>
. As a result, it is more likely that a
true null hypothesis will be rejected erroneously. This is the
statistical issue that lies at the heart of multiple comparison procedures.
To see why this issue warrants our attention, imagine that each of
k independent comparisons is tested at an error rate of
.
Assume that the underlying populations are identical and that each of
the k null hypotheses is true. What is 
,
the probability that at least one of the k comparisons will
reject a true null hypothesis? As in Eq. 2, the probability
of rejecting at least one H0 given that all
H0 are true can be written
For a single comparison, 
=
. When the
number of comparisons increases,
remains constant, but

increases. For example, if
= 0.05, then for
k = 1, 2, 3, 4, 5, ... , 10,
For k = 10 comparisons, there is a 40% chance
that we will reject erroneously at least one true null hypothesis.
Misguided multiple comparisons.
In many of the studies tallied in Table 1, a multiple comparison
procedure was used to analyze several groups of observations made on
the same subjects. In general, this use of a multiple comparison
procedure is misguided: most procedures assume that the groups are
independent, but repeated observations on a subject, for example,
observations made during baseline and then during several periods after
some intervention, create correlation among the groups
(9). As a result, the true error variability is underestimated, and the observed values for the standard deviations of
the group means underestimate the true variabilities (9). When most multiple comparison procedures are used to analyze groups of
repeated observations, the outcome will be an inflated number of
statistically significant differences among the group means (see
APPENDIX).
 |
PHILOSOPHIES ABOUT MULTIPLE COMPARISONS |
Would you tell me, please, which way I ought to go from
here?
Alice
That depends a good deal on where you want to get
to.
The Cat
L. Carroll in Alice's Adventures in Wonderland (1865)
When we decide the validity of a single comparison, we can make a
mistake: we can reject a true null hypothesis, or we can fail to reject
a false null hypothesis. When we decide the validity of k
comparisons
this happens in most experiments
we are more likely to
reject a true null hypothesis. The challenge for any multiple
comparison procedure is to satisfy two conflicting requirements: reduce
the risk that we reject a true null hypothesis but maintain the
likelihood that we detect an experimental effect if it exists (7,
12, 17). The relative importance assigned to these requirements
has produced opposing philosophies about how to handle the issue of
multiple comparisons.
Focus on individual comparisons.
Proponents of this philosophy argue it is sufficient to control the
single comparison error rate
, the probability that we reject a true
null hypothesis. They base this philosophy on the assumption that most
scientific comparisons are preplanned (2, 15, 16). This
assumption is naive and unrealistic: many experimental effects are
discovered only after an investigator explores
rummages through
the data.
Control for multiple comparisons.
In general, physiologists examine the impact of an intervention on a
set
a family
of related comparisons: for example, the impact of some
drug on renal blood flow and urinary excretion of hormones and
electrolytes, or a series of paired comparisons among several groups of
observations. In these situations, we base our scientific conclusions
on a family of comparisons: that is, multiple comparisons considered as
a single entity. As a result, it is not the single comparison error
rate
that we must control but the family error rate

, the probability that we reject at least one true
null hypothesis in the family of comparisons (7, 8, 11-13,
17, 19-20). Multiple comparison procedures provide control
of the family error rate 
.
 |
THE GENERAL STRATEGY |
Most multiple comparison procedures use the same basic strategy:
to make inferences about the population means for two groups, µ
and µ
, they compare the magnitude
of the difference between the sample means

and 
to a critical difference 
*. If
where
|
(3)
|
and where SE{u} is the standard error of the
quantity u, then that is statistical evidence that
µ
µ
. Procedures differ in
the statistics substituted for the coefficient c and the
quantity u. Table 2 lists the
statistics for the Newman-Keuls, Bonferroni, and least significant
difference tests.
 |
SIMULATED SAMPLE OBSERVATIONS |
An article published recently in the Journal provides an ideal
framework with which to illustrate multiple comparison procedures. In
the experiment, Koch et al. (10) explored the heritability of running endurance, measured as distance run, in rats. I used the
observed sample statistics from 10 experimental groups (Fig. 1) as the empirical foundation for the
simulated sample
observations.3

View larger version (11K):
[in this window]
[in a new window]
|
Fig. 1.
Experimental groups 1 - 10 associated with the simulated sample observations and derived sample
statistics listed in Table 3. This diagram is based on the selective
breeding procedure described in Ref. 10. The initial generation is
generation 0. In each generation, the 2 female
( ) and 2 male ( ) rats at the extremes
of observed running endurance were paired and bred to produce the
subsequent generation.
|
|
This is how I generated the simulated sample observations
the data.
Let the random variable Yj represent the
distance run by a rat in group j, where
j = 1, 2, ... , 10. Assume that each Yj is distributed normally with mean
µj and variance
j2
I estimated each µj and
j using approximate values for the observed
group means and standard deviations (see Ref. 10, Tables 1 and 2). For
simplicity, I limited each sample to 10 observations. One set of 10 simulated samples is listed in Table 3.
For the rest of the review, I use the resulting sample means
and the resulting sample standard deviations
as the basis for my illustration of specific multiple comparison
procedures.
 |
NEWMAN-KEULS PROCEDURE |
The Newman-Keuls
procedure4 is a multiple
range test that compares the underlying population means of
r experimental groups. That is, it evaluates the null
hypothesis
|
(4)
|
The procedure sets the family error rate 
at
, the single comparison error rate, by using studentized range
distributions to calculate critical differences (see Eq. 5).
Another multiple range test is the Duncan
procedure.5 It is only the
specification of 
that differentiates the method of
Duncan from that of Newman-Keuls. The Duncan family error rate is

= 1
(1
)m
1, where m is the number of means
being compared. The Duncan multiple range test is a noted ancestor of
modern multiple comparison procedures, but because 
grows with m, the test violates a basic tenet of multiple
comparisons: the control of 
despite a large number
of comparisons (see Ref. 12, p. 87-89).
The example.
To make inferences about the equality of two population means,
µ
and µ
, the Newman-Keuls procedure
uses the critical difference

*m, defined as
|
(5)
|
In Eq. 5, the coefficient
qm,

is the 100[1

]th percentile from a studentized range
distribution with m means and
degrees of freedom, and
SE{
} is the standard error of the sample mean.
Using the pooled sample variance s2 = 6,883
(see Table 3), the standard error of the sample mean is estimated as
Suppose we define 
= 0.05. In this
simulated experiment, there are
= 90 degrees of freedom (see
Table 3). Because there can be groups of m = 2, 3, ... , 10 consecutive sample means, there are nine critical
differences to be calculated using Eq. 5 (Table
4).
A simple graphical technique can communicate the inferences based on
these critical differences. First, we list the sample means in
ascending order (see Table 3)
Then, for each group of m consecutive means,
progressing from largest to smallest m, we compare the
magnitude of the m-mean range,


, to
its corresponding critical difference 
*m. If
then we underline the group of m means: we are unable
to discriminate among them. If
then we draw no line: we have identified at least one difference.
At the end of this process, it is only those means that remain
unconnected that we can discriminate statistically.
To illustrate this technique, we begin with m = 10. The
initial step is
In fact, for m = 9, 8, ... ,
4, 

> 
*m, therefore draw no lines.
The next step is to evaluate groups of m = 3
consecutive means
The final step is to evaluate pairs (m = 2) of adjacent
means
At this point, we can stop: all remaining pairs of consecutive
means were underlined in the preceding step, when m = 3.
The Newman-Keuls procedure leads to these conclusions about the 10 sample means
These are examples of inferences based on this data graphic:
µ2 resembles µ8 and µ7 but
differs from µ4, µ1, ... ,
µ9; and µ9 differs from all other means.
Table 5 lists the inferences for the 16 preplanned group comparisons.
Practical considerations.
The Newman-Keuls procedure evaluates all r (r
1)/2
paired comparisons among r sample means from a balanced
design. The test assumes the r means are independent and are
based on identical numbers of observations (Ref. 12, p. 86). When it
compares more than three means, the Newman-Keuls procedure no longer
caps the family error rate 
at
; instead,

>
(Ref. 8, p. 127). For this reason, the
Newman-Keuls procedure is of limited value for multiple comparisons.
 |
BONFERRONI PROCEDURE |
The Bonferroni inequality is a probability inequality that does
control the family error rate 
. For a family of
k comparisons, the Bonferroni inequality defines the upper
bound of the family error rate to be
where
is the error rate for each comparison. In other words,
the inequality assigns an error rate of

/ k to each comparison within the
family. Because
can vary among comparisons, the general expression
for the family error rate is
The example.
To make inferences about the equality of two population means,
µ
and µ
, the Bonferroni procedure
relies on the critical difference 
*, defined as
|
(6)
|
In Eq. 6, the coefficient
t
/ 2,
is the 100[1
(
/ 2)]th percentile from a t distribution with
degrees of freedom, and SE{

} is the standard error of the difference between the sample means.
If we define 
= 0.05, then for each of the 16 preplanned comparisons listed in Table 5
Therefore, because there are
= 90 degrees of freedom (see
Table 3), t
/2,
= 3.04. Using the
pooled sample variance s2 = 6,883, the standard
error of the difference between sample means is estimated as
|
(7)
|
By virtue of Eq. 6, the resulting critical difference
for the Bonferroni procedure is
Therefore, the Bonferroni procedure leads to these conclusions
about the 10 sample means
Table 5 lists the resulting inferences for the 16 preplanned group comparisons.
Practical considerations.
Although it is not a multiple comparison procedure per se, the
Bonferroni inequality can be used for multiple comparison problems. The
technique is valid regardless of whether the r sample means are independent or correlated (Ref. 12, p. 67). The Bonferroni inequality is appealing because it is versatile and simple.
Unfortunately, its appeal is diminished by the strict protection of the
single comparison error rate
. As a consequence, the Bonferroni
inequality is conservative: it will be unable to detect some of the
actual differences among a family of k comparisons (see
Table 5).
 |
LEAST SIGNIFICANT DIFFERENCE PROCEDURE |
The least significant difference (LSD) procedure, developed by Sir
R. A. Fisher, preceded the Newman-Keuls multiple range test. Like
the Newman-Keuls test, the LSD procedure compares the underlying
population means of r experimental groups (see Eq. 4), and it sets the family error rate 
at the
single comparison error rate
.
The example.
To make inferences about the equality of two population means,
µ
and µ
, the LSD procedure uses the
critical difference 
*, defined as
|
(8)
|
In Eq. 8, the coefficient
t
/ 2,
is the
100[1
(
/ 2)]th percentile from a
t distribution with
degrees of freedom, and
SE{

} is
the standard error of the difference between the sample
means.6
If we define 
= 0.05, then because there are
= 90 degrees of freedom (see Table 3),
t
/ 2,
= 1.99. As shown in Eq. 7,
SE{

} = 37.1. Therefore, by virtue of Eq. 8, the resulting
critical difference for the LSD procedure is
The LSD procedure leads to these conclusions about the 10 sample
means
Table 5 lists the resulting inferences for the 16 preplanned group comparisons.
Practical considerations.
The LSD procedure evaluates all r (r
1) /2 paired
comparisons among r sample means. In its protected form, the
procedure is done only if a preliminary analysis of variance is
statistically significant (18). When it compares more than
three means, the LSD procedure fails to maintain the family error rate

at
(Ref. 8, p. 139). The solution to this
problem is to replace t
/ 2,
in
Eq. 8 with a percentile from a studentized range
distribution: qr
1,

(Ref. 8, p. 139) or qr,

(Ref. 12, p. 92).7
 |
FALSE DISCOVERY RATE PROCEDURE: A RECENT DEVELOPMENT |
In most experiments, scientists strive to make a discovery: to
reject a null hypothesis. When an experiment involves a family of
k comparisons, a scientist is more likely to make a mistaken discovery. The false discovery rate
procedure8 is a promising
solution to the problem of multiple comparisons. This procedure
controls not the family error rate 
but the false
discovery rate f
, the expected fraction of
null hypotheses rejected mistakenly
If all k null hypotheses are
true,9 then
f
= 
; if at least one
null hypothesis is not true, then f

(3). When we define the family
error rate 
, we also set an upper bound on the false
discovery rate f
. But if we control
f
rather than 
, we gain
statistical power, the ability to detect an experimental effect if it
exists (3, 4, 22).
The example.
Unlike the preceding methods, the false discovery rate procedure
operates on achieved significance levels (P values) to make inferences about a family of k comparisons. Let
Pi represent the significance level associated
with comparison i. To execute this procedure, we must
complete three steps:
Step 1. Order the k comparisons by
decreasing magnitude of Pi.
Step 2. For i = k, k
1, ... , 1,
calculate the critical significance level
d*i as
|
(9)
|
Step 3. If Pi
d*i, then reject the null hypotheses
associated with the remaining i
comparisons.10
In the simulation, we selected k = 16
comparisons of interest. For each comparison, we evaluate the null
hypothesis H0: µ
= µ
by doing a t test. The P
values associated with the resulting t statistics vary from
0.723
0.001
(Table 6).
If we define the false discovery rate f
= 0.05, the magnitude of the family error rate 
we
have been using, then the critical significance level
d*i varies from 0.050
0.003. In
step 3, we declare comparisons 1-14 to be
statistically significant (see Table 6). Table 5 lists the inferences
for all 16 comparisons.
Practical considerations.
Because the false discovery rate procedure operates on actual
P values, it is quite versatile. For example, the procedure can be employed when a family of k comparisons involves
different test statistics such as Student t and Wilcoxon
signed rank statistics (3, 4). The false discovery rate
procedure is valid when the k comparisons are independent (a
sample mean is part of only one comparison) or correlated (a sample
mean is part of more than one comparison, as in the example) (3,
4, 22).
The false discovery rate procedure has two important benefits. First,
it allows us to make an inference, with 100[1
(f
/ 2)]% confidence, about the direction
of a statistical difference (4, 22). For example, because
f
= 0.05, we can declare, with 97.5%
confidence, that µ2 < µ8 (see Table
6). This is a stronger inference than the simple declaration
µ2
µ8 (Ref. 8, p. 27-39).
Second, the statistical results for a set of primary comparisons are
largely consistent despite substantial changes in the number of
secondary comparisons included within the family (22).
 |
SUMMARY |
We dare not seek a single multiple comparison procedure for
all experiments.
Adapted from John W. Tukey (1994)
This remark, written by a pioneer in the area of multiple
comparisons, reflects the range of multiple comparison problems that
manifest themselves in scientific research. Over the last 50-60
years, statisticians have explored numerous approaches in an effort to
address these problems (8, 12). In physiology, as in other
disciplines, experiments that involve problems of multiple comparisons
are common.
In this review, I have shown that, as researchers, we are more likely
to reject a true null hypothesis if we fail to use a multiple
comparison procedure when we analyze a family of comparisons. I have
also illustrated the three procedures cited most often in APS journals:
Newman-Keuls, Bonferroni, and LSD. Unfortunately, each of these is of
limited value. In many experimental situations, the Newman-Keuls and
LSD procedures fail to control the family error rate, the probability
that we reject at least one true null hypothesis. In contrast, the
Bonferroni inequality is overly conservative: it fails to detect some
of the actual differences that exist within the family.
Finally, I have reviewed the false discovery rate: a versatile,
simple, and powerful approach to multiple comparisons. As Tukey
suggests, it is perhaps unrealistic to expect that a single multiple
comparison procedure will suffice for all situations: a statistical
procedure designed specifically for a particular experimental situation
will perform better than a general procedure. Nevertheless, there
is growing evidence (4, 22) that the false discovery rate
procedure may be the best practical solution to the problems of
multiple comparisons that exist within science.
To appreciate the impact of correlation on variability, imagine an
investigation in a sample of n subjects. In each subject, some random variable X is measured during two experimental
conditions: a control period and a subsequent intervention period. Let
the random variable measured during the control period be designated X1 and that during the intervention period be
designated X2. Assume that
X1 and X2 are distributed
normally