## Abstract

Statistical procedures underpin the process of scientific discovery. As researchers, one way we use these procedures is to test the validity of a null hypothesis. Often, we test the validity of more than one null hypothesis. If we fail to use an appropriate procedure to account for this multiplicity, then we are more likely to reach a wrong scientific conclusion—we are more likely to make a mistake. In physiology, experiments that involve multiple comparisons are common: of the original articles published in 1997 by the American Physiological Society, ∼40% cite a multiple comparison procedure. In this review, I demonstrate the statistical issue embedded in multiple comparisons, and I summarize the philosophies of handling this issue. I also illustrate the three procedures—Newman-Keuls, Bonferroni, least significant difference—cited most often in my literature review; each of these procedures is of limited practical value. Last, I demonstrate the false discovery rate procedure, a promising development in multiple comparisons. The false discovery rate procedure may be the best practical solution to the problems of multiple comparisons that exist within physiology and other scientific disciplines.

- Bonferroni inequality
- false discovery rate
- least significant difference
- Newman-Keuls
- statistics

statistical procedures are inherent to scientific discovery. As researchers, we use these procedures for two main reasons: to obtain point and interval estimates about the value of a population parameter, and to test the validity of a null hypothesis (5). Point and interval estimates emphasize the magnitude and uncertainty of the experimental results. The test of a null hypothesis helps guard against an unwarranted scientific conclusion, or it helps argue for a real experimental effect (18). When more than one hypothesis is tested—when multiple comparisons are made—the validity of our scientific conclusions may be weakened if we fail to use an appropriate multiple comparison procedure (6, 8,11, 14, 19, 20).

In studies published recently by the American Physiological Society (APS), the citation of a multiple comparison procedure is common (Table1). This finding raises an important question: do physiologists understand the philosophies and assumptions behind competing multiple comparison procedures? This question is relevant for three reasons: there are many procedures available, textbooks of statistics (for example, Refs. 1, 13, and 18) provide little more than a cursory description of the procedures themselves, and there can be several solutions to the problem created by multiple comparisons.

In this paper, I summarize the statistical issue embedded in multiple comparisons, and I review the philosophies of handling this issue. Then, I illustrate the three procedures—Newman-Keuls, Bonferroni, least significant difference—cited most often in my literature review. Last, I review the false discovery rate, a promising development in multiple comparisons.

### Glossary

- α
- Error rate for a single comparison
- α
_{ℱ} - Error rate for a family of k comparisons
- H
_{0} - Null hypothesis
- μ
- Population mean
- P
- Achieved significance level
- Pr {A}
- Probability of event A
- Sample mean
- Δ *
- Critical difference between two sample means

## THE ISSUE EMBEDDED IN MULTIPLE COMPARISONS

To test a null hypothesis, we must formulate the hypothesis beforehand. Then, using data collected during the experiment, we must compute the observed value T of some test statistic. Last, we must compare the observed value T to a critical value T* , chosen from the distribution of the test statistic that is based on the null hypothesis. If T is more extreme than T*, then that is surprising if the null hypothesis is true, and we are entitled to become skeptical about the scientific validity of the null hypothesis.

Suppose we want to assess renal blood flow in two independent samples. If our objective is to compare the underlying population means, μ_{1} and μ_{2}, then one pair of null and alternative hypotheses, H_{0} and H_{1}, is
The probability that we reject H_{0} given that H_{0} is true is the error rate α. We can use mathematical notation^{1} to write this statement as
Equation 1Note that the critical value T* is the 100[1 − (α / 2)]th percentile from the distribution of the test statistic given that the null hypothesis is true. *Equation 1
*can be rewritten as
Equation 2

#### Multiple comparisons.

Suppose we want to assess renal blood flow in three independent samples.^{2} In this setting, there are three alternative hypotheses, H_{1}–H_{3}, that correspond to the comparisons among population means:
Associated with each of these comparisons is an error rate of magnitude α. If the three comparisons are considered to be a family, then the family will have an error rate α_{ℱ}, where α_{ℱ} > α. As a result, it is more likely that a true null hypothesis will be rejected erroneously. This is the statistical issue that lies at the heart of multiple comparison procedures.

To see why this issue warrants our attention, imagine that each of k independent comparisons is tested at an error rate of α. Assume that the underlying populations are identical and that each of the k null hypotheses is true. What is α_{ℱ}, the probability that at least one of the k comparisons will reject a true null hypothesis? As in *Eq. 2,* the probability of rejecting at least one H_{0} given that all H_{0} are true can be written
For a single comparison, α_{ℱ} = α. When the number of comparisons increases, α remains constant, but α_{ℱ} increases. For example, if α = 0.05, then for k = 1, 2, 3, 4, 5, … , 10,
For k = 10 comparisons, there is a 40% chance that we will reject erroneously at least one true null hypothesis.

#### Misguided multiple comparisons.

In many of the studies tallied in Table 1, a multiple comparison procedure was used to analyze several groups of observations made on the same subjects. In general, this use of a multiple comparison procedure is misguided: most procedures assume that the groups are independent, but repeated observations on a subject, for example, observations made during baseline and then during several periods after some intervention, create correlation among the groups (9). As a result, the true error variability is underestimated, and the observed values for the standard deviations of the group means underestimate the true variabilities (9). When most multiple comparison procedures are used to analyze groups of repeated observations, the outcome will be an inflated number of statistically significant differences among the group means (see ).

## PHILOSOPHIES ABOUT MULTIPLE COMPARISONS

*Would you tell me, please, which way I ought to go from here?*—Alice

*That depends a good deal on where you want to get to.*—The Cat

L. Carroll in *Alice's Adventures in Wonderland*(1865)

When we decide the validity of a single comparison, we can make a mistake: we can reject a true null hypothesis, or we can fail to reject a false null hypothesis. When we decide the validity of k comparisons—this happens in most experiments—we are more likely to reject a true null hypothesis. The challenge for any multiple comparison procedure is to satisfy two conflicting requirements: reduce the risk that we reject a true null hypothesis but maintain the likelihood that we detect an experimental effect if it exists (7,12, 17). The relative importance assigned to these requirements has produced opposing philosophies about how to handle the issue of multiple comparisons.

#### Focus on individual comparisons.

Proponents of this philosophy argue it is sufficient to control the single comparison error rate α, the probability that we reject a true null hypothesis. They base this philosophy on the assumption that most scientific comparisons are preplanned (2, 15, 16). This assumption is naive and unrealistic: many experimental effects are discovered only after an investigator explores—rummages through—the data.

#### Control for multiple comparisons.

In general, physiologists examine the impact of an intervention on a set—a family—of related comparisons: for example, the impact of some drug on renal blood flow and urinary excretion of hormones and electrolytes, or a series of paired comparisons among several groups of observations. In these situations, we base our scientific conclusions on a family of comparisons: that is, multiple comparisons considered as a single entity. As a result, it is not the single comparison error rate α that we must control but the family error rate α_{ℱ}, the probability that we reject at least one true null hypothesis in the family of comparisons (7, 8, 11-13,17, 19-20). Multiple comparison procedures provide control of the family error rate α_{ℱ}.

## THE GENERAL STRATEGY

Most multiple comparison procedures use the same basic strategy: to make inferences about the population means for two groups, μ_{ℓ} and μ_{ϕ}, they compare the magnitude of the difference between the sample means
_{ℓ} and
_{ϕ}to a critical difference Δ
*. If
where
Equation 3and where SE {u} is the standard error of the quantity u, then that is statistical evidence that μ_{ℓ} ≠ μ_{ϕ}. Procedures differ in the statistics substituted for the coefficient c and the quantity u. Table 2 lists the statistics for the Newman-Keuls, Bonferroni, and least significant difference tests.

## SIMULATED SAMPLE OBSERVATIONS

An article published recently in the Journal provides an ideal framework with which to illustrate multiple comparison procedures. In the experiment, Koch et al. (10) explored the heritability of running endurance, measured as distance run, in rats. I used the observed sample statistics from 10 experimental groups (Fig.1) as the empirical foundation for the simulated sample observations.^{3}

This is how I generated the simulated sample observations—the data. Let the random variable Y_{j} represent the distance run by a rat in *group* j, where j = 1, 2, … , 10. Assume that each Y_{j} is distributed normally with mean μ_{j} and variance ς_{j}
^{2}
I estimated each μ_{j} and ς_{j} using approximate values for the observed group means and standard deviations (see Ref. 10, Tables 1 and 2). For simplicity, I limited each sample to 10 observations. One set of 10 simulated samples is listed in Table 3. For the rest of the review, I use the resulting sample means
and the resulting sample standard deviations
as the basis for my illustration of specific multiple comparison procedures.

## NEWMAN-KEULS PROCEDURE

The Newman-Keuls procedure^{4} is a multiple range test that compares the underlying population means of r experimental groups. That is, it evaluates the null hypothesis
Equation 4The procedure sets the family error rate α_{ℱ} at α, the single comparison error rate, by using studentized range distributions to calculate critical differences (see *Eq. 5
*).

Another multiple range test is the Duncan procedure.^{5} It is only the specification of α_{ℱ} that differentiates the method of Duncan from that of Newman-Keuls. The Duncan family error rate is α_{ℱ} = 1 − (1 − α)^{m−1}, where m is the number of means being compared. The Duncan multiple range test is a noted ancestor of modern multiple comparison procedures, but because α_{ℱ}grows with m, the test violates a basic tenet of multiple comparisons: the control of α_{ℱ} despite a large number of comparisons (see Ref. 12, p. 87–89).

#### The example.

To make inferences about the equality of two population means, μ_{ℓ} and μ_{ϕ}, the Newman-Keuls procedure uses the critical difference Δ
^{*}
_{m}, defined as
Equation 5In *Eq. 5,* the coefficient q_{m,ν}
^{αℱ} is the 100[1 − α_{ℱ}]th percentile from a studentized range distribution with m means and ν degrees of freedom, and SE {
} is the standard error of the sample mean. Using the pooled sample variance s^{2} = 6,883 (see Table 3), the standard error of the sample mean is estimated as
Suppose we define α_{ℱ} = 0.05. In this simulated experiment, there are ν = 90 degrees of freedom (see Table 3). Because there can be groups of m = 2, 3, … , 10 consecutive sample means, there are nine critical differences to be calculated using *Eq. 5
* (Table4).

A simple graphical technique can communicate the inferences based on these critical differences. First, we list the sample means in ascending order (see Table 3)
Then, for each group of m consecutive means, progressing from largest to smallest m, we compare the magnitude of the m -mean range,
_{ϕ} −
_{ℓ}, to its corresponding critical difference Δ
^{*}
_{m}. If
then we underline the group of m means: we are unable to discriminate among them. If
then we draw no line: we have identified at least one difference. At the end of this process, it is only those means that remain unconnected that we can discriminate statistically.

To illustrate this technique, we begin with m = 10. The initial step is
In fact, for m = 9, 8, … , 4,
_{ϕ} −
_{ℓ} > Δ
^{*}
_{m}, therefore draw no lines.

The next step is to evaluate groups of m = 3 consecutive means The final step is to evaluate pairs (m = 2) of adjacent means At this point, we can stop: all remaining pairs of consecutive means were underlined in the preceding step, when m = 3.

The Newman-Keuls procedure leads to these conclusions about the 10 sample means
These are examples of inferences based on this data graphic: μ_{2} resembles μ_{8} and μ_{7} but differs from μ_{4}, μ_{1}, … , μ_{9}; and μ_{9} differs from all other means. Table 5 lists the inferences for the 16 preplanned group comparisons.

#### Practical considerations.

The Newman-Keuls procedure evaluates all r (r − 1)/2 paired comparisons among r sample means from a balanced design. The test assumes the r means are independent and are based on identical numbers of observations (Ref. 12, p. 86). When it compares more than three means, the Newman-Keuls procedure no longer caps the family error rate α_{ℱ} at α; instead, α_{ℱ} > α (Ref. 8, p. 127). For this reason, the Newman-Keuls procedure is of limited value for multiple comparisons.

## BONFERRONI PROCEDURE

The Bonferroni inequality is a probability inequality that does control the family error rate α_{ℱ}. For a family of k comparisons, the Bonferroni inequality defines the upper bound of the family error rate to be
where α is the error rate for each comparison. In other words, the inequality assigns an error rate of α_{ℱ} / k to each comparison within the family. Because α can vary among comparisons, the general expression for the family error rate is

#### The example.

To make inferences about the equality of two population means, μ_{ℓ} and μ_{ϕ}, the Bonferroni procedure relies on the critical difference Δ
*, defined as
Equation 6In *Eq. 6,* the coefficient t_{α / 2,ν} is the 100[1 − (α / 2)]th percentile from a t distribution with ν degrees of freedom, and SE {
_{ϕ} −
_{ℓ}} is the standard error of the difference between the sample means.

If we define α_{ℱ} = 0.05, then for each of the 16 preplanned comparisons listed in Table 5
Therefore, because there are ν = 90 degrees of freedom (see Table 3), t_{α/2,ν} = 3.04. Using the pooled sample variance s^{2} = 6,883, the standard error of the difference between sample means is estimated as
Equation 7By virtue of *Eq. 6,* the resulting critical difference for the Bonferroni procedure is
Therefore, the Bonferroni procedure leads to these conclusions about the 10 sample means
Table 5 lists the resulting inferences for the 16 preplanned group comparisons.

#### Practical considerations.

Although it is not a multiple comparison procedure per se, the Bonferroni inequality can be used for multiple comparison problems. The technique is valid regardless of whether the r sample means are independent or correlated (Ref. 12, p. 67). The Bonferroni inequality is appealing because it is versatile and simple. Unfortunately, its appeal is diminished by the strict protection of the single comparison error rate α. As a consequence, the Bonferroni inequality is conservative: it will be unable to detect some of the actual differences among a family of k comparisons (see Table 5).

## LEAST SIGNIFICANT DIFFERENCE PROCEDURE

The least significant difference (LSD) procedure, developed by Sir R. A. Fisher, preceded the Newman-Keuls multiple range test. Like the Newman-Keuls test, the LSD procedure compares the underlying population means of r experimental groups (see *Eq.4
*), and it sets the family error rate α_{ℱ} at the single comparison error rate α.

#### The example.

To make inferences about the equality of two population means, μ_{ℓ} and μ_{ϕ}, the LSD procedure uses the critical difference Δ
*, defined as
Equation 8In *Eq. 8,* the coefficient t_{αℱ}
_{ / 2,ν} is the 100[1 − (α_{ℱ} / 2)]th percentile from a t distribution with ν degrees of freedom, and SE {
_{ϕ} −
_{ℓ}} is the standard error of the difference between the sample means.^{6}

If we define α_{ℱ} = 0.05, then because there are ν = 90 degrees of freedom (see Table 3), t_{αℱ}
_{ / 2,ν} = 1.99. As shown in *Eq. 7,*SE {
_{ϕ} −
_{ℓ}} = 37.1. Therefore, by virtue of *Eq. 8,* the resulting critical difference for the LSD procedure is
The LSD procedure leads to these conclusions about the 10 sample means
Table 5 lists the resulting inferences for the 16 preplanned group comparisons.

#### Practical considerations.

The LSD procedure evaluates all r (r − 1) /2 paired comparisons among r sample means. In its protected form, the procedure is done only if a preliminary analysis of variance is statistically significant (18). When it compares more than three means, the LSD procedure fails to maintain the family error rate α_{ℱ} at α (Ref. 8, p. 139). The solution to this problem is to replace t_{αℱ}
_{ / 2,ν} in*Eq. 8
* with a percentile from a studentized range distribution: q_{r−1,ν}
^{αℱ}(Ref. 8, p. 139) or q_{r,ν}
^{αℱ}(Ref. 12, p. 92).^{7}

## FALSE DISCOVERY RATE PROCEDURE: A RECENT DEVELOPMENT

In most experiments, scientists strive to make a discovery: to reject a null hypothesis. When an experiment involves a family of k comparisons, a scientist is more likely to make a mistaken discovery. The false discovery rate procedure^{8} is a promising solution to the problem of multiple comparisons. This procedure controls not the family error rate α_{ℱ} but the false discovery rate f_{ℱ}, the expected fraction of null hypotheses rejected mistakenly
If all k null hypotheses are true,^{9} then f_{ℱ} = α_{ℱ}; if at least one null hypothesis is not true, then f_{ℱ} ≤ α_{ℱ} (3). When we define the family error rate α_{ℱ}, we also set an upper bound on the false discovery rate f_{ℱ}. But if we control f_{ℱ} rather than α_{ℱ}, we gain statistical power, the ability to detect an experimental effect if it exists (3, 4, 22).

#### The example.

Unlike the preceding methods, the false discovery rate procedure operates on achieved significance levels (P values) to make inferences about a family of k comparisons. Let P_{i} represent the significance level associated with comparison i. To execute this procedure, we must complete three steps:

*Step 1.* Order the k comparisons by decreasing magnitude of P_{i}.

*Step 2.* For i = k, k − 1, … , 1, calculate the critical significance level d^{*}
_{i} as
Equation 9
*Step 3.* If P_{i} ≤ d^{*}
_{i}, then reject the null hypotheses associated with the remaining i comparisons.^{10}

In the simulation, we selected k = 16 comparisons of interest. For each comparison, we evaluate the null hypothesis H_{0}: μ_{ℓ} = μ_{ϕ} by doing a t test. The P values associated with the resulting t statistics vary from 0.723 → 0.001^{−} (Table 6). If we define the false discovery rate f_{ℱ} = 0.05, the magnitude of the family error rate α_{ℱ} we have been using, then the critical significance level d^{*}
_{i} varies from 0.050 → 0.003. In*step 3,* we declare *comparisons 1–14* to be statistically significant (see Table 6). Table 5 lists the inferences for all 16 comparisons.

#### Practical considerations.

Because the false discovery rate procedure operates on actual P values, it is quite versatile. For example, the procedure can be employed when a family of k comparisons involves different test statistics such as Student t and Wilcoxon signed rank statistics (3, 4). The false discovery rate procedure is valid when the k comparisons are independent (a sample mean is part of only one comparison) or correlated (a sample mean is part of more than one comparison, as in the example) (3,4, 22).

The false discovery rate procedure has two important benefits. First, it allows us to make an inference, with 100[1 − (f_{ℱ} / 2)]% confidence, about the direction of a statistical difference (4, 22). For example, because f_{ℱ} = 0.05, we can declare, with 97.5% confidence, that μ_{2} < μ_{8} (see Table6). This is a stronger inference than the simple declaration μ_{2} ≠ μ_{8} (Ref. 8, p. 27–39). Second, the statistical results for a set of primary comparisons are largely consistent despite substantial changes in the number of secondary comparisons included within the family (22).

## SUMMARY

We dare not seek a single multiple comparison procedure for all experiments.

Adapted from John W. Tukey (1994)

This remark, written by a pioneer in the area of multiple comparisons, reflects the range of multiple comparison problems that manifest themselves in scientific research. Over the last 50–60 years, statisticians have explored numerous approaches in an effort to address these problems (8, 12). In physiology, as in other disciplines, experiments that involve problems of multiple comparisons are common.

In this review, I have shown that, as researchers, we are more likely to reject a true null hypothesis if we fail to use a multiple comparison procedure when we analyze a family of comparisons. I have also illustrated the three procedures cited most often in APS journals: Newman-Keuls, Bonferroni, and LSD. Unfortunately, each of these is of limited value. In many experimental situations, the Newman-Keuls and LSD procedures fail to control the family error rate, the probability that we reject at least one true null hypothesis. In contrast, the Bonferroni inequality is overly conservative: it fails to detect some of the actual differences that exist within the family.

Finally, I have reviewed the false discovery rate: a versatile, simple, and powerful approach to multiple comparisons. As Tukey suggests, it is perhaps unrealistic to expect that a single multiple comparison procedure will suffice for all situations: a statistical procedure designed specifically for a particular experimental situation will perform better than a general procedure. Nevertheless, there is growing evidence (4, 22) that the false discovery rate procedure may be the best practical solution to the problems of multiple comparisons that exist within science.

## Acknowledgments

I thank Dr. Steven L. Britton (Department of Physiology and Molecular Medicine, Medical College of Ohio) and colleagues for permission to cite their study.

## Appendix

For all but one of the multiple comparison procedures listed in Table 1, an important assumption is that the r experimental groups are independent (12).^{11} In many studies that use these multiple comparison procedures, however, the r groups are not independent. This happens because investigators make repeated observations on each subject: these observations are correlated by virtue of individual biological makeup (9). Therefore, the true error variability is underestimated, and the observed values for the standard deviations of the group means underestimate the true variabilities (9).

To appreciate the impact of correlation on variability, imagine an investigation in a sample of n subjects. In each subject, some random variable X is measured during two experimental conditions: a control period and a subsequent intervention period. Let the random variable measured during the control period be designated X_{1} and that during the intervention period be designated X_{2}. Assume that X_{1} and X_{2} are distributed normally
If the random variables X_{1} and X_{2} are considered jointly, then the distribution of the variable pair (X_{1}, X_{2}) can be envisioned as a bivariate normal distribution. For this distribution, ς_{2‖1}, the standard deviation of the conditional distribution of X_{2} given that X_{1} equals a specific value, depends on the correlation ρ between X_{1} and X_{2}
Because repeated observations on a subject are correlated, that is, because ρ ≠ 0, the standard deviation of the variable measured during a second condition, given the value of the first measurement, is reduced by a factor of
.

## Footnotes

↵1 In comments about my review of statistical concepts (Ref. 5), one referee wrote that my exposition was mathematical and therefore unfriendly. I use mathematics for two reasons: mathematics is one dialect of the language of science, and the precision of mathematical notation simplifies communication and clarifies reasoning. Nevertheless, because I appreciate that readers will have different levels of comfort with mathematics, I integrate the mathematics with text summaries.

↵2 For r experimental groups, there are r (r − 1)/2 paired comparisons possible.

↵3 Statistical calculations and exercises were executed using SAS Release 6.12 (SAS Institute, Cary, NC, 1996).

↵4 This procedure is known also by the name Student-Newman-Keuls.

↵5 Nearly 6% (18 / 321) of the reviewed manuscripts that report a multiple comparison procedure used the Duncan procedure.

↵6 Because α

_{ℱ}= α, this critical difference is simply the allowance used to obtain a 100(1 − α)% confidence interval for the difference_{ϕ}−_{ℓ}(see Ref. 5,*Eq. A2*).↵7 When the latter coefficient is used in

*Eq. 8,*the method is called the wholly (or honestly) significant difference procedure.↵8 This procedure is available within SAS Release 6.12 by using the fdr option in Proc MultTest.

↵9 Because of the artificial nature of null hypotheses (5), this is a rare occurrence.

↵10 If i < k when P

_{i}≤ d^{*}_{i}, then there will be k − i null hypotheses that cannot be rejected.↵11 The lone exception is the Bonferroni inequality, which allows the r experimental groups to be correlated.

Address for reprint requests and other correspondence: D. Curran-Everett, Department of Preventive Medicine and Biometrics, B-195, University of Colorado Health Sciences Center, 4200 East 9th Ave., Denver, CO 80262 (E-mail: dcurran-{at}carbon.cudenver.edu).

- Copyright © 2000 the American Physiological Society