P-Values vs. Patient Values
andrew lo is the Charles E. and Susan T. Harris professor at MIT’s Sloan School of Management, director of MIT’s Laboratory for Financial Engineering and chief investment strategist for AlphaSimplex Group LLC. Research support from the MIT Laboratory for Financial Engineering is gratefully acknowledged, as are helpful comments and discussion from Alison Bateman-House, Don Berry, Jayna Cummings, Ilan Ganot and Debra Miller.
Published May 2, 2016.
The process for approving new drugs is a high-stakes activity fraught with scientific, ethical, political and economic challenges. Regulators charged with that responsibility are under intense scrutiny because of the potential for life-and-death consequences from their actions. Therefore, the FDA and other agencies involved must walk a fine line to avoid approving an unsafe or ineffective drug – or, the other side of the coin, rejecting a safe and effective one.
At the root of this challenge is the unavoidable trade-off between two objectives: minimizing the likelihood that an ineffective drug is approved and maximizing the likelihood that an effective drug is approved. These two objectives are at odds with each other. We could reduce the chances of approving an ineffective drug to zero simply by refusing to approve any drug. However, such an approach would also eliminate the possibility of approving any effective drug. Therefore, an unavoidable aspect of the drug-approval process is deciding how to weigh the two possible errors that can be committed: rejecting an effective drug (known as type I error in the jargon of statistical analy-sis) and approving an ineffective one (type II error).
Within the traditional framework for random clinical trials, this trade-off is addressed by setting a fixed value for type I error (typically, a probability of 2.5 percent) and then determining the sample size of patients needed to detect the reasonable probability of a positive therapeutic benefit (typically 80 to 90 percent). Then the candidate drug is administered to this size sample of patients, while a matching sample of patients is given a placebo.
If the differences in the outcomes of these two groups are statistically significant at the 2.5 percent level – that is, there's only a 2.5 percent probability that the drug made patients better off than the placebo by chance alone – the drug is approved. This outcome is often summarized by a "p-value": if this value is .025 or less, the therapeutic benefit is considered "statistically significant."
Note that by fixing the type I error at 2.5 percent across all random clinical trials, this approach implicitly assumes that the damage done by approving an ineffective therapy is the same for all diseases. However, for patients with terminal illnesses that have no effective treatment options, it is only common sense that this level of type I error is too stringent. Such patients may be willing to take a much greater risk of being treated with an ineffective therapy, even if it means a greater risk of receiving an ineffective therapy that has serious side effects. After all, what's the alternative for these patients? As Prof. Don Berry, a well-known biostatistician and random-trial expert at the University of Texas, put it, "shouldn't we be focusing on patient values as well as p-values?"
One concrete illustration of the distinction between patient values and p-values is with Duchenne muscular dystrophy, a rare, devastating disease usually affecting boys that causes progressive muscle degeneration starting in infancy. There is currently no approved therapy for either the symptoms or root cause. Most patients are confined to wheelchairs by age 12 and dead by 25.
In this particular case, the potential conflict between regulatory and patient perspectives has come to a head with the recent FDA decision to turn down a Duchenne drug candidate, drisapersen, because of insufficient statistical evidence of efficacy and potentially serious side effects. Some parents of Duchenne patients and patients' advocates strongly objected to these decisions, indicating their willing-ness to accept a high level of risk for the chance of potential benefits. As Debra Miller, the founder and chief executive officer of the nonprofit advocacy organization CureDu-chenne (and the parent of a Duchenne patient), put it in responding to the FDA's decision:
We are disappointed that the FDA did not approve drisapersen, given the significant benefit that many Duchenne boys experienced when they were on an early and consistent treatment protocol of the drug. ... This disease, with its complex genetic roots, is unlikely to be cured by any single drug. Drisapersen was expected to be beneficial to about 13 percent of all patients.
Further complicating matters are the differing perspectives regarding the relative costs and risks of the potential outcomes. A dying patient, for example, may be much less concerned about a potentially dangerous experimental drug than the shareholders of a company that would bear the cost of wrongful-death litigation. Note, too, that there is a utilitarian conundrum in which small benefits for large numbers of patients must be weighed against devastating consequences for an unfortunate few.
Can we do better in how we decide whether a drug is approved or not?
The Limitations of Random Clinical Trials
Clinicians, drug regulators and other stakeholders have long recognized that these competing factors must be accounted for in the decision making, both when drug trials are designed and when they come up for approval. In fact, regulatory guidelines do allow for deviation from traditional Phase III clinical trials (testing on humans) for designated breakthrough therapies that address unmet needs for severe or rare diseases. But the relaxation of the rules is inadequate. For one thing, it is often done on an ad hoc basis and may require post-approval studies. For another, many compounds are not eligible for the special programs – fast-track, breakthrough-therapy, accelerated-approval and priority-review – that are designed to expedite the approval process. Consider, too, that the seemingly arbitrary nature of the values for the size and power of the statistical test for efficacy raises questions about their justification.
As Professor Berry has pointed out, the fact that the classic approach to designing clinical trials ignores the fact that at least half the subjects are exposed to an ineffective (placebo) treatment during the trial (assuming a balanced double-blind random clinical trial) raises an important ethical issue. Indeed, the fact that half the seriously ill patients in a clinical trial will be stuck receiving the placebo is one key reason that sample sizes are not further increased to achieve greater statistical certainty. Although we have seen the emergence of new hybrid designs, including sequential and adaptive tests, none accounts for the severity of the target disease and the ethics of depriving some of the desperately ill a last hope for effective treatment.
The classic random-trial framework also does not explicitly account for the potential number of patients who will eventually be affected by the outcome of a trial. It stands to reason that patients suffering from the target disease will experience a positive effect from an approved effective drug or a negative effect from an approved ineffective drug or a rejected effective drug. Thus, the sample size of the trial should depend on the size of the population that will be affected by the outcome of the trial.
Enter Thomas Bayes
Professor Berry suggested one approach to incorporating patient perspectives that involves assigning different costs to the different outcomes, and then determining the optimal type I and type II error levels – that is, the levels that minimize the overall expected cost of the decision. In a recent paper I wrote with Leah Isakov (Pfizer) and Vahid Montazerhodjat (MIT), we apply a standard Bayesian decision analysis framework to the drug-approval process in which the costs of type I and type II errors are explicitly specified using historical U.S. burden-of-disease data. Such an analysis is an established decision-making process for determining the optimal action in cases involving multiple uncertain outcomes. Different possible scenarios are considered and the consequences of each are carefully quantified and then combined into a single cost function that can be minimized. The decision that minimizes this function is the Bayesian analysis optimal action.
It is important to note that the term "cost" in our context refers to the health consequences of incorrect decisions for current and future patients, not necessarily the financial cost. Moreover, our cost measure is not related to the pharmaco-economic concept of "cost effectiveness." In the Bayesian decision analysis framework, cost is meant to reflect the consequences of a given decision under specific assumptions about disease prevalence, severity and patient preferences.
We first define costs associated with the clinical trial given the null hypothesis (the treatment is ineffective) and alternative hypothesis (the treatment is effective). We assign prior probabilities to these two hypotheses and formulate the expected cost associated with the trial. The optimal sample size and critical value for the test are then jointly determined to minimize the expected cost of the trial.
We applied the Bayesian framework only to traditional fixed-sample random clinical trials. But it could also be used for more sophisticated (though still somewhat controversial) adaptive random-trial designs in which information gathered during a trial is used to revise the trial midstream.
For illnesses with extremely high costs – as measured by a combination of objective costs, such as the total number of years of life lost from a given disease, as well as subjective costs, such as the preferences of patients and their caregivers – the optimal type I error may be considerably higher than 2.5 percent. That is, it may be worth approving such drugs even if there is, say, a 10 or 20 percent chance that the measured benefits only reflected chance. By the same token, mild diseases may warrant lower levels of type I error.
We found that the current standards of drug approval are weighted more toward avoiding a type I error than a type II error. That's understandable, given the FDA's mandate to protect the public, but it does not necessarily reflect the severity of all diseases or the values of all stakeholders.
Take the example of clinical trials for therapies targeting pancreatic cancer, a disease with a five-year survival rate of just 1 percent. We calculated a Bayesian decision analysis optimal type I error of 23.9 percent to 27.8 percent, depending on the assumed power of an efficacious treatment. On the other hand, for clinical trials for prostate cancer therapies, in which the disease's prognosis is not as grim, we calculated an optimal type I error of 1.2 percent to 1.5 percent. This suggests that the standard 2.5 percent threshold may be far too conservative when applied to potential therapies for terminal diseases and too aggressive in other cases. It depends on the relative costs and severity of the disease.
We categorized the costs associated with a clinical trial into two groups:
In-trial costs, which are independent of the final decision of the clinical trial but depend on the numbers in the trial.
Post-trial costs, which depend solely on the final outcome of the trial and are assumed to be independent of the number of recruited patients.
In-trial costs are mainly related to patients' exposure to inferior treatment, for example, the exposure of enrolled patients to an ineffective and harmful drug in the treatment arm or the delay in treating all patients (those in the control group and the general population) with an effective drug. If the current treatment or placebo is assumed not to be harmful, the patients in the control arm will experience no extra cost. But if the drug is effective, the situation is quite different. In that case, for every additional patient in the trial, there will likely be an incremental delay in the emergence of the drug in the market, which affects all patients inside and outside the trial.
We assume that there is no post-trial cost associated with making a correct decision (rejecting an ineffective drug or approving an effective one) and count post-trial costs associated with type I and type II errors. Specifying asymmetric costs for type I and type II errors allows us to incorporate the consequences of these two errors with different weights in our formulation.
Take the case of pancreatic cancer, where patients can benefit tremendously from an effective therapy. The type II cost – caused by mistakenly rejecting an effective therapy – must be larger than the type I cost. In addition, the post-trial costs associated with type I and type II errors were assumed to be proportional to the size of the target population, because the greater the prevalence of the disease, the higher the cost associated with a wrong decision.
We used the U.S. Burden of Disease Study 2010 to estimate the cost parameters associated with the adverse effects of the medical treatment and the reduction in the severity of the disease to be treated. One of the key factors in quantifying the burden of disease and loss of health due to disease is the years lived with disability attributed to the study population. In general, that number is computed by first specifying the different outcomes for a particular disease and then multiplying the prevalence of each outcome by its disability weight, a measure of severity for each outcome ranging from 0 (no loss of health) to 1 (complete loss of health – that is, death). It should be noted that years lived with disability are computed only from non-fatal outcomes.
To compute the severity of a disease, we added the number of deaths (multiplied by its disability weight, 1) to the number of years lived with disability and divided the result by the number of people afflicted with, or who died from, the disease. Rather than using the absolute numbers for death, disability years and prevalence, we used their age-standardized rates (per 100,000) to get a severity estimate that's more representative of the severity of the disease in the population, based on a standard population distribution proposed by the World Health Organization. To estimate the current cost of adverse effects of medical treatment per patient, we used the corresponding values from the U.S. Burden of Disease Study 2010 mentioned above.
The Bayesian decision analysis optimal critical values, size and power for the top disease-related leading causes of death for an intermediate value of the treatment effect come from my research with Isakov and Montazerhodjat (see the table on page 61). The table shows that the optimal type I errors (approving an ineffective therapy) vary considerably, depending on the costs associated with each disease. For example, in the case of respiratory syncytial virus pneumonia, the Bayesian decision analysis optimal size is 2.7 percent – nearly identical to the 2.5 percent threshold used by most random clinical trials. However, for pancreatic cancer, the optimal size is 26.4 percent, an order of magnitude bigger than the conventional threshold.
The reason is straightforward: even though the prevalence of pancreatic cancer is in the same ballpark as respiratory syncytial virus pneumonia (23,000 vs. 15,000 cases annually), the severity of pancreatic cancer is an order of magnitude larger (0.74 vs. 0.07). Given the much greater severity of pancreatic cancer, common sense suggests that we ought to be willing to accept a greater risk of approving an ineffective therapy. Bayesian decision analysis provides a formal framework for implementing this intuitively sensible conclusion.
However, some of the entries in the table imply that the conventional 2.5 percent test for type I error may be too aggressive – as in the case of diabetes, where the optimal level is only 1.6 percent. Without providing context by explicitly specifying the burden of disease in terms of costs and benefits for current and future patients, it is virtually impossible to say whether a standard random clinical trial is too conservative or too aggressive. Bayesian decision analysis provides a solid framework for making such determinations.
Light at the End of the Tunnel
The FDA is a highly scrutinized agency charged with tremendously important decisions that are often criticized as too aggressive or too conservative. Efforts to expedite the approval process for drugs targeted at serious conditions and rare diseases – by providing for faster reviews and accelerated endpoints to judge efficacy, for example – have garnered criticism for being too aggressive. At the same time, the FDA's recent decision on drisapersen has incited a new round of criticism from patients' families and advocates for being too conservative.
One reason for these criticisms is the lack of transparency with which the FDA makes its decisions and the fact that the drug-approval process has multiple stakeholders who do not necessarily share the same values or objectives. The formal Bayesian decision analysis framework, by contrast, offers a systematic, objective, transparent and repeatable process for making regulatory decisions that reflects differences in both the impact of diseases and stakeholder perspectives.
The analysis presented here is meant to be illustrative only. Practical implementation would require input from scientists, patient advocates and biopharma and health insurance professionals, as well as construction of an appropriate process for determining the required Bayesian decision analysis parameters. This may seem daunting, given that part of the input is qualitative information, such as the impact of muscle degeneration on the quality of life of a Duchenne patient and his family. However, at least two initiatives for incorporating such information into the drug- approval process are already under way.
Last May, Johnson & Johnson announced a newly formed partnership with the Division of Medical Ethics at NYU Medical School. The partnership will create an external advisory committee to review requests made to a Johnson & Johnson subsidiary, Janssen Pharmaceuticals Companies, for the use of its investigational drugs for so-called compassionate use. Such deliberations may be viewed as narrower versions of the drug-approval process, since they consist of approving or rejecting the one-time use of a drug candidate for just one patient. According to Johnson & Johnson, the Compassionate-Use Advisory Committee "will make recommendations regarding individual patient requests from anywhere in the world" and Janssen clinicians "will make the final decision."
Meanwhile, last September, the FDA announced the formation of the Patient Engagement Advisory Committee. I'll spare you a recitation of the laundry list of subjects the committee is chartered to examine. Suffice it to say that, although newly launched, both of these committees have mandates that clearly include the implementation of the Bayesian decision analysis framework as an alternate method for regulators to evaluate the statistical evidence from random clinical trials.
Cross your fingers. There's hope that drugs will soon be evaluated in a systematic, consistent, transparent, unbiased, equitable, data-driven manner – one that, in Professor Berry's words, focuses on patient values and not just p-values.