4 min read

Science Is Not Neutral, and That’s the Point

Listen to this essay AI Narrated
Subscribe to listen
Science Is Not Neutral, and That’s the Point

In September 2016, the FDA approved eteplirsen for Duchenne muscular dystrophy. The advisory committee had voted 7 to 6 against accelerated approval. The FDA's own review team recommended against it. The clinical program consisted of 12 boys, and western blot analysis showed a dystrophin increase of 0.93%, well below the 10% threshold that researchers believed was needed for clinical benefit. The 6-minute walk test showed no significant improvement.

Janet Woodcock, then head of the Center for Drug Evaluation and Research, approved it anyway. She argued that the consequences of rejecting a potentially effective therapy for dying children with no alternatives were too severe. Nature called the decision "railroading." An FDA reviewer warned it would lower evidentiary standards "to an unprecedented nadir."

Both sides had access to the same data. They reached opposite conclusions. Not because one side misread the statistics, but because they disagreed about which mistake was worse: approving a drug that might not work, or denying access to boys who would lose the ability to walk while waiting for better evidence.

That disagreement is not a failure of the regulatory process. It is the regulatory process.


The trial that destroyed its own equipoise

Seven years later, a different kind of value collision played out with sotorasib, the first KRAS-targeting drug ever approved. The FDA granted accelerated approval in 2021 based on a Phase II response rate of 36% in advanced lung cancer. The confirmatory Phase III trial, CodeBreaK 200, was supposed to convert that to full approval.

The trial met its primary endpoint. Median PFS was 5.6 months on sotorasib versus 4.5 months on docetaxel (HR 0.66). By conventional statistical standards, the result was significant.

ODAC voted 10 to 2 that the data could not be reliably interpreted.

The problem was not the analysis. It was the behavior the trial generated. Sotorasib was already approved and publicly celebrated as a breakthrough. Patients and investigators knew which arm was which. In the docetaxel arm, 13% of patients withdrew consent and never received treatment. In the sotorasib arm, the figure was 1%. Nineteen patients on docetaxel crossed over to sotorasib before blinded central review could assess whether they had actually progressed. Investigators called progression earlier in the docetaxel arm (69% of cases) than in the sotorasib arm (58%).

The enthusiasm for the drug eroded the very equipoise the trial needed to produce interpretable results. As the ODAC chair put it, the question was not whether sotorasib works. It was whether this trial, "conducted with a highly anticipated agent in a hyper-information age," could answer that question reliably.

A statistically significant result was declared uninterpretable. Not because of a statistical error, but because human behavior, shaped by belief and hope, introduced bias that no analysis plan could fully correct.


What these cases actually reveal

The standard framing of regulatory controversy treats these as failures: the system broke down, the process was politicized, the science was overridden. That framing misses the point.

In both cases, the science was not overridden. The science was doing exactly what it always does: producing uncertain estimates that require judgment to interpret. The disagreement was about how to exercise that judgment.

Eteplirsen forced a choice between the evidentiary standard the system was designed to uphold and the moral weight of withholding treatment from children with a fatal disease. Sotorasib forced a reckoning with the fact that a trial's integrity depends on beliefs that exist outside the protocol. Neither case can be resolved by better statistics. They require confronting questions that statistics was never designed to answer: which errors are tolerable, who bears the cost of uncertainty, and how much we are willing to let hope compromise rigor.

These are value judgments. They enter clinical research long before anyone runs a simulation.

We choose which questions are worth asking. We choose which endpoints count as success. We choose which errors are tolerable. We choose how much uncertainty patients should bear. None of those decisions are neutral. They are judgments about harm, benefit, urgency, and trust.

Calling a method "objective" does not absolve it of those judgments. It only hides them.


Why rigor exists at all

If humans were perfectly rational, we would not need pre-specification. If belief did not precede data, we would not worry about p-hacking, outcome switching, or the kind of equipoise erosion that unraveled CodeBreaK 200. If incentives aligned cleanly with truth, we would not need regulators at all.

Rigor exists because humans are fallible. Pre-specified decision rules are not bureaucratic hurdles. They are commitments made in advance, when optimism has not yet been rewarded and disappointment has not yet been rationalized.

This is where design choices quietly become ethical ones. Bayesian frameworks matter here not because they are philosophically elegant, but because they force sponsors to state what they believe before the evidence arrives and to live with the consequences when reality disagrees. The sotorasib case shows what happens when that discipline breaks down: a trial that met its endpoint but could not be trusted, because the beliefs surrounding the drug reshaped the data before anyone had a chance to analyze it.


The fault line

Approving an ineffective therapy and rejecting an effective one are both errors. They are not symmetrical. Their consequences fall on different people, at different times, in different ways.

Patients with DMD valued access over certainty. The FDA's review team valued evidentiary standards over urgency. Woodcock weighed the asymmetry differently. In the sotorasib case, ODAC decided that even a drug with real activity could not be fully approved if the confirmatory evidence was compromised by the enthusiasm it generated.

All of these positions are internally coherent. They cannot all prevail at once. This is where most regulatory controversies actually live: not in the math, but in how different actors weigh false positives against false negatives, who is harmed by delay, and who is harmed by error.

Scientific frameworks do not resolve these conflicts. They encode them. Understanding where a design sits on that fault line is more informative than knowing which statistical method was used.


The balance worth defending

Science works when two forces are held in tension: an honest reckoning with what we owe patients and the public, and evidence that resists our preferences and punishes our overconfidence.

Lean too far in either direction and the system breaks. Pure moral conviction without rigor becomes advocacy. Pure rigor without moral seriousness becomes procedure. Eteplirsen tested one boundary. Sotorasib tested the other.

Good science lives in between. Not neutral. Disciplined.

Most debates about trial design never reach this fault line. That is where the interesting work begins.


📬 For readers interested in the fault lines between trial design, regulatory evidence, and statistical decision-making, additional essays are available via the Evidence in the Wild newsletter or the archive.

Maggie Qian

Maggie Qian

Biostatistician with a decade in oncology clinical trials. Founder of Zetyra. Writing about methods that hold up in practice.