5 min read

What Randomization Can't Fix

Listen to this essay AI Narrated
Subscribe to listen
What Randomization Can't Fix

The last post argued that how you randomize is a design decision most biostatisticians treat as settled before the interesting work begins. REMAP-CAP showed what happens when that decision is taken seriously. The carat package showed that even standard covariate-adaptive randomization has inferential consequences most of us aren't accounting for.

But there's a temptation that comes with getting the design right: believing it's enough. Sometimes it is. ISIS-2 randomized 17,000 patients, asked one question, and got a p-value of 0.00001. The effect was so large that subgroup analyses, multiplicity corrections, and Bayesian priors were all beside the point. Most trials are not ISIS-2.

I've reviewed protocols where the randomization was impeccable (stratified, balanced, implemented correctly) and the trial still produced evidence I wouldn't trust. Not because the randomization failed. Because everything that came after it was undisciplined.


The gap between design and conclusion

Randomize properly. Blind properly. Control properly. Power adequately. If you do all of that, the evidence takes care of itself.

Except it doesn't.

Bevacizumab in metastatic breast cancer is the textbook case. The E2100 trial was a well-powered, randomized, open-label Phase 3 study that showed a statistically significant improvement in progression-free survival: median PFS of 11.8 months versus 5.9 months. FDA granted accelerated approval in 2008. Then three confirmatory trials (AVADO, RIBBON-1, RIBBON-2) showed smaller PFS benefits and no overall survival advantage. In 2011, after a contentious ODAC hearing where the committee voted 6-0 to withdraw the indication, FDA revoked the breast cancer approval.

Nobody challenged the randomization in any of these trials. The fight was about what PFS without OS benefit means, whether the E2100 effect size was reproducible, and whether the confirmatory trials measured the same thing the original trial measured. Design was adequate. Inferential discipline was the battleground.

The antidepressant literature tells a different version of the same story. Kirsch et al. (2008) analyzed all clinical trial data submitted to the FDA for four SSRIs and SNRIs: fluoxetine, venlafaxine, nefazodone, and paroxetine. Across 35 double-blind, randomized, placebo-controlled trials, the mean drug-placebo difference was 1.8 points on the Hamilton Depression Rating Scale (a 53-point instrument). NICE's threshold for clinical significance is 3 points. Every one of those trials was well-designed. Every one of them reached statistical significance, which is what FDA requires for approval. But 1.8 points on a 53-point scale, where a 6-point change can come from sleep pattern shifts alone, is a precisely measured nothing.

I wrote about a version of this in The Post-Hoc Problem. Pre-specification isn't a bureaucratic requirement. It's the mechanism that prevents a well-powered trial from becoming a machine for producing true but misleading findings.


What large N actually buys you

Decentralized trial platforms and digital health tools have made it cheaper to enroll large samples. When your N is 1,000 instead of 200, you detect smaller effects with tighter confidence intervals. That feels like progress.

But statistical power is indifferent to clinical meaning. A trial powered to detect very small effects will detect very small effects, including effects no patient would notice, no physician would act on, and no regulator should approve. The Kirsch analysis is the clearest example: every trial was adequately powered. Every trial "worked." The question was whether working and succeeding are the same thing.

The FDA's PRO guidance exists for exactly this reason. A statistically significant change on a patient-reported outcome is only interpretable if the instrument has established content validity, test-retest reliability, and a minimal clinically important difference. The EORTC QLQ-C30 has these anchors. The PSQI has them. The HAM-D, despite its long regulatory history, has been criticized for decades on exactly this basis. Without these anchors, you have a number. You don't have evidence.

Now add multiplicity. Protocols with 15 secondary endpoints and 10 pre-planned subgroups generate 150 statistical tests. Without a pre-specified multiplicity strategy (hierarchical testing, gatekeeping, or at minimum a clear boundary between confirmatory and exploratory analyses), positive findings in secondary endpoints are hypothesis-generating. Not hypothesis-confirming. The cardiovascular outcomes trials for SGLT2 inhibitors got this right: hierarchical testing strategies that specified the exact order in which secondary endpoints would be tested, so that a positive finding on kidney outcomes carried confirmatory weight rather than exploratory suggestion.

The rare disease approvals I covered recently illustrate the inverse. Papzimeos succeeded without randomization because its endpoint was surgeries avoided. Zevaskyn succeeded because 81% of treated wounds healed versus 16% of controls. Effect sizes that don't depend on modeling assumptions, validated instruments, or multiplicity corrections. Self-evident outcomes.

A 1.8-point shift on the HAM-D at N=3,000 is not self-evident, no matter how clean the randomization was.


Where "gold standard" stops meaning what it should

In regulatory science, "gold-standard clinical trial" means something specific: adequate and well-controlled evidence of efficacy under 21 CFR 314.126. That definition covers not just design features (randomization, blinding, controls) but the adequacy of outcome measures, the appropriateness of analysis, and the pre-specification of what the trial was testing.

Outside regulatory science, "gold standard" increasingly means just the design features. Randomized? Gold standard. Blinded? Gold standard. Placebo-controlled? Gold standard. The inferential discipline, the part that makes results interpretable, drops out of the definition.

This matters because design features are visible. You can describe them in a press release. You can list them on a slide deck. Randomized, double-blind, placebo-controlled reads like a credential.

The inferential discipline is invisible. It lives in the statistical analysis plan, the multiplicity adjustment, the endpoint validation, the pre-specification of what "success" means before enrollment opens. When "gold standard" refers only to the visible part, it becomes available to any trial that checks the design boxes, regardless of whether the analysis supports the conclusions drawn from it.

This is creeping into spaces beyond traditional pharma: decentralized trial platforms, digital therapeutics, real-world evidence studies, direct-to-consumer health research. The design features travel easily. The inferential discipline often doesn't.


What this asks of biostatisticians

The Bayesian framework handles some of this more naturally. Posterior probabilities are less prone to the binary thinking that turns p=0.049 into proof. But even Bayesian trials need pre-specified decision thresholds, justified priors, and validated endpoints. The January 2026 guidance is explicit about this.

As the cost of running trials drops and the scale of data collection grows, the discipline required to interpret results goes up, not down. More endpoints, more subgroups, more biomarkers, more participants. Each one is an opportunity for a well-designed trial to find something that isn't there.

Randomization is where evidence begins. It isn't where evidence is established. The distance between them is filled by the least visible, least glamorous work we do: deciding what to measure, how to measure it, what would count as an answer, and committing to all of that before anyone is enrolled.

That's the part of "gold standard" that doesn't fit in a press release. It's also the part that matters.


Many evidentiary problems appear during trial design, not after the analysis. I work with teams to review trial designs and run simulation studies to evaluate operating characteristics before protocols are finalized.

For consulting inquiries: maggie@zetyra.com
For more essays on statistical design and regulatory evidence, subscribe to the Evidence in the Wild newsletter.