03 Mar 2026 5 min read

Your Randomization Scheme Is a Design Decision, Not a Coin Flip

Listen to this essay AI Narrated

The Bayesian trial design conversation tends to start after patients are assigned to arms. Priors, borrowing, monitoring, posterior inference. All of it assumes the randomization is settled. Stratified permuted block, 1:1 allocation, done.

But how you randomize affects power, balance, ethical allocation, and regulatory credibility. Most biostatisticians treat it as plumbing. It's architecture.

Three kinds of "adaptive"

The word "adaptive" is doing too much work in clinical trials. When someone says they're running an adaptive trial, they could mean three fundamentally different things.

Covariate-adaptive randomization adjusts who goes where based on baseline characteristics. Pocock-Simon minimization, stratified permuted block, Hu and Hu's general framework. No outcome data are used. The adaptation is to covariates measured before treatment begins.

Response-adaptive randomization adjusts allocation probabilities based on accumulating outcome data. If one arm is outperforming another, future patients are more likely to be assigned to the better-performing arm.

Bayesian adaptive design adjusts the trial itself: sample size, active arms, doses, stopping rules, all driven by posterior probabilities computed from accumulating data. This is what the FDA's January 2026 guidance primarily addresses.

These categories aren't mutually exclusive. I-SPY 2 uses all three. But conflating them causes confusion in protocols, in regulatory discussions, and in the literature. A trial that adapts randomization based on covariates is solving a completely different problem than one that adapts allocation based on outcomes.

The power you're already losing

Most Phase III oncology trials use either stratified permuted block randomization or Pocock-Simon minimization. Both are covariate-adaptive. Both are reasonable. Neither is free.

The inferential cost is subtle and under-discussed. Under covariate-adaptive randomization, the standard two-sample t-test is conservative. It yields valid but unnecessarily wide confidence intervals, which means you're losing power you already paid to have. Shao, Yu, and Zhong showed this in JASA (2010), and subsequent work has formalized corrected tests that recover that power: bootstrap t-tests, randomization tests, and corrected t-tests that account for the covariate-adaptive structure.

The carat R package (Ma et al., Journal of Statistical Software 2023) implements these corrections and has accumulated over 68,000 CRAN downloads. It supports six covariate-adaptive procedures and three hypothesis testing methods designed specifically for inference under these designs. Most practicing biostatisticians I talk to haven't heard of it.

This doesn't require a new design philosophy. Just awareness that your testing procedure should match your randomization procedure. If you're using Pocock-Simon minimization but analyzing as though you used simple randomization, you may be leaving power on the table.

Where randomization becomes allocation: REMAP-CAP

I wrote previously about BATTLE as a model for response-adaptive randomization, a trial where the randomization itself was the learning engine. REMAP-CAP takes this further. And it exposes a failure mode nobody anticipated.

REMAP-CAP was designed in 2016 for community-acquired pneumonia. When the pandemic hit, the platform pivoted to enroll critically ill COVID patients across 197 sites in 14 countries. Between March 2020 and June 2021, 4,869 patients were randomized.

What makes REMAP-CAP structurally distinct is the multifactorial design. Each patient was simultaneously randomized across multiple independent treatment domains: corticosteroids, IL-6 receptor antagonists, anticoagulation, convalescent plasma, antivirals, antiplatelets. The original pre-COVID design had 240 possible treatment regimens from the combinatorics alone. Patients weren't routed to a single best arm. They were assigned a combination of interventions, and the Bayesian model learned which components worked, and for whom, across all domains at once.

The analytical engine was a Bayesian cumulative logistic model, re-estimated at prespecified interim analyses. Neutral priors throughout. Posterior probabilities drove two decisions: updated randomization allocations (arms performing better got higher allocation) and trial conclusions (superiority at >99% posterior probability, futility, or equivalence). The primary endpoint, organ support-free days to day 21 with death scored as -1, was fast enough to feed the adaptive machinery.

What the platform found

The IL-6 receptor antagonist domain delivered the headline result. Tocilizumab and sarilumab both showed >99.9% posterior probability of superiority over standard care (median adjusted OR 1.64 for tocilizumab, 1.76 for sarilumab). The domain was stopped for efficacy. Six-month follow-up confirmed: >99.9% probability of improved survival (adjusted HR 0.74, 95% CrI 0.61 to 0.90).

This result was practice-changing, and it only happened because the platform design got three things right simultaneously. It enrolled the right severity population (ICU patients). It allowed concurrent corticosteroid use (93% received dexamethasone after the RECOVERY trial reported). And it had the Bayesian power to distinguish the IL-6 effect on top of steroids. Four prior RCTs of IL-6 antagonists in COVID had shown no significant benefit. Those trials enrolled less severely ill patients. The multifactorial platform, by design, could detect what single-question trials could not.

Across other domains, the model worked as designed. Therapeutic anticoagulation reached futility in critically ill patients (99.9% probability of <20% relative improvement). Convalescent plasma reached futility (99.2%). Hydroxychloroquine showed probable harm (96.9%). Antiplatelet agents showed 95% probability of improved 6-month survival. Each conclusion was generated by the same Bayesian machinery, applied consistently across domains.

The failure mode no one predicted

The corticosteroid domain is where REMAP-CAP becomes instructive for a different reason.

Before COVID, the platform randomized ICU pneumonia patients to fixed-dose hydrocortisone, shock-dependent hydrocortisone, or no hydrocortisone. When COVID patients began enrolling, the domain continued, until June 2020, when the RECOVERY trial reported that dexamethasone reduced mortality in hospitalized COVID patients. Overnight, corticosteroids became standard of care.

REMAP-CAP's internal data were suggestive. Posterior probabilities of superiority were 93% for fixed-dose and 80% for shock-dependent hydrocortisone. Promising, but below the prespecified 99% threshold for a trial conclusion. The DSMB stopped the domain anyway. Once 93% of subsequent patients were receiving corticosteroids as background therapy regardless of randomization, continuing to randomize to "no hydrocortisone" was neither ethical nor informative.

The adaptive machinery worked exactly as designed. It just didn't work fast enough. RECOVERY, a simple, large, rapidly enrolling trial with a conventional design, answered the corticosteroid question before REMAP-CAP's more sophisticated Bayesian engine could reach its own conclusion.

This is a practical failure mode I haven't seen discussed enough. Response-adaptive designs optimize within-trial learning, but they can't outpace external evidence. When the scientific landscape shifts mid-enrollment, as it will in any long-running platform trial, prespecified stopping rules don't have a provision for "a different trial just published in the NEJM." That decision falls to the DSMB, operating outside the statistical model.

The pre-specification discipline I've argued for remains essential. But REMAP-CAP shows that no amount of pre-specification can fully anticipate the information environment a trial will encounter.

The lesson underneath both stories

The IL-6 result and the corticosteroid result came from the same platform, the same Bayesian machinery, the same pre-specification discipline. One was a triumph of adaptive design. The other was a reminder of its limits.

The IL-6 finding emerged because the multifactorial structure could isolate a treatment effect that four conventional trials missed. The corticosteroid finding was preempted because no amount of within-trial sophistication can outpace a 6,000-patient RCT that enrolled faster and asked a simpler question.

The practical takeaway is not "use adaptive randomization" or "don't." It's that randomization design determines what a trial can learn, not just who gets what treatment. A multifactorial platform can detect interaction effects that single-question trials cannot. A response-adaptive design can allocate patients more efficiently within a trial. But neither can control the external evidence landscape, and neither substitutes for the brute statistical force of a large, simple trial when the question is straightforward.

I've reviewed protocols where the randomization scheme was the last decision made, copied from the previous trial in the program without discussion. REMAP-CAP is a reminder that it should be among the first. The choice between covariate-adaptive, response-adaptive, and Bayesian adaptive design is a choice about what you're trying to learn, not just how you're assigning patients. And the cost of choosing wrong, whether too simple or too complex, is measured in answers you never get.

This post follows What BATTLE Got Right That Most Adaptive Trials Get Wrong and The FDA's Bayesian Guidance: Learning in Theory, Pre-Specification in Practice.

📬 For more essays on experimental design, regulatory evidence, and statistical decision-making across domains, subscribe to the Evidence in the Wild newsletter or browse the archive.