9 min read

What A/B Testing Teams Can Learn from 40 Years of Oncology Trial Mistakes

What A/B Testing Teams Can Learn from 40 Years of Oncology Trial Mistakes

Two worlds with remarkably similar challenges—and strikingly similar failures.

Why Tech Should Pay Attention to Cancer Trials

Most product teams believe A/B testing is a modern invention—lean, data-driven, and efficient. But if you step back, experimentation in tech today looks remarkably similar to oncology trials in the 1980s and 1990s: well-intentioned, high-effort, and full of blind spots that took decades to recognize.

The difference is that oncology has spent those decades learning—in painful, expensive, deeply human ways—what happens when experiments are poorly designed. The field transformed after spectacular failures in the early 2000s forced a reckoning with methodological complacency. Tech teams now have the rare privilege of learning those lessons without repeating the same mistakes.

This article walks through the parallels: not as a dry methodology tour, but as the story of how two experimental cultures matured, and how one can borrow wisdom from the other.

1. When Familiar Designs Fool Us: The 3+3 Story

In early cancer trials, clinicians defaulted to a simple dose-finding algorithm, the 3+3 design. It looked clean. It felt objective. It produced neatly structured cohorts of three patients at each dose level, escalating when toxicity stayed low.

But as statisticians later demonstrated, this design was quietly betraying the trials built upon it. Its dose estimates varied wildly. The same drug tested twice could yield Maximum Tolerated Doses two levels apart. Its operating characteristics were abysmal, correctly identifying the optimal dose only 30-40% of the time. Its scientific justification? Essentially nonexistent. It persisted through tradition, not merit.

What survived wasn't the method—it was the comfort of familiarity. Today, the 3+3 design is considered "practically unethical" in many contexts, replaced by model-based approaches like the Continual Reassessment Method that identify optimal doses 20-30% more accurately while exposing fewer patients to subtherapeutic or toxic doses.

The A/B Testing Parallel

Something similar happens every time a tech team defaults to "simple" without questioning whether simple is sufficient. That basic two-variant test with fixed duration and single-point analysis? It's the digital equivalent of the 3+3—familiar, traditional, and often wrong.

The simplest design is not always the most trustworthy. Just as oncology moved to model-based designs that better capture biological reality, modern A/B testing requires designs that match the complexity of the decisions they support. Sequential testing, variance reduction, and stratification aren't complications—they're corrections for the oversimplifications that make traditional tests fail.

2. The Intention-to-Treat Principle: A Guardrail Against Wishful Thinking

In clinical trials, the Intention-to-Treat (ITT) principle protects us from our own storytelling instincts. It forces us to analyze people as they were assigned—not as we wish they had behaved.

ITT prevents the seductive temptation to say: "Well, the treatment worked great... among the people who actually followed it." This seemingly reasonable statement hides a fatal flaw: the ability to follow treatment is itself often an outcome. The patients who can't tolerate a drug, who drop out from side effects, who never start because of contraindications—they're all part of the treatment's real-world effect.

The A/B Testing Version

Tech teams unknowingly violate ITT constantly. I've seen teams exclude users who didn't engage with the feature, analyze only "active" users who clicked through, or filter out people who bounced in under 10 seconds. Each exclusion seems reasonable in isolation. Together, they create a fantasy dataset.

But engagement is part of the outcome. A feature that users ignore has failed, regardless of how well it performs for the handful who explore it fully. Filtering based on post-treatment behavior creates a tautology: the feature works best among the people for whom it worked.

When you measure only the users who loved your feature enough to engage deeply, you're not measuring your feature—you're measuring the intersection of your feature and pre-existing user enthusiasm. That's not an experiment; it's confirmation bias with a p-value.

3. Platform Integrity: When Tiny Cracks Break the Whole Foundation

The Poisson/NSABP misconduct case rocked oncology in 1994—not because the errors were scientifically catastrophic, but because they tore at the very foundation of trust. A single investigator had falsified patient eligibility data in less than 1% of cases in a breast cancer prevention trial. The scientific impact was negligible. The reputational damage was nuclear.

Congressional hearings followed. Trials were suspended. Public trust evaporated. Years of work unraveled because one small crack suggested the entire foundation might be rotten.

In A/B Testing

You experience this when you discover a Sample Ratio Mismatch (SRM)—that moment when your 50/50 split comes back 48/52. It seems trivial, just 2% off. But that 2% signals something fundamental is broken: your randomization, your logging, your entire experimental apparatus.

I've watched teams dismiss SRM as "probably fine" only to discover weeks later that a deployment bug invalidated months of experiments. These issues compound silently—a 2% mismatch here, a missing event type there, an edge case in your assignment service—until one day a sharp-eyed analyst asks, "Can we trust any of our experiments?"

The answer, once doubt creeps in, is usually no. Every experiment becomes questionable. Every past decision needs re-examination. Trust in the platform, once shaken, takes years to rebuild. Better to treat every tiny crack as an emergency than to discover your foundation was rotten all along.

4. The Myth of the Promising Early Signal

The MMP inhibitor saga reads like a cautionary tale of collective delusion. Matrix metalloproteinases were going to revolutionize cancer treatment—the preclinical data was overwhelming. Tumor shrinkage in mice. Metastasis prevention in animal models. Mechanism of action that made perfect biological sense.

Between 1998 and 2004, pharmaceutical companies poured billions into development. Twenty-plus Phase III trials enrolled tens of thousands of patients. The early signals weren't just promising—they were spectacular.

Not a single trial succeeded.

The problem wasn't execution. It was that the early signals were systematically misleading, conflating biological activity (yes, MMP inhibitors affected tumors) with clinical benefit (no, they didn't help patients live longer or better). Companies kept explaining away failures: wrong dose, wrong cancer type, wrong patient population. The real explanation was simpler: the signal was never real.

A/B Testing's Version of the Same Mistake

Every quarter, I see teams chase their own MMP inhibitors. A feature shows a "soft positive trend" in the first week—maybe a 3% lift that's not quite significant. By week two, someone's found a segment where it's winning by 8%. By week three, there's a PowerPoint deck explaining why this is actually a breakthrough that just needs better targeting.

Six months and three iterations later, the feature is deprecated. The early signal wasn't weak—it was random noise wearing a costume. But once a team falls in love with a "promising" result, they'll find ways to keep the hope alive. Wrong timing, wrong messaging, wrong audience—never wrong idea.

The discipline to kill ideas based on weak signals is what separates mature experimentation cultures from those forever chasing shadows. In my experience, if an effect isn't clear and convincing in your primary analysis, it's probably not there at all.

5. Surrogate Metrics: When the Easy-to-Measure Betrays the Important

Oncology loves surrogate outcomes. Tumor shrinkage photographs well. Biomarker changes graph beautifully. Response rates make compelling PowerPoints. They're fast, objective, and scientific-looking.

And far too often, they mean absolutely nothing for what patients actually care about: survival, quality of life, functional recovery.

I witnessed this firsthand in the EDI study when our biomarker panel showed dramatic changes at 3 months—changes that meant absolutely nothing for 5-year survival. The proteins danced, the tumors shrank, the patients died on schedule. The surrogates had told us a story, but it was fiction.

In Tech

Spotify optimizing for listening minutes while user satisfaction plummets. Instagram maximizing story views while meaningful connections decay. Twitter driving "engagement" while discourse becomes increasingly toxic. These aren't hypothetical—they're documented cases where surrogate metrics led product teams off a cliff.

The problem isn't that click-through rates or session duration are bad metrics. It's that they're easy metrics, and easy metrics become targets, and targets reshape entire products around their optimization. You end up with a perfectly optimized system that's solving the wrong problem.

Short-term metrics behave like funhouse mirrors—they show you a distorted version of reality that becomes more distorted the harder you stare. The only protection is ruthless validation: does optimizing this metric actually drive the outcome we care about? If you haven't proven that connection empirically, you're navigating by stars that might not exist.

6. Heterogeneity: When Not Everyone Should Be Treated the Same

Biology is heterogeneous. The same drug that saves one patient kills another, depending on their genetics, tumor markers, organ function. What works for BRCA-positive breast cancer fails for triple-negative. Oncology handles this through enrichment designs—targeting the population where the signal is clearest, where the biology suggests benefit is most likely.

A/B Testing's Equivalent

Users differ in ways that matter. Platform differences aren't just technical details—iOS users spend differently than Android users. Tenure isn't just a number—new users explore while veterans optimize. Intent isn't just behavior—browsers shop while buyers purchase.

Recognizing heterogeneity is good science. But discovering heterogeneity post-hoc is dangerous science.

Why Subgroup Hunting Misleads Teams

Here's the calculation most teams never do: With 10 segments tested at α=0.05, the probability of at least one false positive isn't 5%—it's 1-(0.95)^10 = 40%. This isn't a subtle effect; it's a massive distortion of evidence.

I've watched teams discover 'breakthrough insights' in segment 7 of 10, never realizing they've essentially guaranteed finding something through sheer mathematical inevitability. The feature that "works great for new Android users in Canada" probably doesn't work at all—you just rolled the dice enough times to get snake eyes.

The protection is pre-specification. Decide your segments before you see data. Power your test for the smallest segment you care about. And when you find that surprising subgroup effect? Treat it as a hypothesis for the next experiment, not a conclusion from this one.

7. The Temptation of Peeking

Oncology learned early that unrestricted interim looks create chaos. Every peek at accumulating data without statistical correction dramatically inflates false positive rates. A trial designed for 5% false positives can jump to 20% or more with repeated looking.

The Daily Dashboard Disease

Tech teams peek constantly. The morning metrics review. The afternoon dashboard refresh. The executive who wants daily updates on the test that's "looking good."

Each uncorrected peek inflates your false positive rate by approximately 5-10x. A team checking daily for two weeks has essentially guaranteed they'll see significance regardless of truth. It's not a bug—it's basic probability theory. Random walks will cross any threshold if you watch them long enough.

Every peek without correction is a step away from statistical truth. The irony is that teams peek because they want faster decisions, but peeking makes decisions less trustworthy, requiring longer tests to restore confidence. If you want faster decisions, use faster designs—sequential testing, shorter measurement windows, more aggressive stopping rules. Don't corrupt your existing tests with impatience.

8. Ethics in A/B Testing: Yes, They Exist

Oncology rests on the principle of equipoise—randomization is ethical only when there is genuine uncertainty about which treatment is better. Without equipoise, randomization becomes experimentation without justification.

The Uncomfortable Truth

A/B testing violates equipoise constantly. We knowingly expose some users to experiences we believe are inferior. We test because we have hypotheses, but we randomize even when we're fairly certain which variant will win.

This isn't necessarily unethical, but it requires acknowledgment. We're trading some users' immediate experience for knowledge that benefits future users. That trade is defensible only if we're honest about it and ensure the learning justifies the cost.

Making It Ethical

Experiments must minimize unnecessary harm. If early data shows clear negative impact, stop the test. If the learning won't change decisions, don't run the experiment. If vulnerable users might be harmed, exclude them from exposure.

Ethics isn't about risk magnitude—it's about intentionality and proportionality. A tiny degradation applied carelessly to millions is worse than a large degradation applied thoughtfully to volunteers who can opt out.

9. Build a Culture of Skepticism, Not Hope

Every field that relies on experimentation eventually learns the same truth: most ideas fail. In oncology, Phase I to approval success rates hover around 3-5%. In tech A/B testing, Microsoft found that only one-third of ideas improve metrics at all—and only one-third of those drive meaningful improvement.

The baseline assumption should always be failure.

The Stories We Tell Ourselves

Clinical investigators often explained failures away: wrong dose, wrong population, wrong endpoint, wrong schedule. Just one more trial with one more modification and surely it would work.

Tech teams sing the same song with different lyrics: "Users didn't understand it yet." "Marketing didn't support the test." "The timing was off." "Certain segments loved it—we just need to target better."

These aren't analyses; they're defense mechanisms. Good experimenters start from skepticism and require evidence to move toward belief. Poor experimenters start from hope and explain away evidence that challenges it.

Twyman's Law: A North Star for Experimenters

Any result that looks too good to be true probably is. The 50% conversion lift? Check for a logging error. The massive engagement boost? Look for selection bias. The breakthrough in segment 7? Remember that multiple testing guarantee.

I've learned to celebrate boring results—the 2% improvements, the steady gains, the consistent patterns. They're usually real. The spectacular successes that would make great conference talks? They're usually measurement error wearing a party hat.

Good methods protect you from fooling yourself. Optimism does not.

Final Thoughts: What Makes Experiments Trustworthy

The deepest lesson oncology offers tech is this: Rigor is not a luxury. It is the price of trustworthy learning.

When experimentation cultures mature, they converge on the same principles whether they're testing cancer drugs or checkout flows:

  • Use the right design, not the familiar one
  • Protect against bias—both statistical and human
  • Validate metrics before optimizing them
  • Treat heterogeneity with discipline
  • Root decisions in skepticism, not hope
  • Guard integrity like the fragile resource it is

If oncology can evolve to this standard while dealing with life and death, tech can certainly match it while optimizing conversion funnels.


📬 Want more insights on experimental design across domains? Subscribe to the newsletter or explore the full archive of Evidence in the Wild.