03 Dec 2025 5 min read A/B Testing

What A/B Testing Teams Can Learn from 40 Years of Oncology Trial Mistakes

What oncology learned the hard way—and tech keeps relearning.

Listen to this essay AI Narrated

Why Tech Should Pay Attention to Cancer Trials

Most product teams think of A/B testing as a modern invention: lean, data‑driven, and efficient. But step back far enough and today’s experimentation culture in tech looks eerily like oncology trials in the 1980s and 1990s—well‑intentioned, resource‑intensive, and riddled with blind spots that took decades to uncover.

The difference is that oncology has already paid for those mistakes—in money, reputation, and human lives. After a series of spectacular failures in the early 2000s, the field was forced into a painful methodological reckoning. Design shortcuts that once felt reasonable were revealed as systematically misleading. Comfortable traditions were abandoned. New norms emerged.

Tech teams now have the rare privilege of learning these lessons without repeating the same errors.

This is not a methods tutorial. It’s about how experimental cultures mature.

Familiar Designs Are Comfortable—Not Correct

Early cancer trials relied heavily on a simple dose‑finding algorithm known as the 3+3 design. It looked clean. It felt objective. Patients were enrolled in cohorts of three at increasing dose levels, escalating if toxicity stayed low.

What took years to admit was that the design was quietly undermining the science built on top of it. Dose estimates varied wildly. The same drug, tested twice, could yield “optimal” doses two levels apart. Its probability of identifying the true maximum tolerated dose hovered around 30–40%. Its theoretical justification was essentially nonexistent.

What kept the 3+3 alive wasn’t evidence—it was familiarity.

Today, it’s widely regarded as practically unethical in many contexts, replaced by model‑based approaches that identify better doses while exposing fewer patients to harm.

Tech teams repeat this mistake every time they default to the simplest A/B test because it’s standard. The fixed‑duration, two‑variant test with a single end‑of‑test p‑value survives for the same reason the 3+3 did: it feels safe.

Simplicity is not the same as trustworthiness. Oncology learned this only after decades of avoidable error.

Intention‑to‑Treat: A Guardrail Against Wishful Thinking

Clinical trials rely on the intention‑to‑treat (ITT) principle for a reason. Patients are analyzed as randomized—not as they behaved.

This prevents a tempting narrative: “The treatment worked great among people who actually followed it.” The flaw is subtle but fatal. The ability to comply is often itself an outcome. Side effects, contraindications, early dropout—all are part of the treatment’s real‑world effect.

ITT forces honesty about impact.

The A/B Testing Version

Tech teams violate ITT constantly, usually without realizing it.

Users who didn’t engage with the feature are excluded. Only “active” users are analyzed. Bounces under ten seconds are filtered out. Each exclusion sounds reasonable. Together, they manufacture a fantasy dataset.

Engagement is not a prerequisite—it is part of the outcome. A feature ignored by most users has failed, no matter how well it performs for the few who explore it deeply.

Filtering on post‑treatment behavior creates a tautology: the feature works best among users for whom it worked. That isn’t experimentation. It’s confirmation bias with a p‑value.

Platform Integrity and the Cost of Looking Too Often

In 1994, the Poisson/NSABP misconduct case shook oncology. Fewer than 1% of patient eligibility records were falsified. The scientific impact was negligible.

The trust impact was devastating.

Trials were suspended. Congressional hearings followed. The damage wasn’t about effect sizes—it was about whether the system itself could be believed.

A/B testing teams experience this the moment they discover a sample ratio mismatch. A 50/50 split comes back 48/52. It feels small. It isn’t. It signals that randomization, logging, or assignment is broken.

I’ve watched teams wave this off as “probably fine,” only to uncover weeks later that a deployment bug invalidated months of experiments.

Uncorrected peeking compounds the problem. Daily dashboards turn random walks into apparent wins. A team checking results every day for two weeks has nearly guaranteed they’ll see significance, whether or not anything real exists.

Once trust erodes, every past result becomes suspect. Rebuilding integrity takes far longer than guarding it in the first place.

Early Signals Lie

The MMP inhibitor saga is a textbook case of collective self‑deception.

Preclinical data was overwhelming. Tumors shrank in mice. Metastasis was prevented. The biology made sense. Between 1998 and 2004, companies launched more than twenty Phase III trials, enrolling tens of thousands of patients.

Not one succeeded.

The problem wasn’t execution. It was interpretation. Biological activity was mistaken for patient benefit. Each failure was explained away—wrong dose, wrong cancer, wrong population. The harder explanation—that the signal was never real—came last.

I see the same pattern play out in product reviews. I’ve been in rooms where a week‑one trend sparked a deck, a roadmap, and a narrative about “directionally positive results.” Six months later, the feature was quietly retired.

Early signals aren’t fragile. They’re often illusory. Mature experimentation cultures learn to kill ideas early, before storytelling fills the gaps data leaves behind.

When Easy Metrics Betray Important Outcomes

Oncology has a long history of falling in love with surrogate endpoints. Tumor shrinkage photographs well. Biomarkers graph beautifully. Response rates look rigorous.

And far too often, they mean nothing for survival or quality of life.

I witnessed this firsthand in the EDI study. Our biomarker panel moved dramatically at three months. The plots were convincing. Five‑year survival didn’t change at all. The surrogates told a coherent story—it just wasn’t true.

Tech teams make the same mistake when they optimize what’s easy to measure rather than what actually matters. Engagement rises while satisfaction falls. Time‑on‑site grows as trust erodes.

The danger isn’t in using proxy metrics. It’s in mistaking them for the outcome. If you haven’t shown that improving the metric improves the thing you care about, you’re navigating by instruments that may not be connected to reality.

Heterogeneity Is Real. Post‑Hoc Discovery Is Not.

Biology is heterogeneous. What saves one patient can harm another. Oncology responds with enrichment designs—testing where benefit is biologically plausible.

Users are heterogeneous too. Platform, tenure, intent—these differences matter.

What doesn’t work is discovering heterogeneity after the fact.

Test ten segments at α = 0.05 and your chance of at least one false positive is 40%. I’ve watched teams celebrate a “breakthrough” in segment seven without realizing the math made that outcome almost inevitable.

Pre‑specification is the only defense. Decide segments in advance. Power for them. Treat unexpected subgroup effects as hypotheses, not conclusions.

Ethics Isn’t Optional

Clinical trials rely on equipoise—genuine uncertainty about which option is better.

A/B testing often violates this more than teams admit. We expose users to experiences we believe are worse, trading present harm for future knowledge.

That trade can be justified. But only if it’s intentional.

Stop tests that clearly harm users. Don’t run experiments whose results won’t change decisions. Protect vulnerable populations. A tiny degradation applied carelessly at scale is worse than a larger one applied thoughtfully.

Build Skepticism, Not Hope

Most ideas fail. In oncology, fewer than 5% of drugs succeed. In tech, only a minority of experiments improve metrics at all.

Hope fills the gaps when evidence is weak.

Investigators explain failures away. Product teams do the same. These stories aren’t analysis—they’re coping mechanisms.

The rule I’ve learned to trust is simple: results that look too good are usually wrong. I celebrate the boring 2% gains. They’re usually real.

Final Thoughts: What Makes Experiments Trustworthy

Oncology’s central lesson for tech is simple:

Rigor is not a luxury. It is the price of trustworthy learning.

Mature experimentation cultures converge on the same principles—whether they’re testing cancer drugs or checkout flows:

• Use the right design, not the familiar one
• Protect against bias—statistical and human
• Validate metrics before optimizing them
• Treat heterogeneity with discipline
• Start from skepticism, not hope
• Guard integrity relentlessly

If oncology can evolve to this standard while dealing with life and death, tech can match it while optimizing funnels.

📬 Want more insights on experimental design across domains? Subscribe to the newsletter or explore the full archive of Evidence in the Wild.

Why Tech Should Pay Attention to Cancer Trials

Familiar Designs Are Comfortable—Not Correct

Intention‑to‑Treat: A Guardrail Against Wishful Thinking

The A/B Testing Version

Platform Integrity and the Cost of Looking Too Often

Early Signals Lie

When Easy Metrics Betray Important Outcomes

Heterogeneity Is Real. Post‑Hoc Discovery Is Not.

Ethics Isn’t Optional

Build Skepticism, Not Hope

Final Thoughts: What Makes Experiments Trustworthy

You might also like...

What I Submitted to FDA on the Bayesian Guidance

The COVID-19 Vaccine Trial That Put Bayesian Sequential Design on the Map

The Square Peg Problem: Why FDA’s Bayesian–Frequentist Truce Still Hurts

When Tumor Shrinkage Doesn't Mean Living Longer

Error Asymmetry: How FDA Decides Which Mistakes Matter More

Every essay, delivered