I Asked an LLM to Design My Clinical Trial
I asked an LLM a question that junior biostatisticians ask senior biostatisticians all the time:
Should I use a Bayesian borrowing design for my Phase 2 single-arm oncology trial?
Here's what I got back.
The LLM Response
Should I use a Bayesian borrowing design for my Phase 2 single-arm oncology trial?
It depends on several factors, but Bayesian borrowing can be a powerful approach for single-arm Phase 2 oncology trials when used appropriately. Here's how to think through it:
When Bayesian borrowing makes sense:
- You have high-quality historical control data. If previous trials in the same indication, with similar patient populations and standard-of-care backgrounds, have well-characterized response rates, you can borrow this information to strengthen inference about your control rate rather than assuming a fixed null.
- Sample size is constrained. In rare tumors or specific molecular subsets, enrolling large numbers is impractical. Borrowing external information can increase effective sample size and improve precision without requiring more patients.
- The historical data is relevant. Same line of therapy, similar eligibility criteria, comparable endpoint definitions (ORR by RECIST 1.1, for instance), and no major changes in standard of care that would make historical rates non-transportable.
Common approaches:
- Power priors: Discount historical data by a weight parameter (α) that controls how much borrowing occurs. α = 1 means full borrowing; α = 0 means no borrowing.
- Meta-analytic predictive (MAP) priors: Use a hierarchical model across historical studies to derive a prior that accounts for between-study heterogeneity.
- Commensurate priors: Adaptively borrow more when current data aligns with historical data, less when there's conflict.
Key considerations:
- Prior-data conflict. If your trial population differs from historical controls in unmeasured ways, borrowing can bias your results. Dynamic borrowing methods (commensurate priors, robust mixture priors) help but don't eliminate this risk.
- Regulatory acceptance. FDA has shown increasing openness to Bayesian methods, especially in oncology and rare diseases. The 2019 draft guidance on Bayesian statistics encourages their use. However, you'll need to pre-specify your borrowing mechanism, justify your historical data sources, and often demonstrate operating characteristics via simulation.
- Type I error control. Borrowing inflates Type I error if historical controls are optimistic (lower response rate than truth). You'll need to show through simulation that your design controls false positive rates under realistic scenarios, including prior-data conflict.
- Transparency. Pre-specify everything: which historical studies, what borrowing mechanism, what discount factors or hyperparameters. Post-hoc decisions about borrowing are a red flag.
My recommendation:
If you have relevant, high-quality historical data and are working in a setting where sample size efficiency matters (rare tumor, accelerated development), Bayesian borrowing is worth serious consideration. Use a dynamic borrowing approach to protect against prior-data conflict, simulate extensively under both favorable and adversarial scenarios, and engage with FDA early (Type B meeting) to align on the approach.
If your historical data is sparse, heterogeneous, or from a different treatment landscape, the risks likely outweigh the benefits. A straightforward frequentist design with a well-justified null hypothesis may be more defensible.
The Dissection
This answer is technically correct. It would pass peer review. It sounds like expertise.
It isn't.
"It depends" — the beige cardigan of openers. The kind of bold commitment I'd expect from someone picking a Netflix show with their parents in the room.
"When Bayesian borrowing makes sense" — Three scenarios that describe literally every case where Bayesian borrowing has ever been used. "Sample size is constrained." Sure, and water is wet.
Here's what's actually useful: If you're shoehorning 20 patients into a single-arm design because you couldn't fund a real trial, borrowing is your only friend—but it's a fake friend who might betray you.
"Common approaches" — Power priors, MAP priors, commensurate priors. Rattled off like a textbook trying to remember its own glossary. Where's the judgment? When should you absolutely not use a MAP prior? The answer doesn't say.
"Prior-data conflict" — This is the most important issue in the entire response, and it's treated like an unfortunate scheduling conflict.
Prior-data conflict isn't a minor inconvenience. It's a statistical identity crisis. When your prior and data disagree, you're modeling a lie and praying for convergence. Dynamic borrowing methods "help but don't eliminate this risk" is the understatement of the decade. In practice, dynamic borrowing often does almost nothing—you get the complexity without the protection.
"Engage with FDA early" — Correct but hollow. What actually happens? FDA asks: show me your operating characteristics under prior-data conflict. If you only simulated the happy path, you're not prepared. You're presenting a design that works when it doesn't need to and fails when it matters.
"A straightforward frequentist design may be more defensible" — The LLM giving up on its own argument. Why write 800 words on Bayesian borrowing to end with "Or don't. Whatever."
The Point
The response reads like expertise. Right vocabulary. Right guidance documents. Right hedges. A sponsor reviewing an LLM-drafted protocol might not see anything wrong.
An experienced statistician would.
The difference isn't knowledge—it's judgment. Knowing which risks are real versus theoretical. Which methods work in practice versus in papers. That judgment comes from failure: simulations that behaved badly, trials that surprised you, regulators who pushed back. It's not stored in embeddings. It's earned.
LLMs are excellent at turning established reasoning into clear language. They are poor at generating the reasoning itself. The risk isn't that they'll replace statisticians. The risk is that they'll produce plausible-sounding designs that no one stress-tests because the prose feels authoritative.
If you work at the intersection of AI and clinical research, the most valuable skill is not prompt engineering. It is knowing which parts of the problem should never be automated.
📬 For more essays on experimental design, inference, and judgment in real-world trials, explore the Evidence in the Wild archive or subscribe to the newsletter.
Member discussion