When the AI Statistician Gets It Right (and Why That's the Dangerous Part)
A few months ago, I asked an LLM to design my clinical trial. It recommended a Bayesian borrowing design for a Phase 2 single-arm oncology trial. The response was fluent, well-organized, and cited the right methods. It was also wrong—in ways that would survive a casual review.
That post was about what a general-purpose LLM gets wrong when you ask it a statistical question. This post is about something more consequential: what happens when purpose-built AI tools get it mostly right.
There are now companies selling AI-powered clinical trial design platforms to pharma sponsors. These are not ChatGPT wrappers. They combine LLMs with statistical engines, curated trial databases, and R-based simulation pipelines.
PIs can describe a trial in natural language and receive a complete design: sample size, randomization, endpoint strategy, analysis plan, even draft protocol language.
I've seen demos. In one review, the tool correctly identified a MAP prior as appropriate for a Phase 2 single-arm oncology trial, generated an R-based simulation pipeline, and produced posterior summaries that looked exactly right. The historical data it drew from predated a significant standard-of-care shift in the indication by 18 months. Nothing in the output flagged this.
The output is impressive. That is precisely the problem.
What they get right
Credit where it matters. These tools handle a category of biostatistics work that is genuinely routine.
Standard sample size calculations for well-characterized designs. Literature synthesis across published trials in a given indication. Correct identification of appropriate statistical methods given a set of inputs. R code that runs, produces interpretable output, and follows reasonable programming conventions.
For a PI planning a straightforward Phase 3 superiority trial with a well-established primary endpoint, the AI can generate a design that a competent biostatistician would broadly agree with. The methods are correct. The code works. The regulatory citations are real.
This is not trivial. A substantial portion of what junior biostatisticians spend their first years doing is exactly this kind of work. This is not low-skill work. It is standardized work. If an AI can do this reliably, the efficiency gain is real—and denying it helps no one.
When correct designs fail
The failures I worry about are not the obvious ones. An LLM that hallucinates a formula is easy to catch. What's harder to catch is a design that is technically correct and contextually wrong.
Four categories.
Regulatory context is not static. An AI tool trained on published guidance and historical submissions will recommend what has been accepted before. But what FDA accepts shifts. The January 2026 Bayesian guidance changed the calculus for sponsors using Bayesian methods in confirmatory trials. It introduced pre-specification requirements for prior–data conflict testing and signaled that calibrated hybrid designs must demonstrate composed operating characteristics—not just component-level ones. These are not minor updates. They change the analysis plan for any sponsor using MAP or power priors in a confirmatory setting. An AI tool trained before January 2026 will not recommend pre-specifying a conflict test. It will not flag that the gap between component and composed Type I error can exceed ten percentage points under realistic scenarios. It will recommend what has been accepted before—which is no longer sufficient.
The accelerated approval landscape has also tightened since 2022, with FDA now expecting confirmatory trials to be enrolled, not merely planned, at the time of filing. A design that would have been appropriate in 2023 may be insufficient in 2026.
The AI doesn't know any of this unless someone updates its training data—and the lag between regulatory change and model updates is where trials fail.
The tools themselves may not meet FDA's own standards for AI use. This is a different problem from lag. FDA's January 2025 draft guidance on AI for regulatory decision-making introduces a credibility assessment framework for AI models used to produce information supporting regulatory submissions—and it explicitly scopes in tools that affect the reliability of clinical trial results. The framework defines model risk as the intersection of model influence and decision consequence. For a clinical trial design platform, both are high: the output is the design, and a wrong design enrolls patients in a trial that will fail. Under the guidance's risk matrix, these tools sit in the highest-risk category, requiring documentation of training data provenance, performance metrics with confidence intervals, and lifecycle maintenance plans that account for data drift.
Most sponsors using these platforms have none of this. They are deploying high-model-risk AI without the credibility infrastructure the guidance recommends—and without recognizing that the gap between the tool's training data and the current regulatory environment is itself a form of data drift the guidance explicitly flags.
Then, in December 2025, FDA announced the deployment of agentic AI capabilities to all agency staff, including for pre-market reviews and review validation. The sponsor-side tools have no visibility into how an agentic review workflow will process a MAP prior justification that was itself generated by an AI design platform. The regulatory context these tools train on reflects a static review process. That process is no longer static.
This is the knowledge held by biostatisticians who attend ODAC meetings, read complete review packages, and have pre-IND conversations with specific review divisions. It is not stored in any database. It updates continuously. And it is the difference between a design that looks regulatory-ready and one that actually is.
Design–inference coherence requires judgment, not pattern matching. I wrote in What Randomization Can't Fix about the gap between getting the design right and producing trustworthy evidence. An AI tool can select an appropriate randomization scheme, specify a reasonable primary endpoint, and generate a valid sample size. What it cannot do is evaluate whether the analysis plan answers the question the protocol claims to ask.
Does the estimand align with the clinical question? Is the multiplicity strategy consistent with the endpoint hierarchy, or bolted on after the fact? Does the interim analysis plan reflect the operational realities of this trial—or a generic information-fraction schedule copied from prior examples?
These are not computational questions. They are judgment calls that depend on therapeutic context, regulatory strategy, and downstream consequences. The SGLT2 inhibitor cardiovascular outcomes trials got their hierarchical testing strategies right not because someone ran a calculator, but because the biostatisticians understood what was at stake.
Assumption stress-testing is where experience lives. Every clinical trial design rests on assumptions: control rate, dropout, treatment effect, enrollment trajectory. A good biostatistician does not just specify these assumptions—they interrogate them.
An AI tool uses the assumptions you give it. A senior biostatistician challenges them.
Here is a concrete example. Suppose the AI recommends a single-arm borrowing design using a 70% power prior discount because five historical trials show stable response rates between 18% and 22%. Technically coherent. But those controls predate a new standard of care introduced eighteen months ago. The true rate may now be 12%, not 20%.
The historical pool contributes an effective prior sample size of ~140 patients with a mean of 20%. The new trial enrolls 50 patients and observes 6 responses—consistent with a 12% rate. The posterior mean rises to ~18%. The probability of exceeding a 15% activity threshold is ~72%. Under a minimally informative prior, it is ~34%.
The design declares a signal. The drug probably doesn't work.
This is not a theoretical failure mode. It is the kind of error that survives internal review and fails only when exposed to reality.
No simulation engine catches this. A biostatistician who has worked in the indication does.
What this means for the field
The biostatistician's job is changing. The question is what it is changing into.
Work that translates a well-defined statistical question into code and standard calculations is being automated. This is real, and it is happening now.
But the work that evaluates whether the right question was asked, whether the design will survive regulatory review, and whether the assumptions will hold under real-world conditions is not being automated. It is becoming more important.
When a PI walks into a meeting with an AI-generated design that looks complete and professional, someone needs to be in the room who can identify what is missing. That person needs to have seen enough trials to know where they break.
The risk is not that AI replaces biostatisticians. The risk is that AI produces work that looks like it doesn't need one—and sponsors believe it.
A design that passes surface-level review but fails at FDA is more expensive than no design at all. Because failure at FDA is not just statistical. It is a capital event. By the time you discover the problem, you have enrolled patients, spent budget, and lost time that error asymmetry tells us is not recoverable.
I don't know exactly where the equilibrium settles. I do know this: the value of a biostatistician is migrating from generating designs to stress-testing and defending them. From producing the SAP to knowing whether it will work.
The future is not AI replacing biostatisticians. It is biostatisticians who can interrogate AI outputs replacing those who cannot.
The tools will keep getting better. The judgment required to use them will not get easier.
The first post I wrote about this ended with a line I still believe: the risk is not replacement. It is plausible-sounding work that no one stress-tests.
And that is exactly what makes it dangerous.
Many evidentiary problems appear during trial design, not after analysis. I work with teams to review trial designs and run simulation studies to evaluate operating characteristics before protocols are finalized.
For consulting inquiries: maggie@zetyra.com
For more essays on statistical design and regulatory evidence, subscribe to the Evidence in the Wild newsletter.
Member discussion