4 min read

What I Submitted to FDA on the Bayesian Guidance

My public comments on Docket No. FDA-2025-D-3217, "Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products."
Listen to this essay AI Narrated
Subscribe to listen
What I Submitted to FDA on the Bayesian Guidance

The FDA's January 2026 draft guidance on Bayesian methodology in pivotal trials is the most significant regulatory document on this topic in a decade. I've been writing about it since the day it dropped: analyzing its philosophical tensions, stress-testing its recommendations against real FDA submissions, and building tools to implement what it asks sponsors to do.

The comment period closes March 13, 2026. Here's what I submitted.


Comment 1: Name the hybrid framework practitioners actually use

The guidance presents two approaches to Bayesian success criteria. Framework 1 calibrates the posterior probability threshold to a one-sided Type I error rate of 0.025. Framework 2 interprets the posterior directly, without frequentist calibration.

In practice, most successful Bayesian submissions do both. The sponsor calibrates the threshold, satisfying the frequentist operating characteristic requirement, and then reports the posterior probability itself to communicate evidence strength to the clinical team, advisory committee, and prescribers. This isn't Framework 1 or Framework 2. It's a hybrid: frequentist in its error-rate discipline, Bayesian in its inferential output. And most reviewers already behave as if this hybrid exists.

REBYOTA illustrates this. The trial used Pr(δ > 0) > 0.975, calibrated through simulation to approximately α = 0.025. The reported result, a posterior probability of 0.991, communicated a Bayesian quantity. The FDA reviewed both the frequentist operating characteristics and the posterior. Bayesian machinery driving frequentist conclusions, with the posterior serving as both a calibrated decision rule and an interpretable evidence summary.

My recommendation: The final guidance should acknowledge this hybrid approach explicitly, either as a third framework or as a recognized variant of Framework 1. Naming what practitioners already do reduces ambiguity for sponsors who currently navigate the guidance's two-framework structure without clear indication that their strategy is recognized.


Comment 2: Tell sponsors what to pre-specify about their response to prior-data conflict

The guidance requires sponsors to pre-specify priors and assess prior-data conflict. It describes several dynamic discounting methods (power priors, commensurate priors, mixture priors, elastic priors, hierarchical models) that adjust borrowing based on consistency between external and trial data.

But the guidance doesn't specify what sponsors should pre-specify about what happens when the prior fails.

Consider a sponsor who pre-specifies a dynamic borrowing prior with a conflict detection threshold, say, a posterior predictive tail probability or goodness-of-fit statistic at the 0.10 level. The protocol specifies the prior, the borrowing model, and the threshold, exactly as the guidance recommends. The trial runs. The observed control-arm rate is 21%, and the historical rate was 15%. The conflict statistic is 0.11, just above the threshold. Borrowing is reduced. But the degree of reduction, the fallback prior, and the downstream effect on the success criterion are all consequences of modeling choices that were not required to be pre-specified.

You followed the rules. The rules didn't fully protect the analysis from post-hoc ambiguity.

My recommendation: The final guidance should require sponsors to pre-specify three additional elements when using dynamic borrowing:

(a) Conflict detection criterion. The specific metric and threshold used to detect prior-data conflict.

(b) Pre-planned response. What happens when conflict is detected, including the fallback prior or the functional form of the borrowing weight reduction, specified with enough precision that the analysis is reproducible without analyst judgment at the time of unblinding.

(c) Operating characteristics under conflict. Simulation results showing how the design performs under realistic conflict scenarios, not just the assumed prior-data agreement scenario.

These three sub-requirements close the gap between pre-specifying what the prior is and pre-specifying what the analysis does when the prior is wrong. The former is already required. The latter is where post-hoc flexibility currently enters.


Comment 3: Evaluate the composed pipeline, not just individual components

The guidance addresses Bayesian design components in separate sections: prior specification, effective sample size, success criteria, sequential monitoring, operating characteristics. Each section is thorough. But in practice, sponsors don't deploy these components in isolation.

A Bayesian trial that borrows external data implements a pipeline: the prior is elicited, the borrowing model determines how much external information enters, the ESS informs the sample size calculation, the success threshold governs decisions, and monitoring rules determine when to stop. Each component's behavior depends on the others. The operating characteristics of the composed system are not the product of the individual components' operating characteristics.

Two interactions are particularly consequential:

Prior-data conflict and sequential stopping. A dynamic borrowing prior that downweights external data under conflict changes the effective information at each interim look. If conflict emerges gradually, the early interim analyses operate under a different effective prior than the later ones. A stopping boundary calibrated under the assumption of stable borrowing weight may be miscalibrated when borrowing weight is shifting.

ESS as a random variable under dynamic borrowing. The guidance recommends ESS as the primary metric for prior influence and recommends sample size calculations that account for the prior's contribution. But under dynamic borrowing, ESS isn't fixed; it's a random variable. A sample size calculation based on a single assumed ESS value may silently inflate Type I error or erode power relative to what was simulated under nominal assumptions.

My recommendation: Simulation studies should demonstrate the design's performance under scenarios where the borrowing weight varies across interim looks and where the effective sample size differs from its nominal value. This end-to-end evaluation would reveal interaction effects that component-level analysis cannot detect.


The common thread

These three comments share a theme: the guidance's individual-component recommendations are sound, but the final version would benefit from specificity about how those components interact in practice. Naming the hybrid framework (Comment 1), specifying what pre-specification means when the prior fails (Comment 2), and requiring end-to-end evaluation of composed designs (Comment 3) would narrow the gap between the guidance's methodological rigor and the operational realities sponsors face.

Bayesian rigor lives in the interactions. That's where design credibility is won or lost.

The comment period is open until March 13, 2026. If you have thoughts on any of these issues, or disagree with my framing, submit them. Docket FDA-2025-D-3217 on regulations.gov.


This post is part of a series on FDA's Bayesian guidance. Previous: The FDA's Bayesian Guidance: Learning in Theory, Pre-Specification in Practice.

📬 For more essays on experimental design, regulatory evidence, and statistical decision-making across domains, subscribe to the Evidence in the Wild newsletter or browse the archive.