The Square Peg Problem: Why FDA’s Bayesian–Frequentist Truce Still Hurts
In January 2026, the FDA released updated guidance on the use of Bayesian methods in clinical trials. The document does not read like a manifesto. It reads like an attempt to reconcile competing statistical cultures under real regulatory constraints. On its surface, it is pragmatic and flexible, welcoming Bayesian designs while emphasizing error control and interpretability.
Underneath, however, the guidance asks sponsors to do something philosophically awkward: use Bayesian inference, but calibrate it to frequentist type I error rates.
This is not a minor technical choice. It reflects a deeper tension between two statistical worldviews that answer fundamentally different questions. Yuan Ji, in a comment on an earlier post of mine, described this tension as fitting a square peg in a round hole. The framing is precise: a Bayesian model will not put a point mass on the null value, so calibration to type I error is intrinsically awkward. The phrase is apt, and increasingly consequential.
The questions don't match
At the core of the Bayesian–frequentist divide is not a disagreement about mathematics, but about questions. Specifically, about which questions regulators believe are appropriate to answer.
Bayesian inference asks: given the data we observed, what should we believe about the treatment effect? Type I error control asks something different: if the null hypothesis were true, how often would we incorrectly reject it across repeated hypothetical trials?
These questions live in different worlds. Bayesian inference is conditional on the data at hand. Type I error is defined under a counterfactual data-generating process that assumes the null is true. Bayesian models typically place no point mass on the null. There is no privileged hypothesis to "reject." Evidence accumulates continuously through the posterior distribution.
When regulators ask Bayesian designs to demonstrate type I error control, they are asking a belief-updating framework to behave like a long-run error budgeting system. The result is friction. Not because either framework is wrong, but because they are answering different questions.
What happens when you force the marriage
The consequences of this friction are not theoretical. They show up in real trials.
The AIR2 trial (Castro et al., AJRCCM 2010) randomized 288 patients with severe asthma 2:1 to bronchial thermoplasty or sham bronchoscopy. It was the pivotal study for the Alair system, and one of the first sham-controlled device trials to use a Bayesian primary analysis. The design specified a flat prior for the treatment effect on Asthma Quality of Life Questionnaire scores, with a posterior probability of superiority threshold of 96.4% for the primary endpoint. That threshold was not chosen on inferential grounds. It was inflated from 95% to account for two planned interim analyses and keep the overall type I error rate at 5%.
The interim analyses never occurred. Enrollment was faster than expected, and neither planned look was triggered. But the inflated threshold stayed in place.
At the final analysis, the posterior probability of superiority in the intent-to-treat population was 96.0%. The per-protocol analysis reached 97.9%.
The trial was declared negative on its primary endpoint. It missed the threshold by 0.4 percentage points.
Not because the data were weak, but because the analysis was penalized for protecting against data looks that never happened. As Frank Harrell wrote in his discussion of the case, the study failed solely because of interim analyses that did not occur.
The FDA's Anesthesiology and Respiratory Therapy Devices Panel reviewed the data in October 2009 and voted to recommend the device as approvable with conditions. FDA granted premarket approval in April 2010. Human judgment overrode the statistical conclusion because the calibration penalty, not the evidence, had produced the negative result.
This is what the hybrid compromise looks like when it matters. A posterior probability of 96.0% is strong evidence by any coherent Bayesian standard. It was declared insufficient because the decision rule was designed for a trial that did not happen as planned. The flat prior contributed nothing inferentially useful; a skeptical prior might have controlled error without inflating the threshold. Instead, the penalty was imposed after the fact, on a posterior that had already done its job.
The same dynamic plays out less dramatically but more routinely in rare disease, where single-arm designs with historical borrowing are common and calibration pressure is acute. A typical pattern: a sponsor proposes a skeptical prior and a posterior success criterion of 0.975 for a neuromuscular or metabolic disease study borrowing from natural history data. Simulations show a type I error of 4.1% under the null. To hit the 2.5% target, the sponsor raises the posterior threshold to 0.99, dropping power from 82% to 70%, or increases discounting of the historical data, inflating the required sample size in a population where every patient matters.
Both adjustments satisfy calibration. Neither is neutral. They encode choices about what risk matters more: false approval or delayed access. But those choices are presented as technical tuning rather than the value judgments they are.
The FDA's pragmatic compromise
The guidance does not ignore this tension. It manages it institutionally. I explored the guidance's emphasis on pre-specification in an earlier post. Here, I want to focus on calibration specifically.
The document lays out two broad paths. The default: Bayesian designs calibrated via simulation to demonstrate reasonable control of the type I error rate. The alternative, with sponsor–FDA agreement: non-calibrated Bayesian designs justified through Bayesian operating characteristics and decision-relevant metrics.
This is not philosophical indecision. It is pragmatic pluralism shaped by decades of regulatory precedent, legal expectations, and the need for cross-trial comparability.
The guidance repeatedly emphasizes reasonableness rather than strict guarantees. FDA states explicitly: "We strive for reasonable control of the type I error rate." That word matters. It signals that FDA recognizes the limits of forcing Bayesian inference to conform to frequentist error definitions, while still needing comparability, transparency, and precedent.
Rather than choosing sides, FDA has created a negotiated space where both camps can operate, provided sponsors are explicit about what they are optimizing and why.
Skeptical priors as the better bridge
One way to reduce the incoherence is to move skepticism where it belongs: into the prior.
Instead of inflating posterior probability thresholds to account for interim looks or multiplicity, skeptical priors discount evidence in a Bayesian-coherent way. They encode caution before seeing the data, rather than penalizing the posterior after the fact.
This matters practically. In the rare disease example above, a well-justified skeptical prior might achieve acceptable type I error control without pushing the posterior threshold to 0.99 or gutting the historical borrowing. The design stays interpretable. The posterior probability still means what it's supposed to mean. The internal logic holds.
The FDA guidance explicitly acknowledges skeptical priors as an acceptable and often preferable strategy. This is an important signal: coherence matters, even within a hybrid regulatory framework. The REBYOTA approval, which used Bayesian primary inference with dynamic borrowing, demonstrates that FDA will accept designs where the Bayesian logic and clinical context are well aligned.
The trade-offs you need to name
Hybrid designs are not neutral. They embed value judgments about which mistakes matter most, even when those judgments are hidden behind simulation reports and operating characteristics.
If you choose calibration, understand what behaviors you are locking in and what penalties you may incur. AIR2 is the extreme case, but subtler versions happen constantly: power eroded by a fraction, borrowing weakened by a margin, all in service of a type I error target that the Bayesian framework was never designed to optimize.
Document early why a calibrated or non-calibrated approach is appropriate for your clinical context. Separate the elements that serve epistemic coherence from those that satisfy operating characteristic requirements. Reviewers can handle the distinction. Pretending there isn't one doesn't help anyone.
And consider whether skeptical priors can achieve your regulatory goals more coherently than threshold inflation. A design that controls error through a principled prior is more defensible than one that controls error by moving the goalposts.
The square peg problem does not disappear because we ignore it. But with careful design, transparency, and alignment between statistical logic and clinical stakes, it can be managed.
The real risk is not using Bayesian methods. It is using them without naming the compromises they entail.
The FDA's draft guidance on Bayesian methods is currently open for public comment through March 13. For statisticians and trial designers working in this space, this is a rare opportunity to engage directly with how these compromises are formalized in regulation.
📬 For more essays on experimental design, regulatory evidence, and statistical decision-making across domains, subscribe to the Evidence in the Wild newsletter or browse the archive.
Member discussion