24 Mar 2026 7 min read

When Response-Adaptive Randomization Is the Right Design: Lessons from PAIN-CONTRoLS

A four-arm neuropathy trial shows what RAR looks like when it's used for the right reasons

Listen to this essay AI Narrated

The most common critique of response-adaptive randomization is that it reduces power, inflates type I error, and produces unreliable treatment comparisons. The critique is not wrong: for certain RAR procedures, in certain trial contexts, those problems are real. But the critique has calcified into a general prohibition, one that doesn't hold up when you examine the trial settings where RAR is actually defensible.

PAIN-CONTRoLS is one of those settings. It is also one of the cleaner examples of Bayesian RAR in the published literature: a multi-arm comparative effectiveness trial with a composite utility endpoint, prespecified stopping rules, calibrated operating characteristics, and a clinical question that made the ethical case for adaptive allocation explicit from the start.

The clinical problem

Cryptogenic sensory polyneuropathy (CSPN, peripheral neuropathy with no identifiable cause) affects roughly 5 million people in the United States. Neuropathic pain is the dominant symptom in 70% to 80% of patients. Neurologists have been treating it for decades with a rotating cast of off-label medications: nortriptyline, duloxetine, pregabalin, mexiletine. Duloxetine and pregabalin carry FDA approval for painful diabetic peripheral neuropathy. The others are supported by extrapolation and clinical experience.

What didn't exist before PAIN-CONTRoLS was a head-to-head comparison of all four in the same population, at the same time, under the same conditions. Physicians were choosing between these drugs based on clinical habit, anecdote, and extrapolation from diabetic neuropathy trials that may not generalize to CSPN. The comparative effectiveness question had never been answered.

This matters for the design choice. When there is no existing evidence on relative treatment performance and no strong prior reason to favor any arm, you are not just running a trial to confirm a hypothesis; you are running an experiment to learn which options are worth pursuing. That is exactly the setting where RAR has a legitimate role.

The design

PAIN-CONTRoLS was a multisite, prospective, open-label Bayesian adaptive randomized trial conducted from December 2014 through October 2017, enrolling 402 patients with CSPN across 40 neurology clinics in the United States and Canada. PCORI-funded. Three years. Real-world dosing, real-world prescriptions, real-world insurance coverage barriers included by design.

The four arms: nortriptyline (75 mg/day), duloxetine (60 mg/day), pregabalin (300 mg/day), mexiletine (600 mg/day).

The primary endpoint was a composite utility function combining two measures: the proportion of patients who achieved ≥50% pain reduction at 12 weeks (efficacy) and the proportion who discontinued the study drug for any reason (quit rate). The function was U(E, Q) = 0.75E + 1 − Q, where E and Q are the efficacy and quit proportions respectively. The weights were set in advance through clinical consultation. The highest possible utility is 1.75; the lowest is 0. In effect, the score rewards drugs that both work and are tolerated, penalizing treatments that patients abandon even if they produce meaningful pain relief.

This is worth pausing on. The primary endpoint was not just response rate. It was response rate adjusted for tolerability. A drug that works in 30% of patients but drives 60% to discontinue due to adverse effects is not the same as a drug that works in 25% of patients with a 38% quit rate. The utility function captures that distinction in a single number that is clinically interpretable and directly drives the adaptive allocation.

How RAR was implemented: The first 80 patients were allocated 1:1:1:1. After that, the randomization proportions were updated every 13 weeks based on the posterior probability that each arm had the highest utility, with additional weight given to arms with greater uncertainty. Arms performing better received higher allocation; arms with greater uncertainty (smaller sample sizes) also received slightly higher allocation to accelerate learning. The algorithm therefore balanced exploitation (treating more patients with better-performing drugs) and exploration (reducing uncertainty about under-sampled arms).

At each interim analysis, a decision was made: continue enrolling up to the prespecified maximum of ~400 patients, or stop early for success. The success rule was based on the posterior probability that the best treatment's utility exceeded 0.925. The design was calibrated to achieve 80% power at approximately 5% type I error.

What the trial found

Final enrollment: 402 patients. Due to the adaptive allocation, the arms were not balanced at study end:

Treatment	N	Efficacy (≥50% pain reduction)	Quit rate	Utility (95% CrI)	P(best)
Nortriptyline	134	25.4%	38.1%	0.81 (0.69–0.93)	0.52
Duloxetine	126	23.0%	37.3%	0.80 (0.68–0.92)	0.43
Pregabalin	73	15.1%	42.5%	0.69 (0.55–0.84)	0.05
Mexiletine	69	20.3%	58.0%	0.58 (0.42–0.75)	0.00

No single drug was clearly superior. Nortriptyline had the highest posterior probability of being best (0.52), duloxetine was close behind (0.43), and their 95% credible intervals overlapped. Pregabalin's low efficacy (15.1%) drove its poor utility despite a moderate quit rate. Mexiletine's extraordinary quit rate (58%) effectively eliminated it as a viable option (P(best) < 0.01).

The practical takeaway: nortriptyline and duloxetine should be considered first-line for CSPN pain. Pregabalin and mexiletine, both commonly prescribed and the former FDA-approved in diabetic neuropathy, performed measurably worse when tolerability was part of the picture.

Why the design was right for this question

Three features of PAIN-CONTRoLS map directly onto the situations where RAR is methodologically appropriate.

The setting was multi-arm with genuine equipoise. Four drugs, no strong prior evidence on relative performance in CSPN, no placebo control to anchor the comparison. In this setting, equal randomization to all four arms treats each arm as equally deserving of information when the accumulating data are already telling you something. In multi-arm trials, balanced randomization implicitly assumes each arm is equally promising throughout the study. Once accumulating data indicate otherwise, continuing to allocate patients equally becomes inefficient: information accrues faster for inferior arms while fewer patients receive the treatments most likely to benefit them. By the time 300 patients had been enrolled, the trial had already learned that mexiletine's quit rate was notably higher than the others. Continuing to allocate patients equally to it would have been both inefficient and arguably unjustifiable.

The endpoint was fast enough to feed the adaptive machinery. Response was measured at 12 weeks. Interim analyses occurred every 13 weeks. The lag between enrollment and available outcome data was short enough for the RAR mechanism to actually function; allocation could update meaningfully before most of the trial's information had been spent. This timing matters: many RAR proposals fail because outcomes arrive too slowly for allocation to meaningfully adapt before the trial is nearly complete. For survival endpoints with long follow-up, RAR can fail to adapt in time for the allocation to have any practical effect on patient benefit.

The question was comparative effectiveness, not confirmatory efficacy. PAIN-CONTRoLS was not trying to establish that any drug beats placebo. It was trying to rank options that clinicians were already prescribing, in a population that those drugs had never been studied in, with an endpoint that captured what actually makes patients stop taking a medication. That is a fundamentally different inferential goal from a Phase III confirmatory trial, and it calls for a fundamentally different design. The RAR machinery was aligned with the goal.

What the design honestly couldn't do

The trial was open-label. Patients and investigators knew which drug was being taken. The utility function's quit rate component is sensitive to this: a patient who knows they are receiving a drug with a poor reputation for side effects may be more likely to discontinue. Mexiletine's 58% quit rate is striking, and adverse events were the primary reason for quitting across all arms, but the absence of blinding makes it impossible to fully separate pharmacological effect from expectation.

The pragmatic design also meant patients were responsible for paying for their prescriptions. Fourteen patients quit due to insurance denial and ten due to cost, specifically pregabalin (8.2% of its arm) and mexiletine. These quit events count against the drug's utility score in the same way as adverse-event-driven discontinuations, which is arguably appropriate for a real-world decision question but introduces noise that is not pharmacological in origin.

Neither of these is a design failure. They are deliberate choices in a comparative effectiveness frame: the trial was explicitly trying to model real prescribing conditions. But they constrain what conclusions can be drawn about the drugs' inherent tolerability profiles.

The connection to the broader RAR debate

PAIN-CONTRoLS appears in Robertson et al. (2023) as one of the few "vanilla BRAR" examples in recent literature, a trial that used Bayesian RAR as the core allocation mechanism without layering it into a more complex master protocol. Robertson et al. cite it alongside a handful of other trials to push back on the claim that RAR is only used in high-profile oncology settings and therefore not generalizable to standard practice.

The Thall–Evans critique ("Just Say No") is fundamentally an argument about expected value: the patient benefit gains from RAR are typically small, while the statistical costs (reduced power, potential type I error inflation, implementation complexity) are real. That argument has more force in two-arm confirmatory trials where balanced allocation is near-optimal and the inferential machinery is well-established. It has less force in the PAIN-CONTRoLS scenario, where: the multi-arm structure makes balanced allocation actively suboptimal from a patient benefit standpoint, the endpoint is composite and utility-based rather than a simple binary, and the clinical question is explicitly comparative rather than confirmatory.

The choice of RAR procedure also matters here. This is where much of the field's critique is least careful. PAIN-CONTRoLS used a design calibrated to hold 5% type I error with 80% power. It was not using raw Thompson sampling, which tends to produce the most extreme allocation imbalances and underlies many of the simulation-based critiques in the literature.

The bottom line

PAIN-CONTRoLS answered a question that three decades of individual drug studies had not: when you put nortriptyline, duloxetine, pregabalin, and mexiletine in the same room and ask which one patients actually stay on while experiencing meaningful pain relief, nortriptyline and duloxetine win and mexiletine loses.

That answer has practical value for every neurologist who treats CSPN. It also has statistical value: the design generated it in 402 patients, at 40 sites, over three years, with an endpoint that captured clinical reality rather than regulatory convenience.

Was RAR the only way to get there? No. A fixed balanced design with group sequential stopping rules could have produced similar results. But the adaptive allocation served the stated goals (getting more patients onto better-performing regimens during the trial while maintaining valid inference at the end) and it did so without the operating characteristic failures that dominate the anti-RAR literature.

The lesson is not that RAR is broadly underused. The lesson is that the arguments against it are narrower than they are often presented. In the specific settings where RAR belongs (multi-arm, genuinely uncertain, composite endpoint, comparative effectiveness question) the methodological case holds up.

PAIN-CONTRoLS is one of those settings. It is worth knowing what it looks like.

This post is the first in a series on response-adaptive randomization in clinical trials, drawing on Robertson et al. (2023), "Response-Adaptive Randomization in Clinical Trials: From Myths to Practical Considerations," Statistical Science 38(2), 185–208.

Many evidentiary problems appear during trial design, not after the analysis. I work with teams to review trial designs and run simulation studies to evaluate operating characteristics before protocols are finalized.

For consulting inquiries: maggie@zetyra.com
For more essays on statistical design and regulatory evidence, subscribe to the Evidence in the Wild newsletter.