The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT
When running A/B tests, you are bound to feel this temptation at least once: "The results look pretty good right now; shouldn't I stop already?" With traditional statistical methods, this behavior is fatal. Simply looking at the results before reaching a predetermined sample size—known as "peeking"—can cause the Type I False Positive Rate to skyrocket from α=0.05 to 0.2 or higher. A change with no effect is declared the "winner," and the experimental results become worthless.
Then, is there no method that is statistically safe to look into at any time? Yes, there is. The methods adopted by big tech companies like Optimizely, Netflix, Spotify, and Amplitude to solve this problem are mSPRT (Mixture Sequential Probability Ratio Test) and Always Valid Inference. mSPRT is designed so that the test statistic satisfies the Martingale property, mathematically guaranteeing that the Type I error rate is α or less, regardless of how many times it is looked into or when the experiment is stopped.
In this article, you will learn the mathematical principles, Python implementation, and practical application strategies of mSPRT all at once. After reading, you will understand how to set τ (prior distribution variance) to avoid losing power, why the experiment does not end unless a lower threshold is set, and how the martingale property creates a mathematical guarantee that it is "safe to look at at any time."
Key Concepts
SPRT: Sequential Hypothesis Testing
Fixed-sample tests (t-test, z-test) determine the sample size n before the experiment and make a judgment only once after all the data has been collected. On the other hand, the Sequential Probability Ratio Test (SPRT) is a sequential hypothesis testing method proposed by Abraham Wald (1945) that calculates the likelihood ratio for each incoming data point and immediately draws one of three conclusions.
Likelihood Ratio:
$$\Lambda_n = \prod_{i=1}^{n} \frac{f_{\theta_1}(X_i)}{f_{\theta_0}(X_i)}$$
The decision rule is determined by two thresholds A (upper limit) and B (lower limit).
| Condition | Conclusion |
|---|---|
| $\Lambda_n \geq A$ | H₁ Accept → Experiment End |
| $\Lambda_n \leq B$ | H₀ Accept → Experiment End |
| $B < \Lambda_n < A$ | Collect additional data |
Calculate the threshold value using Wald's approximation.
- Upper bound: $A \approx \frac{1-\beta}{\alpha}$
- 하한: $B \approx \frac{\beta}{1-\alpha}$
For example, if α=0.05 and β=0.20, then A≈16 and B≈0.211.
Type I Error (α): The probability of incorrectly judging that there is a difference when there actually is no difference. It is an error that creates false winners in A/B testing. Type II Error (β): The probability of missing a real difference by reporting "no difference." Power = 1 − β.
Limitations of Classic SPRT: Single Effect Size Assumption
Classical SPRT is based on the simple hypothesis that "the alternative hypothesis has exactly one effect size (θ₁)." In actual A/B testing, it is impossible to know in advance whether the click-through rate will increase by 1% or 3%. If this assumption is broken, Type I error control fails.
mSPRT: Applying a distribution to the alternative hypothesis
The core idea of mSPRT is to assign a prior distribution π(θ) to the parameter θ of the alternative hypothesis instead of a single θ₁, and to integrate (average) the likelihood ratio over that distribution. It became an industry standard after Johari et al. (2015) applied it to online A/B testing.
Mixture Likelihood Ratio:
$$\tilde{\Lambda}n = \int \prod{i=1}^{n} \frac{f_{\theta}(X_i)}{f_{\theta_0}(X_i)} \pi(\theta) d\theta$$
Don't worry if the formula looks complicated — the result of this integral is the closed-form formula below, and you can skip directly to the code without any difficulty in understanding the implementation.
If normal data is used with a normal prior distribution (mean 0, variance τ²), a closed-form solution exists.
$$\tilde{\Lambda}_n = \sqrt{\frac{2\sigma^2}{2\sigma^2 + n\tau^2}} \cdot \exp\left(\frac{n^2\tau^2(\bar{Y}_n - \bar{X}_n)^2}{4\sigma^2(2\sigma^2 + n\tau^2)}\right)$$
- σ²: Population variance (known value or estimate)
- τ²: Variance of the prior distribution — A key parameter reflecting the expected effect size
- n: Number of observations per group
- $\bar{X}_n$, $\bar{Y}_n$: Sample mean of control group/treatment group
Impact of τ setting on results: If τ is large, it includes a wider range of effect sizes as possible alternative hypotheses, making the test conservative and requiring more data to be confident. Conversely, if τ is set too small, power is significantly lost when the actual effect size is much larger than τ. tau = sigma × 0.5 is a safe starting point in practice.
Always Valid Inference: The Mathematical Basis of "It's Okay to Look At Anytime"
Martingale: The possible mixture ratio $\tilde{\Lambda}\n$ satisfies the martingale property under the null hypothesis. This means that the expected value of the current value is equal to the past value, and from this property, Ville's inequality follows.
From the martingale property, the following inequality holds.
$$P_{H_0}\left(\sup_{n \geq 1} \tilde{\Lambda}_n \geq \frac{1}{\alpha}\right) \leq \alpha$$
When the null hypothesis is true, the probability that the possible mixture ratio exceeds the critical value 1/α at any point during the experimental period is less than or equal to α. No matter how many times you look or when you stop, the Type I error rate is guaranteed. This is Always Valid Inference.
Practical Application
Example 1: Real-time Monitoring of Revenue per User A/B Test
We assume an experiment to improve revenue per user. Assuming the revenue data follows a normal distribution, we implement real-time decision-making using mSPRT.
Caution: This implementation assumes a normal distribution, so it is suitable for continuous data such as revenue per user and time spent. Do not apply it directly to binary conversion rate (0/1 click) data.
import numpy as np
from dataclasses import dataclass
from typing import Literal
@dataclass
class SPRTResult:
n: int
lambda_mix: float
decision: Literal["reject H0", "accept H0", "continue"]
x_bar: float
y_bar: float
def compute_msprt(
x: np.ndarray,
y: np.ndarray,
sigma: float,
tau: float,
alpha: float = 0.05,
beta: float = 0.20,
) -> list[SPRTResult]:
"""
mSPRT 혼합 가능비를 순차적으로 계산한다.
Parameters
----------
x : 대조군 관측값 배열
y : 처리군 관측값 배열
sigma : 알려진(또는 추정된) 공통 표준편차
tau : 사전분포 표준편차 — 기대 효과 크기를 반영.
"효과 크기가 sigma의 절반 정도" → tau = sigma * 0.5
alpha : 목표 1종 오류율 (기본값 0.05)
beta : 목표 2종 오류율 (기본값 0.20, 검정력 80%)
"""
threshold = 1.0 / alpha # 상한: H₁ 채택 임계값
lower = beta / (1 - alpha) # 하한: H₀ 채택 임계값 (Wald 근사)
sigma2 = sigma ** 2
tau2 = tau ** 2
results = []
for n in range(1, len(x) + 1):
x_bar = np.mean(x[:n])
y_bar = np.mean(y[:n])
# 닫힌 형식 mSPRT (정규-정규 켤레)
denom = 2 * sigma2 + n * tau2
scale = np.sqrt(2 * sigma2 / denom)
exponent = (n ** 2 * tau2 * (y_bar - x_bar) ** 2) / (4 * sigma2 * denom)
lambda_mix = scale * np.exp(exponent)
if lambda_mix >= threshold:
decision = "reject H0"
elif lambda_mix <= lower:
decision = "accept H0"
else:
decision = "continue"
results.append(
SPRTResult(n=n, lambda_mix=lambda_mix,
decision=decision, x_bar=x_bar, y_bar=y_bar)
)
if decision != "continue":
break
return results
# --- 시뮬레이션 ---
# 정규분포 가정: 사용자당 수익(Revenue per User)처럼 연속형 데이터에 적합.
np.random.seed(42)
ctrl = np.random.normal(loc=0.0, scale=1.0, size=1000)
treat = np.random.normal(loc=0.3, scale=1.0, size=1000) # 실제 효과 크기 0.3
history = compute_msprt(ctrl, treat, sigma=1.0, tau=0.5, alpha=0.05, beta=0.20)
final = history[-1]
print(f"결정: {final.decision}")
print(f"관측 수: {final.n} / 1000")
print(f"혼합 가능비: {final.lambda_mix:.4f}")
print(f"대조군 평균: {final.x_bar:.4f}, 처리군 평균: {final.y_bar:.4f}")Key Code Explanation:
| Variable/Line | Role |
|---|---|
threshold = 1/alpha |
H₁ Acceptance threshold. If α=0.05, then 20 |
lower = beta/(1-alpha) |
H₀ Acceptance threshold. If α=0.05, β=0.20, then ≈ 0.211 (Wald approx.: β/(1−α)) |
scale |
Normalization factor of the prior distribution integral result. Decreases as n increases |
exponent |
A term that amplifies the difference between the means of two groups. Increases rapidly as the effect increases. |
Early break |
Stop immediately once a decision is made — The core of sample efficiency |
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Sample Efficiency | Achieve equivalent power with an average of 40–60% less data compared to fixed-sample tests |
| Real-time Monitoring | Check results at any time without picking issues (Always Valid) |
| Early Termination | Experiment may be terminated earlier than planned if effectiveness is clear |
| Complex Hypothesis Handling | Can set distribution range rather than single effect size as the alternative hypothesis |
| Mathematical Guarantee | The Type I error rate is always mathematically guaranteed due to the martingale property |
Disadvantages and Precautions
Peeking Problem: In fixed-sample testing, the act of checking intermediate results and stopping the experiment when favorable before reaching a predetermined sample size n. Repeated intermediate checks alone can cause the actual Type I error rate to skyrocket from α=0.05 to 0.2 or higher. mSPRT solves this problem mathematically.
| Item | Content | Response Plan |
|---|---|---|
| Practical significance not distinguished | Detects even minute effects that are statistically significant but insignificant in practice | Truncated mSPRT applied (arXiv 2509.07892) |
| τ pre-specification required | The variance τ² of the prior distribution must be set in advance, and results are sensitive | Estimate τ using pilot experiment data or set it conservatively large |
| Assumption of known variance | Basic formula assumes σ² is known, limiting practical application | t-distribution-based anytime-valid t-test extension (arXiv 2310.03722) |
| Multiple Indicator Processing | Multiple Comparison Issues Occur When Testing Multiple Indicators Simultaneously | e-value Multiplication Combination or Bonferroni Correction |
| Implementation complexity | High barrier to entry for conceptual understanding and parameter setting compared to traditional testing | Start with open-source platforms such as GrowthBook |
The Most Common Mistakes in Practice
- Setting τ too small: If τ is much smaller than the actual effect size, power is significantly lost. Setting τ too large requires more data, but it is safer than setting it too small. If you are starting without pilot data, use
tau = sigma * 0.5as your starting point. - Omitting the lower threshold (H₀ acceptance): If you set only the H₁ acceptance threshold (upper limit) and ignore the H₀ acceptance threshold (lower limit), experiments with no effect will never be terminated. Setting
lower = beta / (1 - alpha)is mandatory, not optional. - Ignoring Conservatism in Batch-Based Testing: If testing is performed on a daily basis instead of every observation, the actual Type I error rate becomes lower than α, resulting in a loss of power. This is safe but inefficient. If real-time updates are possible, utilize them as much as possible.
Apply Extension
Example 2: Real-time Detection of ML Model Drift
mSPRT is not limited to A/B testing. The following is an example of detecting data drift in a production ML model in real time using single-hypothesis SPRT. The key difference is that, unlike the mSPRT in Example 1 which deals with the composite alternative hypothesis (the entire parameter distribution), here a single hypothesis, "a state where drift has occurred (μ₁)", is specified in advance.
import numpy as np
from scipy.stats import norm
def drift_detector_sprt(
reference_scores: np.ndarray,
stream_scores: np.ndarray,
alert_threshold: float = 20.0,
) -> dict:
"""
단순 가설 SPRT 기반 드리프트 탐지기.
⚠️ 예시 1의 mSPRT(복합 대립가설)와 달리,
여기서는 드리프트 방향과 크기를 사전에 지정한다.
reference_scores: 기준 모델 점수 (훈련/검증셋)
stream_scores : 실시간 유입 점수
"""
mu0 = np.mean(reference_scores)
sigma0 = np.std(reference_scores)
mu1 = mu0 - 0.5 * sigma0 # 드리프트 가설: 평균이 0.5 sigma 하락
log_lambda = 0.0 # 로그 가능비 누적
for i, score in enumerate(stream_scores, 1):
log_lr = norm.logpdf(score, mu1, sigma0) - norm.logpdf(score, mu0, sigma0)
log_lambda += log_lr
if log_lambda >= np.log(alert_threshold):
return {
"drift_detected": True,
"at_sample": i,
"log_lambda": round(log_lambda, 4),
}
if log_lambda <= -np.log(alert_threshold):
# 하한 도달 시 리셋.
# ⚠️ 이 리셋은 Page's CUSUM의 재시작 개념을 SPRT에 빌려온 혼용 방식이다.
# 순수 SPRT는 리셋을 가정하지 않으므로, 고전 SPRT의 1종 오류 보장이
# 이 코드에 그대로 적용된다고 오해해서는 안 된다.
log_lambda = 0.0
return {"drift_detected": False, "samples_checked": len(stream_scores)}
# 시뮬레이션: 500번째 이후 드리프트 발생
np.random.seed(7)
ref = np.random.normal(0.8, 0.1, 1000)
normal = np.random.normal(0.8, 0.1, 500)
drifted = np.random.normal(0.75, 0.1, 300) # 평균 하락
stream = np.concatenate([normal, drifted])
result = drift_detector_sprt(ref, stream)
print(result)
# 출력 예시 (시드 고정 기준, 실제 값은 실행 환경에 따라 다소 다를 수 있음):
# {'drift_detected': True, 'at_sample': 512, 'log_lambda': 3.04...}Use Log-Likelihood Ratio: Prevents floating-point underflow and converts multiplication to addition to improve numerical stability. It is an essential processing technique when there are many observations.
In Conclusion
By reading this article, you have learned three things: how to set τ to maintain power, why ineffective experiments never end without a lower threshold (H₀ acceptance criterion), and how the martingale property creates a mathematical guarantee that it is "safe to look at at any time. These three points are the reasons why Optimizely, Netflix, and Amplitude run real-time experiments on this technique.
3 Steps to Start Right Now:
- Copy the code block above exactly and apply it to the current experiment data using
tau=sigma*0.5,alpha=0.05, andbeta=0.20. You may also use themsprtpackage from PyPI, but since its API may differ from the code in this article, please check the official documentation before using it. - Launch GrowthBook open source locally, enable the Sequential Testing option, and compare it side-by-side with the existing fixed sample results.
- If you want to delve deeper, read the original paper "Always Valid Inference" by Johari et al. (2015) and run a simulation with your own experimental data to verify for yourself whether the Type I error rate is actually guaranteed.
Next Article: e-value and e-process — Beyond mSPRT, a General Theory of Modern Sequential Inference Freely Combining Heterogeneous Tests
Reference Materials
- Always Valid Inference: Continuous Monitoring of A/B Tests | Johari et al., arXiv
- Sequential Probability Ratio Test: SPRT and Mixture SPRT (mSPRT) | Medium, Carey Chou
- Sequential Test for Practical Significance: Truncated mSPRT | arXiv 2509.07892
- Choosing a Sequential Testing Framework | Spotify Engineering
- Sequential Testing at Booking.com | Booking.ai
- Mastering the mSPRT for A/B Testing | HackerNoon
- Sequential Probability Ratio Test | Wikipedia
- expectation: Python library for e-values and sequential testing | GitHub
- GrowthBook Sequential Testing Docs
- Statsig Sequential Testing | Statsig Blog
- Hypothesis Testing with E-values | Ramdas & Wang, 2025
- Python msprt package | PyPI
- Anytime-valid t-tests and confidence sequences | arXiv 2310.03722
- Sequential Confidence Intervals for A/B Testing | MDPI Mathematics, 2025