Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process
Have you ever looked at the results midway through running an A/B test? From the perspective of classical statistics, at that moment, the Type I error rate (false positive rate) is no longer guaranteed. Traditional hypothesis testing is based on the premise that the sample size is predetermined and the test is performed only once. However, in real-world software systems, data flows in real time, decisions must be made at any moment, and it is impossible to know in advance when to end an experiment.
This article is intended for data engineers or backend developers who are familiar with the concepts of p-values and confidence intervals. If you have ever found the constraint that a p-value must be viewed "only once" inconvenient in practice, this article will be particularly useful. After reading this, you will be able to construct your own statistically safe A/B test pipeline that allows you to look into the experiment midway.
e-value and e-process are general theories of modern sequential inference that bridge this very gap, allowing for the free combination of heterogeneous tests while guaranteeing a Type I error regardless of when stopping. We cover everything in order, from the mathematical foundations of the concepts to practical R and Python code, and even common pitfalls encountered in the field.
Key Concepts
e-value: "How much do you earn if you bet on the null hypothesis?"
The quickest way to understand the e-value is through a betting analogy. If you bet $1 believing the null hypothesis ($H₀: no effect) is false, the amount you receive after seeing the experimental results is $E$. For this game to be "fair"—that is, for there to be no expected return when the null hypothesis is actually true—the expected value of the e-value must be at most 1. This is the mathematical definition of the e-value.
$$E \geq 0, \quad \mathbb{E}_{H_0}[E] \leq 1$$
Betting Score: If $E = 20$, it is strong evidence against the null hypothesis. If $E = 0.3$, it means that you lost more than your principal when betting on the alternative hypothesis, which is not evidence that the null hypothesis is true, but rather that the evidence for the alternative hypothesis is weak.
The most significant difference from the p-value is that it directly controls for Type I errors (the proportion of false positives at significance level $\alpha$) through Markov inequalities.
$$P_{H_0}(E \geq 1/\alpha) \leq \alpha$$
Therefore, the null hypothesis can be rejected at the significance level $\alpha$ at the moment $E_t > 1/\alpha$. If $\alpha = 0.05$, the critical value is 20.
e-process: e-value flowing over time
An e-process is a sequential sequence of e-values $(E_t)_{t \geq 0}$, which is an e-value valid at any stop point $\tau$.
$$\mathbb{E}_{H_0}[E_\tau] \leq 1 \quad \text{(for all stopping points } \tau \text{)}$$
In the language of martingale theory, under the null hypothesis, it is a supermartingale—that is, a time series in which the expected value decreases or remains constant. This property is the mathematical basis for the guarantee that it is "safe to stop at any time."
Sequential Update Rule: Whenever a new observation $x\t$ comes in, the e-process is updated as follows.
$$E_t = E_{t-1} \cdot e_t$$
Here, $e_t$ is the incremental e-value for the $t$-th observation. Thanks to this product structure, data can be updated online in a streaming environment without re-reading the entire dataset.
Ville's inequality is what guarantees this.
$$P!\left(\sup_{t \geq 0} E_t > \frac{1}{\alpha}\right) \leq \alpha$$
Ville's Inequality: An inequality that guarantees that the probability of an e-process path crossing the threshold $1/\alpha$ at least once is less than or equal to $\alpha$. It structurally resolves the "multiple testing inflation" problem that occurs in p-value-based continuous monitoring.
From these two properties, the e-process simultaneously guarantees the following.
- Optional stopping: Maintains Type I error $\alpha$ even if stopping at a desired time
- Optional continuation: If rejection is not possible, retesting is possible by collecting more data.
Relationship with mSPRT: From Special Cases to General Theory
| Method | Key Assumption | Connectability | Degrees of Freedom at Stopping Point |
|---|---|---|---|
| Fixed Sample t-Test | Normality, Fixed $n$ | Not Possible | None |
| SPRT | Simple alternative hypothesis, single parameter | Restricted | Present |
| mSPRT | Composite Alternative Hypothesis + Specific Prior Distribution | Only within the Same Design | Exists |
| e-value / e-process | None (distribution independent) | Freely between heterogeneous tests | Completely free |
mSPRT (Mixture Sequential Probability Ratio Test) is a special case of the e-value. This is because the mixture likelihood ratio, obtained by integrating the likelihood ratio over the prior distribution π, satisfies the e-value condition. mSPRT, adopted by Optimizely's Stats Engine, is a pioneering example of this paradigm, but it requires a single parameter family and a specific prior distribution. The e-value provides a much wider design space.
Free Combination of Heterogeneous Tests: The Multiplication Principle
The most powerful property of an e-value is that the product of independently generated e-values is a valid e-value.
$$E_{\text{combined}} = E_1 \times E_2 \times \cdots \times E_k$$
Here, "independence" does not require complete mathematical independence, but rather means the case where each e-value is generated independently without sharing data with others. When this condition is satisfied, the following combination is mathematically correct.
- Evidence generated in different laboratories
- Other data types (continuous + binary)
- e-process generated from other filtrations
import numpy as np
# 세 개의 독립 실험에서 얻은 e-value (각 사일로의 데이터 비공유)
e1 = 4.2 # RCT A
e2 = 3.1 # 관찰 연구 B
e3 = 2.8 # 분산 환경 사일로 C
# 결합 e-value — 유효한 e-value 보장
e_combined = e1 * e2 * e3
print(f"결합 e-value: {e_combined:.2f}") # 36.46
alpha = 0.05
threshold = 1 / alpha # 20.0
print(f"기각 여부 (α=0.05): {e_combined > threshold}") # TrueFilteration: A sequence of information sets that gradually expands over time. When different measurement methods or data sources have their own filtration, an "adjust-then-combine" framework (adjuster function class) is required to combine e-processes (Choe et al., 2024).
Key Triangle
| Concept | Role | Corresponding Classical Concept |
|---|---|---|
| e-value | Single-point evidence measure | p-value |
| e-process | Time-expanding e-value path (supermartingale) | Test statistic time series |
| Confidence Sequence | Anytime-valid confidence interval configured with e-process | Confidence interval |
Now, let's translate the concept into actual code. While e-value design entails a power cost, there are many scenarios where the advantages of optional stopping and heterogeneous coupling offset this. We will examine each situation through three practical examples.
Practical Application
Reason for using R first: Currently, production-ready e-value implementations exist only in the R safestats package. Python is used for scenarios that are difficult to handle with R (federated learning, streaming).
Example 1: Continuous Monitoring of Online A/B Tests
We are running an experiment to improve the conversion rate. While the classical t-test requires the sample size to be determined in advance, the e-process-based approach allows checking daily and stopping when sufficient evidence is available.
library(safestats)
# 설계 단계: 탐지하고 싶은 최소 효과 크기(deltaMin)와 유의수준 설정
# deltaMin은 반드시 실험 전 비즈니스 요구사항에서 결정해야 한다
design <- designSafeT(
deltaMin = 0.3, # 최소 의미 있는 효과 크기 (Cohen's d)
alpha = 0.05,
alternative = "greater",
testType = "twoSample"
)
# 데이터가 쌓일수록 e-value 업데이트 (반복 호출 가능)
set.seed(42)
control <- rnorm(150, mean = 0, sd = 1)
treatment <- rnorm(150, mean = 0.3, sd = 1)
result <- safeTTest(
x = treatment,
y = control,
designObj = design
)
print(result)
cat("e-value:", result$eValue, "\n")
cat("기각 여부:", result$eValue > 1/0.05, "\n")If eValue in the output exceeds 20, a conclusion can be drawn immediately without collecting additional data. If it is less than 20, the experiment can be continued — if this were a p-value, a "continuous monitoring error" would have already occurred at this point.
| Code Components | Roles |
|---|---|
designSafeT() |
Minimum Effect Size-Based e-Value Test Design (1 time before experiment) |
safeTTest() |
Calculate e-value using accumulated data (can be called repeatedly at any time) |
result$eValue > 20 |
Rejection condition at significance level 0.05 |
Visualizing the e-process path during the experiment allows for an intuitive identification of when sufficient evidence has been obtained.
import numpy as np
import matplotlib.pyplot as plt
# e-process 경로 시각화
# 업데이트 규칙: E_t = E_{t-1} * e_t (e_t: t번째 관측의 incremental e-value)
np.random.seed(42)
n = 200
data = np.random.normal(0.3, 1, n) # 처리 효과 δ=0.3 시뮬레이션
delta = 0.3 # 설계 단계에서 지정한 최소 탐지 효과
e, e_path = 1.0, []
for x in data:
e_t = np.exp(delta * x - delta**2 / 2) # Gaussian incremental e-value
e *= e_t
e_path.append(e)
fig, ax = plt.subplots(figsize=(10, 4))
ax.semilogy(e_path, color='steelblue', linewidth=1.5, label="e-value 경로")
ax.axhline(y=20, color='crimson', linestyle='--', linewidth=1.5,
label="기각 임계값 1/α = 20 (α=0.05)")
ax.set_xlabel("누적 관측 수")
ax.set_ylabel("e-value (로그 스케일)")
ax.set_title("A/B 테스트 연속 모니터링: e-process 경로")
ax.legend()
plt.tight_layout()
plt.savefig("e_process_ab_test.png", dpi=150)The point at which the e-value slopes upward on a logarithmic scale and exceeds the threshold of 20 is the "first point at which the experiment can be statistically safely terminated."
Example 2: Distributed Learning Environment — Combining Evidence Without Data Sharing
In federated learning or multi-silo environments, each node does not share raw data but transmits only e-values, which are combined centrally.
from dataclasses import dataclass
from typing import List
@dataclass
class SiloResult:
silo_id: str
e_value: float
sample_size: int
def federated_e_combine(silo_results: List[SiloResult], alpha: float = 0.05) -> dict:
"""
각 사일로의 e-value를 곱산하여 결합 검정 수행.
데이터 공유 없이 1종 오류 보장 유지.
전제 조건: 각 사일로는 서로의 데이터를 공유하지 않고 독립적으로 e-value를 생성해야 한다.
"""
assert len(silo_results) > 0, "사일로 결과가 최소 1개 이상 있어야 합니다"
e_combined = 1.0
for silo in silo_results:
e_combined *= silo.e_value
threshold = 1.0 / alpha
return {
"e_combined": e_combined,
"threshold": threshold,
"reject_H0": e_combined > threshold,
"total_samples": sum(s.sample_size for s in silo_results),
}
# 세 개 병원 사일로에서 독립 계산된 e-value
hospital_results = [
SiloResult("병원-A", e_value=3.2, sample_size=80),
SiloResult("병원-B", e_value=2.7, sample_size=65),
SiloResult("병원-C", e_value=2.5, sample_size=90),
]
result = federated_e_combine(hospital_results, alpha=0.05)
print(f"결합 e-value: {result['e_combined']:.2f}") # 21.60
print(f"기각 여부: {result['reject_H0']}") # True
print(f"총 샘플 수: {result['total_samples']}") # 235The combined e-value of 21.60 exceeds the threshold of 20, so the null hypothesis is rejected. What was impossible to reject with individual hospitals alone became possible—with only the product of three e-values without sharing raw data.
Example 3: Streaming Anomaly Detection
It detects change points in the time series stream. The point at which the e-process path exceeds a threshold is the alarm.
import numpy as np
def streaming_change_detection(stream: np.ndarray, alpha: float = 0.05) -> dict:
"""
SPRT 기반 e-process 이상 탐지.
Optional stopping: 임계값 돌파 시점이 감지 시간.
업데이트 규칙: E_t = E_{t-1} * exp(δ*x_t - δ²/2)
"""
threshold = 1.0 / alpha # 20.0
e = 1.0
e_path = [1.0]
alarm_time = None
# 기준 분포: N(0,1) 가정, 대립: 평균 shift δ=0.5
delta = 0.5
for t, x in enumerate(stream, start=1):
# incremental e-value: 가우시안 분포 가정
e_t = np.exp(delta * x - delta**2 / 2)
e = e * e_t
e_path.append(e)
if alarm_time is None and e > threshold:
alarm_time = t
return {
"alarm_time": alarm_time,
"final_e_value": e,
"e_path": e_path,
}
np.random.seed(0)
# 처음 200개: 정상 분포 N(0,1)
# 이후 100개: 이상 분포 N(0.5,1)
normal_stream = np.random.normal(0, 1, 200)
shift_stream = np.random.normal(0.5, 1, 100)
full_stream = np.concatenate([normal_stream, shift_stream])
result = streaming_change_detection(full_stream, alpha=0.05)
print(f"알람 발생 시점: t={result['alarm_time']}")
print(f"최종 e-value: {result['final_e_value']:.1f}")If the alarm occurs after the change point (t=200), it is a normal detection. The delay time until the threshold of 20 is reached is the detection delay, and the larger the delta value is set, the faster the detection becomes, but the false positive rate in the normal range increases.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Optional stopping | Type I Error α Guaranteed if stopped at any point during the experiment |
| Optional continuation | Retesting possible after collecting additional data if rejection fails |
| Combining Heterogeneous Tests | Freely combine tests of different assumptions and filtrations using multiplication |
| Adjusting Post-hoc Significance Level | Test Validity Maintained Even When Changing α After Data Analysis |
| No Model Assumption Option | Universal Inference e-process does not require strong assumptions such as normality |
| Intuitive Interpretation | "$E$x profit from betting on the null hypothesis" — Easy to explain even to non-statistics majors |
| Multiple Test Consolidation | Dependency conditions are much weaker than p-values when controlling for FWER and FDR |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Power Cost | Power may be lower than the optimal p-value test at the same $n$ | Pre-designed with designSafeT(); larger effects reduce the gap |
| e-value Design Complexity | Prior Knowledge of Alternative Hypothesis Required for Optimal e-value Derivation | Design Burden Eased with Universal Inference |
| Software Immature | No standard Python packages, R is also safestats focused |
Use R safestats or implement directly |
| Interpretation of Low e-Values | $E_t < 1$ indicates no evidence for the alternative hypothesis, not acceptance of the null hypothesis | View $E_t$ solely as an indicator of evidence strength, and design a separate test for null acceptance |
| Combination Complexity Between Filtrations | Configuring adjuster functions is tricky when combining heterogeneous filtrations | Refer to the "adjust-then-combine" framework by Choe et al. (2024) |
| Lack of researcher familiarity | Additional communication costs for persuading reviewers and collaborators | Citing Ramdas & Wang (2025) textbook in the methodology section |
Universal Inference: Proposed by Wasserman et al. (PNAS 2020). By dividing the data in half, estimating parameters on one side, and calculating the split-likelihood ratio on the other, it generates e-values for any hypothesis without normality or exponential family assumptions.
The Most Common Mistakes in Practice
- Designing e-values with intermediate data:
deltaMinofdesignSafeT()must be determined from business requirements prior to the experiment. IfdeltaMinis determined based on pilot data, it results in circular logic (data dredging). - Duplicate e-value generation with identical data and multiplication: Independence is broken, and the validity of the join is lost. First, check whether silos or data splits actually overlap.
- Interpreting $E_t < 1$ as "proof of the null hypothesis": A low e-value means that the evidence for the alternative hypothesis is weak, not that the null hypothesis is true. If an equivalence test is required, a separate e-value must be designed.
In Conclusion
e-values are not a perfect alternative. You pay the cost of power, the Python ecosystem is still immature, and convincing your team requires energy. However, if you want to transform your A/B testing pipeline into one that is "okay to check anytime," you can start right now simply by logging e-values in parallel with your existing tests. It is perfectly fine to postpone the decision to switch until you have personally experienced the difference in power.
3 Steps to Start Right Now:
- Run the first safe t-test with the R
safestatspackage: Reanalyze the current A/B test data frominstall.packages("safestats")→designSafeT(deltaMin = 0.5, alpha = 0.05)→safeTTest()to compare the p-value and e-value results side by side. - Read Chapters 1–3 of the Ramdas & Wang (2025) textbook thoroughly: The official arXiv version (arXiv:2410.23614) is available for free. You can grasp the core theory, from the definition of e-value and Ville's inequality to confidence sequences, in about 60 pages.
- Add e-process monitoring to the streaming pipeline: Insert the
streaming_change_detectionfunction from Example 3 into the Kafka consumer loop or existing stream processing logic, and log the e-value in parallel with existing monitoring metrics to track when the threshold of 20 is exceeded.
Next Post: Advanced Confidence Sequences — Constructing Anytime-Valid Confidence Intervals and Real-Time Effect Size Estimation with e-Process
Reference Materials
Theoretical Background (Textbooks/Surveys)
- Hypothesis Testing with E-values | Ramdas & Wang, 2025 (arXiv:2410.23614)
- A tiny review on e-values and e-processes | Ruodu Wang, 2023
- Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science, 2023
- Testing by Betting: A Strategy for Statistical Communication | Shafer, JRSS-A 2021
- E-values | Wikipedia
- The Stats Map — E-Process
R/Python Packages and Implementations
- CRAN: safestats package
- safestats vignette — Safe Flexible Hypothesis Tests
- ICML 2025 Tutorial: Game-theoretic Statistics and Sequential Anytime-Valid Inference
Advanced Thesis (by Topic)
- Universal Inference — PNAS 2020 | Wasserman et al.
- Combining Evidence Across Filtrations | arXiv:2402.09698
- Regularized e-processes | arXiv:2410.01427
- E-values for Adaptive Clinical Trials | arXiv:2602.06379
- False Discovery Rate Control with E-values | JRSS-B, 2022
- The e-value | New Archive for Mathematics, 2024