How to Mathematically Allow A/B Test Peeking — Estimating Real-Time Effect Size with e-Process and Anytime-Valid Confidence Sequences
Anyone running an A/B test is tempted at least once. Intermediate results look promising, so why wait for thousands of additional data points? This is known as the peeking problem—if you look at the results midway and end the experiment early, the type-I error rate (false positives) spirals out of control. According to simulation studies, for experiments repeated up to 1,000 times while checking p < 0.05 after every observation, the actual type-I error rate soars to approximately over 30%, rather than 0.05. Traditional statistical methods do not allow for this situation.
However, Netflix, Adobe, and Spotify are already applying frameworks that solve this problem to their actual production environments (Netflix TechBlog, 2024). These are Confidence Sequences and e-process-based anytime-valid inference. As shown below, you can calculate a "valid at any time" confidence interval in real-time with just a few lines of code.
from confseq.betting import betting_cs
lower, upper = betting_cs(x=diffs, alpha=0.05, running_intersection=True)TL;DR — After reading this
You can understand why the e-process mathematically guarantees that "the error rate is maintained even if it stops at any point during data collection."
You can directly compute and visualize confidence sequences with the confseq library.
- The results of multiple independent experiments can be safely combined using e-value multiplication.
Recommended Prior Knowledge: An understanding of basic A/B testing concepts and p-values is sufficient. You can grasp the core concepts through game-theoretic intuition even without a background in martingales or measure theory.
index
- Key Concepts
- Practical Application
- Most Common Mistakes in Practice
- Pros and Cons Analysis
- In Conclusion
Key Concepts
Limitations of Traditional Confidence Intervals: Why Peeking Is a Problem
The classic 95% confidence interval guarantees the following:
If the experiment is repeated infinitely using the same method, 95% of the intervals contain the true parameter.
This guarantee is based on the premise that it is calculated only once after data collection is complete. Checking the p-value midway—stopping if significant and continuing otherwise—breaks this premise. This is because every time the results are examined, the chance for "this time it happens to look significant" accumulates.
Trust Sequence: A section that remains valid regardless of when it stops
The trust sequence ${C_t}_{t \geq 1}$ provides a much stronger guarantee.
$$P\left(\forall t \geq 1 : \theta^* \in C_t\right) \geq 1 - \alpha$$
This formula says only one thing: whether you observe 100 times or 10,000 times, or look in between, the probability that the true effect size in the entire sequence falls within the interval remains 95% or higher.
The key point is that $\forall t$ — holds simultaneously at every time point. This is time-uniform coverage.
Time-Uniform Coverage: A coverage guarantee that holds simultaneously across all possible interruption points, rather than a probability at a single point in time. It is an inherently stronger condition than the pointwise guarantee of traditional confidence intervals.
e-value and e-process: Game-theoretic measures of proof
The key tool for constructing a trust sequence is e-value.
| Concept | Definition | Meaning |
|---|---|---|
| e-value | A non-negative random variable $E[E] \leq 1$ under the null hypothesis $H_0$ | A game-theoretic statement "If the null hypothesis is true, the expected return on this bet is 0 or less" |
| e-process | Sequential version of e-value ${E_t}_{t \geq 1}$ | Real-time evidentiary measure updated as data accumulates |
| Difference from p-value | e-values can be composed through multiplication | p-values cannot combine results from independent experiments through simple multiplication |
The core guarantee of the e-process is expressed by the Ville inequality.
$$P\left(\exists t \geq 1 : E_t \geq 1/\alpha\right) \leq \alpha$$
Easy Understanding of Ville's Inequality: If the coin is fair (the null hypothesis is true), no matter how long you bet, the probability that an asset will exceed a certain threshold (1/α) is less than or equal to α. Conversely, exceeding the threshold is strong evidence that "this coin is not fair." This is the mathematical basis of the sequential test.
(Mathematically, it is the time-uniform probability limit for a Martingale — a "stochastic process that does not increase on the mean" — and is an extension of Doob's maximum inequality.)
Therefore, even if the null hypothesis is rejected at the first time point where $E_t \geq 1/\alpha$, the error rate is controlled to be less than or equal to $\alpha$.
Three Methodologies for E-Process Configuration
It is divided into three types depending on the method of designing a lossless game under the null hypothesis $H₀$.
- Betting Martingale — Simple Null Hypothesis
For every observation $x_t$, select a betting strategy $\lambda_t$ and update assets multiplicatively.
confseq.betting.betting_csimplements this method. $$E_t = \prod_{i=1}^{t}(1 + \lambda_i(x_i - \mu_0))$$ - Mixture Martingale — Composite Null Hypothesis Integrates over the prior distribution $\pi(\theta)$ when the parameter is within a specific range. It is suitable for situations where you are testing "somewhere other than the null hypothesis, although you do not know exactly which alternative it is." $$E_t = \int \prod_{i=1}^{t} \frac{f_\theta(x_i)}{f_0(x_i)} , d\pi(\theta)$$
- Exponential Martingale — Exponential family distributions such as the normal distribution It is naturally derived from exponential distributions (a group of distributions whose distribution shape is expressed in an exponential form, such as the normal distribution and Bernoulli distribution). The method used to solve the unknown variance problem in 2024 falls into this category.
SAVI (Safe Anytime-Valid Inference) Framework: This is an umbrella framework that integrates these three methodologies. By combining e-process (hypothesis testing) and confidence sequences (effect size estimation), it enables statistical inference valid at any point during data collection (Ramdas et al., Statistical Science 2023).
If you understand the concept, let's now implement it in actual code.
Practical Application
Example 1: Real-time Trust Sequence Visualization with the confseq Library
First, install the library.
pip install confseq numpy matplotlibThis is a Click-Through Rate (CTR) A/B test scenario. We experiment with Group A (CTR 10%) and Group B (CTR 13%) and track the trust sequence in real time.
import numpy as np
import matplotlib.pyplot as plt
from confseq.betting import betting_cs
# 재현 가능한 시뮬레이션 설정
rng = np.random.default_rng(42)
# 진짜 클릭률: A=0.10, B=0.13 (3%p 효과)
n_obs = 2000
clicks_a = rng.binomial(1, 0.10, n_obs)
clicks_b = rng.binomial(1, 0.13, n_obs)
# 관측 차이 시퀀스 (B - A)
# 귀무가설 H₀: E[클릭률 차이] = 0
# 두 그룹이 동일 모집단에서 독립 무작위 배정된 경우에 이 해석이 유효합니다
diffs = clicks_b.astype(float) - clicks_a.astype(float)
# e-process 기반 신뢰 시퀀스 계산 (시간-균등 95%)
lower, upper = betting_cs(
x=diffs,
alpha=0.05,
running_intersection=True, # 시간이 지날수록 단조 축소 (실무 권장)
parallel=False
)
# 시각화
t = np.arange(1, n_obs + 1)
running_mean = np.cumsum(diffs) / t
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
# 상단: 신뢰 시퀀스
axes[0].fill_between(t, lower, upper, alpha=0.3, label='95% 신뢰 시퀀스')
axes[0].plot(t, running_mean, 'b-', linewidth=1.5, label='누적 평균 차이')
axes[0].axhline(y=0.03, color='r', linestyle='--', label='진짜 효과 (0.03)')
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].set_xlabel('관측 수 (t)')
axes[0].set_ylabel('클릭률 차이 (B - A)')
axes[0].set_title('e-process 기반 신뢰 시퀀스: 언제 멈춰도 유효합니다')
axes[0].legend()
# 하단: 신뢰구간 너비 (수렴 패턴)
width = upper - lower
axes[1].plot(t, width, 'g-', linewidth=1.5)
axes[1].set_xlabel('관측 수 (t)')
axes[1].set_ylabel('신뢰구간 너비')
axes[1].set_title('샘플이 쌓일수록 좁아지는 신뢰구간')
plt.tight_layout()
plt.savefig('confidence_sequence.png', dpi=150)
plt.show()
# 실시간 결론 도출 — 0을 포함하지 않는 첫 시점 찾기
first_significant = np.where((lower > 0) | (upper < 0))[0]
if len(first_significant) > 0:
t_stop = first_significant[0] + 1
print(f"t={t_stop} 시점에 처음으로 유의미한 차이 감지")
print(f" 신뢰구간: [{lower[t_stop-1]:.4f}, {upper[t_stop-1]:.4f}]")
print(f" 누적 관측 수: {t_stop} (전체의 {t_stop/n_obs:.1%})")Execution Result (Example):
t=847 시점에 처음으로 유의미한 차이 감지
신뢰구간: [0.0021, 0.0589]
누적 관측 수: 847 (전체의 42.4%)We were able to detect the effect at the 42% mark of the total 2,000 people. If it had been a traditional fixed-sample method, we would have waited for all 2,000 people.
| Code Element | Role |
|---|---|
betting_cs() |
Betting Martingale-based Confidence Sequence Calculation |
running_intersection=True |
The interval decreases monotonically over time (recommended in practice) |
alpha=0.05 |
Time-Equal Confidence Level 1-α = 95% |
lower > 0 Condition |
If the confidence interval does not include 0, it is considered effective |
Example 2: Trust Sequence in Continuous Metrics (Session Time, Revenue)
The same method applies to continuous indicators, such as session time or order amount, as well as binary indicators like CTR. You do not need to know the variance.
import numpy as np
from confseq.betting import betting_cs
rng = np.random.default_rng(0)
# 실제 세션 시간 (초): A그룹 평균 120초, B그룹 평균 125초 (5초 개선)
true_effect = 5.0
n = 5000
obs_a = rng.normal(120, 30, n) # 표준편차 30초 — 분산은 알 필요 없음
obs_b = rng.normal(125, 30, n)
diffs = obs_b - obs_a
# 분산을 몰라도 betting_cs가 내부적으로 적응적 추정
lower, upper = betting_cs(
x=diffs,
alpha=0.05,
running_intersection=True
)
t_arr = np.arange(1, n + 1)
running_mean = np.cumsum(diffs) / t_arr
# 주요 체크포인트 모니터링 리포트
check_points = [100, 500, 1000, 2000, 5000]
print(f"{'시점':>6} | {'누적 평균':>10} | {'하한':>8} | {'상한':>8} | {'너비':>8} | {'0 포함?':>8}")
print("-" * 65)
for t in check_points:
idx = t - 1
width = upper[idx] - lower[idx]
contains_zero = "예" if lower[idx] <= 0 <= upper[idx] else "아니오"
print(f"{t:>6} | {running_mean[idx]:>10.4f} | {lower[idx]:>8.4f} | "
f"{upper[idx]:>8.4f} | {width:>8.4f} | {contains_zero:>8}")Execution Result (Example):
시점 | 누적 평균 | 하한 | 상한 | 너비 | 0 포함?
-----------------------------------------------------------------
100 | 5.2341 | -18.9432 | 29.4114 | 48.3546 | 예
500 | 4.8912 | -5.1203 | 14.9027 | 20.0230 | 예
1000 | 5.1234 | 0.4312 | 9.8156 | 9.3844 | 아니오
2000 | 4.9876 | 1.8234 | 8.1518 | 6.3284 | 아니오
5000 | 5.0312 | 2.7891 | 7.2733 | 4.4842 | 아니오| Time | Expected Action |
|---|---|
| t = 100 | Wide interval but includes the true mean |
| t = 1000 | Capture the signal, narrow down to a decision-making level |
| t = 5000 | Width close to the fixed sample t-interval |
| Full Path | Simultaneous Inclusion Guarantee at All Points |
Example 3: Synthesis of Multiple e-Values — Combination of Independent Experiments
The most powerful property of the e-value is multiplicative composability. Multiplying the e-values of several independent experiments yields a valid combination e-value. Unlike the p-value, which requires the Fisher combination method (chi-squared transformation), the e-value can be simply multiplied.
import numpy as np
from scipy.stats import binom
def compute_evalue_bernoulli(successes: int, trials: int, null_p: float) -> float:
"""
베르누이 e-value 계산 (Beta(1,1) = Uniform 사전 분포 하 혼합 우도)
Beta(1,1) 사전 분포에서 베르누이 혼합 우도의 해석적 해는 1/(n+1)이므로:
log_marginal_alt = -log(trials + 1)
"""
log_marginal_alt = -np.log(trials + 1) # 혼합 대안 우도 (해석적 해)
log_null = binom.logpmf(successes, trials, null_p)
return float(np.exp(log_marginal_alt - log_null))
# 3개의 독립 팀이 동일 가설을 별도 실험
experiments = [
{"successes": 45, "trials": 400, "name": "팀 A (마케팅)"},
{"successes": 23, "trials": 200, "name": "팀 B (제품)"},
{"successes": 67, "trials": 600, "name": "팀 C (데이터)"},
]
null_ctr = 0.10 # 귀무가설: CTR = 10%
alpha = 0.05 # 기각 임계값: 1/0.05 = 20
print(f"{'실험':>15} | {'e-value':>10} | {'독립 기각?':>10}")
print("-" * 45)
combined_e = 1.0
for exp_data in experiments:
e = compute_evalue_bernoulli(
exp_data["successes"], exp_data["trials"], null_ctr
)
combined_e *= e
reject = "예" if e >= 1 / alpha else "아니오"
print(f"{exp_data['name']:>15} | {e:>10.3f} | {reject:>10}")
print("-" * 45)
print(f"{'결합 e-value':>15} | {combined_e:>10.3f} | "
f"{'예' if combined_e >= 1/alpha else '아니오':>10}")
print(f"\n기각 임계값(1/α): {1/alpha:.1f}")
print(f"결론: {'귀무가설 기각 — 효과 있음' if combined_e >= 1/alpha else '기각 불가'}")Execution Result (Example):
실험 | e-value | 독립 기각?
---------------------------------------------
팀 A (마케팅) | 1.842 | 아니오
팀 B (제품) | 1.213 | 아니오
팀 C (데이터) | 2.107 | 아니오
---------------------------------------------
결합 e-value | 4.712 | 아니오
기각 임계값(1/α): 20.0
결론: 기각 불가Although all three teams failed to reject the hypothesis individually, the combined e-value provides stronger evidence than the individual ones. Collecting more data or aggregating the experiments of additional teams could exceed the threshold.
The Key to e-Value Synthesis: Thanks to this composability, it is naturally utilized in meta-analyses, experiments with variance, and multi-arm tests. The important point is that the synthesis itself is valid without additional statistical adjustments.
The Most Common Mistakes in Practice
1. Determine early termination status as running_intersection=False
In the case of running_intersection=False, the confidence interval may narrow and then widen again, so the judgment that it is "significant" may be overturned over time. In practice, it is strongly recommended to use running_intersection=True. This option ensures that the interval narrows only monotonically, thereby maintaining the consistency of the conclusion.
2. Misunderstanding the e-value as the "reciprocal of the large p-value"
An e-value of 20 is not equivalent to a p-value of 0.05. The scale of the e-value is the "probability of making this amount of profit under the null hypothesis," and it is interpreted as the $1/\alpha$ threshold according to Ville's inequality. An e-value of 20 means that "if the null hypothesis is true, the probability of this betting outcome is 5% or less," and its calculation and interpretation methods are fundamentally different from those of the p-value.
3. Applying Gaussian Confidence Sequence Without Variance Estimation
If the variance of the actual data differs from the assumption, the time-uniform guarantee is broken. As in Example 2, using betting_cs allows you to avoid explicitly specifying the variance. It is safe to avoid hardcoding parameters by incorrectly assuming that the variance is known.
Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Optional Stopping | The type-I error rate remains below α even if the experiment is stopped based on interim results |
| Optional Continuation | If no conclusion is reached, the experiment can be extended. Multiplying the existing e-value by the new e-value provides valid joint evidence. |
| Continuous Monitoring | You can check results at every observation without a predetermined interim analysis schedule |
| Synthesizable | Multiplying the e-values of multiple independent experiments yields a valid joint e-value. It is advantageous for integrating dispersed experiments. |
| Non-parametric Extension | You can construct a valid confidence sequence with a betting martingale without the assumption of a normal distribution |
| Game Theoretical Interpretation | An intuitive interpretation is possible that "under the null hypothesis, it is impossible to make money through betting." |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Loss of Efficiency | More data may be required than with fixed-sample methods for the same confidence level | Efficiency can be restored through normalization e-process (arXiv:2410.01427) or regression adjustment |
| Initial Interval Width | When initial observations are small, the confidence interval is very wide, making decision-making difficult | running_intersection=True option and minimum observation count preset recommended |
| Complexity of Complex Null Hypothesis | The e-process construction for complex null hypotheses is mathematically more complex than for simple null hypotheses | Utilizing the Mixed Boundary Module of the confseq Library |
| Team Learning Curve | It is difficult to ensure organizational understanding because the concepts of e-value/e-process are unfamiliar compared to p-values | Utilization of safestatistics.com visualizations and ICML 2025 tutorial materials is recommended |
Optional Stopping vs. Optional Continuation: Optional Stopping refers to the freedom to stop at a desired point, while Optional Continuation refers to the freedom to continue if no results are obtained. Traditional statistics do not allow for either, but the SAVI framework allows both while maintaining error rate guarantees.
In Conclusion
e-process-based trust sequences are a practical tool that mathematically provides a guarantee of being "valid at any time," and it is a technology already verified by Netflix and Adobe in operational environments.
Here are 3 steps you can start right now.
- Install the library using
pip install confseqand run the code from Example 1. It works immediately simply by putting your metric difference values into thediffsarray. - You can add code to calculate e-values in parallel to your existing A/B test pipeline. We recommend building confidence by comparing and observing alongside the existing p-values rather than replacing the code from the start.
# 기존 파이프라인에 한 줄 추가 from confseq.betting import betting_cs # 기존 코드: p_value = ttest_ind(group_a, group_b).pvalue # 추가할 코드: lower_cs, upper_cs = betting_cs( x=group_b - group_a, # 또는 페어링된 차이값 alpha=0.05, running_intersection=True ) is_significant_anytime = (lower_cs[-1] > 0) or (upper_cs[-1] < 0)- I recommend reading the Netflix TechBlog's "Sequential A/B Testing Keeps the World Streaming" series (Part 1, Part 2). You can get a vivid look at the implementation details in a real-world production environment and the adoption process within your organization.
Next Post: The Relationship Between Bayesian Factors and e-values — Summarizing how the two frameworks converge and where they diverge, and the criteria for practical selection.
Reference Materials
Introductory — Recommended Reading for Beginners
- E-values — Wikipedia
- The Stats Map: E-Process
- Safe Anytime-Valid Inference (SAVI) — CMU Statistics
- Netflix TechBlog: Sequential A/B Testing Keeps the World Streaming (Part 1)
- Netflix TechBlog: Sequential A/B Testing Keeps the World Streaming (Part 2)
Advanced — If you want to explore the mathematical foundations in greater depth
- Game-Theoretic Statistics and Safe Anytime-Valid Inference | Statistical Science, 2023
- Always Valid Inference: Bringing Sequential Analysis to A/B Testing | arXiv:1512.04922
- Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance | Sequential Analysis, 2024 — arXiv 사전 인쇄본: arXiv:2310.03722
- A tiny review on e-values and e-processes — Ruodu Wang (2023)
- ICML 2025 Tutorial: Game-theoretic Statistics and Sequential Anytime-Valid Inference
For Implementation — If you want to apply it directly at the code level
- GitHub: confseq — Confidence sequences and uniform boundaries
- GitHub: expectation — Python library for e-processes, e-values
- Anytime-Valid Confidence Sequences in an Enterprise A/B Testing Platform | ACM Web Conference 2023
- Anytime-Valid Linear Models and Regression Adjusted Causal Inference | Netflix Research
- Regularized e-processes: anytime valid inference with knowledge-based efficiency gains | arXiv:2410.01427
- E-values for Adaptive Clinical Trials | arXiv:2602.06379
- Prediction-Powered E-Values | OpenReview