How to Mathematically Allow A/B Test Peeking — Estimating Real-Time Effect Size with e-Process and Anytime-Valid Confidence Sequences

Anyone running an A/B test is tempted at least once. Intermediate results look promising, so why wait for thousands of additional data points? This is known as the peeking problem—if you look at the results midway and end the experiment early, the type-I error rate (false positives) spirals out of control. According to simulation studies, for experiments repeated up to 1,000 times while checking p < 0.05 after every observation, the actual type-I error rate soars to approximately over 30%, rather than 0.05. Traditional statistical methods do not allow for this situation.

However, Netflix, Adobe, and Spotify are already applying frameworks that solve this problem to their actual production environments (Netflix TechBlog, 2024). These are Confidence Sequences and e-process-based anytime-valid inference. As shown below, you can calculate a "valid at any time" confidence interval in real-time with just a few lines of code.

from confseq.betting import betting_cs
lower, upper = betting_cs(x=diffs, alpha=0.05, running_intersection=True)

TL;DR — After reading this

You can understand why the e-process mathematically guarantees that "the error rate is maintained even if it stops at any point during data collection." You can directly compute and visualize confidence sequences with the confseq library.

The results of multiple independent experiments can be safely combined using e-value multiplication.

Recommended Prior Knowledge: An understanding of basic A/B testing concepts and p-values is sufficient. You can grasp the core concepts through game-theoretic intuition even without a background in martingales or measure theory.

Key Concepts

Limitations of Traditional Confidence Intervals: Why Peeking Is a Problem

The classic 95% confidence interval guarantees the following:

If the experiment is repeated infinitely using the same method, 95% of the intervals contain the true parameter.

This guarantee is based on the premise that it is calculated only once after data collection is complete. Checking the p-value midway—stopping if significant and continuing otherwise—breaks this premise. This is because every time the results are examined, the chance for "this time it happens to look significant" accumulates.

Trust Sequence: A section that remains valid regardless of when it stops

The trust sequence ${C_t}_{t \geq 1}$ provides a much stronger guarantee.

$$P\left(\forall t \geq 1 : \theta^* \in C_t\right) \geq 1 - \alpha$$

This formula says only one thing: whether you observe 100 times or 10,000 times, or look in between, the probability that the true effect size in the entire sequence falls within the interval remains 95% or higher.

The key point is that $\forall t$ — holds simultaneously at every time point. This is time-uniform coverage.

Time-Uniform Coverage: A coverage guarantee that holds simultaneously across all possible interruption points, rather than a probability at a single point in time. It is an inherently stronger condition than the pointwise guarantee of traditional confidence intervals.

e-value and e-process: Game-theoretic measures of proof

The key tool for constructing a trust sequence is e-value.

Concept	Definition	Meaning
e-value	A non-negative random variable $E[E] \leq 1$ under the null hypothesis $H_0$	A game-theoretic statement "If the null hypothesis is true, the expected return on this bet is 0 or less"
e-process	Sequential version of e-value ${E_t}_{t \geq 1}$	Real-time evidentiary measure updated as data accumulates
Difference from p-value	e-values can be composed through multiplication	p-values cannot combine results from independent experiments through simple multiplication

The core guarantee of the e-process is expressed by the Ville inequality.

$$P\left(\exists t \geq 1 : E_t \geq 1/\alpha\right) \leq \alpha$$

Easy Understanding of Ville's Inequality: If the coin is fair (the null hypothesis is true), no matter how long you bet, the probability that an asset will exceed a certain threshold (1/α) is less than or equal to α. Conversely, exceeding the threshold is strong evidence that "this coin is not fair." This is the mathematical basis of the sequential test.

(Mathematically, it is the time-uniform probability limit for a Martingale — a "stochastic process that does not increase on the mean" — and is an extension of Doob's maximum inequality.)

Therefore, even if the null hypothesis is rejected at the first time point where $E_t \geq 1/\alpha$, the error rate is controlled to be less than or equal to $\alpha$.

Three Methodologies for E-Process Configuration

It is divided into three types depending on the method of designing a lossless game under the null hypothesis $H₀$.

Betting Martingale — Simple Null Hypothesis For every observation $x_t$, select a betting strategy $\lambda_t$ and update assets multiplicatively. confseq.betting.betting_cs implements this method. $$E_t = \prod_{i=1}^{t}(1 + \lambda_i(x_i - \mu_0))$$
Mixture Martingale — Composite Null Hypothesis Integrates over the prior distribution $\pi(\theta)$ when the parameter is within a specific range. It is suitable for situations where you are testing "somewhere other than the null hypothesis, although you do not know exactly which alternative it is." $$E_t = \int \prod_{i=1}^{t} \frac{f_\theta(x_i)}{f_0(x_i)} , d\pi(\theta)$$
Exponential Martingale — Exponential family distributions such as the normal distribution It is naturally derived from exponential distributions (a group of distributions whose distribution shape is expressed in an exponential form, such as the normal distribution and Bernoulli distribution). The method used to solve the unknown variance problem in 2024 falls into this category.

SAVI (Safe Anytime-Valid Inference) Framework: This is an umbrella framework that integrates these three methodologies. By combining e-process (hypothesis testing) and confidence sequences (effect size estimation), it enables statistical inference valid at any point during data collection (Ramdas et al., Statistical Science 2023).

If you understand the concept, let's now implement it in actual code.

Practical Application

Example 1: Real-time Trust Sequence Visualization with the `confseq` Library

First, install the library.

pip install confseq numpy matplotlib

This is a Click-Through Rate (CTR) A/B test scenario. We experiment with Group A (CTR 10%) and Group B (CTR 13%) and track the trust sequence in real time.

python

import numpy as np
import matplotlib.pyplot as plt
from confseq.betting import betting_cs
 
# 재현 가능한 시뮬레이션 설정
rng = np.random.default_rng(42)
 
# 진짜 클릭률: A=0.10, B=0.13 (3%p 효과)
n_obs = 2000
clicks_a = rng.binomial(1, 0.10, n_obs)
clicks_b = rng.binomial(1, 0.13, n_obs)
 
# 관측 차이 시퀀스 (B - A)
# 귀무가설 H₀: E[클릭률 차이] = 0
# 두 그룹이 동일 모집단에서 독립 무작위 배정된 경우에 이 해석이 유효합니다
diffs = clicks_b.astype(float) - clicks_a.astype(float)
 
# e-process 기반 신뢰 시퀀스 계산 (시간-균등 95%)
lower, upper = betting_cs(
    x=diffs,
    alpha=0.05,
    running_intersection=True,  # 시간이 지날수록 단조 축소 (실무 권장)
    parallel=False
)
 
# 시각화
t = np.arange(1, n_obs + 1)
running_mean = np.cumsum(diffs) / t
 
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
 
# 상단: 신뢰 시퀀스
axes[0].fill_between(t, lower, upper, alpha=0.3, label='95% 신뢰 시퀀스')
axes[0].plot(t, running_mean, 'b-', linewidth=1.5, label='누적 평균 차이')
axes[0].axhline(y=0.03, color='r', linestyle='--', label='진짜 효과 (0.03)')
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].set_xlabel('관측 수 (t)')
axes[0].set_ylabel('클릭률 차이 (B - A)')
axes[0].set_title('e-process 기반 신뢰 시퀀스: 언제 멈춰도 유효합니다')
axes[0].legend()
 
# 하단: 신뢰구간 너비 (수렴 패턴)
width = upper - lower
axes[1].plot(t, width, 'g-', linewidth=1.5)
axes[1].set_xlabel('관측 수 (t)')
axes[1].set_ylabel('신뢰구간 너비')
axes[1].set_title('샘플이 쌓일수록 좁아지는 신뢰구간')
 
plt.tight_layout()
plt.savefig('confidence_sequence.png', dpi=150)
plt.show()
 
# 실시간 결론 도출 — 0을 포함하지 않는 첫 시점 찾기
first_significant = np.where((lower > 0) | (upper < 0))[0]
if len(first_significant) > 0:
    t_stop = first_significant[0] + 1
    print(f"t={t_stop} 시점에 처음으로 유의미한 차이 감지")
    print(f"  신뢰구간: [{lower[t_stop-1]:.4f}, {upper[t_stop-1]:.4f}]")
    print(f"  누적 관측 수: {t_stop} (전체의 {t_stop/n_obs:.1%})")

Execution Result (Example):

t=847 시점에 처음으로 유의미한 차이 감지
  신뢰구간: [0.0021, 0.0589]
  누적 관측 수: 847 (전체의 42.4%)

We were able to detect the effect at the 42% mark of the total 2,000 people. If it had been a traditional fixed-sample method, we would have waited for all 2,000 people.

Code Element	Role
`betting_cs()`	Betting Martingale-based Confidence Sequence Calculation
`running_intersection=True`	The interval decreases monotonically over time (recommended in practice)
`alpha=0.05`	Time-Equal Confidence Level 1-α = 95%
`lower > 0` Condition	If the confidence interval does not include 0, it is considered effective

Example 2: Trust Sequence in Continuous Metrics (Session Time, Revenue)

The same method applies to continuous indicators, such as session time or order amount, as well as binary indicators like CTR. You do not need to know the variance.

python

import numpy as np
from confseq.betting import betting_cs
 
rng = np.random.default_rng(0)
 
# 실제 세션 시간 (초): A그룹 평균 120초, B그룹 평균 125초 (5초 개선)
true_effect = 5.0
n = 5000
 
obs_a = rng.normal(120, 30, n)  # 표준편차 30초 — 분산은 알 필요 없음
obs_b = rng.normal(125, 30, n)
diffs = obs_b - obs_a
 
# 분산을 몰라도 betting_cs가 내부적으로 적응적 추정
lower, upper = betting_cs(
    x=diffs,
    alpha=0.05,
    running_intersection=True
)
 
t_arr = np.arange(1, n + 1)
running_mean = np.cumsum(diffs) / t_arr
 
# 주요 체크포인트 모니터링 리포트
check_points = [100, 500, 1000, 2000, 5000]
print(f"{'시점':>6} | {'누적 평균':>10} | {'하한':>8} | {'상한':>8} | {'너비':>8} | {'0 포함?':>8}")
print("-" * 65)
for t in check_points:
    idx = t - 1
    width = upper[idx] - lower[idx]
    contains_zero = "예" if lower[idx] <= 0 <= upper[idx] else "아니오"
    print(f"{t:>6} | {running_mean[idx]:>10.4f} | {lower[idx]:>8.4f} | "
          f"{upper[idx]:>8.4f} | {width:>8.4f} | {contains_zero:>8}")

Execution Result (Example):

  시점 |   누적 평균 |       하한 |       상한 |       너비 |   0 포함?
-----------------------------------------------------------------
   100 |     5.2341 |  -18.9432 |   29.4114 |   48.3546 |       예
   500 |     4.8912 |   -5.1203 |   14.9027 |   20.0230 |       예
  1000 |     5.1234 |    0.4312 |    9.8156 |    9.3844 |    아니오
  2000 |     4.9876 |    1.8234 |    8.1518 |    6.3284 |    아니오
  5000 |     5.0312 |    2.7891 |    7.2733 |    4.4842 |    아니오

Time	Expected Action
t = 100	Wide interval but includes the true mean
t = 1000	Capture the signal, narrow down to a decision-making level
t = 5000	Width close to the fixed sample t-interval
Full Path	Simultaneous Inclusion Guarantee at All Points

Example 3: Synthesis of Multiple e-Values — Combination of Independent Experiments

The most powerful property of the e-value is multiplicative composability. Multiplying the e-values of several independent experiments yields a valid combination e-value. Unlike the p-value, which requires the Fisher combination method (chi-squared transformation), the e-value can be simply multiplied.

python

import numpy as np
from scipy.stats import binom
 
 
def compute_evalue_bernoulli(successes: int, trials: int, null_p: float) -> float:
    """
    베르누이 e-value 계산 (Beta(1,1) = Uniform 사전 분포 하 혼합 우도)
 
    Beta(1,1) 사전 분포에서 베르누이 혼합 우도의 해석적 해는 1/(n+1)이므로:
      log_marginal_alt = -log(trials + 1)
    """
    log_marginal_alt = -np.log(trials + 1)   # 혼합 대안 우도 (해석적 해)
    log_null = binom.logpmf(successes, trials, null_p)
    return float(np.exp(log_marginal_alt - log_null))
 
 
# 3개의 독립 팀이 동일 가설을 별도 실험
experiments = [
    {"successes": 45, "trials": 400, "name": "팀 A (마케팅)"},
    {"successes": 23, "trials": 200, "name": "팀 B (제품)"},
    {"successes": 67, "trials": 600, "name": "팀 C (데이터)"},
]
 
null_ctr = 0.10   # 귀무가설: CTR = 10%
alpha = 0.05      # 기각 임계값: 1/0.05 = 20
 
print(f"{'실험':>15} | {'e-value':>10} | {'독립 기각?':>10}")
print("-" * 45)
 
combined_e = 1.0
for exp_data in experiments:
    e = compute_evalue_bernoulli(
        exp_data["successes"], exp_data["trials"], null_ctr
    )
    combined_e *= e
    reject = "예" if e >= 1 / alpha else "아니오"
    print(f"{exp_data['name']:>15} | {e:>10.3f} | {reject:>10}")
 
print("-" * 45)
print(f"{'결합 e-value':>15} | {combined_e:>10.3f} | "
      f"{'예' if combined_e >= 1/alpha else '아니오':>10}")
print(f"\n기각 임계값(1/α): {1/alpha:.1f}")
print(f"결론: {'귀무가설 기각 — 효과 있음' if combined_e >= 1/alpha else '기각 불가'}")

Execution Result (Example):

           실험 |    e-value |    독립 기각?
---------------------------------------------
  팀 A (마케팅) |      1.842 |      아니오
    팀 B (제품) |      1.213 |      아니오
  팀 C (데이터) |      2.107 |      아니오
---------------------------------------------
   결합 e-value |      4.712 |      아니오
 
기각 임계값(1/α): 20.0
결론: 기각 불가

Although all three teams failed to reject the hypothesis individually, the combined e-value provides stronger evidence than the individual ones. Collecting more data or aggregating the experiments of additional teams could exceed the threshold.

The Key to e-Value Synthesis: Thanks to this composability, it is naturally utilized in meta-analyses, experiments with variance, and multi-arm tests. The important point is that the synthesis itself is valid without additional statistical adjustments.

The Most Common Mistakes in Practice

1. Determine early termination status as `running_intersection=False`

In the case of running_intersection=False, the confidence interval may narrow and then widen again, so the judgment that it is "significant" may be overturned over time. In practice, it is strongly recommended to use running_intersection=True. This option ensures that the interval narrows only monotonically, thereby maintaining the consistency of the conclusion.

2. Misunderstanding the e-value as the "reciprocal of the large p-value"

An e-value of 20 is not equivalent to a p-value of 0.05. The scale of the e-value is the "probability of making this amount of profit under the null hypothesis," and it is interpreted as the $1/\alpha$ threshold according to Ville's inequality. An e-value of 20 means that "if the null hypothesis is true, the probability of this betting outcome is 5% or less," and its calculation and interpretation methods are fundamentally different from those of the p-value.

3. Applying Gaussian Confidence Sequence Without Variance Estimation

If the variance of the actual data differs from the assumption, the time-uniform guarantee is broken. As in Example 2, using betting_cs allows you to avoid explicitly specifying the variance. It is safe to avoid hardcoding parameters by incorrectly assuming that the variance is known.

Pros and Cons Analysis

Advantages

Item	Content
Optional Stopping	The type-I error rate remains below α even if the experiment is stopped based on interim results
Optional Continuation	If no conclusion is reached, the experiment can be extended. Multiplying the existing e-value by the new e-value provides valid joint evidence.
Continuous Monitoring	You can check results at every observation without a predetermined interim analysis schedule
Synthesizable	Multiplying the e-values of multiple independent experiments yields a valid joint e-value. It is advantageous for integrating dispersed experiments.
Non-parametric Extension	You can construct a valid confidence sequence with a betting martingale without the assumption of a normal distribution
Game Theoretical Interpretation	An intuitive interpretation is possible that "under the null hypothesis, it is impossible to make money through betting."

Disadvantages and Precautions

Item	Content	Response Plan
Loss of Efficiency	More data may be required than with fixed-sample methods for the same confidence level	Efficiency can be restored through normalization e-process (arXiv:2410.01427) or regression adjustment
Initial Interval Width	When initial observations are small, the confidence interval is very wide, making decision-making difficult	`running_intersection=True` option and minimum observation count preset recommended
Complexity of Complex Null Hypothesis	The e-process construction for complex null hypotheses is mathematically more complex than for simple null hypotheses	Utilizing the Mixed Boundary Module of the `confseq` Library
Team Learning Curve	It is difficult to ensure organizational understanding because the concepts of e-value/e-process are unfamiliar compared to p-values	Utilization of `safestatistics.com` visualizations and ICML 2025 tutorial materials is recommended

Optional Stopping vs. Optional Continuation: Optional Stopping refers to the freedom to stop at a desired point, while Optional Continuation refers to the freedom to continue if no results are obtained. Traditional statistics do not allow for either, but the SAVI framework allows both while maintaining error rate guarantees.

In Conclusion

e-process-based trust sequences are a practical tool that mathematically provides a guarantee of being "valid at any time," and it is a technology already verified by Netflix and Adobe in operational environments.

Here are 3 steps you can start right now.

Install the library using pip install confseq and run the code from Example 1. It works immediately simply by putting your metric difference values into the diffs array.
You can add code to calculate e-values in parallel to your existing A/B test pipeline. We recommend building confidence by comparing and observing alongside the existing p-values rather than replacing the code from the start.
# 기존 파이프라인에 한 줄 추가 from confseq.betting import betting_cs # 기존 코드: p_value = ttest_ind(group_a, group_b).pvalue # 추가할 코드: lower_cs, upper_cs = betting_cs( x=group_b - group_a, # 또는 페어링된 차이값 alpha=0.05, running_intersection=True ) is_significant_anytime = (lower_cs[-1] > 0) or (upper_cs[-1] < 0)
I recommend reading the Netflix TechBlog's "Sequential A/B Testing Keeps the World Streaming" series (Part 1, Part 2). You can get a vivid look at the implementation details in a real-world production environment and the adoption process within your organization.

Next Post: The Relationship Between Bayesian Factors and e-values — Summarizing how the two frameworks converge and where they diverge, and the criteria for practical selection.

Reference Materials

Introductory — Recommended Reading for Beginners

Advanced — If you want to explore the mathematical foundations in greater depth

For Implementation — If you want to apply it directly at the code level

How to Mathematically Allow A/B Test Peeking — Estimating Real-Time Effect Size with e-Process and Anytime-Valid Confidence Sequences | DEV BAK - 기술블로그

How to Mathematically Allow A/B Test Peeking — Estimating Real-Time Effect Size with e-Process and Anytime-Valid Confidence Sequences

from confseq.betting import betting_cs
lower, upper = betting_cs(x=diffs, alpha=0.05, running_intersection=True)

TL;DR — After reading this

The results of multiple independent experiments can be safely combined using e-value multiplication.

index

Key Concepts
Practical Application
Most Common Mistakes in Practice
Pros and Cons Analysis
In Conclusion

Key Concepts

Limitations of Traditional Confidence Intervals: Why Peeking Is a Problem

The classic 95% confidence interval guarantees the following:

If the experiment is repeated infinitely using the same method, 95% of the intervals contain the true parameter.

Trust Sequence: A section that remains valid regardless of when it stops

The trust sequence ${C_t}_{t \geq 1}$ provides a much stronger guarantee.

$$P\left(\forall t \geq 1 : \theta^* \in C_t\right) \geq 1 - \alpha$$

The key point is that $\forall t$ — holds simultaneously at every time point. This is time-uniform coverage.

e-value and e-process: Game-theoretic measures of proof

The key tool for constructing a trust sequence is e-value.

Concept	Definition	Meaning
e-value	A non-negative random variable $E[E] \leq 1$ under the null hypothesis $H_0$	A game-theoretic statement "If the null hypothesis is true, the expected return on this bet is 0 or less"
e-process	Sequential version of e-value ${E_t}_{t \geq 1}$	Real-time evidentiary measure updated as data accumulates
Difference from p-value	e-values can be composed through multiplication	p-values cannot combine results from independent experiments through simple multiplication

The core guarantee of the e-process is expressed by the Ville inequality.

$$P\left(\exists t \geq 1 : E_t \geq 1/\alpha\right) \leq \alpha$$

(Mathematically, it is the time-uniform probability limit for a Martingale — a "stochastic process that does not increase on the mean" — and is an extension of Doob's maximum inequality.)

Therefore, even if the null hypothesis is rejected at the first time point where $E_t \geq 1/\alpha$, the error rate is controlled to be less than or equal to $\alpha$.

Three Methodologies for E-Process Configuration

It is divided into three types depending on the method of designing a lossless game under the null hypothesis $H₀$.

Betting Martingale — Simple Null Hypothesis For every observation $x_t$, select a betting strategy $\lambda_t$ and update assets multiplicatively. confseq.betting.betting_cs implements this method. $$E_t = \prod_{i=1}^{t}(1 + \lambda_i(x_i - \mu_0))$$
Mixture Martingale — Composite Null Hypothesis Integrates over the prior distribution $\pi(\theta)$ when the parameter is within a specific range. It is suitable for situations where you are testing "somewhere other than the null hypothesis, although you do not know exactly which alternative it is." $$E_t = \int \prod_{i=1}^{t} \frac{f_\theta(x_i)}{f_0(x_i)} , d\pi(\theta)$$
Exponential Martingale — Exponential family distributions such as the normal distribution It is naturally derived from exponential distributions (a group of distributions whose distribution shape is expressed in an exponential form, such as the normal distribution and Bernoulli distribution). The method used to solve the unknown variance problem in 2024 falls into this category.

If you understand the concept, let's now implement it in actual code.

Practical Application

Example 1: Real-time Trust Sequence Visualization with the `confseq` Library

First, install the library.

pip install confseq numpy matplotlib

This is a Click-Through Rate (CTR) A/B test scenario. We experiment with Group A (CTR 10%) and Group B (CTR 13%) and track the trust sequence in real time.

python

import numpy as np
import matplotlib.pyplot as plt
from confseq.betting import betting_cs
 
# 재현 가능한 시뮬레이션 설정
rng = np.random.default_rng(42)
 
# 진짜 클릭률: A=0.10, B=0.13 (3%p 효과)
n_obs = 2000
clicks_a = rng.binomial(1, 0.10, n_obs)
clicks_b = rng.binomial(1, 0.13, n_obs)
 
# 관측 차이 시퀀스 (B - A)
# 귀무가설 H₀: E[클릭률 차이] = 0
# 두 그룹이 동일 모집단에서 독립 무작위 배정된 경우에 이 해석이 유효합니다
diffs = clicks_b.astype(float) - clicks_a.astype(float)
 
# e-process 기반 신뢰 시퀀스 계산 (시간-균등 95%)
lower, upper = betting_cs(
    x=diffs,
    alpha=0.05,
    running_intersection=True,  # 시간이 지날수록 단조 축소 (실무 권장)
    parallel=False
)
 
# 시각화
t = np.arange(1, n_obs + 1)
running_mean = np.cumsum(diffs) / t
 
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
 
# 상단: 신뢰 시퀀스
axes[0].fill_between(t, lower, upper, alpha=0.3, label='95% 신뢰 시퀀스')
axes[0].plot(t, running_mean, 'b-', linewidth=1.5, label='누적 평균 차이')
axes[0].axhline(y=0.03, color='r', linestyle='--', label='진짜 효과 (0.03)')
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].set_xlabel('관측 수 (t)')
axes[0].set_ylabel('클릭률 차이 (B - A)')
axes[0].set_title('e-process 기반 신뢰 시퀀스: 언제 멈춰도 유효합니다')
axes[0].legend()
 
# 하단: 신뢰구간 너비 (수렴 패턴)
width = upper - lower
axes[1].plot(t, width, 'g-', linewidth=1.5)
axes[1].set_xlabel('관측 수 (t)')
axes[1].set_ylabel('신뢰구간 너비')
axes[1].set_title('샘플이 쌓일수록 좁아지는 신뢰구간')
 
plt.tight_layout()
plt.savefig('confidence_sequence.png', dpi=150)
plt.show()
 
# 실시간 결론 도출 — 0을 포함하지 않는 첫 시점 찾기
first_significant = np.where((lower > 0) | (upper < 0))[0]
if len(first_significant) > 0:
    t_stop = first_significant[0] + 1
    print(f"t={t_stop} 시점에 처음으로 유의미한 차이 감지")
    print(f"  신뢰구간: [{lower[t_stop-1]:.4f}, {upper[t_stop-1]:.4f}]")
    print(f"  누적 관측 수: {t_stop} (전체의 {t_stop/n_obs:.1%})")

Execution Result (Example):

t=847 시점에 처음으로 유의미한 차이 감지
  신뢰구간: [0.0021, 0.0589]
  누적 관측 수: 847 (전체의 42.4%)

We were able to detect the effect at the 42% mark of the total 2,000 people. If it had been a traditional fixed-sample method, we would have waited for all 2,000 people.

Code Element	Role
`betting_cs()`	Betting Martingale-based Confidence Sequence Calculation
`running_intersection=True`	The interval decreases monotonically over time (recommended in practice)
`alpha=0.05`	Time-Equal Confidence Level 1-α = 95%
`lower > 0` Condition	If the confidence interval does not include 0, it is considered effective

Example 2: Trust Sequence in Continuous Metrics (Session Time, Revenue)

The same method applies to continuous indicators, such as session time or order amount, as well as binary indicators like CTR. You do not need to know the variance.

python

import numpy as np
from confseq.betting import betting_cs
 
rng = np.random.default_rng(0)
 
# 실제 세션 시간 (초): A그룹 평균 120초, B그룹 평균 125초 (5초 개선)
true_effect = 5.0
n = 5000
 
obs_a = rng.normal(120, 30, n)  # 표준편차 30초 — 분산은 알 필요 없음
obs_b = rng.normal(125, 30, n)
diffs = obs_b - obs_a
 
# 분산을 몰라도 betting_cs가 내부적으로 적응적 추정
lower, upper = betting_cs(
    x=diffs,
    alpha=0.05,
    running_intersection=True
)
 
t_arr = np.arange(1, n + 1)
running_mean = np.cumsum(diffs) / t_arr
 
# 주요 체크포인트 모니터링 리포트
check_points = [100, 500, 1000, 2000, 5000]
print(f"{'시점':>6} | {'누적 평균':>10} | {'하한':>8} | {'상한':>8} | {'너비':>8} | {'0 포함?':>8}")
print("-" * 65)
for t in check_points:
    idx = t - 1
    width = upper[idx] - lower[idx]
    contains_zero = "예" if lower[idx] <= 0 <= upper[idx] else "아니오"
    print(f"{t:>6} | {running_mean[idx]:>10.4f} | {lower[idx]:>8.4f} | "
          f"{upper[idx]:>8.4f} | {width:>8.4f} | {contains_zero:>8}")

Execution Result (Example):

  시점 |   누적 평균 |       하한 |       상한 |       너비 |   0 포함?
-----------------------------------------------------------------
   100 |     5.2341 |  -18.9432 |   29.4114 |   48.3546 |       예
   500 |     4.8912 |   -5.1203 |   14.9027 |   20.0230 |       예
  1000 |     5.1234 |    0.4312 |    9.8156 |    9.3844 |    아니오
  2000 |     4.9876 |    1.8234 |    8.1518 |    6.3284 |    아니오
  5000 |     5.0312 |    2.7891 |    7.2733 |    4.4842 |    아니오

Time	Expected Action
t = 100	Wide interval but includes the true mean
t = 1000	Capture the signal, narrow down to a decision-making level
t = 5000	Width close to the fixed sample t-interval
Full Path	Simultaneous Inclusion Guarantee at All Points

Example 3: Synthesis of Multiple e-Values — Combination of Independent Experiments

python

import numpy as np
from scipy.stats import binom
 
 
def compute_evalue_bernoulli(successes: int, trials: int, null_p: float) -> float:
    """
    베르누이 e-value 계산 (Beta(1,1) = Uniform 사전 분포 하 혼합 우도)
 
    Beta(1,1) 사전 분포에서 베르누이 혼합 우도의 해석적 해는 1/(n+1)이므로:
      log_marginal_alt = -log(trials + 1)
    """
    log_marginal_alt = -np.log(trials + 1)   # 혼합 대안 우도 (해석적 해)
    log_null = binom.logpmf(successes, trials, null_p)
    return float(np.exp(log_marginal_alt - log_null))
 
 
# 3개의 독립 팀이 동일 가설을 별도 실험
experiments = [
    {"successes": 45, "trials": 400, "name": "팀 A (마케팅)"},
    {"successes": 23, "trials": 200, "name": "팀 B (제품)"},
    {"successes": 67, "trials": 600, "name": "팀 C (데이터)"},
]
 
null_ctr = 0.10   # 귀무가설: CTR = 10%
alpha = 0.05      # 기각 임계값: 1/0.05 = 20
 
print(f"{'실험':>15} | {'e-value':>10} | {'독립 기각?':>10}")
print("-" * 45)
 
combined_e = 1.0
for exp_data in experiments:
    e = compute_evalue_bernoulli(
        exp_data["successes"], exp_data["trials"], null_ctr
    )
    combined_e *= e
    reject = "예" if e >= 1 / alpha else "아니오"
    print(f"{exp_data['name']:>15} | {e:>10.3f} | {reject:>10}")
 
print("-" * 45)
print(f"{'결합 e-value':>15} | {combined_e:>10.3f} | "
      f"{'예' if combined_e >= 1/alpha else '아니오':>10}")
print(f"\n기각 임계값(1/α): {1/alpha:.1f}")
print(f"결론: {'귀무가설 기각 — 효과 있음' if combined_e >= 1/alpha else '기각 불가'}")

Execution Result (Example):

           실험 |    e-value |    독립 기각?
---------------------------------------------
  팀 A (마케팅) |      1.842 |      아니오
    팀 B (제품) |      1.213 |      아니오
  팀 C (데이터) |      2.107 |      아니오
---------------------------------------------
   결합 e-value |      4.712 |      아니오
 
기각 임계값(1/α): 20.0
결론: 기각 불가

The Most Common Mistakes in Practice

1. Determine early termination status as `running_intersection=False`

2. Misunderstanding the e-value as the "reciprocal of the large p-value"

3. Applying Gaussian Confidence Sequence Without Variance Estimation

Pros and Cons Analysis

Advantages

Item	Content
Optional Stopping	The type-I error rate remains below α even if the experiment is stopped based on interim results
Optional Continuation	If no conclusion is reached, the experiment can be extended. Multiplying the existing e-value by the new e-value provides valid joint evidence.
Continuous Monitoring	You can check results at every observation without a predetermined interim analysis schedule
Synthesizable	Multiplying the e-values of multiple independent experiments yields a valid joint e-value. It is advantageous for integrating dispersed experiments.
Non-parametric Extension	You can construct a valid confidence sequence with a betting martingale without the assumption of a normal distribution
Game Theoretical Interpretation	An intuitive interpretation is possible that "under the null hypothesis, it is impossible to make money through betting."

Disadvantages and Precautions

Item	Content	Response Plan
Loss of Efficiency	More data may be required than with fixed-sample methods for the same confidence level	Efficiency can be restored through normalization e-process (arXiv:2410.01427) or regression adjustment
Initial Interval Width	When initial observations are small, the confidence interval is very wide, making decision-making difficult	`running_intersection=True` option and minimum observation count preset recommended
Complexity of Complex Null Hypothesis	The e-process construction for complex null hypotheses is mathematically more complex than for simple null hypotheses	Utilizing the Mixed Boundary Module of the `confseq` Library
Team Learning Curve	It is difficult to ensure organizational understanding because the concepts of e-value/e-process are unfamiliar compared to p-values	Utilization of `safestatistics.com` visualizations and ICML 2025 tutorial materials is recommended

In Conclusion

Here are 3 steps you can start right now.

Install the library using pip install confseq and run the code from Example 1. It works immediately simply by putting your metric difference values into the diffs array.
You can add code to calculate e-values in parallel to your existing A/B test pipeline. We recommend building confidence by comparing and observing alongside the existing p-values rather than replacing the code from the start.
# 기존 파이프라인에 한 줄 추가 from confseq.betting import betting_cs # 기존 코드: p_value = ttest_ind(group_a, group_b).pvalue # 추가할 코드: lower_cs, upper_cs = betting_cs( x=group_b - group_a, # 또는 페어링된 차이값 alpha=0.05, running_intersection=True ) is_significant_anytime = (lower_cs[-1] > 0) or (upper_cs[-1] < 0)
I recommend reading the Netflix TechBlog's "Sequential A/B Testing Keeps the World Streaming" series (Part 1, Part 2). You can get a vivid look at the implementation details in a real-world production environment and the adoption process within your organization.

Next Post: The Relationship Between Bayesian Factors and e-values — Summarizing how the two frameworks converge and where they diverge, and the criteria for practical selection.

Reference Materials

Introductory — Recommended Reading for Beginners

Advanced — If you want to explore the mathematical foundations in greater depth

For Implementation — If you want to apply it directly at the code level

index

Key Concepts

Limitations of Traditional Confidence Intervals: Why Peeking Is a Problem

Trust Sequence: A section that remains valid regardless of when it stops

e-value and e-process: Game-theoretic measures of proof

Three Methodologies for E-Process Configuration

Practical Application

Example 1: Real-time Trust Sequence Visualization with the confseq Library

Example 2: Trust Sequence in Continuous Metrics (Session Time, Revenue)

Example 3: Synthesis of Multiple e-Values — Combination of Independent Experiments

The Most Common Mistakes in Practice

1. Determine early termination status as running_intersection=False

2. Misunderstanding the e-value as the "reciprocal of the large p-value"

3. Applying Gaussian Confidence Sequence Without Variance Estimation

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

In Conclusion

Reference Materials

index

Key Concepts

Limitations of Traditional Confidence Intervals: Why Peeking Is a Problem

Trust Sequence: A section that remains valid regardless of when it stops

e-value and e-process: Game-theoretic measures of proof

Three Methodologies for E-Process Configuration

Practical Application

Example 1: Real-time Trust Sequence Visualization with the confseq Library

Example 2: Trust Sequence in Continuous Metrics (Session Time, Revenue)

Example 3: Synthesis of Multiple e-Values — Combination of Independent Experiments

The Most Common Mistakes in Practice

1. Determine early termination status as running_intersection=False

2. Misunderstanding the e-value as the "reciprocal of the large p-value"

3. Applying Gaussian Confidence Sequence Without Variance Estimation

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

In Conclusion

Reference Materials

Recommended Posts

Practical Guide to Implementing Kubernetes Policy-as-Code with OPA Bundle Server + GitOps

n8n MCP Server Trigger Complete Guide — Creating a Custom MCP Server Without Coding and Connecting to Claude Desktop

Bayes Factor vs. E-value (Safety Test): Complete Analysis of Convergence Conditions and Practical Selection Guide for Safe Testing

Building a Multi-Agent Pipeline with the n8n MCP Client Tool

Implementing Per-Tool RBAC and Real-Time PII Blocking with Cloudflare DLP + OPA

Building a Role-Based Multi-Agent Pipeline with CrewAI and LangGraph

Example 1: Real-time Trust Sequence Visualization with the `confseq` Library

1. Determine early termination status as `running_intersection=False`

Example 1: Real-time Trust Sequence Visualization with the `confseq` Library

1. Determine early termination status as `running_intersection=False`