Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process

Have you ever looked at the results midway through running an A/B test? From the perspective of classical statistics, at that moment, the Type I error rate (false positive rate) is no longer guaranteed. Traditional hypothesis testing is based on the premise that the sample size is predetermined and the test is performed only once. However, in real-world software systems, data flows in real time, decisions must be made at any moment, and it is impossible to know in advance when to end an experiment.

This article is intended for data engineers or backend developers who are familiar with the concepts of p-values and confidence intervals. If you have ever found the constraint that a p-value must be viewed "only once" inconvenient in practice, this article will be particularly useful. After reading this, you will be able to construct your own statistically safe A/B test pipeline that allows you to look into the experiment midway.

e-value and e-process are general theories of modern sequential inference that bridge this very gap, allowing for the free combination of heterogeneous tests while guaranteeing a Type I error regardless of when stopping. We cover everything in order, from the mathematical foundations of the concepts to practical R and Python code, and even common pitfalls encountered in the field.

Key Concepts

e-value: "How much do you earn if you bet on the null hypothesis?"

The quickest way to understand the e-value is through a betting analogy. If you bet $1 believing the null hypothesis ($H₀: no effect) is false, the amount you receive after seeing the experimental results is $E$. For this game to be "fair"—that is, for there to be no expected return when the null hypothesis is actually true—the expected value of the e-value must be at most 1. This is the mathematical definition of the e-value.

$$E \geq 0, \quad \mathbb{E}_{H_0}[E] \leq 1$$

Betting Score: If $E = 20$, it is strong evidence against the null hypothesis. If $E = 0.3$, it means that you lost more than your principal when betting on the alternative hypothesis, which is not evidence that the null hypothesis is true, but rather that the evidence for the alternative hypothesis is weak.

The most significant difference from the p-value is that it directly controls for Type I errors (the proportion of false positives at significance level $\alpha$) through Markov inequalities.

$$P_{H_0}(E \geq 1/\alpha) \leq \alpha$$

Therefore, the null hypothesis can be rejected at the significance level $\alpha$ at the moment $E_t > 1/\alpha$. If $\alpha = 0.05$, the critical value is 20.

e-process: e-value flowing over time

An e-process is a sequential sequence of e-values $(E_t)_{t \geq 0}$, which is an e-value valid at any stop point $\tau$.

$$\mathbb{E}_{H_0}[E_\tau] \leq 1 \quad \text{(for all stopping points } \tau \text{)}$$

In the language of martingale theory, under the null hypothesis, it is a supermartingale—that is, a time series in which the expected value decreases or remains constant. This property is the mathematical basis for the guarantee that it is "safe to stop at any time."

Sequential Update Rule: Whenever a new observation $x\t$ comes in, the e-process is updated as follows.

$$E_t = E_{t-1} \cdot e_t$$

Here, $e_t$ is the incremental e-value for the $t$-th observation. Thanks to this product structure, data can be updated online in a streaming environment without re-reading the entire dataset.

Ville's inequality is what guarantees this.

$$P!\left(\sup_{t \geq 0} E_t > \frac{1}{\alpha}\right) \leq \alpha$$

Ville's Inequality: An inequality that guarantees that the probability of an e-process path crossing the threshold $1/\alpha$ at least once is less than or equal to $\alpha$. It structurally resolves the "multiple testing inflation" problem that occurs in p-value-based continuous monitoring.

From these two properties, the e-process simultaneously guarantees the following.

Optional stopping: Maintains Type I error $\alpha$ even if stopping at a desired time
Optional continuation: If rejection is not possible, retesting is possible by collecting more data.

Relationship with mSPRT: From Special Cases to General Theory

Method	Key Assumption	Connectability	Degrees of Freedom at Stopping Point
Fixed Sample t-Test	Normality, Fixed $n$	Not Possible	None
SPRT	Simple alternative hypothesis, single parameter	Restricted	Present
mSPRT	Composite Alternative Hypothesis + Specific Prior Distribution	Only within the Same Design	Exists
e-value / e-process	None (distribution independent)	Freely between heterogeneous tests	Completely free

mSPRT (Mixture Sequential Probability Ratio Test) is a special case of the e-value. This is because the mixture likelihood ratio, obtained by integrating the likelihood ratio over the prior distribution π, satisfies the e-value condition. mSPRT, adopted by Optimizely's Stats Engine, is a pioneering example of this paradigm, but it requires a single parameter family and a specific prior distribution. The e-value provides a much wider design space.

Free Combination of Heterogeneous Tests: The Multiplication Principle

The most powerful property of an e-value is that the product of independently generated e-values is a valid e-value.

$$E_{\text{combined}} = E_1 \times E_2 \times \cdots \times E_k$$

Here, "independence" does not require complete mathematical independence, but rather means the case where each e-value is generated independently without sharing data with others. When this condition is satisfied, the following combination is mathematically correct.

Evidence generated in different laboratories
Other data types (continuous + binary)
e-process generated from other filtrations

python

import numpy as np
 
# 세 개의 독립 실험에서 얻은 e-value (각 사일로의 데이터 비공유)
e1 = 4.2   # RCT A
e2 = 3.1   # 관찰 연구 B
e3 = 2.8   # 분산 환경 사일로 C
 
# 결합 e-value — 유효한 e-value 보장
e_combined = e1 * e2 * e3
print(f"결합 e-value: {e_combined:.2f}")  # 36.46
 
alpha = 0.05
threshold = 1 / alpha  # 20.0
print(f"기각 여부 (α=0.05): {e_combined > threshold}")  # True

Filteration: A sequence of information sets that gradually expands over time. When different measurement methods or data sources have their own filtration, an "adjust-then-combine" framework (adjuster function class) is required to combine e-processes (Choe et al., 2024).

Key Triangle

Concept	Role	Corresponding Classical Concept
e-value	Single-point evidence measure	p-value
e-process	Time-expanding e-value path (supermartingale)	Test statistic time series
Confidence Sequence	Anytime-valid confidence interval configured with e-process	Confidence interval

Now, let's translate the concept into actual code. While e-value design entails a power cost, there are many scenarios where the advantages of optional stopping and heterogeneous coupling offset this. We will examine each situation through three practical examples.

Practical Application

Reason for using R first: Currently, production-ready e-value implementations exist only in the R safestats package. Python is used for scenarios that are difficult to handle with R (federated learning, streaming).

Example 1: Continuous Monitoring of Online A/B Tests

We are running an experiment to improve the conversion rate. While the classical t-test requires the sample size to be determined in advance, the e-process-based approach allows checking daily and stopping when sufficient evidence is available.

bash

library(safestats)
 
# 설계 단계: 탐지하고 싶은 최소 효과 크기(deltaMin)와 유의수준 설정
# deltaMin은 반드시 실험 전 비즈니스 요구사항에서 결정해야 한다
design <- designSafeT(
  deltaMin    = 0.3,   # 최소 의미 있는 효과 크기 (Cohen's d)
  alpha       = 0.05,
  alternative = "greater",
  testType    = "twoSample"
)
 
# 데이터가 쌓일수록 e-value 업데이트 (반복 호출 가능)
set.seed(42)
control   <- rnorm(150, mean = 0,   sd = 1)
treatment <- rnorm(150, mean = 0.3, sd = 1)
 
result <- safeTTest(
  x         = treatment,
  y         = control,
  designObj = design
)
 
print(result)
cat("e-value:", result$eValue, "\n")
cat("기각 여부:", result$eValue > 1/0.05, "\n")

If eValue in the output exceeds 20, a conclusion can be drawn immediately without collecting additional data. If it is less than 20, the experiment can be continued — if this were a p-value, a "continuous monitoring error" would have already occurred at this point.

Code Components	Roles
`designSafeT()`	Minimum Effect Size-Based e-Value Test Design (1 time before experiment)
`safeTTest()`	Calculate e-value using accumulated data (can be called repeatedly at any time)
`result$eValue > 20`	Rejection condition at significance level 0.05

Visualizing the e-process path during the experiment allows for an intuitive identification of when sufficient evidence has been obtained.

css

import numpy as np
import matplotlib.pyplot as plt
 
# e-process 경로 시각화
# 업데이트 규칙: E_t = E_{t-1} * e_t (e_t: t번째 관측의 incremental e-value)
np.random.seed(42)
n = 200
data = np.random.normal(0.3, 1, n)  # 처리 효과 δ=0.3 시뮬레이션
delta = 0.3  # 설계 단계에서 지정한 최소 탐지 효과
 
e, e_path = 1.0, []
for x in data:
    e_t = np.exp(delta * x - delta**2 / 2)  # Gaussian incremental e-value
    e *= e_t
    e_path.append(e)
 
fig, ax = plt.subplots(figsize=(10, 4))
ax.semilogy(e_path, color='steelblue', linewidth=1.5, label="e-value 경로")
ax.axhline(y=20, color='crimson', linestyle='--', linewidth=1.5,
           label="기각 임계값 1/α = 20 (α=0.05)")
ax.set_xlabel("누적 관측 수")
ax.set_ylabel("e-value (로그 스케일)")
ax.set_title("A/B 테스트 연속 모니터링: e-process 경로")
ax.legend()
plt.tight_layout()
plt.savefig("e_process_ab_test.png", dpi=150)

The point at which the e-value slopes upward on a logarithmic scale and exceeds the threshold of 20 is the "first point at which the experiment can be statistically safely terminated."

In federated learning or multi-silo environments, each node does not share raw data but transmits only e-values, which are combined centrally.

python

from dataclasses import dataclass
from typing import List
 
@dataclass
class SiloResult:
    silo_id: str
    e_value: float
    sample_size: int
 
def federated_e_combine(silo_results: List[SiloResult], alpha: float = 0.05) -> dict:
    """
    각 사일로의 e-value를 곱산하여 결합 검정 수행.
    데이터 공유 없이 1종 오류 보장 유지.
    전제 조건: 각 사일로는 서로의 데이터를 공유하지 않고 독립적으로 e-value를 생성해야 한다.
    """
    assert len(silo_results) > 0, "사일로 결과가 최소 1개 이상 있어야 합니다"
 
    e_combined = 1.0
    for silo in silo_results:
        e_combined *= silo.e_value
 
    threshold = 1.0 / alpha
    return {
        "e_combined": e_combined,
        "threshold": threshold,
        "reject_H0": e_combined > threshold,
        "total_samples": sum(s.sample_size for s in silo_results),
    }
 
# 세 개 병원 사일로에서 독립 계산된 e-value
hospital_results = [
    SiloResult("병원-A", e_value=3.2, sample_size=80),
    SiloResult("병원-B", e_value=2.7, sample_size=65),
    SiloResult("병원-C", e_value=2.5, sample_size=90),
]
 
result = federated_e_combine(hospital_results, alpha=0.05)
print(f"결합 e-value: {result['e_combined']:.2f}")   # 21.60
print(f"기각 여부: {result['reject_H0']}")           # True
print(f"총 샘플 수: {result['total_samples']}")      # 235

The combined e-value of 21.60 exceeds the threshold of 20, so the null hypothesis is rejected. What was impossible to reject with individual hospitals alone became possible—with only the product of three e-values without sharing raw data.

Example 3: Streaming Anomaly Detection

It detects change points in the time series stream. The point at which the e-process path exceeds a threshold is the alarm.

python

import numpy as np
 
def streaming_change_detection(stream: np.ndarray, alpha: float = 0.05) -> dict:
    """
    SPRT 기반 e-process 이상 탐지.
    Optional stopping: 임계값 돌파 시점이 감지 시간.
    업데이트 규칙: E_t = E_{t-1} * exp(δ*x_t - δ²/2)
    """
    threshold = 1.0 / alpha  # 20.0
    e = 1.0
    e_path = [1.0]
    alarm_time = None
 
    # 기준 분포: N(0,1) 가정, 대립: 평균 shift δ=0.5
    delta = 0.5
 
    for t, x in enumerate(stream, start=1):
        # incremental e-value: 가우시안 분포 가정
        e_t = np.exp(delta * x - delta**2 / 2)
        e = e * e_t
        e_path.append(e)
 
        if alarm_time is None and e > threshold:
            alarm_time = t
 
    return {
        "alarm_time": alarm_time,
        "final_e_value": e,
        "e_path": e_path,
    }
 
np.random.seed(0)
# 처음 200개: 정상 분포 N(0,1)
# 이후 100개: 이상 분포 N(0.5,1)
normal_stream = np.random.normal(0,   1, 200)
shift_stream  = np.random.normal(0.5, 1, 100)
full_stream   = np.concatenate([normal_stream, shift_stream])
 
result = streaming_change_detection(full_stream, alpha=0.05)
print(f"알람 발생 시점: t={result['alarm_time']}")
print(f"최종 e-value: {result['final_e_value']:.1f}")

If the alarm occurs after the change point (t=200), it is a normal detection. The delay time until the threshold of 20 is reached is the detection delay, and the larger the delta value is set, the faster the detection becomes, but the false positive rate in the normal range increases.

Pros and Cons Analysis

Advantages

Item	Content
Optional stopping	Type I Error α Guaranteed if stopped at any point during the experiment
Optional continuation	Retesting possible after collecting additional data if rejection fails
Combining Heterogeneous Tests	Freely combine tests of different assumptions and filtrations using multiplication
Adjusting Post-hoc Significance Level	Test Validity Maintained Even When Changing α After Data Analysis
No Model Assumption Option	Universal Inference e-process does not require strong assumptions such as normality
Intuitive Interpretation	"$E$x profit from betting on the null hypothesis" — Easy to explain even to non-statistics majors
Multiple Test Consolidation	Dependency conditions are much weaker than p-values when controlling for FWER and FDR

Disadvantages and Precautions

Item	Content	Response Plan
Power Cost	Power may be lower than the optimal p-value test at the same $n$	Pre-designed with `designSafeT()`; larger effects reduce the gap
e-value Design Complexity	Prior Knowledge of Alternative Hypothesis Required for Optimal e-value Derivation	Design Burden Eased with Universal Inference
Software Immature	No standard Python packages, R is also `safestats` focused	Use R `safestats` or implement directly
Interpretation of Low e-Values	$E_t < 1$ indicates no evidence for the alternative hypothesis, not acceptance of the null hypothesis	View $E_t$ solely as an indicator of evidence strength, and design a separate test for null acceptance
Combination Complexity Between Filtrations	Configuring adjuster functions is tricky when combining heterogeneous filtrations	Refer to the "adjust-then-combine" framework by Choe et al. (2024)
Lack of researcher familiarity	Additional communication costs for persuading reviewers and collaborators	Citing Ramdas & Wang (2025) textbook in the methodology section

Universal Inference: Proposed by Wasserman et al. (PNAS 2020). By dividing the data in half, estimating parameters on one side, and calculating the split-likelihood ratio on the other, it generates e-values for any hypothesis without normality or exponential family assumptions.

The Most Common Mistakes in Practice

Designing e-values with intermediate data: deltaMin of designSafeT() must be determined from business requirements prior to the experiment. If deltaMin is determined based on pilot data, it results in circular logic (data dredging).
Duplicate e-value generation with identical data and multiplication: Independence is broken, and the validity of the join is lost. First, check whether silos or data splits actually overlap.
Interpreting $E_t < 1$ as "proof of the null hypothesis": A low e-value means that the evidence for the alternative hypothesis is weak, not that the null hypothesis is true. If an equivalence test is required, a separate e-value must be designed.

In Conclusion

e-values are not a perfect alternative. You pay the cost of power, the Python ecosystem is still immature, and convincing your team requires energy. However, if you want to transform your A/B testing pipeline into one that is "okay to check anytime," you can start right now simply by logging e-values in parallel with your existing tests. It is perfectly fine to postpone the decision to switch until you have personally experienced the difference in power.

3 Steps to Start Right Now:

Run the first safe t-test with the R safestats package: Reanalyze the current A/B test data from install.packages("safestats") → designSafeT(deltaMin = 0.5, alpha = 0.05) → safeTTest() to compare the p-value and e-value results side by side.
Read Chapters 1–3 of the Ramdas & Wang (2025) textbook thoroughly: The official arXiv version (arXiv:2410.23614) is available for free. You can grasp the core theory, from the definition of e-value and Ville's inequality to confidence sequences, in about 60 pages.
Add e-process monitoring to the streaming pipeline: Insert the streaming_change_detection function from Example 3 into the Kafka consumer loop or existing stream processing logic, and log the e-value in parallel with existing monitoring metrics to track when the threshold of 20 is exceeded.

Next Post: Advanced Confidence Sequences — Constructing Anytime-Valid Confidence Intervals and Real-Time Effect Size Estimation with e-Process

Reference Materials

Theoretical Background (Textbooks/Surveys)

R/Python Packages and Implementations

Advanced Thesis (by Topic)

Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process | DEV BAK - 기술블로그

Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process

Key Concepts

e-value: "How much do you earn if you bet on the null hypothesis?"

$$E \geq 0, \quad \mathbb{E}_{H_0}[E] \leq 1$$

The most significant difference from the p-value is that it directly controls for Type I errors (the proportion of false positives at significance level $\alpha$) through Markov inequalities.

$$P_{H_0}(E \geq 1/\alpha) \leq \alpha$$

Therefore, the null hypothesis can be rejected at the significance level $\alpha$ at the moment $E_t > 1/\alpha$. If $\alpha = 0.05$, the critical value is 20.

e-process: e-value flowing over time

An e-process is a sequential sequence of e-values $(E_t)_{t \geq 0}$, which is an e-value valid at any stop point $\tau$.

$$\mathbb{E}_{H_0}[E_\tau] \leq 1 \quad \text{(for all stopping points } \tau \text{)}$$

Sequential Update Rule: Whenever a new observation $x\t$ comes in, the e-process is updated as follows.

$$E_t = E_{t-1} \cdot e_t$$

Here, $e_t$ is the incremental e-value for the $t$-th observation. Thanks to this product structure, data can be updated online in a streaming environment without re-reading the entire dataset.

Ville's inequality is what guarantees this.

$$P!\left(\sup_{t \geq 0} E_t > \frac{1}{\alpha}\right) \leq \alpha$$

From these two properties, the e-process simultaneously guarantees the following.

Optional stopping: Maintains Type I error $\alpha$ even if stopping at a desired time
Optional continuation: If rejection is not possible, retesting is possible by collecting more data.

Relationship with mSPRT: From Special Cases to General Theory

Method	Key Assumption	Connectability	Degrees of Freedom at Stopping Point
Fixed Sample t-Test	Normality, Fixed $n$	Not Possible	None
SPRT	Simple alternative hypothesis, single parameter	Restricted	Present
mSPRT	Composite Alternative Hypothesis + Specific Prior Distribution	Only within the Same Design	Exists
e-value / e-process	None (distribution independent)	Freely between heterogeneous tests	Completely free

Free Combination of Heterogeneous Tests: The Multiplication Principle

The most powerful property of an e-value is that the product of independently generated e-values is a valid e-value.

$$E_{\text{combined}} = E_1 \times E_2 \times \cdots \times E_k$$

Evidence generated in different laboratories
Other data types (continuous + binary)
e-process generated from other filtrations

python

import numpy as np
 
# 세 개의 독립 실험에서 얻은 e-value (각 사일로의 데이터 비공유)
e1 = 4.2   # RCT A
e2 = 3.1   # 관찰 연구 B
e3 = 2.8   # 분산 환경 사일로 C
 
# 결합 e-value — 유효한 e-value 보장
e_combined = e1 * e2 * e3
print(f"결합 e-value: {e_combined:.2f}")  # 36.46
 
alpha = 0.05
threshold = 1 / alpha  # 20.0
print(f"기각 여부 (α=0.05): {e_combined > threshold}")  # True

Key Triangle

Concept	Role	Corresponding Classical Concept
e-value	Single-point evidence measure	p-value
e-process	Time-expanding e-value path (supermartingale)	Test statistic time series
Confidence Sequence	Anytime-valid confidence interval configured with e-process	Confidence interval

Practical Application

Example 1: Continuous Monitoring of Online A/B Tests

bash

library(safestats)
 
# 설계 단계: 탐지하고 싶은 최소 효과 크기(deltaMin)와 유의수준 설정
# deltaMin은 반드시 실험 전 비즈니스 요구사항에서 결정해야 한다
design <- designSafeT(
  deltaMin    = 0.3,   # 최소 의미 있는 효과 크기 (Cohen's d)
  alpha       = 0.05,
  alternative = "greater",
  testType    = "twoSample"
)
 
# 데이터가 쌓일수록 e-value 업데이트 (반복 호출 가능)
set.seed(42)
control   <- rnorm(150, mean = 0,   sd = 1)
treatment <- rnorm(150, mean = 0.3, sd = 1)
 
result <- safeTTest(
  x         = treatment,
  y         = control,
  designObj = design
)
 
print(result)
cat("e-value:", result$eValue, "\n")
cat("기각 여부:", result$eValue > 1/0.05, "\n")

Code Components	Roles
`designSafeT()`	Minimum Effect Size-Based e-Value Test Design (1 time before experiment)
`safeTTest()`	Calculate e-value using accumulated data (can be called repeatedly at any time)
`result$eValue > 20`	Rejection condition at significance level 0.05

Visualizing the e-process path during the experiment allows for an intuitive identification of when sufficient evidence has been obtained.

css

import numpy as np
import matplotlib.pyplot as plt
 
# e-process 경로 시각화
# 업데이트 규칙: E_t = E_{t-1} * e_t (e_t: t번째 관측의 incremental e-value)
np.random.seed(42)
n = 200
data = np.random.normal(0.3, 1, n)  # 처리 효과 δ=0.3 시뮬레이션
delta = 0.3  # 설계 단계에서 지정한 최소 탐지 효과
 
e, e_path = 1.0, []
for x in data:
    e_t = np.exp(delta * x - delta**2 / 2)  # Gaussian incremental e-value
    e *= e_t
    e_path.append(e)
 
fig, ax = plt.subplots(figsize=(10, 4))
ax.semilogy(e_path, color='steelblue', linewidth=1.5, label="e-value 경로")
ax.axhline(y=20, color='crimson', linestyle='--', linewidth=1.5,
           label="기각 임계값 1/α = 20 (α=0.05)")
ax.set_xlabel("누적 관측 수")
ax.set_ylabel("e-value (로그 스케일)")
ax.set_title("A/B 테스트 연속 모니터링: e-process 경로")
ax.legend()
plt.tight_layout()
plt.savefig("e_process_ab_test.png", dpi=150)

The point at which the e-value slopes upward on a logarithmic scale and exceeds the threshold of 20 is the "first point at which the experiment can be statistically safely terminated."

In federated learning or multi-silo environments, each node does not share raw data but transmits only e-values, which are combined centrally.

python

from dataclasses import dataclass
from typing import List
 
@dataclass
class SiloResult:
    silo_id: str
    e_value: float
    sample_size: int
 
def federated_e_combine(silo_results: List[SiloResult], alpha: float = 0.05) -> dict:
    """
    각 사일로의 e-value를 곱산하여 결합 검정 수행.
    데이터 공유 없이 1종 오류 보장 유지.
    전제 조건: 각 사일로는 서로의 데이터를 공유하지 않고 독립적으로 e-value를 생성해야 한다.
    """
    assert len(silo_results) > 0, "사일로 결과가 최소 1개 이상 있어야 합니다"
 
    e_combined = 1.0
    for silo in silo_results:
        e_combined *= silo.e_value
 
    threshold = 1.0 / alpha
    return {
        "e_combined": e_combined,
        "threshold": threshold,
        "reject_H0": e_combined > threshold,
        "total_samples": sum(s.sample_size for s in silo_results),
    }
 
# 세 개 병원 사일로에서 독립 계산된 e-value
hospital_results = [
    SiloResult("병원-A", e_value=3.2, sample_size=80),
    SiloResult("병원-B", e_value=2.7, sample_size=65),
    SiloResult("병원-C", e_value=2.5, sample_size=90),
]
 
result = federated_e_combine(hospital_results, alpha=0.05)
print(f"결합 e-value: {result['e_combined']:.2f}")   # 21.60
print(f"기각 여부: {result['reject_H0']}")           # True
print(f"총 샘플 수: {result['total_samples']}")      # 235

Example 3: Streaming Anomaly Detection

It detects change points in the time series stream. The point at which the e-process path exceeds a threshold is the alarm.

python

import numpy as np
 
def streaming_change_detection(stream: np.ndarray, alpha: float = 0.05) -> dict:
    """
    SPRT 기반 e-process 이상 탐지.
    Optional stopping: 임계값 돌파 시점이 감지 시간.
    업데이트 규칙: E_t = E_{t-1} * exp(δ*x_t - δ²/2)
    """
    threshold = 1.0 / alpha  # 20.0
    e = 1.0
    e_path = [1.0]
    alarm_time = None
 
    # 기준 분포: N(0,1) 가정, 대립: 평균 shift δ=0.5
    delta = 0.5
 
    for t, x in enumerate(stream, start=1):
        # incremental e-value: 가우시안 분포 가정
        e_t = np.exp(delta * x - delta**2 / 2)
        e = e * e_t
        e_path.append(e)
 
        if alarm_time is None and e > threshold:
            alarm_time = t
 
    return {
        "alarm_time": alarm_time,
        "final_e_value": e,
        "e_path": e_path,
    }
 
np.random.seed(0)
# 처음 200개: 정상 분포 N(0,1)
# 이후 100개: 이상 분포 N(0.5,1)
normal_stream = np.random.normal(0,   1, 200)
shift_stream  = np.random.normal(0.5, 1, 100)
full_stream   = np.concatenate([normal_stream, shift_stream])
 
result = streaming_change_detection(full_stream, alpha=0.05)
print(f"알람 발생 시점: t={result['alarm_time']}")
print(f"최종 e-value: {result['final_e_value']:.1f}")

Pros and Cons Analysis

Advantages

Item	Content
Optional stopping	Type I Error α Guaranteed if stopped at any point during the experiment
Optional continuation	Retesting possible after collecting additional data if rejection fails
Combining Heterogeneous Tests	Freely combine tests of different assumptions and filtrations using multiplication
Adjusting Post-hoc Significance Level	Test Validity Maintained Even When Changing α After Data Analysis
No Model Assumption Option	Universal Inference e-process does not require strong assumptions such as normality
Intuitive Interpretation	"$E$x profit from betting on the null hypothesis" — Easy to explain even to non-statistics majors
Multiple Test Consolidation	Dependency conditions are much weaker than p-values when controlling for FWER and FDR

Disadvantages and Precautions

Item	Content	Response Plan
Power Cost	Power may be lower than the optimal p-value test at the same $n$	Pre-designed with `designSafeT()`; larger effects reduce the gap
e-value Design Complexity	Prior Knowledge of Alternative Hypothesis Required for Optimal e-value Derivation	Design Burden Eased with Universal Inference
Software Immature	No standard Python packages, R is also `safestats` focused	Use R `safestats` or implement directly
Interpretation of Low e-Values	$E_t < 1$ indicates no evidence for the alternative hypothesis, not acceptance of the null hypothesis	View $E_t$ solely as an indicator of evidence strength, and design a separate test for null acceptance
Combination Complexity Between Filtrations	Configuring adjuster functions is tricky when combining heterogeneous filtrations	Refer to the "adjust-then-combine" framework by Choe et al. (2024)
Lack of researcher familiarity	Additional communication costs for persuading reviewers and collaborators	Citing Ramdas & Wang (2025) textbook in the methodology section

The Most Common Mistakes in Practice

Designing e-values with intermediate data: deltaMin of designSafeT() must be determined from business requirements prior to the experiment. If deltaMin is determined based on pilot data, it results in circular logic (data dredging).
Duplicate e-value generation with identical data and multiplication: Independence is broken, and the validity of the join is lost. First, check whether silos or data splits actually overlap.
Interpreting $E_t < 1$ as "proof of the null hypothesis": A low e-value means that the evidence for the alternative hypothesis is weak, not that the null hypothesis is true. If an equivalence test is required, a separate e-value must be designed.

In Conclusion

3 Steps to Start Right Now:

Run the first safe t-test with the R safestats package: Reanalyze the current A/B test data from install.packages("safestats") → designSafeT(deltaMin = 0.5, alpha = 0.05) → safeTTest() to compare the p-value and e-value results side by side.
Read Chapters 1–3 of the Ramdas & Wang (2025) textbook thoroughly: The official arXiv version (arXiv:2410.23614) is available for free. You can grasp the core theory, from the definition of e-value and Ville's inequality to confidence sequences, in about 60 pages.
Add e-process monitoring to the streaming pipeline: Insert the streaming_change_detection function from Example 3 into the Kafka consumer loop or existing stream processing logic, and log the e-value in parallel with existing monitoring metrics to track when the threshold of 20 is exceeded.

Next Post: Advanced Confidence Sequences — Constructing Anytime-Valid Confidence Intervals and Real-Time Effect Size Estimation with e-Process

Reference Materials

Theoretical Background (Textbooks/Surveys)

R/Python Packages and Implementations

Advanced Thesis (by Topic)

Key Concepts

e-value: "How much do you earn if you bet on the null hypothesis?"

e-process: e-value flowing over time

Relationship with mSPRT: From Special Cases to General Theory

Free Combination of Heterogeneous Tests: The Multiplication Principle

Key Triangle

Practical Application

Example 1: Continuous Monitoring of Online A/B Tests

Example 2: Distributed Learning Environment — Combining Evidence Without Data Sharing

Example 3: Streaming Anomaly Detection

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

e-value: "How much do you earn if you bet on the null hypothesis?"

e-process: e-value flowing over time

Relationship with mSPRT: From Special Cases to General Theory

Free Combination of Heterogeneous Tests: The Multiplication Principle

Key Triangle

Practical Application

Example 1: Continuous Monitoring of Online A/B Tests

Example 2: Distributed Learning Environment — Combining Evidence Without Data Sharing

Example 3: Streaming Anomaly Detection

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Guide to Building an Enterprise Model Context Protocol Server Securely Shared by the Entire Team: Practical Implementation of Streamable HTTP and OAuth 2.1

MCP Multi-tenant Security: Structurally Blocking Inter-tenant Data Leaks with Cloudflare Durable Objects

Deploying an MCP Server with Streamable HTTP and OAuth 2.1 — From Multi-User Environments to Azure AD Integration

The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping