ML Uncertainty Quantification in Batch and Sequential Settings: What Changes When Integrating e-values into Conformal Prediction

One of the most uncomfortable moments when deploying ML models to production is being asked, "How much can we trust this prediction?" Accuracy or F1 scores only summarize past performance, and statistically guaranteeing how confident a model is at any given moment is no easy task. Conformal Prediction is a framework that tackles this problem head-on, providing a statistical guarantee — "the probability that the true label is contained in this prediction set is at least 95%" — without strong assumptions about model architecture or data distribution. Then, with the 2025 publication of arXiv:2503.13050 "E-Values Expand the Scope of Conformal Prediction," a systematic approach was presented to maintain validity even under conditions where existing methods broke down. In sequential settings where the number of batches is unknown, p-value-based methods lose their validity guarantee, but e-values preserve coverage at any arbitrary stopping point.

This article targets ML engineers and backend developers who have encountered the concept of conformal prediction at least once, or who understand the need for uncertainty estimation but are implementing it for the first time. Topics covered: the core principles of e-values, Python implementations in batch and sequential settings, and fuzzy prediction sets. Topics not covered: optimal design of betting functions (paper-level theory), step-size theory for online conformal (planned for a future post). Statistical terms such as Markov's inequality and supermartingales appear, but each is accompanied by an intuitive explanation. If you want to focus on running the code rather than theoretical rigor, you can skip the blockquote sections and still follow the core flow.

Core Concepts

Conformal Prediction — Model-Agnostic Uncertainty Guarantees

The idea behind conformal prediction is simple. For a new input x, you measure how well a candidate label y "fits" with existing data (conformity) as a score, and then bundle the labels whose scores meet a certain threshold into a prediction set. Only one assumption is required.

Exchangeability: The condition that the joint distribution does not change even if the order of data points is shuffled arbitrarily. This is a weaker condition than i.i.d., and the validity guarantee of conformal prediction relies on this single assumption alone.

Under this exchangeability assumption, for a specified error rate α, the probability that the true label is included in the prediction set can be guaranteed to be at least 1−α.

python

import numpy as np
from sklearn.ensemble import RandomForestClassifier
 
def split_conformal_predict(clf, cal_X, cal_y, test_X, alpha=0.05):
    """
    Split Conformal Prediction (p-value 방식) 기본 구조.
    보정 집합으로 비적합 점수를 계산하고
    임계값 이상의 레이블을 예측 집합으로 반환합니다.
    """
    # 소프트맥스 확률 기반 비적합 점수: 1 - 정답 클래스의 예측 확률
    cal_probs = clf.predict_proba(cal_X)
    cal_scores = 1 - cal_probs[np.arange(len(cal_y)), cal_y]
 
    # (1-α) 분위수를 임계값으로 사용
    n = len(cal_scores)
    threshold = np.quantile(cal_scores, np.ceil((n + 1) * (1 - alpha)) / n)
 
    # 테스트 데이터에 대해 임계값 이하인 레이블을 예측 집합으로 포함
    test_probs = clf.predict_proba(test_X)
    prediction_sets = []
    for probs in test_probs:
        scores = 1 - probs
        pred_set = np.where(scores <= threshold)[0].tolist()
        prediction_sets.append(pred_set)
 
    return prediction_sets

e-values — Evidence Strength Expressed as a Betting Multiplier

Where a p-value expresses "the probability of obtaining a result this extreme if the null hypothesis were true," an e-value expresses "the strength of evidence against the null hypothesis" as a betting multiplier. Mathematically, an e-value E is a non-negative random variable whose expected value under the null hypothesis H₀ is at most 1.

E[E | H₀] ≤ 1

From this simple property, three powerful characteristics emerge.

Property	Description	Practical Meaning
Multiplicative combination	The product E₁ × E₂ × ··· of independent e-values is also a valid e-value	Batch results can be combined by simple multiplication
Anytime validity	Guarantees are maintained at any arbitrary point without pre-specifying sample size	Sequential monitoring, early stopping possible
Fuzzy membership representation	Degree of label inclusion expressed as a continuous value in [0,1] instead of binary (0/1)	Handles ambiguous label environments

The theoretical basis for anytime validity is Ville's inequality.

Ville's Inequality (intuition): "No matter how many times you look, the probability of a false rejection never exceeds α." Formally, for a non-negative supermartingale {Mₜ}, P(∃t: Mₜ ≥ 1/α) ≤ α holds. Because the cumulative product of the reciprocals of e-values forms this supermartingale, statistical guarantees are maintained at any arbitrary point in time.

Supermartingale: A stochastic process whose expected value decreases or stays the same over time. Think of it as a fair gamble where "what you can expect going forward is no greater than what you have now." Because the cumulative product of reciprocals of e-values satisfies this property under the null hypothesis, it is possible to guarantee that the error rate will not exceed α no matter when you stop observing.

Conformal e-Prediction — e-values as the Core Test Statistic

Where traditional methods use p-value-based nonconformity scores, conformal e-prediction directly computes e-values from the calibration set for each candidate label y. The prediction set consists of labels whose e-values exceed a threshold (typically 1).

The reason the 1/p_value transformation is valid lies in Markov's inequality.

Markov's Inequality (intuition): "A random variable with a small expected value is unlikely to take on large values." For a non-negative random variable E, P(E ≥ c) ≤ E[E]/c holds. Setting E = 1/p, since E[E] ≤ 1 under the null hypothesis, we get P(E ≥ 1/α) ≤ α. That is, rejecting the null hypothesis when the e-value is at least 1/α comes with a guarantee that the error rate does not exceed α. However, E = 1/p is valid but not optimal. To increase power, it is necessary to design a betting function suited to the data distribution.

Here is a summary of how conformal e-prediction differs from traditional conformal prediction:

	Traditional Conformal Prediction (p-value)	Conformal e-Prediction (e-value)
Core test statistic	p-value	e-value (betting multiplier)
Unknown number of batches	Requires union bound	Bypassed via multiplicative combination
Sequential anytime validity	Limited	Natively supported
Handling label ambiguity	Binary (0/1) only	[0,1] fuzzy membership
Cross-conformal theoretical guarantee	Can break under excessive randomization	Theoretically guaranteed

python

import numpy as np
from typing import List
 
def compute_conformal_evalue(cal_scores: np.ndarray, test_score: float) -> float:
    """
    보정 집합의 비적합 점수와 테스트 점수로부터 e-value를 계산합니다.
 
    직관: 테스트 점수가 보정 집합보다 얼마나 극단적인가를 '배율'로 표현합니다.
    E = 1/p 변환은 마르코프 부등식으로 유효성이 보장되지만,
    더 정교한 베팅 전략을 사용하면 검출력(power)이 향상됩니다.
    """
    n = len(cal_scores)
    # 테스트 점수보다 크거나 같은 보정 점수의 비율 → p-value
    p_value = (np.sum(cal_scores >= test_score) + 1) / (n + 1)
    return 1.0 / p_value
 
 
def conformal_evalue_predict(clf, cal_X: np.ndarray, cal_y: np.ndarray,
                              test_X: np.ndarray,
                              threshold: float = 1.0) -> List[List[int]]:
    """
    e-value 기반 예측 집합 구성.
    e-value > threshold인 레이블을 집합에 포함합니다.
    """
    cal_probs = clf.predict_proba(cal_X)
    test_probs_all = clf.predict_proba(test_X)  # 배치 추론으로 사전 계산
    n_classes = cal_probs.shape[1]
 
    prediction_sets = []
    for test_probs in test_probs_all:
        pred_set = []
        for y_candidate in range(n_classes):
            cal_scores_y = 1 - cal_probs[cal_y == y_candidate, y_candidate]
            test_score_y = 1 - test_probs[y_candidate]
 
            if len(cal_scores_y) > 0:
                e_val = compute_conformal_evalue(cal_scores_y, test_score_y)
                if e_val > threshold:
                    pred_set.append(y_candidate)
 
        prediction_sets.append(pred_set)
 
    return prediction_sets

Practical Applications

Example 1: Batch Anytime-Valid Conformal Prediction (Sequential Batch Scenario)

Consider a hospital scenario where clinical data for a new drug arrives sequentially in batches. Regulatory agencies, not knowing how many hospitals will contribute data, want statistically valid conclusions at any point in time. Traditional union bounds require knowing the total number of batches K, but the multiplicative property of e-values allows us to bypass this.

python

import numpy as np
from dataclasses import dataclass, field
from typing import List, Tuple
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
 
# ── 데이터 준비 ──────────────────────────────────────────────────
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.6, random_state=42
)
X_cal, X_test_all, y_cal, y_test_all = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
 
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
 
# 순차 배치 시뮬레이션 — 5개 병원에서 데이터가 순차 도착
n_batches = 5
batch_size = len(X_test_all) // n_batches
sequential_batches: List[Tuple[np.ndarray, np.ndarray]] = [
    (
        X_test_all[i * batch_size:(i + 1) * batch_size],
        y_test_all[i * batch_size:(i + 1) * batch_size],
    )
    for i in range(n_batches)
]
 
 
# ── 배치 e-value 계산 ────────────────────────────────────────────
 
def compute_batch_evalue(clf, cal_X: np.ndarray, cal_y: np.ndarray,
                          batch_X: np.ndarray, batch_y: np.ndarray) -> float:
    """
    배치 데이터로부터 배치 e-value를 계산합니다.
 
    구현: 각 샘플의 e-value(= 1/p-value)의 기하평균을 사용합니다.
    이론적으로 독립 e-value의 곱도 유효한 e-value이며,
    기하평균은 배치 크기 차이를 보정한 샘플당 증거 강도입니다.
 
    ⚠️ 주의: arXiv:2503.13050의 배치 알고리즘은 베팅 함수(betting function)를
    명시적으로 최적화합니다. 이 단순 구현은 이론적 보장은 유지하지만
    검출력이 낮을 수 있습니다. 프로덕션 적용 전 원논문 확인을 권장합니다.
    """
    cal_probs = clf.predict_proba(cal_X)
    cal_scores = 1 - cal_probs[np.arange(len(cal_y)), cal_y]
 
    batch_probs = clf.predict_proba(batch_X)
    batch_scores = 1 - batch_probs[np.arange(len(batch_y)), batch_y]
 
    n_cal = len(cal_scores)
    p_values = np.array([
        (np.sum(cal_scores >= s) + 1) / (n_cal + 1)
        for s in batch_scores
    ])
 
    # 각 샘플의 e-value 기하평균 (배치 크기 정규화)
    individual_evalues = 1.0 / p_values
    batch_evalue = float(np.exp(np.mean(np.log(individual_evalues))))
    return batch_evalue
 
 
# ── 애니타임 모니터 ──────────────────────────────────────────────
 
@dataclass
class BatchEValueMonitor:
    """
    배치 단위로 순차 도착하는 데이터에 대해
    e-value를 누적 곱으로 관리하는 모니터.
    언제든 검정을 수행할 수 있는 애니타임 유효성을 제공합니다.
    """
    alpha: float = 0.05
    cumulative_evalue: float = 1.0
    batch_history: List[float] = field(default_factory=list)
 
    def update(self, batch_evalue: float) -> dict:
        """
        새 배치의 e-value로 누적값을 업데이트합니다.
        귀무가설 기각 여부와 현재 유효 오류율을 반환합니다.
        """
        self.cumulative_evalue *= batch_evalue
        self.batch_history.append(batch_evalue)
 
        # Markov 부등식: P(E_cumul ≥ 1/α) ≤ α
        reject = self.cumulative_evalue >= (1.0 / self.alpha)
        current_alpha = 1.0 / self.cumulative_evalue
 
        return {
            "batch_num": len(self.batch_history),
            "batch_evalue": batch_evalue,
            "cumulative_evalue": self.cumulative_evalue,
            "reject_null": reject,
            "current_effective_alpha": min(current_alpha, 1.0),
        }
 
 
# ── 순차 실행 예시 ───────────────────────────────────────────────
 
monitor = BatchEValueMonitor(alpha=0.05)
 
for batch_X, batch_y in sequential_batches:
    e_val = compute_batch_evalue(clf, X_cal, y_cal, batch_X, batch_y)
    result = monitor.update(e_val)
 
    print(f"배치 {result['batch_num']:2d} | "
          f"e-value: {result['batch_evalue']:.3f} | "
          f"누적: {result['cumulative_evalue']:.3f} | "
          f"기각: {result['reject_null']}")
 
    if result["reject_null"]:
        print("→ 귀무가설 기각: 통계적으로 유의미한 효과 확인됨")
        break  # 이 시점 이후 추가 배치가 와도 이미 얻은 보장은 유효합니다

Summary of the Batch e-value Accumulation Process

Step	Action	Key Point
Build calibration set	Obtain nonconformity score distribution from initial data	Can be shared across batches or use a sliding window
Compute batch e-value	Convert each batch's p-values to e-values	Power varies depending on betting strategy
Update cumulative product	`E_cumul *= E_batch`	Validity maintained without knowing the number of batches K
Anytime test	Reject if `E_cumul ≥ 1/α`	α-level guarantee regardless of when you stop

Example 2: Fuzzy Prediction Sets — Handling Ambiguous Labels

In radiology image interpretation, when multiple experts reach different diagnoses, it is difficult to define which label is the "correct" one in binary terms. Fuzzy prediction sets naturally accommodate this ambiguity by expressing label membership as a continuous value in the [0, 1] range.

python

import numpy as np
from typing import List, Optional, Dict
 
def compute_fuzzy_membership(e_value: float, threshold: float = 1.0) -> float:
    """
    e-value를 [0, 1] 범위의 퍼지 멤버십 값으로 변환합니다.
 
    변환 공식: log(1 + (e_value - threshold)) / log(10)
    - threshold 미만 → 0.0 (예측 집합 미포함)
    - e_value = threshold + 9일 때 → 1.0 (포화 기준)
    - log 스케일로 e-value가 커질수록 포화되는 형태 → 극단값 안정화
 
    포화 기준(+9)은 "e-value가 threshold보다 9 높을 때 완전 포함"을 의미하는
    임의 상수입니다. 엄격한 의료 진단처럼 빠른 포화가 필요하다면
    np.log1p(4.0)으로 낮추고, 넓은 분포의 데이터라면 더 크게 설정할 수 있습니다.
    """
    if e_value < threshold:
        return 0.0
    return min(1.0, np.log1p(e_value - threshold) / np.log1p(9.0))
 
 
def fuzzy_conformal_predict(clf, cal_X: np.ndarray, cal_y: np.ndarray,
                             test_X: np.ndarray,
                             ambiguous_labels: Optional[np.ndarray] = None
                             ) -> List[Dict[int, float]]:
    """
    퍼지 컨포멀 예측 집합을 구성합니다.
 
    ambiguous_labels: (n_cal_samples, n_classes) 형태의 소프트 레이블 행렬.
                      각 셀은 해당 클래스에 대한 전문가 동의율(0~1)입니다.
                      None이면 하드 레이블을 원-핫으로 처리합니다.
    반환: 각 테스트 샘플에 대한 {레이블: 멤버십 값} 딕셔너리 리스트
    """
    cal_probs = clf.predict_proba(cal_X)
    test_probs_all = clf.predict_proba(test_X)  # 배치 추론으로 사전 계산
    n_classes = cal_probs.shape[1]
 
    if ambiguous_labels is None:
        ambiguous_labels = np.eye(n_classes)[cal_y]
 
    results = []
    for test_probs in test_probs_all:
        memberships: Dict[int, float] = {}
 
        for y_candidate in range(n_classes):
            soft_weights = ambiguous_labels[:, y_candidate]
            cal_scores = 1 - cal_probs[:, y_candidate]
            weighted_scores = cal_scores * soft_weights
 
            valid_mask = soft_weights > 0
            if valid_mask.sum() < 5:  # 최소 샘플 수 보장
                memberships[y_candidate] = 0.0
                continue
 
            test_score = 1 - test_probs[y_candidate]
            e_val = compute_conformal_evalue(weighted_scores[valid_mask], test_score)
            memberships[y_candidate] = compute_fuzzy_membership(e_val)
 
        results.append(memberships)
 
    return results
 
 
# ── 사용 예시: 3명의 전문가 레이블이 있는 의료 진단 ──────────────
 
n_classes = clf.n_classes_
n_experts = 3
 
# 전문가 레이블 시뮬레이션 — 실제로는 어노테이션 툴에서 가져옵니다
np.random.seed(42)
expert_labels = np.random.randint(0, n_classes, size=(len(y_cal), n_experts))
 
# 클래스별 소프트 레이블 행렬 구성
# ambiguous[i, c] = 샘플 i에 대해 클래스 c로 분류한 전문가 비율
ambiguous = np.zeros((len(y_cal), n_classes))
for c in range(n_classes):
    ambiguous[:, c] = np.mean(expert_labels == c, axis=1)
# 각 행의 합 = 1.0 (클래스별 동의율의 합)
 
fuzzy_preds = fuzzy_conformal_predict(
    clf, X_cal, y_cal, X_test_all[:10],
    ambiguous_labels=ambiguous,
)
 
for i, memberships in enumerate(fuzzy_preds[:3]):
    print(f"\n테스트 샘플 {i}:")
    for label, membership in sorted(memberships.items(), key=lambda x: -x[1]):
        bar = "█" * int(membership * 20)
        print(f"  클래스 {label}: {membership:.3f} |{bar:<20}|")

Summary of the Fuzzy Prediction Set Construction Process

Step	Action	Key Point
Build soft label matrix	Create (n_samples, n_classes) matrix from expert agreement rates	Normalize so each row sums to 1
Compute weighted nonconformity scores	Apply soft label weights to nonconformity scores	Classes with lower agreement have reduced influence
Compute e-value	Derive e-value from the weighted score distribution	Higher e-value when expert opinions agree
Fuzzy membership transformation	Convert to [0,1] via `compute_fuzzy_membership`	Saturation threshold can be tuned to the domain

Strengths and Weaknesses

Strengths

Item	Description	When It Is Especially Useful
Simultaneous guarantee with unknown number of batches	Provides valid coverage even without knowing K, via multiplicative combination of e-values	Multi-site clinical trials, batch A/B testing
Anytime validity	Statistical guarantees maintained at any arbitrary point in time by Ville's inequality	Sequential experiments with early stopping
Fuzzy prediction sets	Accommodates label ambiguity with continuous [0,1] values instead of binary inclusion	Environments with disagreeing expert annotations
Easy design of conditional predictors	Easier to construct per-input customized prediction sets than with p-value methods	Cases requiring per-subgroup guarantees
Cross-conformal validity	While p-value-based cross-conformal can break under excessive randomization, e-value-based is theoretically guaranteed	Cases with limited data requiring cross-validation
Handling ambiguous labels	Valid coverage guaranteed in environments with annotation noise or label ambiguity	Medical imaging, emotion recognition, etc.

Weaknesses and Caveats

Item	Description	Mitigation
Exchangeability dependence	Assumption violated under strong temporal dependence (AR processes)	Sliding window recalibration, weighted conformal
Conservative prediction sets	Prediction sets can be wider than p-value methods with small calibration sets	Ensure sufficient calibration data (recommended n ≥ 200)
Computational cost	Full conformal requires model retraining for each candidate label	Use split conformal e-prediction to reduce computation
Software immaturity	Not natively supported in major libraries such as MAPIE or crepes	Refer to paper authors' GitHub code, implement directly
Distribution shift	Calibration set may become mismatched when distribution changes between batches	Use weighted e-values, adaptive recalibration strategies
Betting function design	`E = 1/p` is valid but may have low power	Optimize betting function based on domain knowledge (see arXiv:2503.13050)

Most Common Mistakes in Practice

Not separating the calibration set from the training data — Computing nonconformity scores on the same data used to train the model invalidates the coverage guarantee. It is essential to maintain a separate held-out calibration set.
Using e-values merely as the reciprocal of p-values — E = 1/p is valid but not optimal. To increase power, a betting function tailored to the data is needed; omitting this can result in unnecessarily wide prediction sets.
Assuming exchangeability across batches while ignoring distribution shift — In environments where the data distribution changes from batch to batch (seasonal time series, differences in hospital equipment, etc.), trusting the cumulative e-value without verification can cause the actual coverage to diverge significantly from the guaranteed value.

Closing Thoughts

By combining e-values with conformal prediction, you can maintain statistically rigorous uncertainty guarantees even in sequential environments with an unknown number of batches and even with real-world data featuring ambiguous labels. The theoretical foundation traces back to Vovk's 2020 work, and arXiv:2503.13050 in 2025 rapidly extended it into three practical applications: batch anytime-valid prediction, fuzzy prediction sets, and handling ambiguous ground truth. The online conformal prediction covered in the next post connects naturally to this batch e-value accumulation concept, so familiarizing yourself with the BatchEValueMonitor flow from this post will be very helpful.

Here are 3 steps you can take right now.

Hands-on conformal prediction basics: After pip install mapie, run the classification example in the official documentation to directly verify the coverage guarantee of split conformal. MAPIE does not natively support e-values, but it is a good starting point for understanding the nonconformity score computation structure.
Experiment by implementing e-values directly: Try applying the BatchEValueMonitor code from this post to make_classification data. You can use the visualization code below to see how prediction set size and actual coverage change as you vary the calibration set size (100, 500, 1000).

python

import matplotlib.pyplot as plt
 
cal_sizes = [100, 500, 1000]
coverages = []
 
for cal_size in cal_sizes:
    cal_X_sub = X_cal[:cal_size]
    cal_y_sub = y_cal[:cal_size]
    pred_sets = conformal_evalue_predict(clf, cal_X_sub, cal_y_sub, X_test_all)
    coverage = np.mean([
        y_true in pred_set
        for y_true, pred_set in zip(y_test_all, pred_sets)
    ])
    coverages.append(coverage)
 
plt.figure(figsize=(8, 4))
plt.plot(cal_sizes, coverages, marker="o", label="실제 커버리지")
plt.axhline(y=0.95, color="red", linestyle="--", label="목표 커버리지 (95%)")
plt.xlabel("보정 집합 크기")
plt.ylabel("실제 커버리지")
plt.title("보정 집합 크기와 커버리지 관계")
plt.legend()
plt.tight_layout()
plt.show()

Read the original paper: arXiv:2503.13050 "E-Values Expand the Scope of Conformal Prediction" presents the batch setting algorithm clearly in pseudocode, making it a great resource for confirming the theoretical foundations before integrating into a real system.

Next post: Online Conformal Prediction — how to maintain real-time coverage on streaming data using a decreasing step size. We will explore how this connects to the batch e-value accumulation concept covered in this post.

References

Introductory — Good starting points for first-time readers

A Gentle Introduction to Conformal Prediction | arXiv:2107.07511 — A suitable starting point for those encountering conformal prediction for the first time.
MAPIE — scikit-learn-contrib | GitHub — Useful for understanding the p-value-based conformal prediction structure through code.
awesome-conformal-prediction | GitHub curated list — Browse papers, tutorials, and libraries all in one place.

Core — Resources directly connected to this post

E-Values Expand the Scope of Conformal Prediction | arXiv:2503.13050 — The original paper on batch anytime-valid prediction and fuzzy prediction sets.
Conformal e-prediction — Vovk | ScienceDirect 2025 — The theoretical foundation of conformal e-prediction.
Fuzzy Prediction Sets: Conformal Prediction with E-values | arXiv:2509.13130 — Covers the connection between fuzzy prediction sets and decision theory.

Further Reading

ML Uncertainty Quantification in Batch and Sequential Settings: What Changes When Integrating e-values into Conformal Prediction | DEV BAK - 기술블로그

ML Uncertainty Quantification in Batch and Sequential Settings: What Changes When Integrating e-values into Conformal Prediction

Core Concepts

Conformal Prediction — Model-Agnostic Uncertainty Guarantees

Exchangeability: The condition that the joint distribution does not change even if the order of data points is shuffled arbitrarily. This is a weaker condition than i.i.d., and the validity guarantee of conformal prediction relies on this single assumption alone.

Under this exchangeability assumption, for a specified error rate α, the probability that the true label is included in the prediction set can be guaranteed to be at least 1−α.

python

import numpy as np
from sklearn.ensemble import RandomForestClassifier
 
def split_conformal_predict(clf, cal_X, cal_y, test_X, alpha=0.05):
    """
    Split Conformal Prediction (p-value 방식) 기본 구조.
    보정 집합으로 비적합 점수를 계산하고
    임계값 이상의 레이블을 예측 집합으로 반환합니다.
    """
    # 소프트맥스 확률 기반 비적합 점수: 1 - 정답 클래스의 예측 확률
    cal_probs = clf.predict_proba(cal_X)
    cal_scores = 1 - cal_probs[np.arange(len(cal_y)), cal_y]
 
    # (1-α) 분위수를 임계값으로 사용
    n = len(cal_scores)
    threshold = np.quantile(cal_scores, np.ceil((n + 1) * (1 - alpha)) / n)
 
    # 테스트 데이터에 대해 임계값 이하인 레이블을 예측 집합으로 포함
    test_probs = clf.predict_proba(test_X)
    prediction_sets = []
    for probs in test_probs:
        scores = 1 - probs
        pred_set = np.where(scores <= threshold)[0].tolist()
        prediction_sets.append(pred_set)
 
    return prediction_sets

e-values — Evidence Strength Expressed as a Betting Multiplier

E[E | H₀] ≤ 1

From this simple property, three powerful characteristics emerge.

Property	Description	Practical Meaning
Multiplicative combination	The product E₁ × E₂ × ··· of independent e-values is also a valid e-value	Batch results can be combined by simple multiplication
Anytime validity	Guarantees are maintained at any arbitrary point without pre-specifying sample size	Sequential monitoring, early stopping possible
Fuzzy membership representation	Degree of label inclusion expressed as a continuous value in [0,1] instead of binary (0/1)	Handles ambiguous label environments

The theoretical basis for anytime validity is Ville's inequality.

Ville's Inequality (intuition): "No matter how many times you look, the probability of a false rejection never exceeds α." Formally, for a non-negative supermartingale {Mₜ}, P(∃t: Mₜ ≥ 1/α) ≤ α holds. Because the cumulative product of the reciprocals of e-values forms this supermartingale, statistical guarantees are maintained at any arbitrary point in time.

Supermartingale: A stochastic process whose expected value decreases or stays the same over time. Think of it as a fair gamble where "what you can expect going forward is no greater than what you have now." Because the cumulative product of reciprocals of e-values satisfies this property under the null hypothesis, it is possible to guarantee that the error rate will not exceed α no matter when you stop observing.

Conformal e-Prediction — e-values as the Core Test Statistic

The reason the 1/p_value transformation is valid lies in Markov's inequality.

Markov's Inequality (intuition): "A random variable with a small expected value is unlikely to take on large values." For a non-negative random variable E, P(E ≥ c) ≤ E[E]/c holds. Setting E = 1/p, since E[E] ≤ 1 under the null hypothesis, we get P(E ≥ 1/α) ≤ α. That is, rejecting the null hypothesis when the e-value is at least 1/α comes with a guarantee that the error rate does not exceed α. However, E = 1/p is valid but not optimal. To increase power, it is necessary to design a betting function suited to the data distribution.

Here is a summary of how conformal e-prediction differs from traditional conformal prediction:

	Traditional Conformal Prediction (p-value)	Conformal e-Prediction (e-value)
Core test statistic	p-value	e-value (betting multiplier)
Unknown number of batches	Requires union bound	Bypassed via multiplicative combination
Sequential anytime validity	Limited	Natively supported
Handling label ambiguity	Binary (0/1) only	[0,1] fuzzy membership
Cross-conformal theoretical guarantee	Can break under excessive randomization	Theoretically guaranteed

python

import numpy as np
from typing import List
 
def compute_conformal_evalue(cal_scores: np.ndarray, test_score: float) -> float:
    """
    보정 집합의 비적합 점수와 테스트 점수로부터 e-value를 계산합니다.
 
    직관: 테스트 점수가 보정 집합보다 얼마나 극단적인가를 '배율'로 표현합니다.
    E = 1/p 변환은 마르코프 부등식으로 유효성이 보장되지만,
    더 정교한 베팅 전략을 사용하면 검출력(power)이 향상됩니다.
    """
    n = len(cal_scores)
    # 테스트 점수보다 크거나 같은 보정 점수의 비율 → p-value
    p_value = (np.sum(cal_scores >= test_score) + 1) / (n + 1)
    return 1.0 / p_value
 
 
def conformal_evalue_predict(clf, cal_X: np.ndarray, cal_y: np.ndarray,
                              test_X: np.ndarray,
                              threshold: float = 1.0) -> List[List[int]]:
    """
    e-value 기반 예측 집합 구성.
    e-value > threshold인 레이블을 집합에 포함합니다.
    """
    cal_probs = clf.predict_proba(cal_X)
    test_probs_all = clf.predict_proba(test_X)  # 배치 추론으로 사전 계산
    n_classes = cal_probs.shape[1]
 
    prediction_sets = []
    for test_probs in test_probs_all:
        pred_set = []
        for y_candidate in range(n_classes):
            cal_scores_y = 1 - cal_probs[cal_y == y_candidate, y_candidate]
            test_score_y = 1 - test_probs[y_candidate]
 
            if len(cal_scores_y) > 0:
                e_val = compute_conformal_evalue(cal_scores_y, test_score_y)
                if e_val > threshold:
                    pred_set.append(y_candidate)
 
        prediction_sets.append(pred_set)
 
    return prediction_sets

Practical Applications

Example 1: Batch Anytime-Valid Conformal Prediction (Sequential Batch Scenario)

python

import numpy as np
from dataclasses import dataclass, field
from typing import List, Tuple
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
 
# ── 데이터 준비 ──────────────────────────────────────────────────
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.6, random_state=42
)
X_cal, X_test_all, y_cal, y_test_all = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)
 
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
 
# 순차 배치 시뮬레이션 — 5개 병원에서 데이터가 순차 도착
n_batches = 5
batch_size = len(X_test_all) // n_batches
sequential_batches: List[Tuple[np.ndarray, np.ndarray]] = [
    (
        X_test_all[i * batch_size:(i + 1) * batch_size],
        y_test_all[i * batch_size:(i + 1) * batch_size],
    )
    for i in range(n_batches)
]
 
 
# ── 배치 e-value 계산 ────────────────────────────────────────────
 
def compute_batch_evalue(clf, cal_X: np.ndarray, cal_y: np.ndarray,
                          batch_X: np.ndarray, batch_y: np.ndarray) -> float:
    """
    배치 데이터로부터 배치 e-value를 계산합니다.
 
    구현: 각 샘플의 e-value(= 1/p-value)의 기하평균을 사용합니다.
    이론적으로 독립 e-value의 곱도 유효한 e-value이며,
    기하평균은 배치 크기 차이를 보정한 샘플당 증거 강도입니다.
 
    ⚠️ 주의: arXiv:2503.13050의 배치 알고리즘은 베팅 함수(betting function)를
    명시적으로 최적화합니다. 이 단순 구현은 이론적 보장은 유지하지만
    검출력이 낮을 수 있습니다. 프로덕션 적용 전 원논문 확인을 권장합니다.
    """
    cal_probs = clf.predict_proba(cal_X)
    cal_scores = 1 - cal_probs[np.arange(len(cal_y)), cal_y]
 
    batch_probs = clf.predict_proba(batch_X)
    batch_scores = 1 - batch_probs[np.arange(len(batch_y)), batch_y]
 
    n_cal = len(cal_scores)
    p_values = np.array([
        (np.sum(cal_scores >= s) + 1) / (n_cal + 1)
        for s in batch_scores
    ])
 
    # 각 샘플의 e-value 기하평균 (배치 크기 정규화)
    individual_evalues = 1.0 / p_values
    batch_evalue = float(np.exp(np.mean(np.log(individual_evalues))))
    return batch_evalue
 
 
# ── 애니타임 모니터 ──────────────────────────────────────────────
 
@dataclass
class BatchEValueMonitor:
    """
    배치 단위로 순차 도착하는 데이터에 대해
    e-value를 누적 곱으로 관리하는 모니터.
    언제든 검정을 수행할 수 있는 애니타임 유효성을 제공합니다.
    """
    alpha: float = 0.05
    cumulative_evalue: float = 1.0
    batch_history: List[float] = field(default_factory=list)
 
    def update(self, batch_evalue: float) -> dict:
        """
        새 배치의 e-value로 누적값을 업데이트합니다.
        귀무가설 기각 여부와 현재 유효 오류율을 반환합니다.
        """
        self.cumulative_evalue *= batch_evalue
        self.batch_history.append(batch_evalue)
 
        # Markov 부등식: P(E_cumul ≥ 1/α) ≤ α
        reject = self.cumulative_evalue >= (1.0 / self.alpha)
        current_alpha = 1.0 / self.cumulative_evalue
 
        return {
            "batch_num": len(self.batch_history),
            "batch_evalue": batch_evalue,
            "cumulative_evalue": self.cumulative_evalue,
            "reject_null": reject,
            "current_effective_alpha": min(current_alpha, 1.0),
        }
 
 
# ── 순차 실행 예시 ───────────────────────────────────────────────
 
monitor = BatchEValueMonitor(alpha=0.05)
 
for batch_X, batch_y in sequential_batches:
    e_val = compute_batch_evalue(clf, X_cal, y_cal, batch_X, batch_y)
    result = monitor.update(e_val)
 
    print(f"배치 {result['batch_num']:2d} | "
          f"e-value: {result['batch_evalue']:.3f} | "
          f"누적: {result['cumulative_evalue']:.3f} | "
          f"기각: {result['reject_null']}")
 
    if result["reject_null"]:
        print("→ 귀무가설 기각: 통계적으로 유의미한 효과 확인됨")
        break  # 이 시점 이후 추가 배치가 와도 이미 얻은 보장은 유효합니다

Summary of the Batch e-value Accumulation Process

Step	Action	Key Point
Build calibration set	Obtain nonconformity score distribution from initial data	Can be shared across batches or use a sliding window
Compute batch e-value	Convert each batch's p-values to e-values	Power varies depending on betting strategy
Update cumulative product	`E_cumul *= E_batch`	Validity maintained without knowing the number of batches K
Anytime test	Reject if `E_cumul ≥ 1/α`	α-level guarantee regardless of when you stop

Example 2: Fuzzy Prediction Sets — Handling Ambiguous Labels

python

import numpy as np
from typing import List, Optional, Dict
 
def compute_fuzzy_membership(e_value: float, threshold: float = 1.0) -> float:
    """
    e-value를 [0, 1] 범위의 퍼지 멤버십 값으로 변환합니다.
 
    변환 공식: log(1 + (e_value - threshold)) / log(10)
    - threshold 미만 → 0.0 (예측 집합 미포함)
    - e_value = threshold + 9일 때 → 1.0 (포화 기준)
    - log 스케일로 e-value가 커질수록 포화되는 형태 → 극단값 안정화
 
    포화 기준(+9)은 "e-value가 threshold보다 9 높을 때 완전 포함"을 의미하는
    임의 상수입니다. 엄격한 의료 진단처럼 빠른 포화가 필요하다면
    np.log1p(4.0)으로 낮추고, 넓은 분포의 데이터라면 더 크게 설정할 수 있습니다.
    """
    if e_value < threshold:
        return 0.0
    return min(1.0, np.log1p(e_value - threshold) / np.log1p(9.0))
 
 
def fuzzy_conformal_predict(clf, cal_X: np.ndarray, cal_y: np.ndarray,
                             test_X: np.ndarray,
                             ambiguous_labels: Optional[np.ndarray] = None
                             ) -> List[Dict[int, float]]:
    """
    퍼지 컨포멀 예측 집합을 구성합니다.
 
    ambiguous_labels: (n_cal_samples, n_classes) 형태의 소프트 레이블 행렬.
                      각 셀은 해당 클래스에 대한 전문가 동의율(0~1)입니다.
                      None이면 하드 레이블을 원-핫으로 처리합니다.
    반환: 각 테스트 샘플에 대한 {레이블: 멤버십 값} 딕셔너리 리스트
    """
    cal_probs = clf.predict_proba(cal_X)
    test_probs_all = clf.predict_proba(test_X)  # 배치 추론으로 사전 계산
    n_classes = cal_probs.shape[1]
 
    if ambiguous_labels is None:
        ambiguous_labels = np.eye(n_classes)[cal_y]
 
    results = []
    for test_probs in test_probs_all:
        memberships: Dict[int, float] = {}
 
        for y_candidate in range(n_classes):
            soft_weights = ambiguous_labels[:, y_candidate]
            cal_scores = 1 - cal_probs[:, y_candidate]
            weighted_scores = cal_scores * soft_weights
 
            valid_mask = soft_weights > 0
            if valid_mask.sum() < 5:  # 최소 샘플 수 보장
                memberships[y_candidate] = 0.0
                continue
 
            test_score = 1 - test_probs[y_candidate]
            e_val = compute_conformal_evalue(weighted_scores[valid_mask], test_score)
            memberships[y_candidate] = compute_fuzzy_membership(e_val)
 
        results.append(memberships)
 
    return results
 
 
# ── 사용 예시: 3명의 전문가 레이블이 있는 의료 진단 ──────────────
 
n_classes = clf.n_classes_
n_experts = 3
 
# 전문가 레이블 시뮬레이션 — 실제로는 어노테이션 툴에서 가져옵니다
np.random.seed(42)
expert_labels = np.random.randint(0, n_classes, size=(len(y_cal), n_experts))
 
# 클래스별 소프트 레이블 행렬 구성
# ambiguous[i, c] = 샘플 i에 대해 클래스 c로 분류한 전문가 비율
ambiguous = np.zeros((len(y_cal), n_classes))
for c in range(n_classes):
    ambiguous[:, c] = np.mean(expert_labels == c, axis=1)
# 각 행의 합 = 1.0 (클래스별 동의율의 합)
 
fuzzy_preds = fuzzy_conformal_predict(
    clf, X_cal, y_cal, X_test_all[:10],
    ambiguous_labels=ambiguous,
)
 
for i, memberships in enumerate(fuzzy_preds[:3]):
    print(f"\n테스트 샘플 {i}:")
    for label, membership in sorted(memberships.items(), key=lambda x: -x[1]):
        bar = "█" * int(membership * 20)
        print(f"  클래스 {label}: {membership:.3f} |{bar:<20}|")

Summary of the Fuzzy Prediction Set Construction Process

Step	Action	Key Point
Build soft label matrix	Create (n_samples, n_classes) matrix from expert agreement rates	Normalize so each row sums to 1
Compute weighted nonconformity scores	Apply soft label weights to nonconformity scores	Classes with lower agreement have reduced influence
Compute e-value	Derive e-value from the weighted score distribution	Higher e-value when expert opinions agree
Fuzzy membership transformation	Convert to [0,1] via `compute_fuzzy_membership`	Saturation threshold can be tuned to the domain

Strengths and Weaknesses

Strengths

Item	Description	When It Is Especially Useful
Simultaneous guarantee with unknown number of batches	Provides valid coverage even without knowing K, via multiplicative combination of e-values	Multi-site clinical trials, batch A/B testing
Anytime validity	Statistical guarantees maintained at any arbitrary point in time by Ville's inequality	Sequential experiments with early stopping
Fuzzy prediction sets	Accommodates label ambiguity with continuous [0,1] values instead of binary inclusion	Environments with disagreeing expert annotations
Easy design of conditional predictors	Easier to construct per-input customized prediction sets than with p-value methods	Cases requiring per-subgroup guarantees
Cross-conformal validity	While p-value-based cross-conformal can break under excessive randomization, e-value-based is theoretically guaranteed	Cases with limited data requiring cross-validation
Handling ambiguous labels	Valid coverage guaranteed in environments with annotation noise or label ambiguity	Medical imaging, emotion recognition, etc.

Weaknesses and Caveats

Item	Description	Mitigation
Exchangeability dependence	Assumption violated under strong temporal dependence (AR processes)	Sliding window recalibration, weighted conformal
Conservative prediction sets	Prediction sets can be wider than p-value methods with small calibration sets	Ensure sufficient calibration data (recommended n ≥ 200)
Computational cost	Full conformal requires model retraining for each candidate label	Use split conformal e-prediction to reduce computation
Software immaturity	Not natively supported in major libraries such as MAPIE or crepes	Refer to paper authors' GitHub code, implement directly
Distribution shift	Calibration set may become mismatched when distribution changes between batches	Use weighted e-values, adaptive recalibration strategies
Betting function design	`E = 1/p` is valid but may have low power	Optimize betting function based on domain knowledge (see arXiv:2503.13050)

Most Common Mistakes in Practice

Not separating the calibration set from the training data — Computing nonconformity scores on the same data used to train the model invalidates the coverage guarantee. It is essential to maintain a separate held-out calibration set.
Using e-values merely as the reciprocal of p-values — E = 1/p is valid but not optimal. To increase power, a betting function tailored to the data is needed; omitting this can result in unnecessarily wide prediction sets.
Assuming exchangeability across batches while ignoring distribution shift — In environments where the data distribution changes from batch to batch (seasonal time series, differences in hospital equipment, etc.), trusting the cumulative e-value without verification can cause the actual coverage to diverge significantly from the guaranteed value.

Closing Thoughts

Here are 3 steps you can take right now.

Hands-on conformal prediction basics: After pip install mapie, run the classification example in the official documentation to directly verify the coverage guarantee of split conformal. MAPIE does not natively support e-values, but it is a good starting point for understanding the nonconformity score computation structure.
Experiment by implementing e-values directly: Try applying the BatchEValueMonitor code from this post to make_classification data. You can use the visualization code below to see how prediction set size and actual coverage change as you vary the calibration set size (100, 500, 1000).

python

import matplotlib.pyplot as plt
 
cal_sizes = [100, 500, 1000]
coverages = []
 
for cal_size in cal_sizes:
    cal_X_sub = X_cal[:cal_size]
    cal_y_sub = y_cal[:cal_size]
    pred_sets = conformal_evalue_predict(clf, cal_X_sub, cal_y_sub, X_test_all)
    coverage = np.mean([
        y_true in pred_set
        for y_true, pred_set in zip(y_test_all, pred_sets)
    ])
    coverages.append(coverage)
 
plt.figure(figsize=(8, 4))
plt.plot(cal_sizes, coverages, marker="o", label="실제 커버리지")
plt.axhline(y=0.95, color="red", linestyle="--", label="목표 커버리지 (95%)")
plt.xlabel("보정 집합 크기")
plt.ylabel("실제 커버리지")
plt.title("보정 집합 크기와 커버리지 관계")
plt.legend()
plt.tight_layout()
plt.show()

Read the original paper: arXiv:2503.13050 "E-Values Expand the Scope of Conformal Prediction" presents the batch setting algorithm clearly in pseudocode, making it a great resource for confirming the theoretical foundations before integrating into a real system.

Next post: Online Conformal Prediction — how to maintain real-time coverage on streaming data using a decreasing step size. We will explore how this connects to the batch e-value accumulation concept covered in this post.

References

Introductory — Good starting points for first-time readers

A Gentle Introduction to Conformal Prediction | arXiv:2107.07511 — A suitable starting point for those encountering conformal prediction for the first time.
MAPIE — scikit-learn-contrib | GitHub — Useful for understanding the p-value-based conformal prediction structure through code.
awesome-conformal-prediction | GitHub curated list — Browse papers, tutorials, and libraries all in one place.

Core — Resources directly connected to this post

E-Values Expand the Scope of Conformal Prediction | arXiv:2503.13050 — The original paper on batch anytime-valid prediction and fuzzy prediction sets.
Conformal e-prediction — Vovk | ScienceDirect 2025 — The theoretical foundation of conformal e-prediction.
Fuzzy Prediction Sets: Conformal Prediction with E-values | arXiv:2509.13130 — Covers the connection between fuzzy prediction sets and decision theory.

Further Reading

Core Concepts

Conformal Prediction — Model-Agnostic Uncertainty Guarantees

e-values — Evidence Strength Expressed as a Betting Multiplier

Conformal e-Prediction — e-values as the Core Test Statistic

Practical Applications

Example 1: Batch Anytime-Valid Conformal Prediction (Sequential Batch Scenario)

Example 2: Fuzzy Prediction Sets — Handling Ambiguous Labels

Strengths and Weaknesses

Strengths

Weaknesses and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Conformal Prediction — Model-Agnostic Uncertainty Guarantees

e-values — Evidence Strength Expressed as a Betting Multiplier

Conformal e-Prediction — e-values as the Core Test Statistic

Practical Applications

Example 1: Batch Anytime-Valid Conformal Prediction (Sequential Batch Scenario)

Example 2: Fuzzy Prediction Sets — Handling Ambiguous Labels

Strengths and Weaknesses

Strengths

Weaknesses and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Enterprise MCP Governance Practical Guide: Centralizing RBAC, Audit Trails, and Token Vaults with ScopeBlind & Webrix

AI Multi-Agent Permission Delegation with Cedar: The delegation_chain Pattern, a Production Policy Library, and Security Pitfalls

AI Agent Zero Trust Pipeline: SPIFFE/SPIRE + Cedar in Practice

Bayes Factor vs. E-value (Safety Test): Complete Analysis of Convergence Conditions and Practical Selection Guide for Safe Testing

n8n MCP Server Trigger Complete Guide — Creating a Custom MCP Server Without Coding and Connecting to Claude Desktop

Practical Guide to Implementing Kubernetes Policy-as-Code with OPA Bundle Server + GitOps