ML Uncertainty Quantification in Batch and Sequential Settings: What Changes When Integrating e-values into Conformal Prediction
One of the most uncomfortable moments when deploying ML models to production is being asked, "How much can we trust this prediction?" Accuracy or F1 scores only summarize past performance, and statistically guaranteeing how confident a model is at any given moment is no easy task. Conformal Prediction is a framework that tackles this problem head-on, providing a statistical guarantee — "the probability that the true label is contained in this prediction set is at least 95%" — without strong assumptions about model architecture or data distribution. Then, with the 2025 publication of arXiv:2503.13050 "E-Values Expand the Scope of Conformal Prediction," a systematic approach was presented to maintain validity even under conditions where existing methods broke down. In sequential settings where the number of batches is unknown, p-value-based methods lose their validity guarantee, but e-values preserve coverage at any arbitrary stopping point.
This article targets ML engineers and backend developers who have encountered the concept of conformal prediction at least once, or who understand the need for uncertainty estimation but are implementing it for the first time. Topics covered: the core principles of e-values, Python implementations in batch and sequential settings, and fuzzy prediction sets. Topics not covered: optimal design of betting functions (paper-level theory), step-size theory for online conformal (planned for a future post). Statistical terms such as Markov's inequality and supermartingales appear, but each is accompanied by an intuitive explanation. If you want to focus on running the code rather than theoretical rigor, you can skip the blockquote sections and still follow the core flow.
Core Concepts
Conformal Prediction — Model-Agnostic Uncertainty Guarantees
The idea behind conformal prediction is simple. For a new input x, you measure how well a candidate label y "fits" with existing data (conformity) as a score, and then bundle the labels whose scores meet a certain threshold into a prediction set. Only one assumption is required.
Exchangeability: The condition that the joint distribution does not change even if the order of data points is shuffled arbitrarily. This is a weaker condition than i.i.d., and the validity guarantee of conformal prediction relies on this single assumption alone.
Under this exchangeability assumption, for a specified error rate α, the probability that the true label is included in the prediction set can be guaranteed to be at least 1−α.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
def split_conformal_predict(clf, cal_X, cal_y, test_X, alpha=0.05):
"""
Split Conformal Prediction (p-value 방식) 기본 구조.
보정 집합으로 비적합 점수를 계산하고
임계값 이상의 레이블을 예측 집합으로 반환합니다.
"""
# 소프트맥스 확률 기반 비적합 점수: 1 - 정답 클래스의 예측 확률
cal_probs = clf.predict_proba(cal_X)
cal_scores = 1 - cal_probs[np.arange(len(cal_y)), cal_y]
# (1-α) 분위수를 임계값으로 사용
n = len(cal_scores)
threshold = np.quantile(cal_scores, np.ceil((n + 1) * (1 - alpha)) / n)
# 테스트 데이터에 대해 임계값 이하인 레이블을 예측 집합으로 포함
test_probs = clf.predict_proba(test_X)
prediction_sets = []
for probs in test_probs:
scores = 1 - probs
pred_set = np.where(scores <= threshold)[0].tolist()
prediction_sets.append(pred_set)
return prediction_setse-values — Evidence Strength Expressed as a Betting Multiplier
Where a p-value expresses "the probability of obtaining a result this extreme if the null hypothesis were true," an e-value expresses "the strength of evidence against the null hypothesis" as a betting multiplier. Mathematically, an e-value E is a non-negative random variable whose expected value under the null hypothesis H₀ is at most 1.
E[E | H₀] ≤ 1From this simple property, three powerful characteristics emerge.
| Property | Description | Practical Meaning |
|---|---|---|
| Multiplicative combination | The product E₁ × E₂ × ··· of independent e-values is also a valid e-value | Batch results can be combined by simple multiplication |
| Anytime validity | Guarantees are maintained at any arbitrary point without pre-specifying sample size | Sequential monitoring, early stopping possible |
| Fuzzy membership representation | Degree of label inclusion expressed as a continuous value in [0,1] instead of binary (0/1) | Handles ambiguous label environments |
The theoretical basis for anytime validity is Ville's inequality.
Ville's Inequality (intuition): "No matter how many times you look, the probability of a false rejection never exceeds α." Formally, for a non-negative supermartingale {Mₜ}, P(∃t: Mₜ ≥ 1/α) ≤ α holds. Because the cumulative product of the reciprocals of e-values forms this supermartingale, statistical guarantees are maintained at any arbitrary point in time.
Supermartingale: A stochastic process whose expected value decreases or stays the same over time. Think of it as a fair gamble where "what you can expect going forward is no greater than what you have now." Because the cumulative product of reciprocals of e-values satisfies this property under the null hypothesis, it is possible to guarantee that the error rate will not exceed α no matter when you stop observing.
Conformal e-Prediction — e-values as the Core Test Statistic
Where traditional methods use p-value-based nonconformity scores, conformal e-prediction directly computes e-values from the calibration set for each candidate label y. The prediction set consists of labels whose e-values exceed a threshold (typically 1).
The reason the 1/p_value transformation is valid lies in Markov's inequality.
Markov's Inequality (intuition): "A random variable with a small expected value is unlikely to take on large values." For a non-negative random variable E, P(E ≥ c) ≤ E[E]/c holds. Setting E = 1/p, since E[E] ≤ 1 under the null hypothesis, we get P(E ≥ 1/α) ≤ α. That is, rejecting the null hypothesis when the e-value is at least 1/α comes with a guarantee that the error rate does not exceed α. However,
E = 1/pis valid but not optimal. To increase power, it is necessary to design a betting function suited to the data distribution.
Here is a summary of how conformal e-prediction differs from traditional conformal prediction:
| Traditional Conformal Prediction (p-value) | Conformal e-Prediction (e-value) | |
|---|---|---|
| Core test statistic | p-value | e-value (betting multiplier) |
| Unknown number of batches | Requires union bound | Bypassed via multiplicative combination |
| Sequential anytime validity | Limited | Natively supported |
| Handling label ambiguity | Binary (0/1) only | [0,1] fuzzy membership |
| Cross-conformal theoretical guarantee | Can break under excessive randomization | Theoretically guaranteed |
import numpy as np
from typing import List
def compute_conformal_evalue(cal_scores: np.ndarray, test_score: float) -> float:
"""
보정 집합의 비적합 점수와 테스트 점수로부터 e-value를 계산합니다.
직관: 테스트 점수가 보정 집합보다 얼마나 극단적인가를 '배율'로 표현합니다.
E = 1/p 변환은 마르코프 부등식으로 유효성이 보장되지만,
더 정교한 베팅 전략을 사용하면 검출력(power)이 향상됩니다.
"""
n = len(cal_scores)
# 테스트 점수보다 크거나 같은 보정 점수의 비율 → p-value
p_value = (np.sum(cal_scores >= test_score) + 1) / (n + 1)
return 1.0 / p_value
def conformal_evalue_predict(clf, cal_X: np.ndarray, cal_y: np.ndarray,
test_X: np.ndarray,
threshold: float = 1.0) -> List[List[int]]:
"""
e-value 기반 예측 집합 구성.
e-value > threshold인 레이블을 집합에 포함합니다.
"""
cal_probs = clf.predict_proba(cal_X)
test_probs_all = clf.predict_proba(test_X) # 배치 추론으로 사전 계산
n_classes = cal_probs.shape[1]
prediction_sets = []
for test_probs in test_probs_all:
pred_set = []
for y_candidate in range(n_classes):
cal_scores_y = 1 - cal_probs[cal_y == y_candidate, y_candidate]
test_score_y = 1 - test_probs[y_candidate]
if len(cal_scores_y) > 0:
e_val = compute_conformal_evalue(cal_scores_y, test_score_y)
if e_val > threshold:
pred_set.append(y_candidate)
prediction_sets.append(pred_set)
return prediction_setsPractical Applications
Example 1: Batch Anytime-Valid Conformal Prediction (Sequential Batch Scenario)
Consider a hospital scenario where clinical data for a new drug arrives sequentially in batches. Regulatory agencies, not knowing how many hospitals will contribute data, want statistically valid conclusions at any point in time. Traditional union bounds require knowing the total number of batches K, but the multiplicative property of e-values allows us to bypass this.
import numpy as np
from dataclasses import dataclass, field
from typing import List, Tuple
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# ── 데이터 준비 ──────────────────────────────────────────────────
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.6, random_state=42
)
X_cal, X_test_all, y_cal, y_test_all = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# 순차 배치 시뮬레이션 — 5개 병원에서 데이터가 순차 도착
n_batches = 5
batch_size = len(X_test_all) // n_batches
sequential_batches: List[Tuple[np.ndarray, np.ndarray]] = [
(
X_test_all[i * batch_size:(i + 1) * batch_size],
y_test_all[i * batch_size:(i + 1) * batch_size],
)
for i in range(n_batches)
]
# ── 배치 e-value 계산 ────────────────────────────────────────────
def compute_batch_evalue(clf, cal_X: np.ndarray, cal_y: np.ndarray,
batch_X: np.ndarray, batch_y: np.ndarray) -> float:
"""
배치 데이터로부터 배치 e-value를 계산합니다.
구현: 각 샘플의 e-value(= 1/p-value)의 기하평균을 사용합니다.
이론적으로 독립 e-value의 곱도 유효한 e-value이며,
기하평균은 배치 크기 차이를 보정한 샘플당 증거 강도입니다.
⚠️ 주의: arXiv:2503.13050의 배치 알고리즘은 베팅 함수(betting function)를
명시적으로 최적화합니다. 이 단순 구현은 이론적 보장은 유지하지만
검출력이 낮을 수 있습니다. 프로덕션 적용 전 원논문 확인을 권장합니다.
"""
cal_probs = clf.predict_proba(cal_X)
cal_scores = 1 - cal_probs[np.arange(len(cal_y)), cal_y]
batch_probs = clf.predict_proba(batch_X)
batch_scores = 1 - batch_probs[np.arange(len(batch_y)), batch_y]
n_cal = len(cal_scores)
p_values = np.array([
(np.sum(cal_scores >= s) + 1) / (n_cal + 1)
for s in batch_scores
])
# 각 샘플의 e-value 기하평균 (배치 크기 정규화)
individual_evalues = 1.0 / p_values
batch_evalue = float(np.exp(np.mean(np.log(individual_evalues))))
return batch_evalue
# ── 애니타임 모니터 ──────────────────────────────────────────────
@dataclass
class BatchEValueMonitor:
"""
배치 단위로 순차 도착하는 데이터에 대해
e-value를 누적 곱으로 관리하는 모니터.
언제든 검정을 수행할 수 있는 애니타임 유효성을 제공합니다.
"""
alpha: float = 0.05
cumulative_evalue: float = 1.0
batch_history: List[float] = field(default_factory=list)
def update(self, batch_evalue: float) -> dict:
"""
새 배치의 e-value로 누적값을 업데이트합니다.
귀무가설 기각 여부와 현재 유효 오류율을 반환합니다.
"""
self.cumulative_evalue *= batch_evalue
self.batch_history.append(batch_evalue)
# Markov 부등식: P(E_cumul ≥ 1/α) ≤ α
reject = self.cumulative_evalue >= (1.0 / self.alpha)
current_alpha = 1.0 / self.cumulative_evalue
return {
"batch_num": len(self.batch_history),
"batch_evalue": batch_evalue,
"cumulative_evalue": self.cumulative_evalue,
"reject_null": reject,
"current_effective_alpha": min(current_alpha, 1.0),
}
# ── 순차 실행 예시 ───────────────────────────────────────────────
monitor = BatchEValueMonitor(alpha=0.05)
for batch_X, batch_y in sequential_batches:
e_val = compute_batch_evalue(clf, X_cal, y_cal, batch_X, batch_y)
result = monitor.update(e_val)
print(f"배치 {result['batch_num']:2d} | "
f"e-value: {result['batch_evalue']:.3f} | "
f"누적: {result['cumulative_evalue']:.3f} | "
f"기각: {result['reject_null']}")
if result["reject_null"]:
print("→ 귀무가설 기각: 통계적으로 유의미한 효과 확인됨")
break # 이 시점 이후 추가 배치가 와도 이미 얻은 보장은 유효합니다Summary of the Batch e-value Accumulation Process
| Step | Action | Key Point |
|---|---|---|
| Build calibration set | Obtain nonconformity score distribution from initial data | Can be shared across batches or use a sliding window |
| Compute batch e-value | Convert each batch's p-values to e-values | Power varies depending on betting strategy |
| Update cumulative product | E_cumul *= E_batch |
Validity maintained without knowing the number of batches K |
| Anytime test | Reject if E_cumul ≥ 1/α |
α-level guarantee regardless of when you stop |
Example 2: Fuzzy Prediction Sets — Handling Ambiguous Labels
In radiology image interpretation, when multiple experts reach different diagnoses, it is difficult to define which label is the "correct" one in binary terms. Fuzzy prediction sets naturally accommodate this ambiguity by expressing label membership as a continuous value in the [0, 1] range.
import numpy as np
from typing import List, Optional, Dict
def compute_fuzzy_membership(e_value: float, threshold: float = 1.0) -> float:
"""
e-value를 [0, 1] 범위의 퍼지 멤버십 값으로 변환합니다.
변환 공식: log(1 + (e_value - threshold)) / log(10)
- threshold 미만 → 0.0 (예측 집합 미포함)
- e_value = threshold + 9일 때 → 1.0 (포화 기준)
- log 스케일로 e-value가 커질수록 포화되는 형태 → 극단값 안정화
포화 기준(+9)은 "e-value가 threshold보다 9 높을 때 완전 포함"을 의미하는
임의 상수입니다. 엄격한 의료 진단처럼 빠른 포화가 필요하다면
np.log1p(4.0)으로 낮추고, 넓은 분포의 데이터라면 더 크게 설정할 수 있습니다.
"""
if e_value < threshold:
return 0.0
return min(1.0, np.log1p(e_value - threshold) / np.log1p(9.0))
def fuzzy_conformal_predict(clf, cal_X: np.ndarray, cal_y: np.ndarray,
test_X: np.ndarray,
ambiguous_labels: Optional[np.ndarray] = None
) -> List[Dict[int, float]]:
"""
퍼지 컨포멀 예측 집합을 구성합니다.
ambiguous_labels: (n_cal_samples, n_classes) 형태의 소프트 레이블 행렬.
각 셀은 해당 클래스에 대한 전문가 동의율(0~1)입니다.
None이면 하드 레이블을 원-핫으로 처리합니다.
반환: 각 테스트 샘플에 대한 {레이블: 멤버십 값} 딕셔너리 리스트
"""
cal_probs = clf.predict_proba(cal_X)
test_probs_all = clf.predict_proba(test_X) # 배치 추론으로 사전 계산
n_classes = cal_probs.shape[1]
if ambiguous_labels is None:
ambiguous_labels = np.eye(n_classes)[cal_y]
results = []
for test_probs in test_probs_all:
memberships: Dict[int, float] = {}
for y_candidate in range(n_classes):
soft_weights = ambiguous_labels[:, y_candidate]
cal_scores = 1 - cal_probs[:, y_candidate]
weighted_scores = cal_scores * soft_weights
valid_mask = soft_weights > 0
if valid_mask.sum() < 5: # 최소 샘플 수 보장
memberships[y_candidate] = 0.0
continue
test_score = 1 - test_probs[y_candidate]
e_val = compute_conformal_evalue(weighted_scores[valid_mask], test_score)
memberships[y_candidate] = compute_fuzzy_membership(e_val)
results.append(memberships)
return results
# ── 사용 예시: 3명의 전문가 레이블이 있는 의료 진단 ──────────────
n_classes = clf.n_classes_
n_experts = 3
# 전문가 레이블 시뮬레이션 — 실제로는 어노테이션 툴에서 가져옵니다
np.random.seed(42)
expert_labels = np.random.randint(0, n_classes, size=(len(y_cal), n_experts))
# 클래스별 소프트 레이블 행렬 구성
# ambiguous[i, c] = 샘플 i에 대해 클래스 c로 분류한 전문가 비율
ambiguous = np.zeros((len(y_cal), n_classes))
for c in range(n_classes):
ambiguous[:, c] = np.mean(expert_labels == c, axis=1)
# 각 행의 합 = 1.0 (클래스별 동의율의 합)
fuzzy_preds = fuzzy_conformal_predict(
clf, X_cal, y_cal, X_test_all[:10],
ambiguous_labels=ambiguous,
)
for i, memberships in enumerate(fuzzy_preds[:3]):
print(f"\n테스트 샘플 {i}:")
for label, membership in sorted(memberships.items(), key=lambda x: -x[1]):
bar = "█" * int(membership * 20)
print(f" 클래스 {label}: {membership:.3f} |{bar:<20}|")Summary of the Fuzzy Prediction Set Construction Process
| Step | Action | Key Point |
|---|---|---|
| Build soft label matrix | Create (n_samples, n_classes) matrix from expert agreement rates | Normalize so each row sums to 1 |
| Compute weighted nonconformity scores | Apply soft label weights to nonconformity scores | Classes with lower agreement have reduced influence |
| Compute e-value | Derive e-value from the weighted score distribution | Higher e-value when expert opinions agree |
| Fuzzy membership transformation | Convert to [0,1] via compute_fuzzy_membership |
Saturation threshold can be tuned to the domain |
Strengths and Weaknesses
Strengths
| Item | Description | When It Is Especially Useful |
|---|---|---|
| Simultaneous guarantee with unknown number of batches | Provides valid coverage even without knowing K, via multiplicative combination of e-values | Multi-site clinical trials, batch A/B testing |
| Anytime validity | Statistical guarantees maintained at any arbitrary point in time by Ville's inequality | Sequential experiments with early stopping |
| Fuzzy prediction sets | Accommodates label ambiguity with continuous [0,1] values instead of binary inclusion | Environments with disagreeing expert annotations |
| Easy design of conditional predictors | Easier to construct per-input customized prediction sets than with p-value methods | Cases requiring per-subgroup guarantees |
| Cross-conformal validity | While p-value-based cross-conformal can break under excessive randomization, e-value-based is theoretically guaranteed | Cases with limited data requiring cross-validation |
| Handling ambiguous labels | Valid coverage guaranteed in environments with annotation noise or label ambiguity | Medical imaging, emotion recognition, etc. |
Weaknesses and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Exchangeability dependence | Assumption violated under strong temporal dependence (AR processes) | Sliding window recalibration, weighted conformal |
| Conservative prediction sets | Prediction sets can be wider than p-value methods with small calibration sets | Ensure sufficient calibration data (recommended n ≥ 200) |
| Computational cost | Full conformal requires model retraining for each candidate label | Use split conformal e-prediction to reduce computation |
| Software immaturity | Not natively supported in major libraries such as MAPIE or crepes | Refer to paper authors' GitHub code, implement directly |
| Distribution shift | Calibration set may become mismatched when distribution changes between batches | Use weighted e-values, adaptive recalibration strategies |
| Betting function design | E = 1/p is valid but may have low power |
Optimize betting function based on domain knowledge (see arXiv:2503.13050) |
Most Common Mistakes in Practice
- Not separating the calibration set from the training data — Computing nonconformity scores on the same data used to train the model invalidates the coverage guarantee. It is essential to maintain a separate held-out calibration set.
- Using e-values merely as the reciprocal of p-values —
E = 1/pis valid but not optimal. To increase power, a betting function tailored to the data is needed; omitting this can result in unnecessarily wide prediction sets. - Assuming exchangeability across batches while ignoring distribution shift — In environments where the data distribution changes from batch to batch (seasonal time series, differences in hospital equipment, etc.), trusting the cumulative e-value without verification can cause the actual coverage to diverge significantly from the guaranteed value.
Closing Thoughts
By combining e-values with conformal prediction, you can maintain statistically rigorous uncertainty guarantees even in sequential environments with an unknown number of batches and even with real-world data featuring ambiguous labels. The theoretical foundation traces back to Vovk's 2020 work, and arXiv:2503.13050 in 2025 rapidly extended it into three practical applications: batch anytime-valid prediction, fuzzy prediction sets, and handling ambiguous ground truth. The online conformal prediction covered in the next post connects naturally to this batch e-value accumulation concept, so familiarizing yourself with the BatchEValueMonitor flow from this post will be very helpful.
Here are 3 steps you can take right now.
- Hands-on conformal prediction basics: After
pip install mapie, run the classification example in the official documentation to directly verify the coverage guarantee of split conformal. MAPIE does not natively support e-values, but it is a good starting point for understanding the nonconformity score computation structure. - Experiment by implementing e-values directly: Try applying the
BatchEValueMonitorcode from this post tomake_classificationdata. You can use the visualization code below to see how prediction set size and actual coverage change as you vary the calibration set size (100, 500, 1000).
import matplotlib.pyplot as plt
cal_sizes = [100, 500, 1000]
coverages = []
for cal_size in cal_sizes:
cal_X_sub = X_cal[:cal_size]
cal_y_sub = y_cal[:cal_size]
pred_sets = conformal_evalue_predict(clf, cal_X_sub, cal_y_sub, X_test_all)
coverage = np.mean([
y_true in pred_set
for y_true, pred_set in zip(y_test_all, pred_sets)
])
coverages.append(coverage)
plt.figure(figsize=(8, 4))
plt.plot(cal_sizes, coverages, marker="o", label="실제 커버리지")
plt.axhline(y=0.95, color="red", linestyle="--", label="목표 커버리지 (95%)")
plt.xlabel("보정 집합 크기")
plt.ylabel("실제 커버리지")
plt.title("보정 집합 크기와 커버리지 관계")
plt.legend()
plt.tight_layout()
plt.show()- Read the original paper: arXiv:2503.13050 "E-Values Expand the Scope of Conformal Prediction" presents the batch setting algorithm clearly in pseudocode, making it a great resource for confirming the theoretical foundations before integrating into a real system.
Next post: Online Conformal Prediction — how to maintain real-time coverage on streaming data using a decreasing step size. We will explore how this connects to the batch e-value accumulation concept covered in this post.
References
Introductory — Good starting points for first-time readers
- A Gentle Introduction to Conformal Prediction | arXiv:2107.07511 — A suitable starting point for those encountering conformal prediction for the first time.
- MAPIE — scikit-learn-contrib | GitHub — Useful for understanding the p-value-based conformal prediction structure through code.
- awesome-conformal-prediction | GitHub curated list — Browse papers, tutorials, and libraries all in one place.
Core — Resources directly connected to this post
- E-Values Expand the Scope of Conformal Prediction | arXiv:2503.13050 — The original paper on batch anytime-valid prediction and fuzzy prediction sets.
- Conformal e-prediction — Vovk | ScienceDirect 2025 — The theoretical foundation of conformal e-prediction.
- Fuzzy Prediction Sets: Conformal Prediction with E-values | arXiv:2509.13130 — Covers the connection between fuzzy prediction sets and decision theory.
Further Reading