How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design
This article assumes knowledge of A/B testing basics (p-value, null hypothesis) and experience with canary deployment. It is suitable for backend and data engineers operating canary pipelines.
When operating a canary deployment, two painful situations are repeated. First, the waste of forcing a clearly ineffective version to the final deployment stage. Even though the purchase conversion rate was already below the baseline in the first interim analysis, unnecessary user exposure continues under the pretext that "more data is needed." Second, the confusion of false positives occurring "even though nothing has actually changed" while simultaneously testing 10 metrics, including purchase conversion rate, response delay, and error rate. If each metric is tested independently with α=0.05, the probability that at least one is a false positive is 1-(0.95)^10 ≈ 40%.
This article covers the overall design method for early termination of failing canaries by setting availability boundaries with Beta-Spending and controlling FWER in multi-metric environments to α or less using Hierarchical Testing. We examine step-by-step the actual implementation of experimental platforms like Statsig and Eppo, boundary calculation using R's gsDesign package, Python pipeline integration, and integration with Kubernetes Argo Rollouts. At the end of this article, you can obtain Python code and R scripts that you can immediately attach to your canary pipelines.
Key Concepts
The three tools covered in this section—Alpha-Spending, Beta-Spending, and Hierarchical Testing—can be used independently, but combining these three completes a Canary Pipeline that is "statistically valid even after multiple checks, automatically stops ineffective experiments, and prevents the accumulation of false positives across multiple metrics." Let's build each concept step by step.
Alpha-Spending and Beta-Spending: The Two Axes of Error Budgeting
Sequential testing is a design that involves looking at the data multiple times in between before collecting all of it. The problem is that false positives accumulate with each intermediate look. For example, if the same data is looked at weekly and judged to be "significant" when p < 0.05, the actual error rate far exceeds α. The key idea to solve this is error budget variance allocation.
| Classification | Control Target | Boundary Direction | When Boundary Exceeded/Undermined |
|---|---|---|---|
| Alpha-Spending | Type I Error (False Positive, α) | Upper Bound | Exceeded → Effect Detection, Distribution Proceed |
| Beta-Spending | Type II Error (False Negative, β) | Lower Bound | Below Spending → Deemed Futile, Canary Termination |
The Lan-DeMets (1983) method, which is the de facto standard for Alpha-Spending, has the advantage of not requiring the number of intermediate analyses to be fixed in advance. The O'Brien-Fleming type function consumes almost no α in the initial intermediate analyses, so the final analysis maintains a standard that is nearly identical to that of a fixed-sample test.
Terminology — Futility Stopping: The act of ending an experiment early when it is determined, based on the data collected so far, that there is a low probability of obtaining meaningful results even if the experiment is continued to the end. This reduces resource waste and unnecessary user exposure.
Conditional Power and Dynamic Uselessness Boundary
The CP-based β-spending function published by Ni et al. in the Biometrical Journal in 2024 has evolved the existing fixed uselessness boundary by one step.
Conditional Power (CP) estimates the "probability of rejecting the null hypothesis in the final analysis" in real time based on the data accumulated to date. While classical β-spending determined the boundary based solely on the analysis time point (information fraction), CP-based methods dynamically adjust the futility boundary by reflecting the trend of the actual observed effect size.
# Python 3.9+
import numpy as np
from scipy import stats
def conditional_power(
z_current: float, # 현재 시점의 검정 통계량
t: float, # 정보 분율 (현재 수집된 정보 / 계획된 총 정보)
delta: float, # 사전 가정 효과 크기 (표준화된 드리프트 파라미터)
alpha: float = 0.025 # 단측 유의 수준
) -> float:
"""
고정 효과 크기 가정 하 조건부 검정력 계산.
수식 해설: 정보 분율 t 시점에서 관측된 Z 통계량 z_current를 기반으로
최종 시점(t=1)의 검정 통계량 기댓값을 계산한다.
여기서 delta는 n_total=1로 표준화된 드리프트 파라미터다.
실제 샘플 수 기반 계산이 필요하면 delta에 sqrt(n_total)을 곱한 값을 전달하라.
CP < 0.2이면 무용성 판정 후보.
"""
z_alpha = stats.norm.ppf(1 - alpha)
# 현재까지 수집된 정보(t)와 남은 정보(1-t)의 가중 합산
z_final_mean = z_current * np.sqrt(t) + delta * np.sqrt(1 - t)
cp = 1 - stats.norm.cdf(z_alpha - z_final_mean)
return cp
# 예: 50% 데이터 수집 시점, 효과가 절반 수준만 관측됨
cp = conditional_power(z_current=0.8, t=0.5, delta=2.0)
print(f"조건부 검정력: {cp:.3f}") # → 0.183 → 무용성 경계 검토Caution — In a canary environment, the traffic ratio and information fraction may not be the same. If the variance of the canary group and the baseline group differ, or if the response rate changes, the actual amount of statistical information collected will differ from the traffic ratio. Ignoring this can cause the uselessness boundary to be triggered earlier or later than expected. In practice, it is recommended to track variance in real-time and calculate the information fraction separately.
Hierarchical Testing and FWER Control
If 10 metrics are tested with α=0.05, the probability that at least one is a false positive reaches 1-(0.95)^10 ≈ 40%. This is the reason why the Family-Wise Error Rate (FWER) is exploding.
The simple Bonferroni correction (α/k) controls for FWER but consumes excessive power. If k=10, the α for each metric drops to 0.005, making it difficult to detect even the actual effect. Furthermore, while Bonferroni assumes independence between metrics, actual metrics (purchase conversion rate, session duration, and click-through rate) often have positive correlations with each other, leading to overconservative results. To mitigate this overconservatism, permutation-based multiple tests or the Šidák correction can be considered.
Hierarchical Testing (Gatekeeping) minimizes this loss by utilizing logical priorities among metrics.
[1차 메트릭: 구매 전환율] ← α 전체 배분
│
│ 유의 (p < α)
▼
[2차 메트릭 그룹: 세션 시간, 클릭률] ← 각각 α/2로 Bonferroni
│
│ 그룹 내 하나 이상 유의 (disjunctive)
▼
[3차 메트릭 그룹: 페이지 로드, API 에러율] ← 각각 α/2로 BonferroniMathematical Guarantee: If an α-level procedure is applied to the test of first-order metrics and second-order and lower-order metrics are tested only with the condition of rejection of higher classes, the overall FWER is controlled to be α or less. This is proven by the Closed Testing Principle.
In the structure above, the "open next gate if at least one is significant" within the hierarchy is disjunctive gatekeeping. This has higher power than the conjunctive method, where "all must be significant to open next gate," but is somewhat less conservative. In either case, the FWER ≤ α guarantee holds, but the selected criterion must be registered in advance before the experiment begins.
Terminology — Gatekeeping: A multiple test design where the gate to a lower-level test opens only after passing a higher-level test. If the gate is opened without permission, FWER is not controlled.
Overall Structure of Group Sequential Design
The overall framework combining Alpha-Spending, Beta-Spending, and hierarchical testing is as follows.
Z-통계량
4 ┤
│ ━━┓ ← 상한(efficacy): O'Brien-Fleming은 초반에 매우 높고
3 ┤ ┗━━┓ 분석이 거듭될수록 점차 낮아짐
│ ┗━━━┓
2 ┤ ┗━━━━━━━━━━━ (최종: ≈ 고정 표본 기준과 유사)
1 ┤
0 ┼────────────────────────────────► 정보 분율 t
-1 ┤ ─ ─ ─ ─┐ ← 하한(futility): 초반엔 음수(거의 무조건 통과)
│ └ ─ ─┐ 분석이 거듭될수록 점차 높아짐
1 ┤ └ ─ ─ ─ ─ ─ (최종: 상한과 수렴)
│
t=0.10 t=0.25 t=0.50 t=1.00
통계량 > 상한 → 효과 탐지 → 배포 진행 또는 즉시 롤백
통계량 < 하한 → 무용성 → 카나리 종료
두 경계 사이 → 계속 진행Key characteristics of the O'Brien-Fleming boundary: The upper bound at the beginning (t=0.10) is very strict (e.g., Z > 3.47) to prevent premature distribution decisions, and the lower bound at the beginning is very loose (e.g., Z < -0.48) to prevent early termination due to noise. At the end (t=1.0), both boundaries converge to make a final decision.
Practical Application
The four examples below use a single virtual e-commerce API server canary deployment as a common scenario.
- Primary Metric: Purchase Conversion Rate(
purchase_rate) - Secondary Metrics: Session Time(
session_time), Click-through Rate(click_rate) - Third-level Metrics: Page Load Time (
page_load_ms), API Error Rate (error_rate) - Checkpoint: Traffic 10% → 25% → 50% → 100%
Example 1: Designing Canary Distribution Boundaries with R gsDesign
# install.packages("gsDesign")
library(gsDesign)
# 4회 분석(중간 3회 + 최종 1회), 단측 α=0.025, β=0.20(검정력 80%)
# delta1 = 0.3: Cohen's d 기준 "소-중간 효과 크기"
# 구매 전환율 맥락에서 약 0.5%p 차이에 해당.
# 과거 A/B 테스트 히스토리에서 "탐지할 가치 있는 최소 효과"로 결정하라.
design <- gsDesign(
k = 4, # 중간 분석 3회 + 최종 1회
test.type = 2, # 양측 검정 (상한/하한 모두)
alpha = 0.025, # 단측 α
beta = 0.20, # 타입 II 오류 (검정력 80%)
sfu = sfLDOF, # Alpha-Spending: O'Brien-Fleming 유사
sfl = sfLDOF, # Beta-Spending: 무용성 경계
delta1 = 0.3, # 탐지 목표 최소 효과 크기 (Cohen's d)
n.fix = 1000 # 고정 표본 기준 샘플 수
)
print(design)
# 출력 예시:
# Analysis N Z (upper) Z (lower/futility)
# 1 310 3.47 -0.48
# 2 620 2.78 0.94
# 3 930 2.29 1.71
# 4 1051 2.02 2.02| Column | Meaning |
|---|---|
N |
Cumulative sample size required up to the analysis point (Total 1051, approximately 5% inflation compared to a fixed sample of 1000) |
Z (upper) |
Detect effect when this value is exceeded → Proceed with deployment |
Z (lower) |
If this value is not met, deemed useless → Canary terminates |
Reason for the lower bound of the first interim analysis being -0.48: In the early stages, when only 10% of the data has been collected, noise is significant, so it cannot be concluded that there is no effect even if the test statistic is slightly negative. A value below -0.48 means that the analysis is terminated only when the effect clearly appears in the opposite direction—in the case of obvious regression. As the analysis progresses, the lower bound rises (0.94 → 1.71 → 2.02), making the judgment of futility increasingly strict.
Note: O'Brien-Fleming conserves α early on (upper limit 3.47), keeping the final analysis similar to the fixed sample standard (2.02). On the other hand, the lower limit of futility becomes stricter over time.
Example 2: Integrating into a Canary Deployment Pipeline with Python
Apply the boundary values calculated in gsDesign to the Python pipeline. When selecting checkpoints, it is important to use the "most recently passed point" as the standard, rather than "a future point in time that has not yet been reached."
# Python 3.9+
from dataclasses import dataclass
from typing import Literal
@dataclass
class SequentialBoundary:
upper: float # efficacy (효과 탐지) 경계
lower: float # futility (무용성) 경계
# 예시 1의 gsDesign 출력에서 가져온 경계값
BOUNDARIES: dict[float, SequentialBoundary] = {
0.10: SequentialBoundary(upper=3.47, lower=-0.48),
0.25: SequentialBoundary(upper=2.78, lower=0.94),
0.50: SequentialBoundary(upper=2.29, lower=1.71),
1.00: SequentialBoundary(upper=2.02, lower=2.02),
}
def evaluate_canary(
z_stat: float,
traffic_fraction: float
) -> Literal["continue", "deploy", "stop_futility"]:
"""
현재 검정 통계량과 트래픽 비율로 카나리 배포 결정.
가장 최근에 통과한 체크포인트 기준을 사용한다.
"""
checkpoints = sorted(BOUNDARIES.keys())
passed = [c for c in checkpoints if c <= traffic_fraction]
if not passed:
return "continue" # 아직 첫 체크포인트 미도달
checkpoint = passed[-1] # 가장 최근에 통과한 시점
boundary = BOUNDARIES[checkpoint]
if z_stat >= boundary.upper:
return "deploy" # 효과 탐지 → 전체 배포
elif z_stat <= boundary.lower:
return "stop_futility" # 무용성 → 카나리 종료
else:
return "continue" # 다음 분석까지 유지
# 사용 예시: 25% 체크포인트에서 z=0.72
result = evaluate_canary(z_stat=0.72, traffic_fraction=0.25)
print(f"카나리 판정: {result}") # → stop_futility (0.72 < 0.94 하한)Example 3: Implementation of Multi-metric Hierarchical Testing
It is practical to separate gatekeeping logic from p-value calculation. Different methods are used for p-values depending on the metric characteristics (ratio, continuous, count, etc.), while only the gatekeeping logic is reused commonly.
# Python 3.9+
import numpy as np
from scipy import stats as scipy_stats
def compute_proportion_pvalue(
obs_count: int, obs_total: int,
ctrl_count: int, ctrl_total: int
) -> float:
"""비율 차이에 대한 양측 z-검정 p-value."""
p_obs = obs_count / obs_total
p_ctrl = ctrl_count / ctrl_total
p_pool = (obs_count + ctrl_count) / (obs_total + ctrl_total)
se = np.sqrt(p_pool * (1 - p_pool) * (1 / obs_total + 1 / ctrl_total))
if se == 0:
return 1.0
z = (p_obs - p_ctrl) / se
return float(2 * (1 - scipy_stats.norm.cdf(abs(z))))
def hierarchical_gate(
p_values: dict[str, float],
hierarchy: list[list[str]],
alpha: float = 0.05
) -> dict[str, dict]:
"""
계층적 게이트키핑: 상위 계층이 유의해야 하위 계층 검정 진행.
각 계층 내에서 Bonferroni 보정 후 "하나라도 유의"(disjunctive)하면
다음 계층 게이트를 연다. FWER <= alpha 보장.
더 보수적인 "모두 유의"(conjunctive) 방식으로 바꾸려면
gate_open = all(sig_list) 로 변경하라.
"""
results: dict[str, dict] = {}
gate_open = True
for metric_group in hierarchy:
if not gate_open:
for name in metric_group:
results[name] = {"tested": False, "reason": "gate_closed"}
continue
alpha_adj = alpha / len(metric_group) # 계층 내 Bonferroni
level_any_significant = False
for name in metric_group:
p = p_values.get(name, 1.0)
significant = p < alpha_adj
if significant:
level_any_significant = True
results[name] = {
"tested": True,
"p_value": p,
"alpha_used": alpha_adj,
"significant": significant,
}
gate_open = level_any_significant # disjunctive gatekeeping
return results
# ─────────────────────────────────────────────────────────────
# 사용 예시: e-commerce 카나리, 25% 체크포인트 데이터 (각 500명)
# ─────────────────────────────────────────────────────────────
raw_p_values = {
# 1차: 구매 전환율 (카나리 27/500 vs 베이스라인 25/500)
"purchase_rate": compute_proportion_pvalue(27, 500, 25, 500),
# 2차: 세션 시간, 클릭률 (실제로는 t-test 등 적합한 방법 사용)
"session_time": 0.08,
"click_rate": 0.03,
# 3차: 페이지 로드, API 에러율
"page_load_ms": 0.12,
"error_rate": 0.04,
}
hierarchy = [
["purchase_rate"], # 1차: 가장 중요
["session_time", "click_rate"], # 2차: 1차 통과 후
["page_load_ms", "error_rate"], # 3차: 2차 통과 후
]
gate_results = hierarchical_gate(raw_p_values, hierarchy, alpha=0.05)
for name, r in gate_results.items():
if r["tested"]:
sig = "★ 유의" if r["significant"] else " 비유의"
print(f"{sig} | {name}: p={r['p_value']:.4f} (α_adj={r['alpha_used']:.4f})")
else:
print(f" 미검정 | {name}: {r['reason']}")Example 4: Actual Application in Kubernetes + Argo Rollouts
After pre-calculating the z-statistic in Prometheus using the recording rule, connect the boundary value to AnalysisTemplate in Argo Rollouts. The value of futility_lower below is taken directly from the gsDesign output of Example 1.
# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-server-canary
spec:
strategy:
canary:
steps:
- setWeight: 10 # 1차 체크포인트: 10% 트래픽
- pause: {duration: 5m}
- analysis:
templates:
- templateName: sequential-boundary-check
args:
- name: checkpoint
value: "0.10"
- name: futility_lower # 예시 1 참조: 10% 하한 = -0.48
value: "-0.48"
- setWeight: 25 # 2차 체크포인트: 25% 트래픽
- pause: {duration: 10m}
- analysis:
templates:
- templateName: sequential-boundary-check
args:
- name: checkpoint
value: "0.25"
- name: futility_lower # 예시 1 참조: 25% 하한 = 0.94
value: "0.94"
- setWeight: 50
- pause: {duration: 20m}
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: sequential-boundary-check
spec:
args:
- name: checkpoint
- name: futility_lower
metrics:
- name: z-stat-purchase-rate
provider:
prometheus:
address: http://prometheus:9090
# Prometheus recording rule에서 카나리 vs 베이스라인 z-통계량을 미리 계산
query: |
canary_z_statistic{
metric="purchase_rate",
checkpoint="{{args.checkpoint}}"
}
# 무용성 하한(예시 1 gsDesign 출력값)을 초과해야 "계속 진행" 성공
successCondition: "result[0] > {{args.futility_lower}}"
failureCondition: "result[0] <= {{args.futility_lower}}"Pros and Cons Analysis
Advantages
| Item | Content |
|---|---|
| Rapid failure detection | Minimize damage by early termination of ineffective canaries before full user exposure |
| Resource Conservation | Reduction in costs of continuing unnecessary experiments (computing, engineering time) |
| Statistical Rigor | Preventing False Positive Accumulation in Multiple Metrics with FWER Control |
| Flexible Monitoring | Lan-DeMets method does not require pre-fixing the number and timing of intermediate analyses |
| Design Flexibility | Securing Room for Business Decision with Non-binding Futility Boundaries |
| Power Efficiency | Hierarchical tests maintain higher power on second-order metrics compared to Bonferroni |
Disadvantages and Precautions
| Item | Content | Response Plan |
|---|---|---|
| Risk of false futility | Possible early discontinuation even if effective | Use non-binding boundaries, set CP threshold conservatively (0.1–0.15) |
| Prior Planning Dependency | α/β-spending functions and effect sizes must be pre-registered prior to the experiment to be valid | Mandatory Experiment Design Doc process |
| Sample Inflation | Total samples required are greater than fixed samples for equivalence power | Pre-calculate inflation factor with gsDesign n.I field |
| Hierarchical design complexity | Loss of opportunity to detect critical indicators due to errors in defining logical priorities between metrics | Prior agreement on hierarchical structure through joint workshops with product and statistics teams |
| Correlation Metric Overconservatives | Since actual metrics are correlated, Bonferroni correction may operate overconservatively | Review Permutation-based Multiple Tests or Šidák Correction |
Terminology Supplement — Binding vs. Non-binding Futility Boundary: Binding requires mandatory termination if the lower boundary is crossed downward, and this is reflected in the α calculation. Non-binding statistically recommends termination, but can continue based on business judgment. Choosing non-binding slightly weakens FWER coverage but increases operational flexibility.
The Most Common Mistakes in Practice
- Retrospectively adjusting boundaries after analysis: The moment you adjust the boundary "just this once," the FWER guarantee is broken. Boundaries must be locked in the code/documentation before the experiment starts.
- Treating all metrics equally without primary metrics: Dividing 10 metrics using Bonferroni without a hierarchical structure excessively lowers the power of the truly important indicators. Be sure to define 1 or 2 primary metrics first.
- Setting only Beta-Spending and omitting Alpha-Spending: If there is only a futility boundary and no efficacy boundary, the criterion for early detection of effect in the intermediate analysis disappears. Both boundaries must be designed together.
In Conclusion
By setting the futility boundary with Beta-Spending and controlling FWER with hierarchical testing, canary distribution is elevated from "slow observation" to "automated statistical decision-making." Experimental platforms such as Statsig and Eppo have adopted this method as a default option because its practicality has already been verified.
3 Steps to Start Right Now:
- Run boundary calculation in R: After running
install.packages("gsDesign"), let's use the code below to output the α/β boundary values corresponding to the current canary sample size and share them on the team Slack. Just sharing the first number with the team starts a design discussion. gsDesign(k=4, test.type=2, alpha=0.025, beta=0.20, sfu=sfLDOF, sfl=sfLDOF, delta1=0.3, n.fix=1000)- Documenting the metric hierarchy: Take the list of metrics used in the current experiment, explicitly specify 1-2 primary metrics that "if this is not significant, there is no need to look at the rest," and record them in the Experiment Design Doc.
- Integrate z-statistics with Argo Rollouts or Flagger AnalysisTemplate: Pre-calculate the canary vs. baseline z-statistics using the Prometheus recording rule, and input the lower boundary value calculated in Step 1 into
successConditionto configure an automatic uselessness determination pipeline.
Finally, let us note one limitation. The method described in this article is constrained by the requirement that the number of intermediate analyses and checkpoints be determined in advance. mSPRT, the subject of the next article, is what removes this constraint—that is, enables the Anytime-Valid test, which allows you to stop or continue monitoring at any time.
Next Post: A Comparative Analysis of How mSPRT (Mixture Sequential Probability Ratio Test) Guarantees Anytime-Valid p-values and How the Implementations of Netflix, Optimizely, and Spotify Differ
Reference Materials
- What is Beta-Spending? | Analytics ToolKit Glossary
- Beta spending function based on conditional power in group sequential design (2024) | PubMed
- Sequential A/B Testing Keeps the World Streaming — Part 1 | Netflix TechBlog
- Sequential A/B Testing Keeps the World Streaming — Part 2 | Netflix TechBlog
- Beyond Bonferroni: Hierarchical Multiple Testing in Empirical Research | NBER Working Paper (2025)
- Hierarchical Testing of Multiple Endpoints in Group-Sequential Trials | Statistics in Medicine
- A gatekeeping procedure to test a primary and a secondary endpoint in a group sequential design | PubMed
- Futility Monitoring in Clinical Trials | PMC/NIH
- gsDesign: Spending Function Overview | R Documentation
- Defining Group Sequential Boundaries | rpact
- Choosing a Sequential Testing Framework | Spotify Engineering (2023)
- Bonferroni Correction for Multiple Comparisons | Statsig Docs
- Sequential Testing | Eppo Docs
- Introducing Kayenta: Automated Canary Analysis Tool | Google Cloud Blog
- A Flexible Futility Monitoring Method with Time-Varying Conditional Power Boundary | PMC
- A Gentle Introduction to Group Sequential Design | CRAN gsDesign