How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design

This article assumes knowledge of A/B testing basics (p-value, null hypothesis) and experience with canary deployment. It is suitable for backend and data engineers operating canary pipelines.

When operating a canary deployment, two painful situations are repeated. First, the waste of forcing a clearly ineffective version to the final deployment stage. Even though the purchase conversion rate was already below the baseline in the first interim analysis, unnecessary user exposure continues under the pretext that "more data is needed." Second, the confusion of false positives occurring "even though nothing has actually changed" while simultaneously testing 10 metrics, including purchase conversion rate, response delay, and error rate. If each metric is tested independently with α=0.05, the probability that at least one is a false positive is 1-(0.95)^10 ≈ 40%.

This article covers the overall design method for early termination of failing canaries by setting availability boundaries with Beta-Spending and controlling FWER in multi-metric environments to α or less using Hierarchical Testing. We examine step-by-step the actual implementation of experimental platforms like Statsig and Eppo, boundary calculation using R's gsDesign package, Python pipeline integration, and integration with Kubernetes Argo Rollouts. At the end of this article, you can obtain Python code and R scripts that you can immediately attach to your canary pipelines.

Key Concepts

The three tools covered in this section—Alpha-Spending, Beta-Spending, and Hierarchical Testing—can be used independently, but combining these three completes a Canary Pipeline that is "statistically valid even after multiple checks, automatically stops ineffective experiments, and prevents the accumulation of false positives across multiple metrics." Let's build each concept step by step.

Alpha-Spending and Beta-Spending: The Two Axes of Error Budgeting

Sequential testing is a design that involves looking at the data multiple times in between before collecting all of it. The problem is that false positives accumulate with each intermediate look. For example, if the same data is looked at weekly and judged to be "significant" when p < 0.05, the actual error rate far exceeds α. The key idea to solve this is error budget variance allocation.

Classification	Control Target	Boundary Direction	When Boundary Exceeded/Undermined
Alpha-Spending	Type I Error (False Positive, α)	Upper Bound	Exceeded → Effect Detection, Distribution Proceed
Beta-Spending	Type II Error (False Negative, β)	Lower Bound	Below Spending → Deemed Futile, Canary Termination

The Lan-DeMets (1983) method, which is the de facto standard for Alpha-Spending, has the advantage of not requiring the number of intermediate analyses to be fixed in advance. The O'Brien-Fleming type function consumes almost no α in the initial intermediate analyses, so the final analysis maintains a standard that is nearly identical to that of a fixed-sample test.

Terminology — Futility Stopping: The act of ending an experiment early when it is determined, based on the data collected so far, that there is a low probability of obtaining meaningful results even if the experiment is continued to the end. This reduces resource waste and unnecessary user exposure.

Conditional Power and Dynamic Uselessness Boundary

The CP-based β-spending function published by Ni et al. in the Biometrical Journal in 2024 has evolved the existing fixed uselessness boundary by one step.

Conditional Power (CP) estimates the "probability of rejecting the null hypothesis in the final analysis" in real time based on the data accumulated to date. While classical β-spending determined the boundary based solely on the analysis time point (information fraction), CP-based methods dynamically adjust the futility boundary by reflecting the trend of the actual observed effect size.

python

# Python 3.9+
import numpy as np
from scipy import stats
 
def conditional_power(
    z_current: float,    # 현재 시점의 검정 통계량
    t: float,            # 정보 분율 (현재 수집된 정보 / 계획된 총 정보)
    delta: float,        # 사전 가정 효과 크기 (표준화된 드리프트 파라미터)
    alpha: float = 0.025 # 단측 유의 수준
) -> float:
    """
    고정 효과 크기 가정 하 조건부 검정력 계산.
 
    수식 해설: 정보 분율 t 시점에서 관측된 Z 통계량 z_current를 기반으로
    최종 시점(t=1)의 검정 통계량 기댓값을 계산한다.
    여기서 delta는 n_total=1로 표준화된 드리프트 파라미터다.
    실제 샘플 수 기반 계산이 필요하면 delta에 sqrt(n_total)을 곱한 값을 전달하라.
 
    CP < 0.2이면 무용성 판정 후보.
    """
    z_alpha = stats.norm.ppf(1 - alpha)
    # 현재까지 수집된 정보(t)와 남은 정보(1-t)의 가중 합산
    z_final_mean = z_current * np.sqrt(t) + delta * np.sqrt(1 - t)
    cp = 1 - stats.norm.cdf(z_alpha - z_final_mean)
    return cp
 
# 예: 50% 데이터 수집 시점, 효과가 절반 수준만 관측됨
cp = conditional_power(z_current=0.8, t=0.5, delta=2.0)
print(f"조건부 검정력: {cp:.3f}")  # → 0.183 → 무용성 경계 검토

Caution — In a canary environment, the traffic ratio and information fraction may not be the same. If the variance of the canary group and the baseline group differ, or if the response rate changes, the actual amount of statistical information collected will differ from the traffic ratio. Ignoring this can cause the uselessness boundary to be triggered earlier or later than expected. In practice, it is recommended to track variance in real-time and calculate the information fraction separately.

Hierarchical Testing and FWER Control

If 10 metrics are tested with α=0.05, the probability that at least one is a false positive reaches 1-(0.95)^10 ≈ 40%. This is the reason why the Family-Wise Error Rate (FWER) is exploding.

The simple Bonferroni correction (α/k) controls for FWER but consumes excessive power. If k=10, the α for each metric drops to 0.005, making it difficult to detect even the actual effect. Furthermore, while Bonferroni assumes independence between metrics, actual metrics (purchase conversion rate, session duration, and click-through rate) often have positive correlations with each other, leading to overconservative results. To mitigate this overconservatism, permutation-based multiple tests or the Šidák correction can be considered.

Hierarchical Testing (Gatekeeping) minimizes this loss by utilizing logical priorities among metrics.

[1차 메트릭: 구매 전환율] ← α 전체 배분
        │
        │ 유의 (p < α)
        ▼
[2차 메트릭 그룹: 세션 시간, 클릭률] ← 각각 α/2로 Bonferroni
        │
        │ 그룹 내 하나 이상 유의 (disjunctive)
        ▼
[3차 메트릭 그룹: 페이지 로드, API 에러율] ← 각각 α/2로 Bonferroni

Mathematical Guarantee: If an α-level procedure is applied to the test of first-order metrics and second-order and lower-order metrics are tested only with the condition of rejection of higher classes, the overall FWER is controlled to be α or less. This is proven by the Closed Testing Principle.

In the structure above, the "open next gate if at least one is significant" within the hierarchy is disjunctive gatekeeping. This has higher power than the conjunctive method, where "all must be significant to open next gate," but is somewhat less conservative. In either case, the FWER ≤ α guarantee holds, but the selected criterion must be registered in advance before the experiment begins.

Terminology — Gatekeeping: A multiple test design where the gate to a lower-level test opens only after passing a higher-level test. If the gate is opened without permission, FWER is not controlled.

Overall Structure of Group Sequential Design

The overall framework combining Alpha-Spending, Beta-Spending, and hierarchical testing is as follows.

Z-통계량
  4 ┤
    │  ━━┓   ← 상한(efficacy): O'Brien-Fleming은 초반에 매우 높고
  3 ┤    ┗━━┓                    분석이 거듭될수록 점차 낮아짐
    │        ┗━━━┓
  2 ┤              ┗━━━━━━━━━━━ (최종: ≈ 고정 표본 기준과 유사)
  1 ┤
  0 ┼────────────────────────────────► 정보 분율 t
 -1 ┤  ─ ─ ─ ─┐   ← 하한(futility): 초반엔 음수(거의 무조건 통과)
    │           └ ─ ─┐               분석이 거듭될수록 점차 높아짐
  1 ┤                 └ ─ ─ ─ ─ ─  (최종: 상한과 수렴)
    │
    t=0.10    t=0.25   t=0.50   t=1.00
 
  통계량 > 상한 → 효과 탐지 → 배포 진행 또는 즉시 롤백
  통계량 < 하한 → 무용성  → 카나리 종료
  두 경계 사이  → 계속 진행

Key characteristics of the O'Brien-Fleming boundary: The upper bound at the beginning (t=0.10) is very strict (e.g., Z > 3.47) to prevent premature distribution decisions, and the lower bound at the beginning is very loose (e.g., Z < -0.48) to prevent early termination due to noise. At the end (t=1.0), both boundaries converge to make a final decision.

Practical Application

The four examples below use a single virtual e-commerce API server canary deployment as a common scenario.

Primary Metric: Purchase Conversion Rate(purchase_rate)
Secondary Metrics: Session Time(session_time), Click-through Rate(click_rate)
Third-level Metrics: Page Load Time (page_load_ms), API Error Rate (error_rate)
Checkpoint: Traffic 10% → 25% → 50% → 100%

Example 1: Designing Canary Distribution Boundaries with R `gsDesign`

python

# install.packages("gsDesign")
library(gsDesign)
 
# 4회 분석(중간 3회 + 최종 1회), 단측 α=0.025, β=0.20(검정력 80%)
# delta1 = 0.3: Cohen's d 기준 "소-중간 효과 크기"
#   구매 전환율 맥락에서 약 0.5%p 차이에 해당.
#   과거 A/B 테스트 히스토리에서 "탐지할 가치 있는 최소 효과"로 결정하라.
design <- gsDesign(
  k         = 4,      # 중간 분석 3회 + 최종 1회
  test.type = 2,      # 양측 검정 (상한/하한 모두)
  alpha     = 0.025,  # 단측 α
  beta      = 0.20,   # 타입 II 오류 (검정력 80%)
  sfu       = sfLDOF, # Alpha-Spending: O'Brien-Fleming 유사
  sfl       = sfLDOF, # Beta-Spending: 무용성 경계
  delta1    = 0.3,    # 탐지 목표 최소 효과 크기 (Cohen's d)
  n.fix     = 1000    # 고정 표본 기준 샘플 수
)
 
print(design)
# 출력 예시:
# Analysis  N     Z (upper)  Z (lower/futility)
#    1      310    3.47          -0.48
#    2      620    2.78           0.94
#    3      930    2.29           1.71
#    4     1051    2.02           2.02

Column	Meaning
`N`	Cumulative sample size required up to the analysis point (Total 1051, approximately 5% inflation compared to a fixed sample of 1000)
`Z (upper)`	Detect effect when this value is exceeded → Proceed with deployment
`Z (lower)`	If this value is not met, deemed useless → Canary terminates

Reason for the lower bound of the first interim analysis being -0.48: In the early stages, when only 10% of the data has been collected, noise is significant, so it cannot be concluded that there is no effect even if the test statistic is slightly negative. A value below -0.48 means that the analysis is terminated only when the effect clearly appears in the opposite direction—in the case of obvious regression. As the analysis progresses, the lower bound rises (0.94 → 1.71 → 2.02), making the judgment of futility increasingly strict.

Note: O'Brien-Fleming conserves α early on (upper limit 3.47), keeping the final analysis similar to the fixed sample standard (2.02). On the other hand, the lower limit of futility becomes stricter over time.

Example 2: Integrating into a Canary Deployment Pipeline with Python

Apply the boundary values calculated in gsDesign to the Python pipeline. When selecting checkpoints, it is important to use the "most recently passed point" as the standard, rather than "a future point in time that has not yet been reached."

python

# Python 3.9+
from dataclasses import dataclass
from typing import Literal
 
@dataclass
class SequentialBoundary:
    upper: float  # efficacy (효과 탐지) 경계
    lower: float  # futility (무용성) 경계
 
# 예시 1의 gsDesign 출력에서 가져온 경계값
BOUNDARIES: dict[float, SequentialBoundary] = {
    0.10: SequentialBoundary(upper=3.47, lower=-0.48),
    0.25: SequentialBoundary(upper=2.78, lower=0.94),
    0.50: SequentialBoundary(upper=2.29, lower=1.71),
    1.00: SequentialBoundary(upper=2.02, lower=2.02),
}
 
def evaluate_canary(
    z_stat: float,
    traffic_fraction: float
) -> Literal["continue", "deploy", "stop_futility"]:
    """
    현재 검정 통계량과 트래픽 비율로 카나리 배포 결정.
    가장 최근에 통과한 체크포인트 기준을 사용한다.
    """
    checkpoints = sorted(BOUNDARIES.keys())
    passed = [c for c in checkpoints if c <= traffic_fraction]
    if not passed:
        return "continue"  # 아직 첫 체크포인트 미도달
    checkpoint = passed[-1]  # 가장 최근에 통과한 시점
    boundary = BOUNDARIES[checkpoint]
 
    if z_stat >= boundary.upper:
        return "deploy"           # 효과 탐지 → 전체 배포
    elif z_stat <= boundary.lower:
        return "stop_futility"    # 무용성 → 카나리 종료
    else:
        return "continue"         # 다음 분석까지 유지
 
# 사용 예시: 25% 체크포인트에서 z=0.72
result = evaluate_canary(z_stat=0.72, traffic_fraction=0.25)
print(f"카나리 판정: {result}")  # → stop_futility (0.72 < 0.94 하한)

Example 3: Implementation of Multi-metric Hierarchical Testing

It is practical to separate gatekeeping logic from p-value calculation. Different methods are used for p-values depending on the metric characteristics (ratio, continuous, count, etc.), while only the gatekeeping logic is reused commonly.

python

# Python 3.9+
import numpy as np
from scipy import stats as scipy_stats
 
def compute_proportion_pvalue(
    obs_count: int, obs_total: int,
    ctrl_count: int, ctrl_total: int
) -> float:
    """비율 차이에 대한 양측 z-검정 p-value."""
    p_obs = obs_count / obs_total
    p_ctrl = ctrl_count / ctrl_total
    p_pool = (obs_count + ctrl_count) / (obs_total + ctrl_total)
    se = np.sqrt(p_pool * (1 - p_pool) * (1 / obs_total + 1 / ctrl_total))
    if se == 0:
        return 1.0
    z = (p_obs - p_ctrl) / se
    return float(2 * (1 - scipy_stats.norm.cdf(abs(z))))
 
 
def hierarchical_gate(
    p_values: dict[str, float],
    hierarchy: list[list[str]],
    alpha: float = 0.05
) -> dict[str, dict]:
    """
    계층적 게이트키핑: 상위 계층이 유의해야 하위 계층 검정 진행.
 
    각 계층 내에서 Bonferroni 보정 후 "하나라도 유의"(disjunctive)하면
    다음 계층 게이트를 연다. FWER <= alpha 보장.
 
    더 보수적인 "모두 유의"(conjunctive) 방식으로 바꾸려면
    gate_open = all(sig_list) 로 변경하라.
    """
    results: dict[str, dict] = {}
    gate_open = True
 
    for metric_group in hierarchy:
        if not gate_open:
            for name in metric_group:
                results[name] = {"tested": False, "reason": "gate_closed"}
            continue
 
        alpha_adj = alpha / len(metric_group)  # 계층 내 Bonferroni
        level_any_significant = False
 
        for name in metric_group:
            p = p_values.get(name, 1.0)
            significant = p < alpha_adj
            if significant:
                level_any_significant = True
            results[name] = {
                "tested": True,
                "p_value": p,
                "alpha_used": alpha_adj,
                "significant": significant,
            }
 
        gate_open = level_any_significant  # disjunctive gatekeeping
 
    return results
 
 
# ─────────────────────────────────────────────────────────────
# 사용 예시: e-commerce 카나리, 25% 체크포인트 데이터 (각 500명)
# ─────────────────────────────────────────────────────────────
raw_p_values = {
    # 1차: 구매 전환율 (카나리 27/500 vs 베이스라인 25/500)
    "purchase_rate": compute_proportion_pvalue(27, 500, 25, 500),
    # 2차: 세션 시간, 클릭률 (실제로는 t-test 등 적합한 방법 사용)
    "session_time": 0.08,
    "click_rate":   0.03,
    # 3차: 페이지 로드, API 에러율
    "page_load_ms": 0.12,
    "error_rate":   0.04,
}
 
hierarchy = [
    ["purchase_rate"],               # 1차: 가장 중요
    ["session_time", "click_rate"],  # 2차: 1차 통과 후
    ["page_load_ms", "error_rate"],  # 3차: 2차 통과 후
]
 
gate_results = hierarchical_gate(raw_p_values, hierarchy, alpha=0.05)
for name, r in gate_results.items():
    if r["tested"]:
        sig = "★ 유의" if r["significant"] else "  비유의"
        print(f"{sig} | {name}: p={r['p_value']:.4f} (α_adj={r['alpha_used']:.4f})")
    else:
        print(f"  미검정 | {name}: {r['reason']}")

Example 4: Actual Application in Kubernetes + Argo Rollouts

After pre-calculating the z-statistic in Prometheus using the recording rule, connect the boundary value to AnalysisTemplate in Argo Rollouts. The value of futility_lower below is taken directly from the gsDesign output of Example 1.

yaml

# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server-canary
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10       # 1차 체크포인트: 10% 트래픽
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: sequential-boundary-check
            args:
              - name: checkpoint
                value: "0.10"
              - name: futility_lower   # 예시 1 참조: 10% 하한 = -0.48
                value: "-0.48"
        - setWeight: 25       # 2차 체크포인트: 25% 트래픽
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: sequential-boundary-check
            args:
              - name: checkpoint
                value: "0.25"
              - name: futility_lower   # 예시 1 참조: 25% 하한 = 0.94
                value: "0.94"
        - setWeight: 50
        - pause: {duration: 20m}
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: sequential-boundary-check
spec:
  args:
    - name: checkpoint
    - name: futility_lower
  metrics:
    - name: z-stat-purchase-rate
      provider:
        prometheus:
          address: http://prometheus:9090
          # Prometheus recording rule에서 카나리 vs 베이스라인 z-통계량을 미리 계산
          query: |
            canary_z_statistic{
              metric="purchase_rate",
              checkpoint="{{args.checkpoint}}"
            }
      # 무용성 하한(예시 1 gsDesign 출력값)을 초과해야 "계속 진행" 성공
      successCondition: "result[0] > {{args.futility_lower}}"
      failureCondition: "result[0] <= {{args.futility_lower}}"

Pros and Cons Analysis

Advantages

Item	Content
Rapid failure detection	Minimize damage by early termination of ineffective canaries before full user exposure
Resource Conservation	Reduction in costs of continuing unnecessary experiments (computing, engineering time)
Statistical Rigor	Preventing False Positive Accumulation in Multiple Metrics with FWER Control
Flexible Monitoring	Lan-DeMets method does not require pre-fixing the number and timing of intermediate analyses
Design Flexibility	Securing Room for Business Decision with Non-binding Futility Boundaries
Power Efficiency	Hierarchical tests maintain higher power on second-order metrics compared to Bonferroni

Disadvantages and Precautions

Item	Content	Response Plan
Risk of false futility	Possible early discontinuation even if effective	Use non-binding boundaries, set CP threshold conservatively (0.1–0.15)
Prior Planning Dependency	α/β-spending functions and effect sizes must be pre-registered prior to the experiment to be valid	Mandatory Experiment Design Doc process
Sample Inflation	Total samples required are greater than fixed samples for equivalence power	Pre-calculate inflation factor with gsDesign `n.I` field
Hierarchical design complexity	Loss of opportunity to detect critical indicators due to errors in defining logical priorities between metrics	Prior agreement on hierarchical structure through joint workshops with product and statistics teams
Correlation Metric Overconservatives	Since actual metrics are correlated, Bonferroni correction may operate overconservatively	Review Permutation-based Multiple Tests or Šidák Correction

Terminology Supplement — Binding vs. Non-binding Futility Boundary: Binding requires mandatory termination if the lower boundary is crossed downward, and this is reflected in the α calculation. Non-binding statistically recommends termination, but can continue based on business judgment. Choosing non-binding slightly weakens FWER coverage but increases operational flexibility.

The Most Common Mistakes in Practice

Retrospectively adjusting boundaries after analysis: The moment you adjust the boundary "just this once," the FWER guarantee is broken. Boundaries must be locked in the code/documentation before the experiment starts.
Treating all metrics equally without primary metrics: Dividing 10 metrics using Bonferroni without a hierarchical structure excessively lowers the power of the truly important indicators. Be sure to define 1 or 2 primary metrics first.
Setting only Beta-Spending and omitting Alpha-Spending: If there is only a futility boundary and no efficacy boundary, the criterion for early detection of effect in the intermediate analysis disappears. Both boundaries must be designed together.

In Conclusion

By setting the futility boundary with Beta-Spending and controlling FWER with hierarchical testing, canary distribution is elevated from "slow observation" to "automated statistical decision-making." Experimental platforms such as Statsig and Eppo have adopted this method as a default option because its practicality has already been verified.

3 Steps to Start Right Now:

Run boundary calculation in R: After running install.packages("gsDesign"), let's use the code below to output the α/β boundary values corresponding to the current canary sample size and share them on the team Slack. Just sharing the first number with the team starts a design discussion.
gsDesign(k=4, test.type=2, alpha=0.025, beta=0.20, sfu=sfLDOF, sfl=sfLDOF, delta1=0.3, n.fix=1000)
Documenting the metric hierarchy: Take the list of metrics used in the current experiment, explicitly specify 1-2 primary metrics that "if this is not significant, there is no need to look at the rest," and record them in the Experiment Design Doc.
Integrate z-statistics with Argo Rollouts or Flagger AnalysisTemplate: Pre-calculate the canary vs. baseline z-statistics using the Prometheus recording rule, and input the lower boundary value calculated in Step 1 into successCondition to configure an automatic uselessness determination pipeline.

Finally, let us note one limitation. The method described in this article is constrained by the requirement that the number of intermediate analyses and checkpoints be determined in advance. mSPRT, the subject of the next article, is what removes this constraint—that is, enables the Anytime-Valid test, which allows you to stop or continue monitoring at any time.

Next Post: A Comparative Analysis of How mSPRT (Mixture Sequential Probability Ratio Test) Guarantees Anytime-Valid p-values and How the Implementations of Netflix, Optimizely, and Spotify Differ

Reference Materials

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design | DEV BAK - 기술블로그

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design

This article assumes knowledge of A/B testing basics (p-value, null hypothesis) and experience with canary deployment. It is suitable for backend and data engineers operating canary pipelines.

Key Concepts

Alpha-Spending and Beta-Spending: The Two Axes of Error Budgeting

Classification	Control Target	Boundary Direction	When Boundary Exceeded/Undermined
Alpha-Spending	Type I Error (False Positive, α)	Upper Bound	Exceeded → Effect Detection, Distribution Proceed
Beta-Spending	Type II Error (False Negative, β)	Lower Bound	Below Spending → Deemed Futile, Canary Termination

Conditional Power and Dynamic Uselessness Boundary

The CP-based β-spending function published by Ni et al. in the Biometrical Journal in 2024 has evolved the existing fixed uselessness boundary by one step.

python

# Python 3.9+
import numpy as np
from scipy import stats
 
def conditional_power(
    z_current: float,    # 현재 시점의 검정 통계량
    t: float,            # 정보 분율 (현재 수집된 정보 / 계획된 총 정보)
    delta: float,        # 사전 가정 효과 크기 (표준화된 드리프트 파라미터)
    alpha: float = 0.025 # 단측 유의 수준
) -> float:
    """
    고정 효과 크기 가정 하 조건부 검정력 계산.
 
    수식 해설: 정보 분율 t 시점에서 관측된 Z 통계량 z_current를 기반으로
    최종 시점(t=1)의 검정 통계량 기댓값을 계산한다.
    여기서 delta는 n_total=1로 표준화된 드리프트 파라미터다.
    실제 샘플 수 기반 계산이 필요하면 delta에 sqrt(n_total)을 곱한 값을 전달하라.
 
    CP < 0.2이면 무용성 판정 후보.
    """
    z_alpha = stats.norm.ppf(1 - alpha)
    # 현재까지 수집된 정보(t)와 남은 정보(1-t)의 가중 합산
    z_final_mean = z_current * np.sqrt(t) + delta * np.sqrt(1 - t)
    cp = 1 - stats.norm.cdf(z_alpha - z_final_mean)
    return cp
 
# 예: 50% 데이터 수집 시점, 효과가 절반 수준만 관측됨
cp = conditional_power(z_current=0.8, t=0.5, delta=2.0)
print(f"조건부 검정력: {cp:.3f}")  # → 0.183 → 무용성 경계 검토

Hierarchical Testing and FWER Control

If 10 metrics are tested with α=0.05, the probability that at least one is a false positive reaches 1-(0.95)^10 ≈ 40%. This is the reason why the Family-Wise Error Rate (FWER) is exploding.

Hierarchical Testing (Gatekeeping) minimizes this loss by utilizing logical priorities among metrics.

[1차 메트릭: 구매 전환율] ← α 전체 배분
        │
        │ 유의 (p < α)
        ▼
[2차 메트릭 그룹: 세션 시간, 클릭률] ← 각각 α/2로 Bonferroni
        │
        │ 그룹 내 하나 이상 유의 (disjunctive)
        ▼
[3차 메트릭 그룹: 페이지 로드, API 에러율] ← 각각 α/2로 Bonferroni

Overall Structure of Group Sequential Design

The overall framework combining Alpha-Spending, Beta-Spending, and hierarchical testing is as follows.

Z-통계량
  4 ┤
    │  ━━┓   ← 상한(efficacy): O'Brien-Fleming은 초반에 매우 높고
  3 ┤    ┗━━┓                    분석이 거듭될수록 점차 낮아짐
    │        ┗━━━┓
  2 ┤              ┗━━━━━━━━━━━ (최종: ≈ 고정 표본 기준과 유사)
  1 ┤
  0 ┼────────────────────────────────► 정보 분율 t
 -1 ┤  ─ ─ ─ ─┐   ← 하한(futility): 초반엔 음수(거의 무조건 통과)
    │           └ ─ ─┐               분석이 거듭될수록 점차 높아짐
  1 ┤                 └ ─ ─ ─ ─ ─  (최종: 상한과 수렴)
    │
    t=0.10    t=0.25   t=0.50   t=1.00
 
  통계량 > 상한 → 효과 탐지 → 배포 진행 또는 즉시 롤백
  통계량 < 하한 → 무용성  → 카나리 종료
  두 경계 사이  → 계속 진행

Practical Application

The four examples below use a single virtual e-commerce API server canary deployment as a common scenario.

Primary Metric: Purchase Conversion Rate(purchase_rate)
Secondary Metrics: Session Time(session_time), Click-through Rate(click_rate)
Third-level Metrics: Page Load Time (page_load_ms), API Error Rate (error_rate)
Checkpoint: Traffic 10% → 25% → 50% → 100%

Example 1: Designing Canary Distribution Boundaries with R `gsDesign`

python

# install.packages("gsDesign")
library(gsDesign)
 
# 4회 분석(중간 3회 + 최종 1회), 단측 α=0.025, β=0.20(검정력 80%)
# delta1 = 0.3: Cohen's d 기준 "소-중간 효과 크기"
#   구매 전환율 맥락에서 약 0.5%p 차이에 해당.
#   과거 A/B 테스트 히스토리에서 "탐지할 가치 있는 최소 효과"로 결정하라.
design <- gsDesign(
  k         = 4,      # 중간 분석 3회 + 최종 1회
  test.type = 2,      # 양측 검정 (상한/하한 모두)
  alpha     = 0.025,  # 단측 α
  beta      = 0.20,   # 타입 II 오류 (검정력 80%)
  sfu       = sfLDOF, # Alpha-Spending: O'Brien-Fleming 유사
  sfl       = sfLDOF, # Beta-Spending: 무용성 경계
  delta1    = 0.3,    # 탐지 목표 최소 효과 크기 (Cohen's d)
  n.fix     = 1000    # 고정 표본 기준 샘플 수
)
 
print(design)
# 출력 예시:
# Analysis  N     Z (upper)  Z (lower/futility)
#    1      310    3.47          -0.48
#    2      620    2.78           0.94
#    3      930    2.29           1.71
#    4     1051    2.02           2.02

Column	Meaning
`N`	Cumulative sample size required up to the analysis point (Total 1051, approximately 5% inflation compared to a fixed sample of 1000)
`Z (upper)`	Detect effect when this value is exceeded → Proceed with deployment
`Z (lower)`	If this value is not met, deemed useless → Canary terminates

Example 2: Integrating into a Canary Deployment Pipeline with Python

python

# Python 3.9+
from dataclasses import dataclass
from typing import Literal
 
@dataclass
class SequentialBoundary:
    upper: float  # efficacy (효과 탐지) 경계
    lower: float  # futility (무용성) 경계
 
# 예시 1의 gsDesign 출력에서 가져온 경계값
BOUNDARIES: dict[float, SequentialBoundary] = {
    0.10: SequentialBoundary(upper=3.47, lower=-0.48),
    0.25: SequentialBoundary(upper=2.78, lower=0.94),
    0.50: SequentialBoundary(upper=2.29, lower=1.71),
    1.00: SequentialBoundary(upper=2.02, lower=2.02),
}
 
def evaluate_canary(
    z_stat: float,
    traffic_fraction: float
) -> Literal["continue", "deploy", "stop_futility"]:
    """
    현재 검정 통계량과 트래픽 비율로 카나리 배포 결정.
    가장 최근에 통과한 체크포인트 기준을 사용한다.
    """
    checkpoints = sorted(BOUNDARIES.keys())
    passed = [c for c in checkpoints if c <= traffic_fraction]
    if not passed:
        return "continue"  # 아직 첫 체크포인트 미도달
    checkpoint = passed[-1]  # 가장 최근에 통과한 시점
    boundary = BOUNDARIES[checkpoint]
 
    if z_stat >= boundary.upper:
        return "deploy"           # 효과 탐지 → 전체 배포
    elif z_stat <= boundary.lower:
        return "stop_futility"    # 무용성 → 카나리 종료
    else:
        return "continue"         # 다음 분석까지 유지
 
# 사용 예시: 25% 체크포인트에서 z=0.72
result = evaluate_canary(z_stat=0.72, traffic_fraction=0.25)
print(f"카나리 판정: {result}")  # → stop_futility (0.72 < 0.94 하한)

Example 3: Implementation of Multi-metric Hierarchical Testing

python

# Python 3.9+
import numpy as np
from scipy import stats as scipy_stats
 
def compute_proportion_pvalue(
    obs_count: int, obs_total: int,
    ctrl_count: int, ctrl_total: int
) -> float:
    """비율 차이에 대한 양측 z-검정 p-value."""
    p_obs = obs_count / obs_total
    p_ctrl = ctrl_count / ctrl_total
    p_pool = (obs_count + ctrl_count) / (obs_total + ctrl_total)
    se = np.sqrt(p_pool * (1 - p_pool) * (1 / obs_total + 1 / ctrl_total))
    if se == 0:
        return 1.0
    z = (p_obs - p_ctrl) / se
    return float(2 * (1 - scipy_stats.norm.cdf(abs(z))))
 
 
def hierarchical_gate(
    p_values: dict[str, float],
    hierarchy: list[list[str]],
    alpha: float = 0.05
) -> dict[str, dict]:
    """
    계층적 게이트키핑: 상위 계층이 유의해야 하위 계층 검정 진행.
 
    각 계층 내에서 Bonferroni 보정 후 "하나라도 유의"(disjunctive)하면
    다음 계층 게이트를 연다. FWER <= alpha 보장.
 
    더 보수적인 "모두 유의"(conjunctive) 방식으로 바꾸려면
    gate_open = all(sig_list) 로 변경하라.
    """
    results: dict[str, dict] = {}
    gate_open = True
 
    for metric_group in hierarchy:
        if not gate_open:
            for name in metric_group:
                results[name] = {"tested": False, "reason": "gate_closed"}
            continue
 
        alpha_adj = alpha / len(metric_group)  # 계층 내 Bonferroni
        level_any_significant = False
 
        for name in metric_group:
            p = p_values.get(name, 1.0)
            significant = p < alpha_adj
            if significant:
                level_any_significant = True
            results[name] = {
                "tested": True,
                "p_value": p,
                "alpha_used": alpha_adj,
                "significant": significant,
            }
 
        gate_open = level_any_significant  # disjunctive gatekeeping
 
    return results
 
 
# ─────────────────────────────────────────────────────────────
# 사용 예시: e-commerce 카나리, 25% 체크포인트 데이터 (각 500명)
# ─────────────────────────────────────────────────────────────
raw_p_values = {
    # 1차: 구매 전환율 (카나리 27/500 vs 베이스라인 25/500)
    "purchase_rate": compute_proportion_pvalue(27, 500, 25, 500),
    # 2차: 세션 시간, 클릭률 (실제로는 t-test 등 적합한 방법 사용)
    "session_time": 0.08,
    "click_rate":   0.03,
    # 3차: 페이지 로드, API 에러율
    "page_load_ms": 0.12,
    "error_rate":   0.04,
}
 
hierarchy = [
    ["purchase_rate"],               # 1차: 가장 중요
    ["session_time", "click_rate"],  # 2차: 1차 통과 후
    ["page_load_ms", "error_rate"],  # 3차: 2차 통과 후
]
 
gate_results = hierarchical_gate(raw_p_values, hierarchy, alpha=0.05)
for name, r in gate_results.items():
    if r["tested"]:
        sig = "★ 유의" if r["significant"] else "  비유의"
        print(f"{sig} | {name}: p={r['p_value']:.4f} (α_adj={r['alpha_used']:.4f})")
    else:
        print(f"  미검정 | {name}: {r['reason']}")

Example 4: Actual Application in Kubernetes + Argo Rollouts

yaml

# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server-canary
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10       # 1차 체크포인트: 10% 트래픽
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: sequential-boundary-check
            args:
              - name: checkpoint
                value: "0.10"
              - name: futility_lower   # 예시 1 참조: 10% 하한 = -0.48
                value: "-0.48"
        - setWeight: 25       # 2차 체크포인트: 25% 트래픽
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: sequential-boundary-check
            args:
              - name: checkpoint
                value: "0.25"
              - name: futility_lower   # 예시 1 참조: 25% 하한 = 0.94
                value: "0.94"
        - setWeight: 50
        - pause: {duration: 20m}
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: sequential-boundary-check
spec:
  args:
    - name: checkpoint
    - name: futility_lower
  metrics:
    - name: z-stat-purchase-rate
      provider:
        prometheus:
          address: http://prometheus:9090
          # Prometheus recording rule에서 카나리 vs 베이스라인 z-통계량을 미리 계산
          query: |
            canary_z_statistic{
              metric="purchase_rate",
              checkpoint="{{args.checkpoint}}"
            }
      # 무용성 하한(예시 1 gsDesign 출력값)을 초과해야 "계속 진행" 성공
      successCondition: "result[0] > {{args.futility_lower}}"
      failureCondition: "result[0] <= {{args.futility_lower}}"

Pros and Cons Analysis

Advantages

Item	Content
Rapid failure detection	Minimize damage by early termination of ineffective canaries before full user exposure
Resource Conservation	Reduction in costs of continuing unnecessary experiments (computing, engineering time)
Statistical Rigor	Preventing False Positive Accumulation in Multiple Metrics with FWER Control
Flexible Monitoring	Lan-DeMets method does not require pre-fixing the number and timing of intermediate analyses
Design Flexibility	Securing Room for Business Decision with Non-binding Futility Boundaries
Power Efficiency	Hierarchical tests maintain higher power on second-order metrics compared to Bonferroni

Disadvantages and Precautions

Item	Content	Response Plan
Risk of false futility	Possible early discontinuation even if effective	Use non-binding boundaries, set CP threshold conservatively (0.1–0.15)
Prior Planning Dependency	α/β-spending functions and effect sizes must be pre-registered prior to the experiment to be valid	Mandatory Experiment Design Doc process
Sample Inflation	Total samples required are greater than fixed samples for equivalence power	Pre-calculate inflation factor with gsDesign `n.I` field
Hierarchical design complexity	Loss of opportunity to detect critical indicators due to errors in defining logical priorities between metrics	Prior agreement on hierarchical structure through joint workshops with product and statistics teams
Correlation Metric Overconservatives	Since actual metrics are correlated, Bonferroni correction may operate overconservatively	Review Permutation-based Multiple Tests or Šidák Correction

The Most Common Mistakes in Practice

Retrospectively adjusting boundaries after analysis: The moment you adjust the boundary "just this once," the FWER guarantee is broken. Boundaries must be locked in the code/documentation before the experiment starts.
Treating all metrics equally without primary metrics: Dividing 10 metrics using Bonferroni without a hierarchical structure excessively lowers the power of the truly important indicators. Be sure to define 1 or 2 primary metrics first.
Setting only Beta-Spending and omitting Alpha-Spending: If there is only a futility boundary and no efficacy boundary, the criterion for early detection of effect in the intermediate analysis disappears. Both boundaries must be designed together.

In Conclusion

3 Steps to Start Right Now:

Run boundary calculation in R: After running install.packages("gsDesign"), let's use the code below to output the α/β boundary values corresponding to the current canary sample size and share them on the team Slack. Just sharing the first number with the team starts a design discussion.
gsDesign(k=4, test.type=2, alpha=0.025, beta=0.20, sfu=sfLDOF, sfl=sfLDOF, delta1=0.3, n.fix=1000)
Documenting the metric hierarchy: Take the list of metrics used in the current experiment, explicitly specify 1-2 primary metrics that "if this is not significant, there is no need to look at the rest," and record them in the Experiment Design Doc.
Integrate z-statistics with Argo Rollouts or Flagger AnalysisTemplate: Pre-calculate the canary vs. baseline z-statistics using the Prometheus recording rule, and input the lower boundary value calculated in Step 1 into successCondition to configure an automatic uselessness determination pipeline.

Next Post: A Comparative Analysis of How mSPRT (Mixture Sequential Probability Ratio Test) Guarantees Anytime-Valid p-values and How the Implementations of Netflix, Optimizely, and Spotify Differ

Key Concepts

Alpha-Spending and Beta-Spending: The Two Axes of Error Budgeting

Conditional Power and Dynamic Uselessness Boundary

Hierarchical Testing and FWER Control

Overall Structure of Group Sequential Design

Practical Application

Example 1: Designing Canary Distribution Boundaries with R gsDesign

Example 2: Integrating into a Canary Deployment Pipeline with Python

Example 3: Implementation of Multi-metric Hierarchical Testing

Example 4: Actual Application in Kubernetes + Argo Rollouts

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Key Concepts

Alpha-Spending and Beta-Spending: The Two Axes of Error Budgeting

Conditional Power and Dynamic Uselessness Boundary

Hierarchical Testing and FWER Control

Overall Structure of Group Sequential Design

Practical Application

Example 1: Designing Canary Distribution Boundaries with R gsDesign

Example 2: Integrating into a Canary Deployment Pipeline with Python

Example 3: Implementation of Multi-metric Hierarchical Testing

Example 4: Actual Application in Kubernetes + Argo Rollouts

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

In Conclusion

Reference Materials

Recommended Posts

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT

Implementing Alpha Spending Sequential Testing in Flagger Webhook — How to Reduce Canary Rollbacks by Up to 66% with Statistical Early Exit

Implementing Canary Deployment Gating Without Unnecessary Rollbacks with Flagger Webhook — The Complete Guide to Mann-Whitney Statistical Validation Services

Flagger + Istio A/B Routing: Integrating New Relic NRQL with Conversion Rate as Distribution Gating Criteria

Example 1: Designing Canary Distribution Boundaries with R `gsDesign`

Example 1: Designing Canary Distribution Boundaries with R `gsDesign`