The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT

When running A/B tests, you are bound to feel this temptation at least once: "The results look pretty good right now; shouldn't I stop already?" With traditional statistical methods, this behavior is fatal. Simply looking at the results before reaching a predetermined sample size—known as "peeking"—can cause the Type I False Positive Rate to skyrocket from α=0.05 to 0.2 or higher. A change with no effect is declared the "winner," and the experimental results become worthless.

Then, is there no method that is statistically safe to look into at any time? Yes, there is. The methods adopted by big tech companies like Optimizely, Netflix, Spotify, and Amplitude to solve this problem are mSPRT (Mixture Sequential Probability Ratio Test) and Always Valid Inference. mSPRT is designed so that the test statistic satisfies the Martingale property, mathematically guaranteeing that the Type I error rate is α or less, regardless of how many times it is looked into or when the experiment is stopped.

In this article, you will learn the mathematical principles, Python implementation, and practical application strategies of mSPRT all at once. After reading, you will understand how to set τ (prior distribution variance) to avoid losing power, why the experiment does not end unless a lower threshold is set, and how the martingale property creates a mathematical guarantee that it is "safe to look at at any time."

Key Concepts

SPRT: Sequential Hypothesis Testing

Fixed-sample tests (t-test, z-test) determine the sample size n before the experiment and make a judgment only once after all the data has been collected. On the other hand, the Sequential Probability Ratio Test (SPRT) is a sequential hypothesis testing method proposed by Abraham Wald (1945) that calculates the likelihood ratio for each incoming data point and immediately draws one of three conclusions.

Likelihood Ratio:

$$\Lambda_n = \prod_{i=1}^{n} \frac{f_{\theta_1}(X_i)}{f_{\theta_0}(X_i)}$$

The decision rule is determined by two thresholds A (upper limit) and B (lower limit).

Condition	Conclusion
$\Lambda_n \geq A$	H₁ Accept → Experiment End
$\Lambda_n \leq B$	H₀ Accept → Experiment End
$B < \Lambda_n < A$	Collect additional data

Calculate the threshold value using Wald's approximation.

Upper bound: $A \approx \frac{1-\beta}{\alpha}$
하한: $B \approx \frac{\beta}{1-\alpha}$

For example, if α=0.05 and β=0.20, then A≈16 and B≈0.211.

Type I Error (α): The probability of incorrectly judging that there is a difference when there actually is no difference. It is an error that creates false winners in A/B testing. Type II Error (β): The probability of missing a real difference by reporting "no difference." Power = 1 − β.

Limitations of Classic SPRT: Single Effect Size Assumption

Classical SPRT is based on the simple hypothesis that "the alternative hypothesis has exactly one effect size (θ₁)." In actual A/B testing, it is impossible to know in advance whether the click-through rate will increase by 1% or 3%. If this assumption is broken, Type I error control fails.

mSPRT: Applying a distribution to the alternative hypothesis

The core idea of mSPRT is to assign a prior distribution π(θ) to the parameter θ of the alternative hypothesis instead of a single θ₁, and to integrate (average) the likelihood ratio over that distribution. It became an industry standard after Johari et al. (2015) applied it to online A/B testing.

Mixture Likelihood Ratio:

$$\tilde{\Lambda}n = \int \prod{i=1}^{n} \frac{f_{\theta}(X_i)}{f_{\theta_0}(X_i)} \pi(\theta) d\theta$$

Don't worry if the formula looks complicated — the result of this integral is the closed-form formula below, and you can skip directly to the code without any difficulty in understanding the implementation.

If normal data is used with a normal prior distribution (mean 0, variance τ²), a closed-form solution exists.

$$\tilde{\Lambda}_n = \sqrt{\frac{2\sigma^2}{2\sigma^2 + n\tau^2}} \cdot \exp\left(\frac{n^2\tau^2(\bar{Y}_n - \bar{X}_n)^2}{4\sigma^2(2\sigma^2 + n\tau^2)}\right)$$

σ²: Population variance (known value or estimate)
τ²: Variance of the prior distribution — A key parameter reflecting the expected effect size
n: Number of observations per group
$\bar{X}_n$, $\bar{Y}_n$: Sample mean of control group/treatment group

Impact of τ setting on results: If τ is large, it includes a wider range of effect sizes as possible alternative hypotheses, making the test conservative and requiring more data to be confident. Conversely, if τ is set too small, power is significantly lost when the actual effect size is much larger than τ. tau = sigma × 0.5 is a safe starting point in practice.

Always Valid Inference: The Mathematical Basis of "It's Okay to Look At Anytime"

Martingale: The possible mixture ratio $\tilde{\Lambda}\n$ satisfies the martingale property under the null hypothesis. This means that the expected value of the current value is equal to the past value, and from this property, Ville's inequality follows.

From the martingale property, the following inequality holds.

$$P_{H_0}\left(\sup_{n \geq 1} \tilde{\Lambda}_n \geq \frac{1}{\alpha}\right) \leq \alpha$$

When the null hypothesis is true, the probability that the possible mixture ratio exceeds the critical value 1/α at any point during the experimental period is less than or equal to α. No matter how many times you look or when you stop, the Type I error rate is guaranteed. This is Always Valid Inference.

Practical Application

Example 1: Real-time Monitoring of Revenue per User A/B Test

We assume an experiment to improve revenue per user. Assuming the revenue data follows a normal distribution, we implement real-time decision-making using mSPRT.

Caution: This implementation assumes a normal distribution, so it is suitable for continuous data such as revenue per user and time spent. Do not apply it directly to binary conversion rate (0/1 click) data.

python

import numpy as np
from dataclasses import dataclass
from typing import Literal
 
@dataclass
class SPRTResult:
    n: int
    lambda_mix: float
    decision: Literal["reject H0", "accept H0", "continue"]
    x_bar: float
    y_bar: float
 
def compute_msprt(
    x: np.ndarray,
    y: np.ndarray,
    sigma: float,
    tau: float,
    alpha: float = 0.05,
    beta: float = 0.20,
) -> list[SPRTResult]:
    """
    mSPRT 혼합 가능비를 순차적으로 계산한다.
 
    Parameters
    ----------
    x     : 대조군 관측값 배열
    y     : 처리군 관측값 배열
    sigma : 알려진(또는 추정된) 공통 표준편차
    tau   : 사전분포 표준편차 — 기대 효과 크기를 반영.
            "효과 크기가 sigma의 절반 정도" → tau = sigma * 0.5
    alpha : 목표 1종 오류율 (기본값 0.05)
    beta  : 목표 2종 오류율 (기본값 0.20, 검정력 80%)
    """
    threshold = 1.0 / alpha            # 상한: H₁ 채택 임계값
    lower     = beta / (1 - alpha)     # 하한: H₀ 채택 임계값 (Wald 근사)
 
    sigma2 = sigma ** 2
    tau2   = tau ** 2
    results = []
 
    for n in range(1, len(x) + 1):
        x_bar = np.mean(x[:n])
        y_bar = np.mean(y[:n])
 
        # 닫힌 형식 mSPRT (정규-정규 켤레)
        denom    = 2 * sigma2 + n * tau2
        scale    = np.sqrt(2 * sigma2 / denom)
        exponent = (n ** 2 * tau2 * (y_bar - x_bar) ** 2) / (4 * sigma2 * denom)
        lambda_mix = scale * np.exp(exponent)
 
        if lambda_mix >= threshold:
            decision = "reject H0"
        elif lambda_mix <= lower:
            decision = "accept H0"
        else:
            decision = "continue"
 
        results.append(
            SPRTResult(n=n, lambda_mix=lambda_mix,
                       decision=decision, x_bar=x_bar, y_bar=y_bar)
        )
        if decision != "continue":
            break
 
    return results
 
 
# --- 시뮬레이션 ---
# 정규분포 가정: 사용자당 수익(Revenue per User)처럼 연속형 데이터에 적합.
np.random.seed(42)
ctrl  = np.random.normal(loc=0.0, scale=1.0, size=1000)
treat = np.random.normal(loc=0.3, scale=1.0, size=1000)  # 실제 효과 크기 0.3
 
history = compute_msprt(ctrl, treat, sigma=1.0, tau=0.5, alpha=0.05, beta=0.20)
final   = history[-1]
 
print(f"결정: {final.decision}")
print(f"관측 수: {final.n} / 1000")
print(f"혼합 가능비: {final.lambda_mix:.4f}")
print(f"대조군 평균: {final.x_bar:.4f}, 처리군 평균: {final.y_bar:.4f}")

Key Code Explanation:

Variable/Line	Role
`threshold = 1/alpha`	H₁ Acceptance threshold. If α=0.05, then 20
`lower = beta/(1-alpha)`	H₀ Acceptance threshold. If α=0.05, β=0.20, then ≈ 0.211 (Wald approx.: β/(1−α))
`scale`	Normalization factor of the prior distribution integral result. Decreases as n increases
`exponent`	A term that amplifies the difference between the means of two groups. Increases rapidly as the effect increases.
Early `break`	Stop immediately once a decision is made — The core of sample efficiency

Pros and Cons Analysis

Advantages

Item	Content
Sample Efficiency	Achieve equivalent power with an average of 40–60% less data compared to fixed-sample tests
Real-time Monitoring	Check results at any time without picking issues (Always Valid)
Early Termination	Experiment may be terminated earlier than planned if effectiveness is clear
Complex Hypothesis Handling	Can set distribution range rather than single effect size as the alternative hypothesis
Mathematical Guarantee	The Type I error rate is always mathematically guaranteed due to the martingale property

Disadvantages and Precautions

Peeking Problem: In fixed-sample testing, the act of checking intermediate results and stopping the experiment when favorable before reaching a predetermined sample size n. Repeated intermediate checks alone can cause the actual Type I error rate to skyrocket from α=0.05 to 0.2 or higher. mSPRT solves this problem mathematically.

Item	Content	Response Plan
Practical significance not distinguished	Detects even minute effects that are statistically significant but insignificant in practice	Truncated mSPRT applied (arXiv 2509.07892)
τ pre-specification required	The variance τ² of the prior distribution must be set in advance, and results are sensitive	Estimate τ using pilot experiment data or set it conservatively large
Assumption of known variance	Basic formula assumes σ² is known, limiting practical application	t-distribution-based anytime-valid t-test extension (arXiv 2310.03722)
Multiple Indicator Processing	Multiple Comparison Issues Occur When Testing Multiple Indicators Simultaneously	e-value Multiplication Combination or Bonferroni Correction
Implementation complexity	High barrier to entry for conceptual understanding and parameter setting compared to traditional testing	Start with open-source platforms such as GrowthBook

The Most Common Mistakes in Practice

Setting τ too small: If τ is much smaller than the actual effect size, power is significantly lost. Setting τ too large requires more data, but it is safer than setting it too small. If you are starting without pilot data, use tau = sigma * 0.5 as your starting point.
Omitting the lower threshold (H₀ acceptance): If you set only the H₁ acceptance threshold (upper limit) and ignore the H₀ acceptance threshold (lower limit), experiments with no effect will never be terminated. Setting lower = beta / (1 - alpha) is mandatory, not optional.
Ignoring Conservatism in Batch-Based Testing: If testing is performed on a daily basis instead of every observation, the actual Type I error rate becomes lower than α, resulting in a loss of power. This is safe but inefficient. If real-time updates are possible, utilize them as much as possible.

Apply Extension

Example 2: Real-time Detection of ML Model Drift

mSPRT is not limited to A/B testing. The following is an example of detecting data drift in a production ML model in real time using single-hypothesis SPRT. The key difference is that, unlike the mSPRT in Example 1 which deals with the composite alternative hypothesis (the entire parameter distribution), here a single hypothesis, "a state where drift has occurred (μ₁)", is specified in advance.

python

import numpy as np
from scipy.stats import norm
 
def drift_detector_sprt(
    reference_scores: np.ndarray,
    stream_scores: np.ndarray,
    alert_threshold: float = 20.0,
) -> dict:
    """
    단순 가설 SPRT 기반 드리프트 탐지기.
 
    ⚠️ 예시 1의 mSPRT(복합 대립가설)와 달리,
    여기서는 드리프트 방향과 크기를 사전에 지정한다.
 
    reference_scores: 기준 모델 점수 (훈련/검증셋)
    stream_scores   : 실시간 유입 점수
    """
    mu0    = np.mean(reference_scores)
    sigma0 = np.std(reference_scores)
    mu1    = mu0 - 0.5 * sigma0  # 드리프트 가설: 평균이 0.5 sigma 하락
 
    log_lambda = 0.0  # 로그 가능비 누적
 
    for i, score in enumerate(stream_scores, 1):
        log_lr     = norm.logpdf(score, mu1, sigma0) - norm.logpdf(score, mu0, sigma0)
        log_lambda += log_lr
 
        if log_lambda >= np.log(alert_threshold):
            return {
                "drift_detected": True,
                "at_sample": i,
                "log_lambda": round(log_lambda, 4),
            }
        if log_lambda <= -np.log(alert_threshold):
            # 하한 도달 시 리셋.
            # ⚠️ 이 리셋은 Page's CUSUM의 재시작 개념을 SPRT에 빌려온 혼용 방식이다.
            # 순수 SPRT는 리셋을 가정하지 않으므로, 고전 SPRT의 1종 오류 보장이
            # 이 코드에 그대로 적용된다고 오해해서는 안 된다.
            log_lambda = 0.0
 
    return {"drift_detected": False, "samples_checked": len(stream_scores)}
 
 
# 시뮬레이션: 500번째 이후 드리프트 발생
np.random.seed(7)
ref     = np.random.normal(0.8, 0.1, 1000)
normal  = np.random.normal(0.8, 0.1, 500)
drifted = np.random.normal(0.75, 0.1, 300)  # 평균 하락
stream  = np.concatenate([normal, drifted])
 
result = drift_detector_sprt(ref, stream)
print(result)
# 출력 예시 (시드 고정 기준, 실제 값은 실행 환경에 따라 다소 다를 수 있음):
# {'drift_detected': True, 'at_sample': 512, 'log_lambda': 3.04...}

Use Log-Likelihood Ratio: Prevents floating-point underflow and converts multiplication to addition to improve numerical stability. It is an essential processing technique when there are many observations.

In Conclusion

By reading this article, you have learned three things: how to set τ to maintain power, why ineffective experiments never end without a lower threshold (H₀ acceptance criterion), and how the martingale property creates a mathematical guarantee that it is "safe to look at at any time. These three points are the reasons why Optimizely, Netflix, and Amplitude run real-time experiments on this technique.

3 Steps to Start Right Now:

Copy the code block above exactly and apply it to the current experiment data using tau=sigma*0.5, alpha=0.05, and beta=0.20. You may also use the msprt package from PyPI, but since its API may differ from the code in this article, please check the official documentation before using it.
Launch GrowthBook open source locally, enable the Sequential Testing option, and compare it side-by-side with the existing fixed sample results.
If you want to delve deeper, read the original paper "Always Valid Inference" by Johari et al. (2015) and run a simulation with your own experimental data to verify for yourself whether the Type I error rate is actually guaranteed.

Next Article: e-value and e-process — Beyond mSPRT, a General Theory of Modern Sequential Inference Freely Combining Heterogeneous Tests

Reference Materials

The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT | DEV BAK - 기술블로그

The Mathematics of A/B Testing That Can Be Stopped at Any Time: Sequential Hypothesis Testing and Always Valid Inference for Solving Picking Problems with mSPRT

Key Concepts

SPRT: Sequential Hypothesis Testing

Likelihood Ratio:

$$\Lambda_n = \prod_{i=1}^{n} \frac{f_{\theta_1}(X_i)}{f_{\theta_0}(X_i)}$$

The decision rule is determined by two thresholds A (upper limit) and B (lower limit).

Condition	Conclusion
$\Lambda_n \geq A$	H₁ Accept → Experiment End
$\Lambda_n \leq B$	H₀ Accept → Experiment End
$B < \Lambda_n < A$	Collect additional data

Calculate the threshold value using Wald's approximation.

Upper bound: $A \approx \frac{1-\beta}{\alpha}$
하한: $B \approx \frac{\beta}{1-\alpha}$

For example, if α=0.05 and β=0.20, then A≈16 and B≈0.211.

Limitations of Classic SPRT: Single Effect Size Assumption

mSPRT: Applying a distribution to the alternative hypothesis

Mixture Likelihood Ratio:

$$\tilde{\Lambda}n = \int \prod{i=1}^{n} \frac{f_{\theta}(X_i)}{f_{\theta_0}(X_i)} \pi(\theta) d\theta$$

If normal data is used with a normal prior distribution (mean 0, variance τ²), a closed-form solution exists.

$$\tilde{\Lambda}_n = \sqrt{\frac{2\sigma^2}{2\sigma^2 + n\tau^2}} \cdot \exp\left(\frac{n^2\tau^2(\bar{Y}_n - \bar{X}_n)^2}{4\sigma^2(2\sigma^2 + n\tau^2)}\right)$$

σ²: Population variance (known value or estimate)
τ²: Variance of the prior distribution — A key parameter reflecting the expected effect size
n: Number of observations per group
$\bar{X}_n$, $\bar{Y}_n$: Sample mean of control group/treatment group

Always Valid Inference: The Mathematical Basis of "It's Okay to Look At Anytime"

From the martingale property, the following inequality holds.

$$P_{H_0}\left(\sup_{n \geq 1} \tilde{\Lambda}_n \geq \frac{1}{\alpha}\right) \leq \alpha$$

Practical Application

Example 1: Real-time Monitoring of Revenue per User A/B Test

We assume an experiment to improve revenue per user. Assuming the revenue data follows a normal distribution, we implement real-time decision-making using mSPRT.

python

import numpy as np
from dataclasses import dataclass
from typing import Literal
 
@dataclass
class SPRTResult:
    n: int
    lambda_mix: float
    decision: Literal["reject H0", "accept H0", "continue"]
    x_bar: float
    y_bar: float
 
def compute_msprt(
    x: np.ndarray,
    y: np.ndarray,
    sigma: float,
    tau: float,
    alpha: float = 0.05,
    beta: float = 0.20,
) -> list[SPRTResult]:
    """
    mSPRT 혼합 가능비를 순차적으로 계산한다.
 
    Parameters
    ----------
    x     : 대조군 관측값 배열
    y     : 처리군 관측값 배열
    sigma : 알려진(또는 추정된) 공통 표준편차
    tau   : 사전분포 표준편차 — 기대 효과 크기를 반영.
            "효과 크기가 sigma의 절반 정도" → tau = sigma * 0.5
    alpha : 목표 1종 오류율 (기본값 0.05)
    beta  : 목표 2종 오류율 (기본값 0.20, 검정력 80%)
    """
    threshold = 1.0 / alpha            # 상한: H₁ 채택 임계값
    lower     = beta / (1 - alpha)     # 하한: H₀ 채택 임계값 (Wald 근사)
 
    sigma2 = sigma ** 2
    tau2   = tau ** 2
    results = []
 
    for n in range(1, len(x) + 1):
        x_bar = np.mean(x[:n])
        y_bar = np.mean(y[:n])
 
        # 닫힌 형식 mSPRT (정규-정규 켤레)
        denom    = 2 * sigma2 + n * tau2
        scale    = np.sqrt(2 * sigma2 / denom)
        exponent = (n ** 2 * tau2 * (y_bar - x_bar) ** 2) / (4 * sigma2 * denom)
        lambda_mix = scale * np.exp(exponent)
 
        if lambda_mix >= threshold:
            decision = "reject H0"
        elif lambda_mix <= lower:
            decision = "accept H0"
        else:
            decision = "continue"
 
        results.append(
            SPRTResult(n=n, lambda_mix=lambda_mix,
                       decision=decision, x_bar=x_bar, y_bar=y_bar)
        )
        if decision != "continue":
            break
 
    return results
 
 
# --- 시뮬레이션 ---
# 정규분포 가정: 사용자당 수익(Revenue per User)처럼 연속형 데이터에 적합.
np.random.seed(42)
ctrl  = np.random.normal(loc=0.0, scale=1.0, size=1000)
treat = np.random.normal(loc=0.3, scale=1.0, size=1000)  # 실제 효과 크기 0.3
 
history = compute_msprt(ctrl, treat, sigma=1.0, tau=0.5, alpha=0.05, beta=0.20)
final   = history[-1]
 
print(f"결정: {final.decision}")
print(f"관측 수: {final.n} / 1000")
print(f"혼합 가능비: {final.lambda_mix:.4f}")
print(f"대조군 평균: {final.x_bar:.4f}, 처리군 평균: {final.y_bar:.4f}")

Key Code Explanation:

Variable/Line	Role
`threshold = 1/alpha`	H₁ Acceptance threshold. If α=0.05, then 20
`lower = beta/(1-alpha)`	H₀ Acceptance threshold. If α=0.05, β=0.20, then ≈ 0.211 (Wald approx.: β/(1−α))
`scale`	Normalization factor of the prior distribution integral result. Decreases as n increases
`exponent`	A term that amplifies the difference between the means of two groups. Increases rapidly as the effect increases.
Early `break`	Stop immediately once a decision is made — The core of sample efficiency

Pros and Cons Analysis

Advantages

Item	Content
Sample Efficiency	Achieve equivalent power with an average of 40–60% less data compared to fixed-sample tests
Real-time Monitoring	Check results at any time without picking issues (Always Valid)
Early Termination	Experiment may be terminated earlier than planned if effectiveness is clear
Complex Hypothesis Handling	Can set distribution range rather than single effect size as the alternative hypothesis
Mathematical Guarantee	The Type I error rate is always mathematically guaranteed due to the martingale property

Disadvantages and Precautions

Item	Content	Response Plan
Practical significance not distinguished	Detects even minute effects that are statistically significant but insignificant in practice	Truncated mSPRT applied (arXiv 2509.07892)
τ pre-specification required	The variance τ² of the prior distribution must be set in advance, and results are sensitive	Estimate τ using pilot experiment data or set it conservatively large
Assumption of known variance	Basic formula assumes σ² is known, limiting practical application	t-distribution-based anytime-valid t-test extension (arXiv 2310.03722)
Multiple Indicator Processing	Multiple Comparison Issues Occur When Testing Multiple Indicators Simultaneously	e-value Multiplication Combination or Bonferroni Correction
Implementation complexity	High barrier to entry for conceptual understanding and parameter setting compared to traditional testing	Start with open-source platforms such as GrowthBook

The Most Common Mistakes in Practice

Setting τ too small: If τ is much smaller than the actual effect size, power is significantly lost. Setting τ too large requires more data, but it is safer than setting it too small. If you are starting without pilot data, use tau = sigma * 0.5 as your starting point.
Omitting the lower threshold (H₀ acceptance): If you set only the H₁ acceptance threshold (upper limit) and ignore the H₀ acceptance threshold (lower limit), experiments with no effect will never be terminated. Setting lower = beta / (1 - alpha) is mandatory, not optional.
Ignoring Conservatism in Batch-Based Testing: If testing is performed on a daily basis instead of every observation, the actual Type I error rate becomes lower than α, resulting in a loss of power. This is safe but inefficient. If real-time updates are possible, utilize them as much as possible.

Apply Extension

Example 2: Real-time Detection of ML Model Drift

python

import numpy as np
from scipy.stats import norm
 
def drift_detector_sprt(
    reference_scores: np.ndarray,
    stream_scores: np.ndarray,
    alert_threshold: float = 20.0,
) -> dict:
    """
    단순 가설 SPRT 기반 드리프트 탐지기.
 
    ⚠️ 예시 1의 mSPRT(복합 대립가설)와 달리,
    여기서는 드리프트 방향과 크기를 사전에 지정한다.
 
    reference_scores: 기준 모델 점수 (훈련/검증셋)
    stream_scores   : 실시간 유입 점수
    """
    mu0    = np.mean(reference_scores)
    sigma0 = np.std(reference_scores)
    mu1    = mu0 - 0.5 * sigma0  # 드리프트 가설: 평균이 0.5 sigma 하락
 
    log_lambda = 0.0  # 로그 가능비 누적
 
    for i, score in enumerate(stream_scores, 1):
        log_lr     = norm.logpdf(score, mu1, sigma0) - norm.logpdf(score, mu0, sigma0)
        log_lambda += log_lr
 
        if log_lambda >= np.log(alert_threshold):
            return {
                "drift_detected": True,
                "at_sample": i,
                "log_lambda": round(log_lambda, 4),
            }
        if log_lambda <= -np.log(alert_threshold):
            # 하한 도달 시 리셋.
            # ⚠️ 이 리셋은 Page's CUSUM의 재시작 개념을 SPRT에 빌려온 혼용 방식이다.
            # 순수 SPRT는 리셋을 가정하지 않으므로, 고전 SPRT의 1종 오류 보장이
            # 이 코드에 그대로 적용된다고 오해해서는 안 된다.
            log_lambda = 0.0
 
    return {"drift_detected": False, "samples_checked": len(stream_scores)}
 
 
# 시뮬레이션: 500번째 이후 드리프트 발생
np.random.seed(7)
ref     = np.random.normal(0.8, 0.1, 1000)
normal  = np.random.normal(0.8, 0.1, 500)
drifted = np.random.normal(0.75, 0.1, 300)  # 평균 하락
stream  = np.concatenate([normal, drifted])
 
result = drift_detector_sprt(ref, stream)
print(result)
# 출력 예시 (시드 고정 기준, 실제 값은 실행 환경에 따라 다소 다를 수 있음):
# {'drift_detected': True, 'at_sample': 512, 'log_lambda': 3.04...}

In Conclusion

3 Steps to Start Right Now:

Copy the code block above exactly and apply it to the current experiment data using tau=sigma*0.5, alpha=0.05, and beta=0.20. You may also use the msprt package from PyPI, but since its API may differ from the code in this article, please check the official documentation before using it.
Launch GrowthBook open source locally, enable the Sequential Testing option, and compare it side-by-side with the existing fixed sample results.
If you want to delve deeper, read the original paper "Always Valid Inference" by Johari et al. (2015) and run a simulation with your own experimental data to verify for yourself whether the Type I error rate is actually guaranteed.

Next Article: e-value and e-process — Beyond mSPRT, a General Theory of Modern Sequential Inference Freely Combining Heterogeneous Tests

Key Concepts

SPRT: Sequential Hypothesis Testing

Limitations of Classic SPRT: Single Effect Size Assumption

mSPRT: Applying a distribution to the alternative hypothesis

Always Valid Inference: The Mathematical Basis of "It's Okay to Look At Anytime"

Practical Application

Example 1: Real-time Monitoring of Revenue per User A/B Test

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

Apply Extension

Example 2: Real-time Detection of ML Model Drift

In Conclusion

Reference Materials

Key Concepts

SPRT: Sequential Hypothesis Testing

Limitations of Classic SPRT: Single Effect Size Assumption

mSPRT: Applying a distribution to the alternative hypothesis

Always Valid Inference: The Mathematical Basis of "It's Okay to Look At Anytime"

Practical Application

Example 1: Real-time Monitoring of Revenue per User A/B Test

Pros and Cons Analysis

Advantages

Disadvantages and Precautions

The Most Common Mistakes in Practice

Apply Extension

Example 2: Real-time Detection of ML Model Drift

In Conclusion

Reference Materials

Recommended Posts

Why It Is Safe to Stop A/B Testing — A Practical Guide to e-value and e-process

Guide to Building an Enterprise Model Context Protocol Server Securely Shared by the Entire Team: Practical Implementation of Streamable HTTP and OAuth 2.1

MCP Multi-tenant Security: Structurally Blocking Inter-tenant Data Leaks with Cloudflare Durable Objects

Safely Stopping A/B Testing — A Complete Implementation Guide to Confidence Sequence and E-process Sequential Testing

Safely Stopping Sequential A/B Testing at Any Time — The Mathematical Principles of e-Values and Optional Stopping

How to Statistically Automatically Terminate Canaries with Utility Stopping and Hierarchical Testing: A Practical Guide to Beta-Spending Design