Catching Missing PII Masking Automatically Before Deployment

A masking gap detection system built with Presidio and CI pipelines

Honest confession — I once managed masking rules in a spreadsheet. Every time a new column was added to the DB, our entire process was asking on Slack, "Did you update the masking rules?" Then one day, I discovered customer phone numbers being logged in plaintext in the staging environment. It had been two weeks since the new column was added. We were lucky the audit team didn't find it first.

After that experience, I built a pipeline to automatically verify the completeness of masking rules. Now we operate under a structure where PRs cannot be merged unless masking gaps are caught at the PR stage. In this post, I'll share the entire process of building a pipeline that integrates PII detection tools with CI to block deployments the moment a gap appears in masking rules. If you're a backend or infrastructure developer who works directly with CI/CD pipelines, this is something you can apply right away. For other roles, it should help you get a sense of "our team needs this kind of automation."

Core Concepts

"Shift-Left Privacy" — Bringing Masking Verification to the Development Stage

The traditional approach was to discover masking gaps during periodic audits after production deployment. The problem? The data has already been exposed. Shift-Left Privacy pulls this verification into the development stage — specifically into pre-commit hooks or CI pipelines — to catch gaps before code is merged.

At first I thought, "Isn't this just adding one more PII scan like running SAST?" And that intuition is actually correct. The trend of embedding PII detection into CI at the same level as SAST is becoming increasingly common, and the partnership between Checkmarx and HoundDog.ai is a good example. They include PII leak detection as a default item in existing AppSec coverage, catching things like "the email stored in this variable is leaking into logs" through static analysis before code is merged.

GDPR is already a familiar story, and with the EU AI Act going into full effect starting in 2025, PII management in AI training data has become a legal obligation. According to Verizon's 2024 DBIR (Data Breach Investigations Report), more than approximately 50% of all data breach incidents involved personal information. When even 85% of organizations are hardcoding secrets in plaintext in source code, PII management is bound to be in even worse shape. Managing this with human memory alone, without automation, has already exceeded its limits.

The 3-Stage Structure of Masking Completeness Verification

When first designing the pipeline's backbone, I wondered "Where do I even start?" But once I organized it, it broke down cleanly into three stages.

[1. PII Discovery] → [2. Masking Rule Mapping] → [3. Gap Detection]

Stage	What It Does	Key Technology
PII Discovery	Identifies sensitive data fields in code, schemas, and logs	NER, regular expressions, checksum verification
Rule Mapping	Cross-references whether masking rules are applied to detected PII fields	Masking rules YAML + schema diff
Gap Detection	Classifies fields without rules as "uncovered" and fails the build	CI gate, alert integration

Core Principle: The entire reason this pipeline exists is to version-control masking rules like code and automatically re-verify coverage every time a schema changes.

4 Masking Strategies — When to Use Which?

When writing masking rules, there comes a moment where you wonder whether to use MASK, REDACT, HASH, or ENCRYPT. Each serves a different purpose.

Strategy	Behavior	Use Case	Reversible
MASK	Replaces some characters with `*`	`us***@example.com`	No
REDACT	Replaces the entire value with `[REDACTED]`	Log output, debugging environments	No
HASH	One-way transformation using SHA-256, etc.	When analytical ID consistency is needed	No (one-way)
ENCRYPT	Two-way encryption using AES-256, etc.	When only authorized users should decrypt	Yes, with the key

Here are the combinations frequently used in practice: For cases like emails and phone numbers where "the format needs to be visible but not the full value," use MASK. For things that should never appear in logs, use REDACT. When you need to track the same user in an analytics pipeline, use HASH. When internal administrators need to see the original value, apply ENCRYPT.

NER + Regex + Domain Rules: Why Multi-Layer Detection Is Necessary

This is a situation frequently encountered in practice — regular expressions alone cannot determine whether the name "Kim Cheolsu" is PII or general text. Conversely, using only an NER model can miss obvious patterns like "010-1234-5678" depending on context. That's why tools like Microsoft Presidio take a hybrid approach combining NER, regex, and checksum verification.

sql

Detection Layer Structure:
 
Layer 1: Regex          →  Structured patterns like phone numbers, emails, national IDs
Layer 2: NER Model      →  Unstructured text like person names, addresses
Layer 3: Domain Rules   →  Korean national ID checksum, business registration number validation
Layer 4: Context        →  Confidence adjustment based on surrounding words ("Contact:", "Name:")

Terminology — NER (Named Entity Recognition): A natural language processing technique that automatically identifies proper nouns such as person names, places, and organization names in text. In PII detection, this is extended to recognize all personally identifiable information including phone numbers, addresses, and more.

Practical Application

Example 1: Blocking Hardcoded PII in Code with Pre-commit Hooks

This is the starting point that delivers the fastest results with the least effort. You'd be surprised how often real PII like user@test.com or 010-1234-5678 gets hardcoded in test code.

I once put a real email address in test code thinking, "It's just test code, whatever." The problem was that the test ran in CI, the email was printed as-is in the logs, the monitoring system collected those logs, and they ended up on a dashboard visible to the entire team. Letting my guard down because it was test code turned into quite an embarrassing situation. Setting up a pre-commit hook blocks the commit itself, which changes your habits entirely.

yaml

# .pre-commit-config.yaml
repos:
  # PII detection — block hardcoded personal information in code
  - repo: https://github.com/uktrade/pii-secret-check-hooks
    rev: v0.5.0
    hooks:
      - id: pii_secret_filename_check
      - id: pii_secret_content_check
 
  # Secret detection — block leaks of API keys, passwords, etc.
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.21.0
    hooks:
      - id: gitleaks

Practical Tip: PII detection and secret detection (gitleaks) serve different purposes, but running them in parallel in the pipeline is effective. PII protects personal information while secrets protect access credentials — but both share the commonality of being "sensitive information that shouldn't be in code."

Once an infrastructure team member sets up .pre-commit-config.yaml, team members only need to run pip install pre-commit && pre-commit install. From then on, scans run automatically with every commit.

Example 2: Blocking Masking Gaps at the PR Stage with GitHub Actions + Presidio

This is the core of the pipeline. It's a workflow that cross-references DB schema files against the masking rules YAML, and blocks the PR if newly added columns aren't included in the masking rules. When I first set this up, I thought "schema parsing is going to be complicated," but since you only need to extract tables and columns from SQL DDL, it turned out to be simpler than expected.

yaml

# .github/workflows/pii-check.yml
name: PII Masking Verification
on: [pull_request]
jobs:
  pii-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
 
      - name: Install dependencies
        run: pip install presidio-analyzer presidio-anonymizer pyyaml
 
      - name: Verify Masking Completeness
        run: |
          python scripts/scan_schema_coverage.py \
            --schema-dir ./db/schemas \
            --masking-rules ./config/masking-rules.yaml \
            --fail-on-uncovered

Masking rules are kept in this YAML format alongside schema files and version-controlled together.

yaml

# config/masking-rules.yaml
tables:
  users:
    email: MASK
    phone: REDACT
    name: HASH
    address: ENCRYPT
  orders:
    shipping_address: MASK
    recipient_name: HASH

And here's the core schema-rule cross-referencing script. I initially left this part as a black box in the draft, but ultimately decided "the article is meaningless without this" and chose to share it directly.

python

# scripts/scan_schema_coverage.py
import argparse
import re
import sys
from pathlib import Path
 
import yaml
 
 
def extract_columns_from_ddl(schema_dir: str) -> dict[str, list[str]]:
    """SQL DDL 파일에서 테이블별 컬럼 목록을 추출"""
    tables = {}
    create_re = re.compile(
        r"CREATE\s+TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\w+)\s*\(", re.IGNORECASE
    )
    column_re = re.compile(r"^\s+(\w+)\s+\w+", re.MULTILINE)
 
    for sql_file in Path(schema_dir).glob("**/*.sql"):
        content = sql_file.read_text()
        for match in create_re.finditer(content):
            table_name = match.group(1)
            start = match.end()
            depth, end = 1, start
            for i in range(start, len(content)):
                if content[i] == "(":
                    depth += 1
                elif content[i] == ")":
                    depth -= 1
                    if depth == 0:
                        end = i
                        break
            body = content[start:end]
            skip = {"constraint", "primary", "foreign", "unique", "index", "check"}
            columns = [
                m.group(1)
                for m in column_re.finditer(body)
                if m.group(1).lower() not in skip
            ]
            tables[table_name] = columns
    return tables
 
 
def load_masking_rules(rules_path: str) -> dict[str, list[str]]:
    """YAML에서 마스킹 규칙이 정의된 컬럼 목록을 로드"""
    with open(rules_path) as f:
        config = yaml.safe_load(f)
    return {
        table: list(columns.keys())
        for table, columns in config.get("tables", {}).items()
    }
 
 
def find_uncovered(schema_dir: str, rules_path: str) -> list[dict]:
    """마스킹 규칙이 없는 컬럼을 찾아 반환"""
    schema_tables = extract_columns_from_ddl(schema_dir)
    rule_tables = load_masking_rules(rules_path)
    uncovered = []
 
    for table, columns in schema_tables.items():
        covered = set(rule_tables.get(table, []))
        for col in columns:
            if col not in covered and col not in ("id", "created_at", "updated_at"):
                uncovered.append({"table": table, "column": col})
 
    return uncovered
 
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--schema-dir", required=True)
    parser.add_argument("--masking-rules", required=True)
    parser.add_argument("--fail-on-uncovered", action="store_true")
    parser.add_argument("--warn-only", action="store_true")
    args = parser.parse_args()
 
    uncovered = find_uncovered(args.schema_dir, args.masking_rules)
 
    if not uncovered:
        print("All columns are covered by masking rules.")
        return
 
    print(f"\n{'!'*50}")
    print(f"  {len(uncovered)} uncovered column(s) detected:")
    print(f"{'!'*50}\n")
    for item in uncovered:
        print(f"  - {item['table']}.{item['column']}")
 
    if args.fail_on_uncovered and not args.warn_only:
        sys.exit(1)
    else:
        print("\n[WARN] Running in warn-only mode. CI will not fail.")
 
 
if __name__ == "__main__":
    main()

Component	Role
`extract_columns_from_ddl`	Parses SQL DDL files to extract column lists per table
`load_masking_rules`	Loads the table and column lists from masking rules defined in YAML
`find_uncovered`	Finds columns that exist in the schema but not in the masking rules
`--fail-on-uncovered`	Exits with code 1 if uncovered columns are found, blocking the PR
`--warn-only`	Outputs warnings only and allows CI to pass (useful during initial adoption)

The key to this setup is that it automatically verifies masking rule coverage on PRs with schema changes. If a developer adds ALTER TABLE users ADD COLUMN birthday DATE; but doesn't include birthday in the masking rules, CI automatically raises a red flag.

Example 3: Double Verification After Masking — "Is the Masking Actually Working?"

There's one more point I want to address here. A masking rule "existing" and "actually working" are two different problems. I've personally experienced cases where rules existed but the application order got tangled, leaving some records unmasked. On our team, this happened during a masking pipeline refactoring — it looked fine on the surface, so it took three weeks to discover. That's why double verification by re-scanning masked data with Presidio is necessary.

python

from presidio_analyzer import AnalyzerEngine
 
 
def validate_masking_completeness(
    masked_db, masking_rules: dict, score_threshold: float = 0.7
) -> list[dict]:
    """마스킹된 DB에서 여전히 PII가 남아있는지 레코드 단위로 검증"""
    analyzer = AnalyzerEngine()
    violations = []
 
    for table, columns in masking_rules.items():
        for col in columns:
            samples = masked_db.sample(table, col, n=1000)
 
            for idx, value in enumerate(samples):
                if value is None:
                    continue
                results = analyzer.analyze(
                    text=str(value),
                    language="en",
                    score_threshold=score_threshold,
                )
                if results:
                    violations.append({
                        "table": table,
                        "column": col,
                        "row_index": idx,
                        "detected_pii": [r.entity_type for r in results],
                    })
 
    return violations

Parameter	Description
`n=1000`	Uses sampling instead of full scans to maintain CI speed
`score_threshold=0.7`	Too low and false positives explode; too high and you get false negatives. 0.7 is an empirically balanced starting point
Per-record iteration	Converting samples as a whole string erases record boundaries, making it impossible to trace which record the PII was detected in. Always analyze record by record

Note: The reason language="en" is set in the code above will be explained in Example 4 below — Presidio's default NER model does not support Korean. Separate configuration is required for Korean PII detection.

Adding this script as the final step in CI catches the ghost rule problem where "rules exist but masking isn't actually applied."

Example 4: Custom Korean PII Recognizers — Presidio's Korean Limitations and Workarounds

There's something I need to address honestly here. Presidio's default NER model (spaCy's en_core_web_lg) does not support Korean. Even if you pass language="ko", the NER layer essentially doesn't work, and only regex-based recognizers operate. I initially didn't know this and expected that just setting language="ko" would catch Korean names too — I was quite surprised when I saw the test results.

To properly detect Korean PII, two things are needed:

Regex-based custom recognizers: Structured patterns like resident registration numbers, phone numbers
Korean NLP model integration: Unstructured text like Korean names, addresses (requires a spaCy Korean model or separate NLP pipeline)

Let's start with regex-based custom recognizers that can be applied immediately.

python

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
 
# 주민등록번호 인식기 (하이픈 포함/미포함)
rrn_patterns = [
    Pattern(
        name="rrn_with_hyphen",
        regex=r"\b\d{6}-[1-4]\d{6}\b",
        score=0.9,
    ),
    Pattern(
        name="rrn_without_hyphen",
        regex=r"\b\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\d|3[01])[1-4]\d{6}\b",
        score=0.7,
    ),
]
 
rrn_recognizer = PatternRecognizer(
    supported_entity="KR_RRN",
    name="Korean RRN Recognizer",
    patterns=rrn_patterns,
    supported_language="en",
)
 
# 한국 전화번호 인식기
phone_recognizer = PatternRecognizer(
    supported_entity="KR_PHONE",
    name="Korean Phone Recognizer",
    patterns=[
        Pattern(
            name="kr_mobile",
            regex=r"\b01[016789]-?\d{3,4}-?\d{4}\b",
            score=0.85,
        )
    ],
    supported_language="en",
)
 
# 기본 영문 NLP 엔진 위에 커스텀 인식기를 추가
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(rrn_recognizer)
analyzer.registry.add_recognizer(phone_recognizer)
 
# 사용
results = analyzer.analyze(
    text="주민번호는 900101-1234567이고 연락처는 010-9876-5432입니다.",
    language="en",
    entities=["KR_RRN", "KR_PHONE"],
)

You might be curious why the score values differ — the pattern with a hyphen (900101-1234567) has a clearly identifiable resident registration number format, so it's set high at 0.9. The pattern without a hyphen (9001011234567) can be confused with a regular number string, so it's lowered to 0.7. Setting lower scores for patterns with higher false positive potential allows flexible control when filtering by score_threshold.

Why supported_language="en"? If you call with language="ko" without separately registering a Korean NLP engine in Presidio, an error occurs. Since regex-based recognizers work regardless of language, they function correctly even when mounted on the default English engine. To support Korean NER (unstructured detection of names, addresses, etc.), you need to register a spaCy Korean model (ko_core_news_sm) or another NLP pipeline with NlpEngineProvider. This will be covered in detail in the next post.

Pros and Cons Analysis

Pros

Item	Description
Pre-deployment blocking	Discovers masking gaps before deployment, preventing data breach incidents
Automated compliance	CI automatically ensures compliance with regulations like GDPR, HIPAA, and the EU AI Act
Audit trail	All masking verification results are preserved in CI logs, usable as audit evidence
Schema drift detection	Automatically detects masking gaps when new columns or tables are added
Shortened developer feedback loop	Early feedback from pre-commit and CI reduces code review burden

Cons and Caveats

Item	Description	Mitigation
False Positives	Normal data may be misidentified as PII, unnecessarily blocking builds	Start `score_threshold` at 0.7 and tune to your team's situation. Our team initially set it to 0.5, got exhausted by false positives, and raised it to 0.75
False Negatives	May miss unstructured PII (names in free text, etc.)	Secure coverage with a multi-layer approach of NER + regex + domain rules
Korean limitations	Presidio's default NER does not support Korean, so only regex-based detection works	Requires adding custom recognizers + integrating a spaCy Korean model
CI speed degradation	Full schema scans on large schemas increase pipeline time	Use schema diff-based incremental scanning to verify only changed portions
Rule maintenance cost	Rules quickly become useless if they don't evolve with the schema	Keep masking rules alongside schema files and change them in the same PR
Re-identification risk	Masking individual fields alone still allows re-identification through quasi-identifier combinations	Add a separate verification layer such as k-anonymity

Supplementary term — Quasi-identifier: Information that cannot identify an individual on its own, but becomes identifiable when multiple pieces are combined. For example, research has shown that combining date of birth + gender + zip code can identify a significant number of individuals.

Supplementary term — k-anonymity: A privacy model that ensures every record in a dataset has at least k records with the same quasi-identifier combination. If k=5, searching by a specific combination must return at least 5 people, making it difficult to single out one individual.

Most Common Mistakes in Practice

Managing masking rules and schemas separately — If schemas change through migrations but masking rules live in a wiki, it's only a matter of time before they fall out of sync. It's recommended to change them in the same repository, in the same PR.
Running with default score_threshold without tuning — If the default is too low, developers get tired of false positives and start ignoring warnings. If too high, actual PII gets missed. Running in --warn-only mode for about a week initially to find the appropriate threshold is absolutely necessary.
Verifying only the "existence" of rules, not their "operation" — Just because masking rules exist in YAML doesn't guarantee masking is actually happening. It's strongly recommended to include the double verification introduced in Example 3. I once skipped this and had three weeks of unmasked data accumulating in staging.

Conclusion

The core of this pipeline is replacing "human memory" with a system, but the real value lies beyond that. Once built, verification runs automatically every time the schema changes, and you can tangibly feel the cognitive load decrease across the entire team. As this automation matures, it can evolve to the level of integrating with a Data Catalog to automatically suggest masking rules when new tables are added.

Here are 3 steps you can start with right now:

Start with installing pre-commit hooks — After pip install pre-commit, add PII detection hooks to .pre-commit-config.yaml, and you can block hardcoded PII in code starting today. This is the starting point that delivers the fastest results with the least effort.
Write a masking rules YAML — Create masking-rules.yaml based on your current DB schema. You don't need to cover everything from the start — begin with the most sensitive tables. Columns like email, phone, name, address, date_of_birth in the users table and card_number, billing_address in the payments table are almost always PII. Start by catching these "sure things" and expand incrementally to keep the burden low.
Add schema coverage verification to CI — Set up the scan_schema_coverage.py and GitHub Actions workflow introduced above as a PR trigger. It's recommended to run with the --warn-only flag for the first two weeks, outputting warnings only, then switch to --fail-on-uncovered once the team is comfortable.

Next post: "Improving Korean PII Detection Accuracy with Presidio Custom NLP Engines — From spaCy Korean Model Integration to Name and Address Recognition"

References

Microsoft Presidio — GitHub | Open-source PII detection and anonymization framework
Presidio Official Documentation | Analyzer, anonymizer, and custom recognizer guide
Presidio Research & Evaluation | PII detection model evaluation tools
PII Masker (HydroXai) — GitHub | High-precision AI detection based on DeBERTa-v3
HoundDog.ai — CI Integrations | CI/CD pipeline PII scanner integration guide
Checkmarx + HoundDog.ai PII Leak Detection | Case study of integrating PII leak detection into AppSec
Piiano — Application Data Leaks Detection | PII leak detection case analysis in open-source projects
pii-secret-check-hooks — GitHub | PII + secret checking pre-commit hooks
PII Detection in Integration Testing | hoop.dev | How to catch PII leaks in integration testing
Top Open Source Sensitive Data Discovery Tools (2025) | Bytebase | Comparison of open-source sensitive data discovery tools
Cyber Defense Magazine — PII Leak Detection in Code | The importance of code-level PII leak detection

Catching Missing PII Masking Automatically Before Deployment | DEV BAK - 기술블로그

DevOps

Catching Missing PII Masking Automatically Before Deployment

A masking gap detection system built with Presidio and CI pipelines

Core Concepts

"Shift-Left Privacy" — Bringing Masking Verification to the Development Stage

The 3-Stage Structure of Masking Completeness Verification

When first designing the pipeline's backbone, I wondered "Where do I even start?" But once I organized it, it broke down cleanly into three stages.

[1. PII Discovery] → [2. Masking Rule Mapping] → [3. Gap Detection]

Stage	What It Does	Key Technology
PII Discovery	Identifies sensitive data fields in code, schemas, and logs	NER, regular expressions, checksum verification
Rule Mapping	Cross-references whether masking rules are applied to detected PII fields	Masking rules YAML + schema diff
Gap Detection	Classifies fields without rules as "uncovered" and fails the build	CI gate, alert integration

Core Principle: The entire reason this pipeline exists is to version-control masking rules like code and automatically re-verify coverage every time a schema changes.

4 Masking Strategies — When to Use Which?

When writing masking rules, there comes a moment where you wonder whether to use MASK, REDACT, HASH, or ENCRYPT. Each serves a different purpose.

Strategy	Behavior	Use Case	Reversible
MASK	Replaces some characters with `*`	`us***@example.com`	No
REDACT	Replaces the entire value with `[REDACTED]`	Log output, debugging environments	No
HASH	One-way transformation using SHA-256, etc.	When analytical ID consistency is needed	No (one-way)
ENCRYPT	Two-way encryption using AES-256, etc.	When only authorized users should decrypt	Yes, with the key

NER + Regex + Domain Rules: Why Multi-Layer Detection Is Necessary

sql

Detection Layer Structure:
 
Layer 1: Regex          →  Structured patterns like phone numbers, emails, national IDs
Layer 2: NER Model      →  Unstructured text like person names, addresses
Layer 3: Domain Rules   →  Korean national ID checksum, business registration number validation
Layer 4: Context        →  Confidence adjustment based on surrounding words ("Contact:", "Name:")

Terminology — NER (Named Entity Recognition): A natural language processing technique that automatically identifies proper nouns such as person names, places, and organization names in text. In PII detection, this is extended to recognize all personally identifiable information including phone numbers, addresses, and more.

Practical Application

Example 1: Blocking Hardcoded PII in Code with Pre-commit Hooks

This is the starting point that delivers the fastest results with the least effort. You'd be surprised how often real PII like user@test.com or 010-1234-5678 gets hardcoded in test code.

yaml

# .pre-commit-config.yaml
repos:
  # PII detection — block hardcoded personal information in code
  - repo: https://github.com/uktrade/pii-secret-check-hooks
    rev: v0.5.0
    hooks:
      - id: pii_secret_filename_check
      - id: pii_secret_content_check
 
  # Secret detection — block leaks of API keys, passwords, etc.
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.21.0
    hooks:
      - id: gitleaks

Practical Tip: PII detection and secret detection (gitleaks) serve different purposes, but running them in parallel in the pipeline is effective. PII protects personal information while secrets protect access credentials — but both share the commonality of being "sensitive information that shouldn't be in code."

Example 2: Blocking Masking Gaps at the PR Stage with GitHub Actions + Presidio

yaml

# .github/workflows/pii-check.yml
name: PII Masking Verification
on: [pull_request]
jobs:
  pii-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
 
      - name: Install dependencies
        run: pip install presidio-analyzer presidio-anonymizer pyyaml
 
      - name: Verify Masking Completeness
        run: |
          python scripts/scan_schema_coverage.py \
            --schema-dir ./db/schemas \
            --masking-rules ./config/masking-rules.yaml \
            --fail-on-uncovered

Masking rules are kept in this YAML format alongside schema files and version-controlled together.

yaml

# config/masking-rules.yaml
tables:
  users:
    email: MASK
    phone: REDACT
    name: HASH
    address: ENCRYPT
  orders:
    shipping_address: MASK
    recipient_name: HASH

python

# scripts/scan_schema_coverage.py
import argparse
import re
import sys
from pathlib import Path
 
import yaml
 
 
def extract_columns_from_ddl(schema_dir: str) -> dict[str, list[str]]:
    """SQL DDL 파일에서 테이블별 컬럼 목록을 추출"""
    tables = {}
    create_re = re.compile(
        r"CREATE\s+TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\w+)\s*\(", re.IGNORECASE
    )
    column_re = re.compile(r"^\s+(\w+)\s+\w+", re.MULTILINE)
 
    for sql_file in Path(schema_dir).glob("**/*.sql"):
        content = sql_file.read_text()
        for match in create_re.finditer(content):
            table_name = match.group(1)
            start = match.end()
            depth, end = 1, start
            for i in range(start, len(content)):
                if content[i] == "(":
                    depth += 1
                elif content[i] == ")":
                    depth -= 1
                    if depth == 0:
                        end = i
                        break
            body = content[start:end]
            skip = {"constraint", "primary", "foreign", "unique", "index", "check"}
            columns = [
                m.group(1)
                for m in column_re.finditer(body)
                if m.group(1).lower() not in skip
            ]
            tables[table_name] = columns
    return tables
 
 
def load_masking_rules(rules_path: str) -> dict[str, list[str]]:
    """YAML에서 마스킹 규칙이 정의된 컬럼 목록을 로드"""
    with open(rules_path) as f:
        config = yaml.safe_load(f)
    return {
        table: list(columns.keys())
        for table, columns in config.get("tables", {}).items()
    }
 
 
def find_uncovered(schema_dir: str, rules_path: str) -> list[dict]:
    """마스킹 규칙이 없는 컬럼을 찾아 반환"""
    schema_tables = extract_columns_from_ddl(schema_dir)
    rule_tables = load_masking_rules(rules_path)
    uncovered = []
 
    for table, columns in schema_tables.items():
        covered = set(rule_tables.get(table, []))
        for col in columns:
            if col not in covered and col not in ("id", "created_at", "updated_at"):
                uncovered.append({"table": table, "column": col})
 
    return uncovered
 
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--schema-dir", required=True)
    parser.add_argument("--masking-rules", required=True)
    parser.add_argument("--fail-on-uncovered", action="store_true")
    parser.add_argument("--warn-only", action="store_true")
    args = parser.parse_args()
 
    uncovered = find_uncovered(args.schema_dir, args.masking_rules)
 
    if not uncovered:
        print("All columns are covered by masking rules.")
        return
 
    print(f"\n{'!'*50}")
    print(f"  {len(uncovered)} uncovered column(s) detected:")
    print(f"{'!'*50}\n")
    for item in uncovered:
        print(f"  - {item['table']}.{item['column']}")
 
    if args.fail_on_uncovered and not args.warn_only:
        sys.exit(1)
    else:
        print("\n[WARN] Running in warn-only mode. CI will not fail.")
 
 
if __name__ == "__main__":
    main()

Component	Role
`extract_columns_from_ddl`	Parses SQL DDL files to extract column lists per table
`load_masking_rules`	Loads the table and column lists from masking rules defined in YAML
`find_uncovered`	Finds columns that exist in the schema but not in the masking rules
`--fail-on-uncovered`	Exits with code 1 if uncovered columns are found, blocking the PR
`--warn-only`	Outputs warnings only and allows CI to pass (useful during initial adoption)

Example 3: Double Verification After Masking — "Is the Masking Actually Working?"

python

from presidio_analyzer import AnalyzerEngine
 
 
def validate_masking_completeness(
    masked_db, masking_rules: dict, score_threshold: float = 0.7
) -> list[dict]:
    """마스킹된 DB에서 여전히 PII가 남아있는지 레코드 단위로 검증"""
    analyzer = AnalyzerEngine()
    violations = []
 
    for table, columns in masking_rules.items():
        for col in columns:
            samples = masked_db.sample(table, col, n=1000)
 
            for idx, value in enumerate(samples):
                if value is None:
                    continue
                results = analyzer.analyze(
                    text=str(value),
                    language="en",
                    score_threshold=score_threshold,
                )
                if results:
                    violations.append({
                        "table": table,
                        "column": col,
                        "row_index": idx,
                        "detected_pii": [r.entity_type for r in results],
                    })
 
    return violations

Parameter	Description
`n=1000`	Uses sampling instead of full scans to maintain CI speed
`score_threshold=0.7`	Too low and false positives explode; too high and you get false negatives. 0.7 is an empirically balanced starting point
Per-record iteration	Converting samples as a whole string erases record boundaries, making it impossible to trace which record the PII was detected in. Always analyze record by record

Note: The reason language="en" is set in the code above will be explained in Example 4 below — Presidio's default NER model does not support Korean. Separate configuration is required for Korean PII detection.

Adding this script as the final step in CI catches the ghost rule problem where "rules exist but masking isn't actually applied."

Example 4: Custom Korean PII Recognizers — Presidio's Korean Limitations and Workarounds

To properly detect Korean PII, two things are needed:

Regex-based custom recognizers: Structured patterns like resident registration numbers, phone numbers
Korean NLP model integration: Unstructured text like Korean names, addresses (requires a spaCy Korean model or separate NLP pipeline)

Let's start with regex-based custom recognizers that can be applied immediately.

python

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
 
# 주민등록번호 인식기 (하이픈 포함/미포함)
rrn_patterns = [
    Pattern(
        name="rrn_with_hyphen",
        regex=r"\b\d{6}-[1-4]\d{6}\b",
        score=0.9,
    ),
    Pattern(
        name="rrn_without_hyphen",
        regex=r"\b\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\d|3[01])[1-4]\d{6}\b",
        score=0.7,
    ),
]
 
rrn_recognizer = PatternRecognizer(
    supported_entity="KR_RRN",
    name="Korean RRN Recognizer",
    patterns=rrn_patterns,
    supported_language="en",
)
 
# 한국 전화번호 인식기
phone_recognizer = PatternRecognizer(
    supported_entity="KR_PHONE",
    name="Korean Phone Recognizer",
    patterns=[
        Pattern(
            name="kr_mobile",
            regex=r"\b01[016789]-?\d{3,4}-?\d{4}\b",
            score=0.85,
        )
    ],
    supported_language="en",
)
 
# 기본 영문 NLP 엔진 위에 커스텀 인식기를 추가
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(rrn_recognizer)
analyzer.registry.add_recognizer(phone_recognizer)
 
# 사용
results = analyzer.analyze(
    text="주민번호는 900101-1234567이고 연락처는 010-9876-5432입니다.",
    language="en",
    entities=["KR_RRN", "KR_PHONE"],
)

Why supported_language="en"? If you call with language="ko" without separately registering a Korean NLP engine in Presidio, an error occurs. Since regex-based recognizers work regardless of language, they function correctly even when mounted on the default English engine. To support Korean NER (unstructured detection of names, addresses, etc.), you need to register a spaCy Korean model (ko_core_news_sm) or another NLP pipeline with NlpEngineProvider. This will be covered in detail in the next post.

Pros and Cons Analysis

Pros

Item	Description
Pre-deployment blocking	Discovers masking gaps before deployment, preventing data breach incidents
Automated compliance	CI automatically ensures compliance with regulations like GDPR, HIPAA, and the EU AI Act
Audit trail	All masking verification results are preserved in CI logs, usable as audit evidence
Schema drift detection	Automatically detects masking gaps when new columns or tables are added
Shortened developer feedback loop	Early feedback from pre-commit and CI reduces code review burden

Cons and Caveats

Item	Description	Mitigation
False Positives	Normal data may be misidentified as PII, unnecessarily blocking builds	Start `score_threshold` at 0.7 and tune to your team's situation. Our team initially set it to 0.5, got exhausted by false positives, and raised it to 0.75
False Negatives	May miss unstructured PII (names in free text, etc.)	Secure coverage with a multi-layer approach of NER + regex + domain rules
Korean limitations	Presidio's default NER does not support Korean, so only regex-based detection works	Requires adding custom recognizers + integrating a spaCy Korean model
CI speed degradation	Full schema scans on large schemas increase pipeline time	Use schema diff-based incremental scanning to verify only changed portions
Rule maintenance cost	Rules quickly become useless if they don't evolve with the schema	Keep masking rules alongside schema files and change them in the same PR
Re-identification risk	Masking individual fields alone still allows re-identification through quasi-identifier combinations	Add a separate verification layer such as k-anonymity

Supplementary term — Quasi-identifier: Information that cannot identify an individual on its own, but becomes identifiable when multiple pieces are combined. For example, research has shown that combining date of birth + gender + zip code can identify a significant number of individuals.

Supplementary term — k-anonymity: A privacy model that ensures every record in a dataset has at least k records with the same quasi-identifier combination. If k=5, searching by a specific combination must return at least 5 people, making it difficult to single out one individual.

Most Common Mistakes in Practice

Managing masking rules and schemas separately — If schemas change through migrations but masking rules live in a wiki, it's only a matter of time before they fall out of sync. It's recommended to change them in the same repository, in the same PR.
Running with default score_threshold without tuning — If the default is too low, developers get tired of false positives and start ignoring warnings. If too high, actual PII gets missed. Running in --warn-only mode for about a week initially to find the appropriate threshold is absolutely necessary.
Verifying only the "existence" of rules, not their "operation" — Just because masking rules exist in YAML doesn't guarantee masking is actually happening. It's strongly recommended to include the double verification introduced in Example 3. I once skipped this and had three weeks of unmasked data accumulating in staging.

Conclusion

Here are 3 steps you can start with right now:

Start with installing pre-commit hooks — After pip install pre-commit, add PII detection hooks to .pre-commit-config.yaml, and you can block hardcoded PII in code starting today. This is the starting point that delivers the fastest results with the least effort.
Write a masking rules YAML — Create masking-rules.yaml based on your current DB schema. You don't need to cover everything from the start — begin with the most sensitive tables. Columns like email, phone, name, address, date_of_birth in the users table and card_number, billing_address in the payments table are almost always PII. Start by catching these "sure things" and expand incrementally to keep the burden low.
Add schema coverage verification to CI — Set up the scan_schema_coverage.py and GitHub Actions workflow introduced above as a PR trigger. It's recommended to run with the --warn-only flag for the first two weeks, outputting warnings only, then switch to --fail-on-uncovered once the team is comfortable.

Next post: "Improving Korean PII Detection Accuracy with Presidio Custom NLP Engines — From spaCy Korean Model Integration to Name and Address Recognition"

References

Microsoft Presidio — GitHub | Open-source PII detection and anonymization framework
Presidio Official Documentation | Analyzer, anonymizer, and custom recognizer guide
Presidio Research & Evaluation | PII detection model evaluation tools
PII Masker (HydroXai) — GitHub | High-precision AI detection based on DeBERTa-v3
HoundDog.ai — CI Integrations | CI/CD pipeline PII scanner integration guide
Checkmarx + HoundDog.ai PII Leak Detection | Case study of integrating PII leak detection into AppSec
Piiano — Application Data Leaks Detection | PII leak detection case analysis in open-source projects
pii-secret-check-hooks — GitHub | PII + secret checking pre-commit hooks
PII Detection in Integration Testing | hoop.dev | How to catch PII leaks in integration testing
Top Open Source Sensitive Data Discovery Tools (2025) | Bytebase | Comparison of open-source sensitive data discovery tools
Cyber Defense Magazine — PII Leak Detection in Code | The importance of code-level PII leak detection

Core Concepts

"Shift-Left Privacy" — Bringing Masking Verification to the Development Stage

The 3-Stage Structure of Masking Completeness Verification

4 Masking Strategies — When to Use Which?

NER + Regex + Domain Rules: Why Multi-Layer Detection Is Necessary

Practical Application

Example 1: Blocking Hardcoded PII in Code with Pre-commit Hooks

Example 2: Blocking Masking Gaps at the PR Stage with GitHub Actions + Presidio

Example 3: Double Verification After Masking — "Is the Masking Actually Working?"

Example 4: Custom Korean PII Recognizers — Presidio's Korean Limitations and Workarounds

Pros and Cons Analysis

Pros

Cons and Caveats

Most Common Mistakes in Practice

Conclusion

References

Core Concepts

"Shift-Left Privacy" — Bringing Masking Verification to the Development Stage

The 3-Stage Structure of Masking Completeness Verification

4 Masking Strategies — When to Use Which?

NER + Regex + Domain Rules: Why Multi-Layer Detection Is Necessary

Practical Application

Example 1: Blocking Hardcoded PII in Code with Pre-commit Hooks

Example 2: Blocking Masking Gaps at the PR Stage with GitHub Actions + Presidio

Example 3: Double Verification After Masking — "Is the Masking Actually Working?"

Example 4: Custom Korean PII Recognizers — Presidio's Korean Limitations and Workarounds

Pros and Cons Analysis

Pros

Cons and Caveats

Most Common Mistakes in Practice

Conclusion

References

Recommended Posts

Building a Korean PII Detection Pipeline with Presidio + spaCy

Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER

Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context

Running E2E Tests on a Masked Production DB — Building a Playwright + Neon Branching Pipeline

How to Safely Serve Production Data to Preview Environments

How to Safely Use a Shared DB in Preview Environments — Per-PR Schema Isolation and Seed Data Automation