Catching Missing PII Masking Automatically Before Deployment
A masking gap detection system built with Presidio and CI pipelines
Honest confession — I once managed masking rules in a spreadsheet. Every time a new column was added to the DB, our entire process was asking on Slack, "Did you update the masking rules?" Then one day, I discovered customer phone numbers being logged in plaintext in the staging environment. It had been two weeks since the new column was added. We were lucky the audit team didn't find it first.
After that experience, I built a pipeline to automatically verify the completeness of masking rules. Now we operate under a structure where PRs cannot be merged unless masking gaps are caught at the PR stage. In this post, I'll share the entire process of building a pipeline that integrates PII detection tools with CI to block deployments the moment a gap appears in masking rules. If you're a backend or infrastructure developer who works directly with CI/CD pipelines, this is something you can apply right away. For other roles, it should help you get a sense of "our team needs this kind of automation."
Core Concepts
"Shift-Left Privacy" — Bringing Masking Verification to the Development Stage
The traditional approach was to discover masking gaps during periodic audits after production deployment. The problem? The data has already been exposed. Shift-Left Privacy pulls this verification into the development stage — specifically into pre-commit hooks or CI pipelines — to catch gaps before code is merged.
At first I thought, "Isn't this just adding one more PII scan like running SAST?" And that intuition is actually correct. The trend of embedding PII detection into CI at the same level as SAST is becoming increasingly common, and the partnership between Checkmarx and HoundDog.ai is a good example. They include PII leak detection as a default item in existing AppSec coverage, catching things like "the email stored in this variable is leaking into logs" through static analysis before code is merged.
GDPR is already a familiar story, and with the EU AI Act going into full effect starting in 2025, PII management in AI training data has become a legal obligation. According to Verizon's 2024 DBIR (Data Breach Investigations Report), more than approximately 50% of all data breach incidents involved personal information. When even 85% of organizations are hardcoding secrets in plaintext in source code, PII management is bound to be in even worse shape. Managing this with human memory alone, without automation, has already exceeded its limits.
The 3-Stage Structure of Masking Completeness Verification
When first designing the pipeline's backbone, I wondered "Where do I even start?" But once I organized it, it broke down cleanly into three stages.
[1. PII Discovery] → [2. Masking Rule Mapping] → [3. Gap Detection]| Stage | What It Does | Key Technology |
|---|---|---|
| PII Discovery | Identifies sensitive data fields in code, schemas, and logs | NER, regular expressions, checksum verification |
| Rule Mapping | Cross-references whether masking rules are applied to detected PII fields | Masking rules YAML + schema diff |
| Gap Detection | Classifies fields without rules as "uncovered" and fails the build | CI gate, alert integration |
Core Principle: The entire reason this pipeline exists is to version-control masking rules like code and automatically re-verify coverage every time a schema changes.
4 Masking Strategies — When to Use Which?
When writing masking rules, there comes a moment where you wonder whether to use MASK, REDACT, HASH, or ENCRYPT. Each serves a different purpose.
| Strategy | Behavior | Use Case | Reversible |
|---|---|---|---|
| MASK | Replaces some characters with * |
us***@example.com |
No |
| REDACT | Replaces the entire value with [REDACTED] |
Log output, debugging environments | No |
| HASH | One-way transformation using SHA-256, etc. | When analytical ID consistency is needed | No (one-way) |
| ENCRYPT | Two-way encryption using AES-256, etc. | When only authorized users should decrypt | Yes, with the key |
Here are the combinations frequently used in practice: For cases like emails and phone numbers where "the format needs to be visible but not the full value," use MASK. For things that should never appear in logs, use REDACT. When you need to track the same user in an analytics pipeline, use HASH. When internal administrators need to see the original value, apply ENCRYPT.
NER + Regex + Domain Rules: Why Multi-Layer Detection Is Necessary
This is a situation frequently encountered in practice — regular expressions alone cannot determine whether the name "Kim Cheolsu" is PII or general text. Conversely, using only an NER model can miss obvious patterns like "010-1234-5678" depending on context. That's why tools like Microsoft Presidio take a hybrid approach combining NER, regex, and checksum verification.
Detection Layer Structure:
Layer 1: Regex → Structured patterns like phone numbers, emails, national IDs
Layer 2: NER Model → Unstructured text like person names, addresses
Layer 3: Domain Rules → Korean national ID checksum, business registration number validation
Layer 4: Context → Confidence adjustment based on surrounding words ("Contact:", "Name:")Terminology — NER (Named Entity Recognition): A natural language processing technique that automatically identifies proper nouns such as person names, places, and organization names in text. In PII detection, this is extended to recognize all personally identifiable information including phone numbers, addresses, and more.
Practical Application
Example 1: Blocking Hardcoded PII in Code with Pre-commit Hooks
This is the starting point that delivers the fastest results with the least effort. You'd be surprised how often real PII like user@test.com or 010-1234-5678 gets hardcoded in test code.
I once put a real email address in test code thinking, "It's just test code, whatever." The problem was that the test ran in CI, the email was printed as-is in the logs, the monitoring system collected those logs, and they ended up on a dashboard visible to the entire team. Letting my guard down because it was test code turned into quite an embarrassing situation. Setting up a pre-commit hook blocks the commit itself, which changes your habits entirely.
# .pre-commit-config.yaml
repos:
# PII detection — block hardcoded personal information in code
- repo: https://github.com/uktrade/pii-secret-check-hooks
rev: v0.5.0
hooks:
- id: pii_secret_filename_check
- id: pii_secret_content_check
# Secret detection — block leaks of API keys, passwords, etc.
- repo: https://github.com/gitleaks/gitleaks
rev: v8.21.0
hooks:
- id: gitleaksPractical Tip: PII detection and secret detection (
gitleaks) serve different purposes, but running them in parallel in the pipeline is effective. PII protects personal information while secrets protect access credentials — but both share the commonality of being "sensitive information that shouldn't be in code."
Once an infrastructure team member sets up .pre-commit-config.yaml, team members only need to run pip install pre-commit && pre-commit install. From then on, scans run automatically with every commit.
Example 2: Blocking Masking Gaps at the PR Stage with GitHub Actions + Presidio
This is the core of the pipeline. It's a workflow that cross-references DB schema files against the masking rules YAML, and blocks the PR if newly added columns aren't included in the masking rules. When I first set this up, I thought "schema parsing is going to be complicated," but since you only need to extract tables and columns from SQL DDL, it turned out to be simpler than expected.
# .github/workflows/pii-check.yml
name: PII Masking Verification
on: [pull_request]
jobs:
pii-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install presidio-analyzer presidio-anonymizer pyyaml
- name: Verify Masking Completeness
run: |
python scripts/scan_schema_coverage.py \
--schema-dir ./db/schemas \
--masking-rules ./config/masking-rules.yaml \
--fail-on-uncoveredMasking rules are kept in this YAML format alongside schema files and version-controlled together.
# config/masking-rules.yaml
tables:
users:
email: MASK
phone: REDACT
name: HASH
address: ENCRYPT
orders:
shipping_address: MASK
recipient_name: HASHAnd here's the core schema-rule cross-referencing script. I initially left this part as a black box in the draft, but ultimately decided "the article is meaningless without this" and chose to share it directly.
# scripts/scan_schema_coverage.py
import argparse
import re
import sys
from pathlib import Path
import yaml
def extract_columns_from_ddl(schema_dir: str) -> dict[str, list[str]]:
"""SQL DDL 파일에서 테이블별 컬럼 목록을 추출"""
tables = {}
create_re = re.compile(
r"CREATE\s+TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)?(\w+)\s*\(", re.IGNORECASE
)
column_re = re.compile(r"^\s+(\w+)\s+\w+", re.MULTILINE)
for sql_file in Path(schema_dir).glob("**/*.sql"):
content = sql_file.read_text()
for match in create_re.finditer(content):
table_name = match.group(1)
start = match.end()
depth, end = 1, start
for i in range(start, len(content)):
if content[i] == "(":
depth += 1
elif content[i] == ")":
depth -= 1
if depth == 0:
end = i
break
body = content[start:end]
skip = {"constraint", "primary", "foreign", "unique", "index", "check"}
columns = [
m.group(1)
for m in column_re.finditer(body)
if m.group(1).lower() not in skip
]
tables[table_name] = columns
return tables
def load_masking_rules(rules_path: str) -> dict[str, list[str]]:
"""YAML에서 마스킹 규칙이 정의된 컬럼 목록을 로드"""
with open(rules_path) as f:
config = yaml.safe_load(f)
return {
table: list(columns.keys())
for table, columns in config.get("tables", {}).items()
}
def find_uncovered(schema_dir: str, rules_path: str) -> list[dict]:
"""마스킹 규칙이 없는 컬럼을 찾아 반환"""
schema_tables = extract_columns_from_ddl(schema_dir)
rule_tables = load_masking_rules(rules_path)
uncovered = []
for table, columns in schema_tables.items():
covered = set(rule_tables.get(table, []))
for col in columns:
if col not in covered and col not in ("id", "created_at", "updated_at"):
uncovered.append({"table": table, "column": col})
return uncovered
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--schema-dir", required=True)
parser.add_argument("--masking-rules", required=True)
parser.add_argument("--fail-on-uncovered", action="store_true")
parser.add_argument("--warn-only", action="store_true")
args = parser.parse_args()
uncovered = find_uncovered(args.schema_dir, args.masking_rules)
if not uncovered:
print("All columns are covered by masking rules.")
return
print(f"\n{'!'*50}")
print(f" {len(uncovered)} uncovered column(s) detected:")
print(f"{'!'*50}\n")
for item in uncovered:
print(f" - {item['table']}.{item['column']}")
if args.fail_on_uncovered and not args.warn_only:
sys.exit(1)
else:
print("\n[WARN] Running in warn-only mode. CI will not fail.")
if __name__ == "__main__":
main()| Component | Role |
|---|---|
extract_columns_from_ddl |
Parses SQL DDL files to extract column lists per table |
load_masking_rules |
Loads the table and column lists from masking rules defined in YAML |
find_uncovered |
Finds columns that exist in the schema but not in the masking rules |
--fail-on-uncovered |
Exits with code 1 if uncovered columns are found, blocking the PR |
--warn-only |
Outputs warnings only and allows CI to pass (useful during initial adoption) |
The key to this setup is that it automatically verifies masking rule coverage on PRs with schema changes. If a developer adds ALTER TABLE users ADD COLUMN birthday DATE; but doesn't include birthday in the masking rules, CI automatically raises a red flag.
Example 3: Double Verification After Masking — "Is the Masking Actually Working?"
There's one more point I want to address here. A masking rule "existing" and "actually working" are two different problems. I've personally experienced cases where rules existed but the application order got tangled, leaving some records unmasked. On our team, this happened during a masking pipeline refactoring — it looked fine on the surface, so it took three weeks to discover. That's why double verification by re-scanning masked data with Presidio is necessary.
from presidio_analyzer import AnalyzerEngine
def validate_masking_completeness(
masked_db, masking_rules: dict, score_threshold: float = 0.7
) -> list[dict]:
"""마스킹된 DB에서 여전히 PII가 남아있는지 레코드 단위로 검증"""
analyzer = AnalyzerEngine()
violations = []
for table, columns in masking_rules.items():
for col in columns:
samples = masked_db.sample(table, col, n=1000)
for idx, value in enumerate(samples):
if value is None:
continue
results = analyzer.analyze(
text=str(value),
language="en",
score_threshold=score_threshold,
)
if results:
violations.append({
"table": table,
"column": col,
"row_index": idx,
"detected_pii": [r.entity_type for r in results],
})
return violations| Parameter | Description |
|---|---|
n=1000 |
Uses sampling instead of full scans to maintain CI speed |
score_threshold=0.7 |
Too low and false positives explode; too high and you get false negatives. 0.7 is an empirically balanced starting point |
| Per-record iteration | Converting samples as a whole string erases record boundaries, making it impossible to trace which record the PII was detected in. Always analyze record by record |
Note: The reason
language="en"is set in the code above will be explained in Example 4 below — Presidio's default NER model does not support Korean. Separate configuration is required for Korean PII detection.
Adding this script as the final step in CI catches the ghost rule problem where "rules exist but masking isn't actually applied."
Example 4: Custom Korean PII Recognizers — Presidio's Korean Limitations and Workarounds
There's something I need to address honestly here. Presidio's default NER model (spaCy's en_core_web_lg) does not support Korean. Even if you pass language="ko", the NER layer essentially doesn't work, and only regex-based recognizers operate. I initially didn't know this and expected that just setting language="ko" would catch Korean names too — I was quite surprised when I saw the test results.
To properly detect Korean PII, two things are needed:
- Regex-based custom recognizers: Structured patterns like resident registration numbers, phone numbers
- Korean NLP model integration: Unstructured text like Korean names, addresses (requires a spaCy Korean model or separate NLP pipeline)
Let's start with regex-based custom recognizers that can be applied immediately.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
# 주민등록번호 인식기 (하이픈 포함/미포함)
rrn_patterns = [
Pattern(
name="rrn_with_hyphen",
regex=r"\b\d{6}-[1-4]\d{6}\b",
score=0.9,
),
Pattern(
name="rrn_without_hyphen",
regex=r"\b\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\d|3[01])[1-4]\d{6}\b",
score=0.7,
),
]
rrn_recognizer = PatternRecognizer(
supported_entity="KR_RRN",
name="Korean RRN Recognizer",
patterns=rrn_patterns,
supported_language="en",
)
# 한국 전화번호 인식기
phone_recognizer = PatternRecognizer(
supported_entity="KR_PHONE",
name="Korean Phone Recognizer",
patterns=[
Pattern(
name="kr_mobile",
regex=r"\b01[016789]-?\d{3,4}-?\d{4}\b",
score=0.85,
)
],
supported_language="en",
)
# 기본 영문 NLP 엔진 위에 커스텀 인식기를 추가
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(rrn_recognizer)
analyzer.registry.add_recognizer(phone_recognizer)
# 사용
results = analyzer.analyze(
text="주민번호는 900101-1234567이고 연락처는 010-9876-5432입니다.",
language="en",
entities=["KR_RRN", "KR_PHONE"],
)You might be curious why the score values differ — the pattern with a hyphen (900101-1234567) has a clearly identifiable resident registration number format, so it's set high at 0.9. The pattern without a hyphen (9001011234567) can be confused with a regular number string, so it's lowered to 0.7. Setting lower scores for patterns with higher false positive potential allows flexible control when filtering by score_threshold.
Why
supported_language="en"? If you call withlanguage="ko"without separately registering a Korean NLP engine in Presidio, an error occurs. Since regex-based recognizers work regardless of language, they function correctly even when mounted on the default English engine. To support Korean NER (unstructured detection of names, addresses, etc.), you need to register a spaCy Korean model (ko_core_news_sm) or another NLP pipeline withNlpEngineProvider. This will be covered in detail in the next post.
Pros and Cons Analysis
Pros
| Item | Description |
|---|---|
| Pre-deployment blocking | Discovers masking gaps before deployment, preventing data breach incidents |
| Automated compliance | CI automatically ensures compliance with regulations like GDPR, HIPAA, and the EU AI Act |
| Audit trail | All masking verification results are preserved in CI logs, usable as audit evidence |
| Schema drift detection | Automatically detects masking gaps when new columns or tables are added |
| Shortened developer feedback loop | Early feedback from pre-commit and CI reduces code review burden |
Cons and Caveats
| Item | Description | Mitigation |
|---|---|---|
| False Positives | Normal data may be misidentified as PII, unnecessarily blocking builds | Start score_threshold at 0.7 and tune to your team's situation. Our team initially set it to 0.5, got exhausted by false positives, and raised it to 0.75 |
| False Negatives | May miss unstructured PII (names in free text, etc.) | Secure coverage with a multi-layer approach of NER + regex + domain rules |
| Korean limitations | Presidio's default NER does not support Korean, so only regex-based detection works | Requires adding custom recognizers + integrating a spaCy Korean model |
| CI speed degradation | Full schema scans on large schemas increase pipeline time | Use schema diff-based incremental scanning to verify only changed portions |
| Rule maintenance cost | Rules quickly become useless if they don't evolve with the schema | Keep masking rules alongside schema files and change them in the same PR |
| Re-identification risk | Masking individual fields alone still allows re-identification through quasi-identifier combinations | Add a separate verification layer such as k-anonymity |
Supplementary term — Quasi-identifier: Information that cannot identify an individual on its own, but becomes identifiable when multiple pieces are combined. For example, research has shown that combining date of birth + gender + zip code can identify a significant number of individuals.
Supplementary term — k-anonymity: A privacy model that ensures every record in a dataset has at least k records with the same quasi-identifier combination. If k=5, searching by a specific combination must return at least 5 people, making it difficult to single out one individual.
Most Common Mistakes in Practice
-
Managing masking rules and schemas separately — If schemas change through migrations but masking rules live in a wiki, it's only a matter of time before they fall out of sync. It's recommended to change them in the same repository, in the same PR.
-
Running with default
score_thresholdwithout tuning — If the default is too low, developers get tired of false positives and start ignoring warnings. If too high, actual PII gets missed. Running in--warn-onlymode for about a week initially to find the appropriate threshold is absolutely necessary. -
Verifying only the "existence" of rules, not their "operation" — Just because masking rules exist in YAML doesn't guarantee masking is actually happening. It's strongly recommended to include the double verification introduced in Example 3. I once skipped this and had three weeks of unmasked data accumulating in staging.
Conclusion
The core of this pipeline is replacing "human memory" with a system, but the real value lies beyond that. Once built, verification runs automatically every time the schema changes, and you can tangibly feel the cognitive load decrease across the entire team. As this automation matures, it can evolve to the level of integrating with a Data Catalog to automatically suggest masking rules when new tables are added.
Here are 3 steps you can start with right now:
-
Start with installing pre-commit hooks — After
pip install pre-commit, add PII detection hooks to.pre-commit-config.yaml, and you can block hardcoded PII in code starting today. This is the starting point that delivers the fastest results with the least effort. -
Write a masking rules YAML — Create
masking-rules.yamlbased on your current DB schema. You don't need to cover everything from the start — begin with the most sensitive tables. Columns likeemail,phone,name,address,date_of_birthin theuserstable andcard_number,billing_addressin thepaymentstable are almost always PII. Start by catching these "sure things" and expand incrementally to keep the burden low. -
Add schema coverage verification to CI — Set up the
scan_schema_coverage.pyand GitHub Actions workflow introduced above as a PR trigger. It's recommended to run with the--warn-onlyflag for the first two weeks, outputting warnings only, then switch to--fail-on-uncoveredonce the team is comfortable.
Next post: "Improving Korean PII Detection Accuracy with Presidio Custom NLP Engines — From spaCy Korean Model Integration to Name and Address Recognition"
References
- Microsoft Presidio — GitHub | Open-source PII detection and anonymization framework
- Presidio Official Documentation | Analyzer, anonymizer, and custom recognizer guide
- Presidio Research & Evaluation | PII detection model evaluation tools
- PII Masker (HydroXai) — GitHub | High-precision AI detection based on DeBERTa-v3
- HoundDog.ai — CI Integrations | CI/CD pipeline PII scanner integration guide
- Checkmarx + HoundDog.ai PII Leak Detection | Case study of integrating PII leak detection into AppSec
- Piiano — Application Data Leaks Detection | PII leak detection case analysis in open-source projects
- pii-secret-check-hooks — GitHub | PII + secret checking pre-commit hooks
- PII Detection in Integration Testing | hoop.dev | How to catch PII leaks in integration testing
- Top Open Source Sensitive Data Discovery Tools (2025) | Bytebase | Comparison of open-source sensitive data discovery tools
- Cyber Defense Magazine — PII Leak Detection in Code | The importance of code-level PII leak detection