Building a Korean PII Detection Pipeline with Presidio + spaCy

From Custom NLP Engine Integration to Name and Address Recognition

AI/ML · Data Engineering

Prerequisites: Basic Python 3.8+ syntax, experience installing pip packages, and a casual interest in NLP are all you need.

Table of Contents

Core Concepts
Practical Application
The Most Common Mistakes in Practice
Pros and Cons Analysis
Conclusion

I still remember the bewilderment when I first ran Presidio on Korean text. When I fed it "김철수에게 연락주세요" (Please contact Kim Cheolsu), it grabbed "김철수에게" as a single token, and it couldn't find the resident registration number at all. I had high expectations since it was an open-source PII detection framework made by Microsoft, and it worked quite well for English — but it fell flat when faced with Korean.

In 2025, the Personal Information Protection Commission published the "Guide for Processing Personal Information in Generative AI Development and Utilization," making PII filtering in AI training data no longer optional but mandatory. So I documented everything I figured out through hands-on trial and error. By replacing the NLP engine with a spaCy Korean model and implementing custom recognizers tailored to Korea-specific PII formats, by the time you finish this article, you'll have a complete, working Korean PII detection pipeline codebase in your hands.

Core Concepts

Presidio's Architecture — Why a "Custom NLP Engine" Is Needed

Presidio is broadly divided into two engines: the AnalyzerEngine, which finds PII in text, and the AnonymizerEngine, which masks, replaces, or encrypts the found PII. Inside the AnalyzerEngine sits the NlpEngine, which handles tokenization, morphological analysis, and Named Entity Recognition (NER) — and its default is the English spaCy model (en_core_web_lg).

What is NER (Named Entity Recognition)? It's an NLP technique that automatically finds and classifies proper nouns like person names, places, and organization names in text. In the sentence "김철수가 서울에서 근무한다" (Kim Cheolsu works in Seoul), tagging "김철수" → Person and "서울" → Location is what NER does.

css

[Text Input]
    ↓
[NlpEngine] → Tokenization + NER (spaCy/Stanza/Transformers)
    ↓
[Recognizer Registry] → Regex Matching + Context Keywords + NER Results Combined
    ↓
[RecognizerResult] → Entity Type, Position, Confidence Score
    ↓
[AnonymizerEngine] → Masking / Replacement / Encryption

The key point is that the NlpEngine is plugin-based. By passing a configuration dictionary or YAML file to NlpEngineProvider, you can swap in any model you want — whether it's a spaCy Korean model or a Hugging Face Transformer. And Presidio's PII recognition doesn't rely on NER alone; it combines NER + regex patterns + checksum validation + context keywords in a multi-layered approach. Thanks to this multi-layered structure, even in environments like Korean where NER alone can't provide full coverage, you can set up multiple safety nets.

What is NlpEngine? It's an abstraction layer inside Presidio responsible for text tokenization, morphological analysis, and named entity recognition. It officially supports three backends — spaCy, Stanza, and Hugging Face Transformers — and regardless of which backend you use, the rest of Presidio's pipeline works identically.

spaCy Korean Models — What They Can Do and Where Their Limits Are

The Korean pipelines (ko_core_news_sm/md/lg) introduced in spaCy v3.3 include tok2vec, tagger, morphologizer, parser, lemmatizer, senter, and ner components. NER is trained on the KLUE dataset (a benchmark dataset for evaluating Korean natural language understanding capabilities, encompassing various tasks including NER, sentence similarity, sentiment analysis, and more) and recognizes persons (PER), locations (LOC), organizations (ORG), and other entities.

There are two noteworthy points in practice. First, the md and lg models use floret vectors.

What are floret vectors? It's a vector technology that compresses fastText's subword embeddings into a Bloom filter-based hash table. It enables vector representations for out-of-vocabulary words without significantly increasing model size, making it particularly effective for agglutinative languages with rich particle and ending variations.

This means that forms with attached particles like "김철수에게" (to Kim Cheolsu) can also have vector representations. The other point is the tokenization method — it uses space + punctuation segmentation based on UD Korean Kaist, so it works without mecab-ko. This is quite welcome news for those who found mecab-ko installation cumbersome in deployment environments.

However, the limitations are also clear:

Challenge	Specific Situation	Impact
Particle Attachment	"김철수에게" (to Kim Cheolsu), "서울시에서" (in Seoul)	Entity boundaries become ambiguous, reducing NER accuracy
Korea-Specific PII Formats	Resident registration number `900101-1234567`, mobile phone `010-1234-5678`	Not supported by built-in recognizers
Address System Complexity	"서울특별시 강남구 테헤란로 123 OO빌딩 4층" (Seoul Gangnam-gu Teheran-ro 123 OO Building 4th Floor)	Hierarchical structure + mixed old/new address formats + abbreviations
Short Names	Korean names are 2–4 characters	Difficult to distinguish from common nouns

Types of Presidio Recognizers — Which One Should You Use

To implement Korean PII detection, it's important to understand Presidio's recognizer system:

Recognizer Type	Suitable PII	Characteristics
PatternRecognizer	Resident registration numbers, phone numbers, emails	Regex + context keyword based, simple to implement
EntityRecognizer	Person names, organization names, addresses	Leverages NER results or implements custom logic
RemoteRecognizer	Complex PII	Delegates recognition to external API calls

This is a situation frequently encountered in practice — for PII with fixed formats like resident registration numbers or phone numbers, PatternRecognizer is sufficient. For context-dependent PII like names or addresses, inheriting EntityRecognizer and adding validation logic on top of NER results is the effective approach.

Practical Application

Example 1: Setting Up the Presidio Engine with a spaCy Korean Model

The very first thing to do is replace Presidio's NLP engine with a Korean model.

bash

pip install presidio-analyzer presidio-anonymizer
python -m spacy download ko_core_news_lg

python

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
 
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "ko", "model_name": "ko_core_news_lg"}
    ]
}
 
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
 
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=["ko"]
)
 
results = analyzer.analyze(
    text="김철수의 전화번호는 010-1234-5678이고 주민등록번호는 900101-1234567입니다.",
    language="ko"
)
 
for r in results:
    print(f"{r.entity_type}: score={r.score}, start={r.start}, end={r.end}")

Code Element	Role
`nlp_engine_name: "spacy"`	Specifies spaCy as the NLP backend
`lang_code: "ko"`	Registers the Korean language code
`model_name: "ko_core_news_lg"`	Uses the large Korean model trained on KLUE (approximately 550MB per official documentation)
`supported_languages=["ko"]`	Configures the AnalyzerEngine to process Korean input

Running it in this state produces the following result:

PERSON: score=0.85, start=0, end=4

"김철수" (Kim Cheolsu) is detected as PERSON, but the resident registration number and phone number don't appear in the results at all. This is because the built-in recognizers only support English-language PII (SSN, US phone numbers, etc.). This is where custom recognizers become necessary.

Example 2: Custom PatternRecognizer for Korean Resident Registration Numbers and Phone Numbers

PII with fixed formats can be cleanly handled with regex-based PatternRecognizer.

python

from presidio_analyzer import Pattern, PatternRecognizer
 
# 주민등록번호 인식기
kr_rrn_pattern = Pattern(
    name="kr_rrn_pattern",
    regex=r"\b(\d{2})(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])-?([1-4])\d{6}\b",
    score=0.85
)
 
kr_rrn_recognizer = PatternRecognizer(
    supported_entity="KR_RRN",
    supported_language="ko",
    patterns=[kr_rrn_pattern],
    context=["주민등록번호", "주민번호", "생년월일"]
)
 
# 휴대전화번호 인식기
kr_phone_pattern = Pattern(
    name="kr_phone_pattern",
    regex=r"\b(01[016789])-?(\d{3,4})-?(\d{4})\b",
    score=0.7
)
 
kr_phone_recognizer = PatternRecognizer(
    supported_entity="KR_PHONE_NUMBER",
    supported_language="ko",
    patterns=[kr_phone_pattern],
    context=["전화번호", "휴대폰", "연락처", "핸드폰"]
)
 
# ⚠️ 예시 1에서 생성한 analyzer 인스턴스를 그대로 사용합니다
analyzer.registry.add_recognizer(kr_rrn_recognizer)
analyzer.registry.add_recognizer(kr_phone_recognizer)

Parameter	Description
`regex`	Uses a pattern that validates date-of-birth validity for resident registration numbers
`score`	Default confidence score. Resident registration numbers get 0.85 since the format itself is unique; phone numbers get 0.7 since they can be confused with general number sequences
`context`	If these keywords appear nearby, the score is adjusted upward. The key point is to include enough Korean keywords

⚠️ Known limitations of the resident registration number regex: The [1-4] in the pattern above only covers domestic resident registration numbers. Foreign registration numbers start with 5–8 in the second half, so to detect foreign registration numbers as well, it needs to be expanded to [1-8]. Additionally, \b (word boundary) may behave differently than expected in Korean contexts, so in production, it's safer to use explicit boundary conditions like (?<=\s|^) and (?=\s|$).

Now, after registering the custom recognizers and analyzing the same text again:

python

results = analyzer.analyze(
    text="김철수의 전화번호는 010-1234-5678이고 주민등록번호는 900101-1234567입니다.",
    language="ko"
)
 
for r in results:
    print(f"{r.entity_type}: score={r.score}, start={r.start}, end={r.end}")

sql

PERSON:          score=0.85, start=0,  end=4
KR_PHONE_NUMBER: score=0.7,  start=12, end=25
KR_RRN:          score=0.85, start=36, end=50

You can confirm that both the phone number and resident registration number are detected. The context parameter is surprisingly powerful here — in the sentence "전화번호는 010-1234-5678입니다" (The phone number is 010-1234-5678), when the keyword "전화번호" (phone number) appears nearby, the base score of 0.7 gets adjusted upward. I initially filled this in carelessly myself, but after thoroughly populating the context keywords, accuracy improved noticeably.

Example 3: Enhancing Korean Name Recognition with NER

This is personally the part where I struggled the most. While spaCy's Korean NER model does detect person names with the PER label, it's not easy to distinguish whether "이수" is a person's name or the common noun "이수 (理數, mathematical principles)." Initially, I used NER results as-is without a surname dictionary, and it flagged every instance of "이수" as a person's name — resulting in a disaster where half the data got masked. Ultimately, I improved reliability by layering Korean surname dictionary-based validation on top of NER results.

python

from presidio_analyzer import EntityRecognizer, RecognizerResult
 
KOREAN_SURNAMES = {
    "김", "이", "박", "최", "정", "강", "조", "윤", "장", "임",
    "한", "오", "서", "신", "권", "황", "안", "송", "류", "전",
    "홍", "고", "문", "양", "손", "배", "백", "허", "유", "남",
    "심", "노", "하", "곽", "성", "차", "주", "우", "구", "민",
}
 
class KoreanNameRecognizer(EntityRecognizer):
    def load(self):
        pass
 
    def analyze(self, text, entities, nlp_artifacts=None):
        results = []
        if not nlp_artifacts:
            return results
 
        for ent in nlp_artifacts.entities:
            if ent.label_ not in ("PER", "PERSON"):
                continue
 
            # ⚠️ rstrip은 문자 집합 기반이므로 "이만"→"" 같은 오작동 위험이 있습니다.
            # 프로덕션에서는 반드시 형태소 분석기(mecab-ko, kiwi 등) 사용을 권장합니다.
            name = ent.text.rstrip("이가을를에게의와과도만")
 
            if len(name) < 2:
                continue
 
            score = 0.7
            if name[0] in KOREAN_SURNAMES:
                score = 0.9
 
            results.append(RecognizerResult(
                entity_type="KR_PERSON",
                start=ent.start_char,
                end=ent.start_char + len(name),
                score=score,
            ))
        return results
 
# ⚠️ 예시 1에서 생성한 analyzer 인스턴스를 그대로 사용합니다
kr_name_recognizer = KoreanNameRecognizer(
    supported_entities=["KR_PERSON"],
    supported_language="ko",
)
analyzer.registry.add_recognizer(kr_name_recognizer)

Processing Step	Description
NER Filtering	Extracts only `PER`/`PERSON` labels from spaCy NER results
Particle Removal	Removes trailing particles using `rstrip` to extract the pure name
Surname Validation	If the first character is in the Korean surname dictionary, score is raised from 0.7 → 0.9
Length Validation	Korean names are at least 2 characters, so anything shorter is excluded

Why the rstrip code is dangerous: rstrip("이가을를에게의와과도만") strips based on a character set, not a string. This means for the name "이만" (Lee Man), both "이" and "만" get stripped, resulting in an empty string. In the code above, the len(name) < 2 check filters out empty strings, but unexpected results can occur — such as "이만수" (Lee Mansu) being trimmed to "수." In production environments, it is strongly recommended to use a morphological analyzer (mecab-ko, kiwipiepy, etc.) to accurately separate particles.

One more thing to watch out for is the part that calculates the end position with ent.start_char + len(name). There can be edge cases where the length of the name after stripping particles doesn't match the position in the original text. In practice, it's a good idea to always run tests like the following:

python

test_cases = [
    "김철수에게 연락했다",
    "이만수의 기록",
    "박지성과 손흥민이 만났다",
]
 
for text in test_cases:
    results = analyzer.analyze(text=text, language="ko")
    for r in results:
        detected = text[r.start:r.end]
        print(f"원문: '{text}' → 탐지: '{detected}' ({r.entity_type}, {r.score})")

Example 4: Multilingual Configuration (English + Korean)

When you need to process text with mixed English and Korean — click to expand

You can neatly configure a multilingual engine with a YAML configuration file.

yaml

# languages-config.yml
nlp_engine_name: spacy
models:
  - lang_code: ko
    model_name: ko_core_news_lg
  - lang_code: en
    model_name: en_core_web_lg

python

provider = NlpEngineProvider(conf_file="./languages-config.yml")
nlp_engine = provider.create_engine()
 
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=["ko", "en"]
)
 
ko_results = analyzer.analyze(
    text="고객 김철수(010-9876-5432)의 주민등록번호는 850315-1234567입니다.",
    language="ko"
)
 
en_results = analyzer.analyze(
    text="Contact John at john@example.com",
    language="en"
)

It's convenient because you can handle both Korean and English PII with a single Presidio instance. However, there's a limitation that PatternRecognizer only supports one language (Issue #1606), so you need to create and register separate recognizers for Korean and English.

Example 5: Inserting a PII Filter into a RAG Pipeline

By this point, basic Korean PII detection is working, but when I tried to integrate it with an actual RAG system, new concerns arose — things like what format to output the masking results in, and how to log detection failures. Let me wrap up with a RAG preprocessing integration example, which is the most in-demand pattern these days.

python

import logging
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
 
logger = logging.getLogger(__name__)
anonymizer = AnonymizerEngine()
 
def mask_pii_for_rag(text: str) -> str:
    """RAG 파이프라인 전단에서 PII를 마스킹하는 전처리 함수"""
    results = analyzer.analyze(text=text, language="ko")
    logger.info(f"PII 탐지 결과: {len(results)}건 발견")
 
    if not results:
        return text
 
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "KR_RRN": OperatorConfig("replace", {"new_value": "[주민번호]"}),
            "KR_PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[전화번호]"}),
            "KR_PERSON": OperatorConfig("replace", {"new_value": "[이름]"}),
            "DEFAULT": OperatorConfig("replace", {"new_value": "[개인정보]"}),
        }
    )
    return anonymized.text
 
# 사용 예
raw_text = "고객 김철수(010-1234-5678)의 주민등록번호는 900101-1234567입니다."
safe_text = mask_pii_for_rag(raw_text)
print(safe_text)
# 결과: "고객 [이름]([전화번호])의 주민등록번호는 [주민번호]입니다."

This way, you can prevent sensitive information like resident registration numbers or phone numbers from entering the LLM context. It's also the most direct technical approach to implementing the "de-identification of personal information in AI training data" required by the Personal Information Protection Commission's guide.

The Most Common Mistakes in Practice

This part is important enough that I want to highlight it separately. Here are three mistakes I've experienced or repeatedly seen around me.

1. Leaving context keywords as an empty list

If you only include regex patterns and leave context keywords empty, score adjustments don't happen and detection rates drop significantly. If you populate various expressions actually used in documents — such as "주민등록번호" (resident registration number), "주민번호" (ID number), "생년월일" (date of birth) — the perceived accuracy changes dramatically.

2. Forgetting post-processing for particles

When NER captures "김철수에게" (to Kim Cheolsu) as a whole PER entity and you pass it to Presidio as-is, the anonymization masks "에게" (to) as well, resulting in "[Name] contacted" — breaking the context and also negatively affecting downstream NLP processing.

3. Deploying without test data

Since you can't use actual personal information, the process of building a test set with fictitious name, resident registration number, and phone number combinations and measuring precision/recall cannot be skipped.

What are Precision vs Recall? Precision is "the proportion of actual PII among what was identified as PII," and Recall is "the proportion of actual PII that was successfully detected." In PII detection, missing something (false negative) is more dangerous than a false alarm (false positive), so the general strategy is to secure recall first and then tune precision.

Pros and Cons Analysis

Pros

Item	Details
Modular Architecture	The NLP engine, recognizers, and anonymizer are separated, allowing you to swap just the Korean model or add only recognizers
Multi-layered Detection Strategy	By combining NER + regex + context keywords, it's possible to achieve higher recall compared to any single approach
Transformer Integration	You can directly connect KLUE-BERT, KoBERT, etc. via `TransformersNlpEngine`. The perceived accuracy for context-dependent entities like names and addresses improves significantly compared to spaCy NER — I'll cover this in detail in the next article
Production-Friendly	Provides Docker images and REST API server mode for easy integration into microservices
Active Ecosystem	Steadily updated under Microsoft's leadership, with ongoing discussions about next-generation features including LLM integration

Cons and Caveats

Item	Details	Mitigation
No Built-in Korean Support	Languages other than English follow a "bring your own model" approach, with significant initial cost for custom recognizer implementation	Start by building resident registration number, phone number, and name recognizers based on the code in this article, then expand gradually — this is the realistic approach
NER Accuracy Limitations	Accuracy may drop on real-world text such as colloquial language, social media, and customer service logs	Consider switching to Transformer models like KLUE-BERT, or domain-specific fine-tuning
Particle Handling Issues	Entity boundary recognition errors are frequent in forms like "김철수에게" (to Kim Cheolsu)	Morphological analyzer-based particle separation logic in the NER result post-processing step is essential
Model Size	`ko_core_news_lg` is approximately 550MB per official documentation, creating memory and loading time overhead	For lightweight environments, start with `ko_core_news_sm` and upgrade as needed
Korean Address Recognition	Regex alone has limitations due to hierarchical structures + mixed old/new addresses + abbreviations	A composite recognizer combining NER (LOC) + address keyword dictionary + regex is needed

Conclusion

Thanks to Presidio's modular architecture, simply replacing the spaCy Korean model and adding custom recognizers is enough to extend an English-only PII detection pipeline for the Korean language environment. Of course, it's not perfect — especially for particle handling and address recognition, there's still a lot of manual work involved. But with regulations tightening now, "starting at a usable level and improving incrementally" is a far more realistic strategy than "waiting until it's perfect." In my experience, achieving 95%+ recall for resident registration numbers and phone numbers and 80%+ recall for person names is sufficient for an initial deployment, with tuning done afterward based on production service logs.

Three steps you can start right now:

Environment Setup — Install dependencies with pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_lg, and verify that the Korean NLP engine works correctly using the Example 1 code.
Register Custom Recognizers — Add the resident registration number and phone number PatternRecognizer from Example 2 and the KoreanNameRecognizer from Example 3 to your project, and check the detection results against actual business text (de-identified).
Pipeline Integration — Referencing the mask_pii_for_rag pattern from Example 5, insert a PII filter into your RAG pipeline or data preprocessing workflow. Since Presidio also supports REST API mode, you can operate it as a microservice by utilizing the official Docker image.

Next Article: Taking it one step further with Transformer models — I'll cover advanced Korean PII detection methods that go beyond spaCy NER limitations by connecting KLUE-BERT to TransformersNlpEngine.

References

Essential References — The documents to check first when following along with or extending the code in this article:

Microsoft Presidio GitHub Repository — Framework source code and Docker images
Presidio Custom NLP Engine Configuration Guide | Microsoft — Detailed instructions for NLP engine replacement
Presidio Language Addition Tutorial | Microsoft — Official tutorial for non-English language support
spaCy Korean Model Official Documentation | Explosion — Detailed information on model size, performance, and components
KLUE Benchmark — The benchmark dataset used for Korean NER training

Advanced References — Useful when diving deep into specific issues or exploring alternatives:

Building a Korean PII Detection Pipeline with Presidio + spaCy | DEV BAK - 기술블로그

DevOps

Building a Korean PII Detection Pipeline with Presidio + spaCy

From Custom NLP Engine Integration to Name and Address Recognition

AI/ML · Data Engineering

Prerequisites: Basic Python 3.8+ syntax, experience installing pip packages, and a casual interest in NLP are all you need.

Table of Contents

Core Concepts
Practical Application
The Most Common Mistakes in Practice
Pros and Cons Analysis
Conclusion

Core Concepts

Presidio's Architecture — Why a "Custom NLP Engine" Is Needed

What is NER (Named Entity Recognition)? It's an NLP technique that automatically finds and classifies proper nouns like person names, places, and organization names in text. In the sentence "김철수가 서울에서 근무한다" (Kim Cheolsu works in Seoul), tagging "김철수" → Person and "서울" → Location is what NER does.

css

[Text Input]
    ↓
[NlpEngine] → Tokenization + NER (spaCy/Stanza/Transformers)
    ↓
[Recognizer Registry] → Regex Matching + Context Keywords + NER Results Combined
    ↓
[RecognizerResult] → Entity Type, Position, Confidence Score
    ↓
[AnonymizerEngine] → Masking / Replacement / Encryption

What is NlpEngine? It's an abstraction layer inside Presidio responsible for text tokenization, morphological analysis, and named entity recognition. It officially supports three backends — spaCy, Stanza, and Hugging Face Transformers — and regardless of which backend you use, the rest of Presidio's pipeline works identically.

spaCy Korean Models — What They Can Do and Where Their Limits Are

There are two noteworthy points in practice. First, the md and lg models use floret vectors.

What are floret vectors? It's a vector technology that compresses fastText's subword embeddings into a Bloom filter-based hash table. It enables vector representations for out-of-vocabulary words without significantly increasing model size, making it particularly effective for agglutinative languages with rich particle and ending variations.

However, the limitations are also clear:

Challenge	Specific Situation	Impact
Particle Attachment	"김철수에게" (to Kim Cheolsu), "서울시에서" (in Seoul)	Entity boundaries become ambiguous, reducing NER accuracy
Korea-Specific PII Formats	Resident registration number `900101-1234567`, mobile phone `010-1234-5678`	Not supported by built-in recognizers
Address System Complexity	"서울특별시 강남구 테헤란로 123 OO빌딩 4층" (Seoul Gangnam-gu Teheran-ro 123 OO Building 4th Floor)	Hierarchical structure + mixed old/new address formats + abbreviations
Short Names	Korean names are 2–4 characters	Difficult to distinguish from common nouns

Types of Presidio Recognizers — Which One Should You Use

To implement Korean PII detection, it's important to understand Presidio's recognizer system:

Recognizer Type	Suitable PII	Characteristics
PatternRecognizer	Resident registration numbers, phone numbers, emails	Regex + context keyword based, simple to implement
EntityRecognizer	Person names, organization names, addresses	Leverages NER results or implements custom logic
RemoteRecognizer	Complex PII	Delegates recognition to external API calls

Practical Application

Example 1: Setting Up the Presidio Engine with a spaCy Korean Model

The very first thing to do is replace Presidio's NLP engine with a Korean model.

bash

pip install presidio-analyzer presidio-anonymizer
python -m spacy download ko_core_news_lg

python

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
 
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "ko", "model_name": "ko_core_news_lg"}
    ]
}
 
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
 
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=["ko"]
)
 
results = analyzer.analyze(
    text="김철수의 전화번호는 010-1234-5678이고 주민등록번호는 900101-1234567입니다.",
    language="ko"
)
 
for r in results:
    print(f"{r.entity_type}: score={r.score}, start={r.start}, end={r.end}")

Code Element	Role
`nlp_engine_name: "spacy"`	Specifies spaCy as the NLP backend
`lang_code: "ko"`	Registers the Korean language code
`model_name: "ko_core_news_lg"`	Uses the large Korean model trained on KLUE (approximately 550MB per official documentation)
`supported_languages=["ko"]`	Configures the AnalyzerEngine to process Korean input

Running it in this state produces the following result:

PERSON: score=0.85, start=0, end=4

Example 2: Custom PatternRecognizer for Korean Resident Registration Numbers and Phone Numbers

PII with fixed formats can be cleanly handled with regex-based PatternRecognizer.

python

from presidio_analyzer import Pattern, PatternRecognizer
 
# 주민등록번호 인식기
kr_rrn_pattern = Pattern(
    name="kr_rrn_pattern",
    regex=r"\b(\d{2})(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])-?([1-4])\d{6}\b",
    score=0.85
)
 
kr_rrn_recognizer = PatternRecognizer(
    supported_entity="KR_RRN",
    supported_language="ko",
    patterns=[kr_rrn_pattern],
    context=["주민등록번호", "주민번호", "생년월일"]
)
 
# 휴대전화번호 인식기
kr_phone_pattern = Pattern(
    name="kr_phone_pattern",
    regex=r"\b(01[016789])-?(\d{3,4})-?(\d{4})\b",
    score=0.7
)
 
kr_phone_recognizer = PatternRecognizer(
    supported_entity="KR_PHONE_NUMBER",
    supported_language="ko",
    patterns=[kr_phone_pattern],
    context=["전화번호", "휴대폰", "연락처", "핸드폰"]
)
 
# ⚠️ 예시 1에서 생성한 analyzer 인스턴스를 그대로 사용합니다
analyzer.registry.add_recognizer(kr_rrn_recognizer)
analyzer.registry.add_recognizer(kr_phone_recognizer)

Parameter	Description
`regex`	Uses a pattern that validates date-of-birth validity for resident registration numbers
`score`	Default confidence score. Resident registration numbers get 0.85 since the format itself is unique; phone numbers get 0.7 since they can be confused with general number sequences
`context`	If these keywords appear nearby, the score is adjusted upward. The key point is to include enough Korean keywords

⚠️ Known limitations of the resident registration number regex: The [1-4] in the pattern above only covers domestic resident registration numbers. Foreign registration numbers start with 5–8 in the second half, so to detect foreign registration numbers as well, it needs to be expanded to [1-8]. Additionally, \b (word boundary) may behave differently than expected in Korean contexts, so in production, it's safer to use explicit boundary conditions like (?<=\s|^) and (?=\s|$).

Now, after registering the custom recognizers and analyzing the same text again:

python

results = analyzer.analyze(
    text="김철수의 전화번호는 010-1234-5678이고 주민등록번호는 900101-1234567입니다.",
    language="ko"
)
 
for r in results:
    print(f"{r.entity_type}: score={r.score}, start={r.start}, end={r.end}")

sql

PERSON:          score=0.85, start=0,  end=4
KR_PHONE_NUMBER: score=0.7,  start=12, end=25
KR_RRN:          score=0.85, start=36, end=50

Example 3: Enhancing Korean Name Recognition with NER

python

from presidio_analyzer import EntityRecognizer, RecognizerResult
 
KOREAN_SURNAMES = {
    "김", "이", "박", "최", "정", "강", "조", "윤", "장", "임",
    "한", "오", "서", "신", "권", "황", "안", "송", "류", "전",
    "홍", "고", "문", "양", "손", "배", "백", "허", "유", "남",
    "심", "노", "하", "곽", "성", "차", "주", "우", "구", "민",
}
 
class KoreanNameRecognizer(EntityRecognizer):
    def load(self):
        pass
 
    def analyze(self, text, entities, nlp_artifacts=None):
        results = []
        if not nlp_artifacts:
            return results
 
        for ent in nlp_artifacts.entities:
            if ent.label_ not in ("PER", "PERSON"):
                continue
 
            # ⚠️ rstrip은 문자 집합 기반이므로 "이만"→"" 같은 오작동 위험이 있습니다.
            # 프로덕션에서는 반드시 형태소 분석기(mecab-ko, kiwi 등) 사용을 권장합니다.
            name = ent.text.rstrip("이가을를에게의와과도만")
 
            if len(name) < 2:
                continue
 
            score = 0.7
            if name[0] in KOREAN_SURNAMES:
                score = 0.9
 
            results.append(RecognizerResult(
                entity_type="KR_PERSON",
                start=ent.start_char,
                end=ent.start_char + len(name),
                score=score,
            ))
        return results
 
# ⚠️ 예시 1에서 생성한 analyzer 인스턴스를 그대로 사용합니다
kr_name_recognizer = KoreanNameRecognizer(
    supported_entities=["KR_PERSON"],
    supported_language="ko",
)
analyzer.registry.add_recognizer(kr_name_recognizer)

Processing Step	Description
NER Filtering	Extracts only `PER`/`PERSON` labels from spaCy NER results
Particle Removal	Removes trailing particles using `rstrip` to extract the pure name
Surname Validation	If the first character is in the Korean surname dictionary, score is raised from 0.7 → 0.9
Length Validation	Korean names are at least 2 characters, so anything shorter is excluded

Why the rstrip code is dangerous: rstrip("이가을를에게의와과도만") strips based on a character set, not a string. This means for the name "이만" (Lee Man), both "이" and "만" get stripped, resulting in an empty string. In the code above, the len(name) < 2 check filters out empty strings, but unexpected results can occur — such as "이만수" (Lee Mansu) being trimmed to "수." In production environments, it is strongly recommended to use a morphological analyzer (mecab-ko, kiwipiepy, etc.) to accurately separate particles.

python

test_cases = [
    "김철수에게 연락했다",
    "이만수의 기록",
    "박지성과 손흥민이 만났다",
]
 
for text in test_cases:
    results = analyzer.analyze(text=text, language="ko")
    for r in results:
        detected = text[r.start:r.end]
        print(f"원문: '{text}' → 탐지: '{detected}' ({r.entity_type}, {r.score})")

Example 4: Multilingual Configuration (English + Korean)

When you need to process text with mixed English and Korean — click to expand

You can neatly configure a multilingual engine with a YAML configuration file.

yaml

# languages-config.yml
nlp_engine_name: spacy
models:
  - lang_code: ko
    model_name: ko_core_news_lg
  - lang_code: en
    model_name: en_core_web_lg

python

provider = NlpEngineProvider(conf_file="./languages-config.yml")
nlp_engine = provider.create_engine()
 
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine,
    supported_languages=["ko", "en"]
)
 
ko_results = analyzer.analyze(
    text="고객 김철수(010-9876-5432)의 주민등록번호는 850315-1234567입니다.",
    language="ko"
)
 
en_results = analyzer.analyze(
    text="Contact John at john@example.com",
    language="en"
)

Example 5: Inserting a PII Filter into a RAG Pipeline

python

import logging
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
 
logger = logging.getLogger(__name__)
anonymizer = AnonymizerEngine()
 
def mask_pii_for_rag(text: str) -> str:
    """RAG 파이프라인 전단에서 PII를 마스킹하는 전처리 함수"""
    results = analyzer.analyze(text=text, language="ko")
    logger.info(f"PII 탐지 결과: {len(results)}건 발견")
 
    if not results:
        return text
 
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "KR_RRN": OperatorConfig("replace", {"new_value": "[주민번호]"}),
            "KR_PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[전화번호]"}),
            "KR_PERSON": OperatorConfig("replace", {"new_value": "[이름]"}),
            "DEFAULT": OperatorConfig("replace", {"new_value": "[개인정보]"}),
        }
    )
    return anonymized.text
 
# 사용 예
raw_text = "고객 김철수(010-1234-5678)의 주민등록번호는 900101-1234567입니다."
safe_text = mask_pii_for_rag(raw_text)
print(safe_text)
# 결과: "고객 [이름]([전화번호])의 주민등록번호는 [주민번호]입니다."

The Most Common Mistakes in Practice

This part is important enough that I want to highlight it separately. Here are three mistakes I've experienced or repeatedly seen around me.

1. Leaving context keywords as an empty list

2. Forgetting post-processing for particles

3. Deploying without test data

What are Precision vs Recall? Precision is "the proportion of actual PII among what was identified as PII," and Recall is "the proportion of actual PII that was successfully detected." In PII detection, missing something (false negative) is more dangerous than a false alarm (false positive), so the general strategy is to secure recall first and then tune precision.

Pros and Cons Analysis

Pros

Item	Details
Modular Architecture	The NLP engine, recognizers, and anonymizer are separated, allowing you to swap just the Korean model or add only recognizers
Multi-layered Detection Strategy	By combining NER + regex + context keywords, it's possible to achieve higher recall compared to any single approach
Transformer Integration	You can directly connect KLUE-BERT, KoBERT, etc. via `TransformersNlpEngine`. The perceived accuracy for context-dependent entities like names and addresses improves significantly compared to spaCy NER — I'll cover this in detail in the next article
Production-Friendly	Provides Docker images and REST API server mode for easy integration into microservices
Active Ecosystem	Steadily updated under Microsoft's leadership, with ongoing discussions about next-generation features including LLM integration

Cons and Caveats

Item	Details	Mitigation
No Built-in Korean Support	Languages other than English follow a "bring your own model" approach, with significant initial cost for custom recognizer implementation	Start by building resident registration number, phone number, and name recognizers based on the code in this article, then expand gradually — this is the realistic approach
NER Accuracy Limitations	Accuracy may drop on real-world text such as colloquial language, social media, and customer service logs	Consider switching to Transformer models like KLUE-BERT, or domain-specific fine-tuning
Particle Handling Issues	Entity boundary recognition errors are frequent in forms like "김철수에게" (to Kim Cheolsu)	Morphological analyzer-based particle separation logic in the NER result post-processing step is essential
Model Size	`ko_core_news_lg` is approximately 550MB per official documentation, creating memory and loading time overhead	For lightweight environments, start with `ko_core_news_sm` and upgrade as needed
Korean Address Recognition	Regex alone has limitations due to hierarchical structures + mixed old/new addresses + abbreviations	A composite recognizer combining NER (LOC) + address keyword dictionary + regex is needed

Conclusion

Three steps you can start right now:

Environment Setup — Install dependencies with pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_lg, and verify that the Korean NLP engine works correctly using the Example 1 code.
Register Custom Recognizers — Add the resident registration number and phone number PatternRecognizer from Example 2 and the KoreanNameRecognizer from Example 3 to your project, and check the detection results against actual business text (de-identified).
Pipeline Integration — Referencing the mask_pii_for_rag pattern from Example 5, insert a PII filter into your RAG pipeline or data preprocessing workflow. Since Presidio also supports REST API mode, you can operate it as a microservice by utilizing the official Docker image.

Next Article: Taking it one step further with Transformer models — I'll cover advanced Korean PII detection methods that go beyond spaCy NER limitations by connecting KLUE-BERT to TransformersNlpEngine.

References

Essential References — The documents to check first when following along with or extending the code in this article:

Microsoft Presidio GitHub Repository — Framework source code and Docker images
Presidio Custom NLP Engine Configuration Guide | Microsoft — Detailed instructions for NLP engine replacement
Presidio Language Addition Tutorial | Microsoft — Official tutorial for non-English language support
spaCy Korean Model Official Documentation | Explosion — Detailed information on model size, performance, and components
KLUE Benchmark — The benchmark dataset used for Korean NER training

Advanced References — Useful when diving deep into specific issues or exploring alternatives:

Core Concepts

Presidio's Architecture — Why a "Custom NLP Engine" Is Needed

spaCy Korean Models — What They Can Do and Where Their Limits Are

Types of Presidio Recognizers — Which One Should You Use

Practical Application

Example 1: Setting Up the Presidio Engine with a spaCy Korean Model

Example 2: Custom PatternRecognizer for Korean Resident Registration Numbers and Phone Numbers

Example 3: Enhancing Korean Name Recognition with NER

Example 4: Multilingual Configuration (English + Korean)

Example 5: Inserting a PII Filter into a RAG Pipeline

The Most Common Mistakes in Practice

Pros and Cons Analysis

Pros

Cons and Caveats

Conclusion

References

Core Concepts

Presidio's Architecture — Why a "Custom NLP Engine" Is Needed

spaCy Korean Models — What They Can Do and Where Their Limits Are

Types of Presidio Recognizers — Which One Should You Use

Practical Application

Example 1: Setting Up the Presidio Engine with a spaCy Korean Model

Example 2: Custom PatternRecognizer for Korean Resident Registration Numbers and Phone Numbers

Example 3: Enhancing Korean Name Recognition with NER

Example 4: Multilingual Configuration (English + Korean)

Example 5: Inserting a PII Filter into a RAG Pipeline

The Most Common Mistakes in Practice

Pros and Cons Analysis

Pros

Cons and Caveats

Conclusion

References

Recommended Posts

Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER

Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context

How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry

Catching Missing PII Masking Automatically Before Deployment

Running E2E Tests on a Masked Production DB — Building a Playwright + Neon Branching Pipeline

How to Safely Serve Production Data to Preview Environments