Building a Korean PII Detection Pipeline with Presidio + spaCy
From Custom NLP Engine Integration to Name and Address Recognition
AI/ML · Data Engineering
Prerequisites: Basic Python 3.8+ syntax, experience installing pip packages, and a casual interest in NLP are all you need.
Table of Contents
- Core Concepts
- Practical Application
- The Most Common Mistakes in Practice
- Pros and Cons Analysis
- Conclusion
I still remember the bewilderment when I first ran Presidio on Korean text. When I fed it "김철수에게 연락주세요" (Please contact Kim Cheolsu), it grabbed "김철수에게" as a single token, and it couldn't find the resident registration number at all. I had high expectations since it was an open-source PII detection framework made by Microsoft, and it worked quite well for English — but it fell flat when faced with Korean.
In 2025, the Personal Information Protection Commission published the "Guide for Processing Personal Information in Generative AI Development and Utilization," making PII filtering in AI training data no longer optional but mandatory. So I documented everything I figured out through hands-on trial and error. By replacing the NLP engine with a spaCy Korean model and implementing custom recognizers tailored to Korea-specific PII formats, by the time you finish this article, you'll have a complete, working Korean PII detection pipeline codebase in your hands.
Core Concepts
Presidio's Architecture — Why a "Custom NLP Engine" Is Needed
Presidio is broadly divided into two engines: the AnalyzerEngine, which finds PII in text, and the AnonymizerEngine, which masks, replaces, or encrypts the found PII. Inside the AnalyzerEngine sits the NlpEngine, which handles tokenization, morphological analysis, and Named Entity Recognition (NER) — and its default is the English spaCy model (en_core_web_lg).
What is NER (Named Entity Recognition)? It's an NLP technique that automatically finds and classifies proper nouns like person names, places, and organization names in text. In the sentence "김철수가 서울에서 근무한다" (Kim Cheolsu works in Seoul), tagging "김철수" → Person and "서울" → Location is what NER does.
[Text Input]
↓
[NlpEngine] → Tokenization + NER (spaCy/Stanza/Transformers)
↓
[Recognizer Registry] → Regex Matching + Context Keywords + NER Results Combined
↓
[RecognizerResult] → Entity Type, Position, Confidence Score
↓
[AnonymizerEngine] → Masking / Replacement / EncryptionThe key point is that the NlpEngine is plugin-based. By passing a configuration dictionary or YAML file to NlpEngineProvider, you can swap in any model you want — whether it's a spaCy Korean model or a Hugging Face Transformer. And Presidio's PII recognition doesn't rely on NER alone; it combines NER + regex patterns + checksum validation + context keywords in a multi-layered approach. Thanks to this multi-layered structure, even in environments like Korean where NER alone can't provide full coverage, you can set up multiple safety nets.
What is NlpEngine? It's an abstraction layer inside Presidio responsible for text tokenization, morphological analysis, and named entity recognition. It officially supports three backends — spaCy, Stanza, and Hugging Face Transformers — and regardless of which backend you use, the rest of Presidio's pipeline works identically.
spaCy Korean Models — What They Can Do and Where Their Limits Are
The Korean pipelines (ko_core_news_sm/md/lg) introduced in spaCy v3.3 include tok2vec, tagger, morphologizer, parser, lemmatizer, senter, and ner components. NER is trained on the KLUE dataset (a benchmark dataset for evaluating Korean natural language understanding capabilities, encompassing various tasks including NER, sentence similarity, sentiment analysis, and more) and recognizes persons (PER), locations (LOC), organizations (ORG), and other entities.
There are two noteworthy points in practice. First, the md and lg models use floret vectors.
What are floret vectors? It's a vector technology that compresses fastText's subword embeddings into a Bloom filter-based hash table. It enables vector representations for out-of-vocabulary words without significantly increasing model size, making it particularly effective for agglutinative languages with rich particle and ending variations.
This means that forms with attached particles like "김철수에게" (to Kim Cheolsu) can also have vector representations. The other point is the tokenization method — it uses space + punctuation segmentation based on UD Korean Kaist, so it works without mecab-ko. This is quite welcome news for those who found mecab-ko installation cumbersome in deployment environments.
However, the limitations are also clear:
| Challenge | Specific Situation | Impact |
|---|---|---|
| Particle Attachment | "김철수에게" (to Kim Cheolsu), "서울시에서" (in Seoul) | Entity boundaries become ambiguous, reducing NER accuracy |
| Korea-Specific PII Formats | Resident registration number 900101-1234567, mobile phone 010-1234-5678 |
Not supported by built-in recognizers |
| Address System Complexity | "서울특별시 강남구 테헤란로 123 OO빌딩 4층" (Seoul Gangnam-gu Teheran-ro 123 OO Building 4th Floor) | Hierarchical structure + mixed old/new address formats + abbreviations |
| Short Names | Korean names are 2–4 characters | Difficult to distinguish from common nouns |
Types of Presidio Recognizers — Which One Should You Use
To implement Korean PII detection, it's important to understand Presidio's recognizer system:
| Recognizer Type | Suitable PII | Characteristics |
|---|---|---|
| PatternRecognizer | Resident registration numbers, phone numbers, emails | Regex + context keyword based, simple to implement |
| EntityRecognizer | Person names, organization names, addresses | Leverages NER results or implements custom logic |
| RemoteRecognizer | Complex PII | Delegates recognition to external API calls |
This is a situation frequently encountered in practice — for PII with fixed formats like resident registration numbers or phone numbers, PatternRecognizer is sufficient. For context-dependent PII like names or addresses, inheriting EntityRecognizer and adding validation logic on top of NER results is the effective approach.
Practical Application
Example 1: Setting Up the Presidio Engine with a spaCy Korean Model
The very first thing to do is replace Presidio's NLP engine with a Korean model.
pip install presidio-analyzer presidio-anonymizer
python -m spacy download ko_core_news_lgfrom presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
configuration = {
"nlp_engine_name": "spacy",
"models": [
{"lang_code": "ko", "model_name": "ko_core_news_lg"}
]
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
supported_languages=["ko"]
)
results = analyzer.analyze(
text="김철수의 전화번호는 010-1234-5678이고 주민등록번호는 900101-1234567입니다.",
language="ko"
)
for r in results:
print(f"{r.entity_type}: score={r.score}, start={r.start}, end={r.end}")| Code Element | Role |
|---|---|
nlp_engine_name: "spacy" |
Specifies spaCy as the NLP backend |
lang_code: "ko" |
Registers the Korean language code |
model_name: "ko_core_news_lg" |
Uses the large Korean model trained on KLUE (approximately 550MB per official documentation) |
supported_languages=["ko"] |
Configures the AnalyzerEngine to process Korean input |
Running it in this state produces the following result:
PERSON: score=0.85, start=0, end=4"김철수" (Kim Cheolsu) is detected as PERSON, but the resident registration number and phone number don't appear in the results at all. This is because the built-in recognizers only support English-language PII (SSN, US phone numbers, etc.). This is where custom recognizers become necessary.
Example 2: Custom PatternRecognizer for Korean Resident Registration Numbers and Phone Numbers
PII with fixed formats can be cleanly handled with regex-based PatternRecognizer.
from presidio_analyzer import Pattern, PatternRecognizer
# 주민등록번호 인식기
kr_rrn_pattern = Pattern(
name="kr_rrn_pattern",
regex=r"\b(\d{2})(0[1-9]|1[0-2])(0[1-9]|[12]\d|3[01])-?([1-4])\d{6}\b",
score=0.85
)
kr_rrn_recognizer = PatternRecognizer(
supported_entity="KR_RRN",
supported_language="ko",
patterns=[kr_rrn_pattern],
context=["주민등록번호", "주민번호", "생년월일"]
)
# 휴대전화번호 인식기
kr_phone_pattern = Pattern(
name="kr_phone_pattern",
regex=r"\b(01[016789])-?(\d{3,4})-?(\d{4})\b",
score=0.7
)
kr_phone_recognizer = PatternRecognizer(
supported_entity="KR_PHONE_NUMBER",
supported_language="ko",
patterns=[kr_phone_pattern],
context=["전화번호", "휴대폰", "연락처", "핸드폰"]
)
# ⚠️ 예시 1에서 생성한 analyzer 인스턴스를 그대로 사용합니다
analyzer.registry.add_recognizer(kr_rrn_recognizer)
analyzer.registry.add_recognizer(kr_phone_recognizer)| Parameter | Description |
|---|---|
regex |
Uses a pattern that validates date-of-birth validity for resident registration numbers |
score |
Default confidence score. Resident registration numbers get 0.85 since the format itself is unique; phone numbers get 0.7 since they can be confused with general number sequences |
context |
If these keywords appear nearby, the score is adjusted upward. The key point is to include enough Korean keywords |
⚠️ Known limitations of the resident registration number regex: The
[1-4]in the pattern above only covers domestic resident registration numbers. Foreign registration numbers start with 5–8 in the second half, so to detect foreign registration numbers as well, it needs to be expanded to[1-8]. Additionally,\b(word boundary) may behave differently than expected in Korean contexts, so in production, it's safer to use explicit boundary conditions like(?<=\s|^)and(?=\s|$).
Now, after registering the custom recognizers and analyzing the same text again:
results = analyzer.analyze(
text="김철수의 전화번호는 010-1234-5678이고 주민등록번호는 900101-1234567입니다.",
language="ko"
)
for r in results:
print(f"{r.entity_type}: score={r.score}, start={r.start}, end={r.end}")PERSON: score=0.85, start=0, end=4
KR_PHONE_NUMBER: score=0.7, start=12, end=25
KR_RRN: score=0.85, start=36, end=50You can confirm that both the phone number and resident registration number are detected. The context parameter is surprisingly powerful here — in the sentence "전화번호는 010-1234-5678입니다" (The phone number is 010-1234-5678), when the keyword "전화번호" (phone number) appears nearby, the base score of 0.7 gets adjusted upward. I initially filled this in carelessly myself, but after thoroughly populating the context keywords, accuracy improved noticeably.
Example 3: Enhancing Korean Name Recognition with NER
This is personally the part where I struggled the most. While spaCy's Korean NER model does detect person names with the PER label, it's not easy to distinguish whether "이수" is a person's name or the common noun "이수 (理數, mathematical principles)." Initially, I used NER results as-is without a surname dictionary, and it flagged every instance of "이수" as a person's name — resulting in a disaster where half the data got masked. Ultimately, I improved reliability by layering Korean surname dictionary-based validation on top of NER results.
from presidio_analyzer import EntityRecognizer, RecognizerResult
KOREAN_SURNAMES = {
"김", "이", "박", "최", "정", "강", "조", "윤", "장", "임",
"한", "오", "서", "신", "권", "황", "안", "송", "류", "전",
"홍", "고", "문", "양", "손", "배", "백", "허", "유", "남",
"심", "노", "하", "곽", "성", "차", "주", "우", "구", "민",
}
class KoreanNameRecognizer(EntityRecognizer):
def load(self):
pass
def analyze(self, text, entities, nlp_artifacts=None):
results = []
if not nlp_artifacts:
return results
for ent in nlp_artifacts.entities:
if ent.label_ not in ("PER", "PERSON"):
continue
# ⚠️ rstrip은 문자 집합 기반이므로 "이만"→"" 같은 오작동 위험이 있습니다.
# 프로덕션에서는 반드시 형태소 분석기(mecab-ko, kiwi 등) 사용을 권장합니다.
name = ent.text.rstrip("이가을를에게의와과도만")
if len(name) < 2:
continue
score = 0.7
if name[0] in KOREAN_SURNAMES:
score = 0.9
results.append(RecognizerResult(
entity_type="KR_PERSON",
start=ent.start_char,
end=ent.start_char + len(name),
score=score,
))
return results
# ⚠️ 예시 1에서 생성한 analyzer 인스턴스를 그대로 사용합니다
kr_name_recognizer = KoreanNameRecognizer(
supported_entities=["KR_PERSON"],
supported_language="ko",
)
analyzer.registry.add_recognizer(kr_name_recognizer)| Processing Step | Description |
|---|---|
| NER Filtering | Extracts only PER/PERSON labels from spaCy NER results |
| Particle Removal | Removes trailing particles using rstrip to extract the pure name |
| Surname Validation | If the first character is in the Korean surname dictionary, score is raised from 0.7 → 0.9 |
| Length Validation | Korean names are at least 2 characters, so anything shorter is excluded |
Why the
rstripcode is dangerous:rstrip("이가을를에게의와과도만")strips based on a character set, not a string. This means for the name "이만" (Lee Man), both "이" and "만" get stripped, resulting in an empty string. In the code above, thelen(name) < 2check filters out empty strings, but unexpected results can occur — such as "이만수" (Lee Mansu) being trimmed to "수." In production environments, it is strongly recommended to use a morphological analyzer (mecab-ko, kiwipiepy, etc.) to accurately separate particles.
One more thing to watch out for is the part that calculates the end position with ent.start_char + len(name). There can be edge cases where the length of the name after stripping particles doesn't match the position in the original text. In practice, it's a good idea to always run tests like the following:
test_cases = [
"김철수에게 연락했다",
"이만수의 기록",
"박지성과 손흥민이 만났다",
]
for text in test_cases:
results = analyzer.analyze(text=text, language="ko")
for r in results:
detected = text[r.start:r.end]
print(f"원문: '{text}' → 탐지: '{detected}' ({r.entity_type}, {r.score})")Example 4: Multilingual Configuration (English + Korean)
When you need to process text with mixed English and Korean — click to expand
You can neatly configure a multilingual engine with a YAML configuration file.
# languages-config.yml
nlp_engine_name: spacy
models:
- lang_code: ko
model_name: ko_core_news_lg
- lang_code: en
model_name: en_core_web_lgprovider = NlpEngineProvider(conf_file="./languages-config.yml")
nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
supported_languages=["ko", "en"]
)
ko_results = analyzer.analyze(
text="고객 김철수(010-9876-5432)의 주민등록번호는 850315-1234567입니다.",
language="ko"
)
en_results = analyzer.analyze(
text="Contact John at john@example.com",
language="en"
)It's convenient because you can handle both Korean and English PII with a single Presidio instance. However, there's a limitation that PatternRecognizer only supports one language (Issue #1606), so you need to create and register separate recognizers for Korean and English.
Example 5: Inserting a PII Filter into a RAG Pipeline
By this point, basic Korean PII detection is working, but when I tried to integrate it with an actual RAG system, new concerns arose — things like what format to output the masking results in, and how to log detection failures. Let me wrap up with a RAG preprocessing integration example, which is the most in-demand pattern these days.
import logging
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
logger = logging.getLogger(__name__)
anonymizer = AnonymizerEngine()
def mask_pii_for_rag(text: str) -> str:
"""RAG 파이프라인 전단에서 PII를 마스킹하는 전처리 함수"""
results = analyzer.analyze(text=text, language="ko")
logger.info(f"PII 탐지 결과: {len(results)}건 발견")
if not results:
return text
anonymized = anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
"KR_RRN": OperatorConfig("replace", {"new_value": "[주민번호]"}),
"KR_PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[전화번호]"}),
"KR_PERSON": OperatorConfig("replace", {"new_value": "[이름]"}),
"DEFAULT": OperatorConfig("replace", {"new_value": "[개인정보]"}),
}
)
return anonymized.text
# 사용 예
raw_text = "고객 김철수(010-1234-5678)의 주민등록번호는 900101-1234567입니다."
safe_text = mask_pii_for_rag(raw_text)
print(safe_text)
# 결과: "고객 [이름]([전화번호])의 주민등록번호는 [주민번호]입니다."This way, you can prevent sensitive information like resident registration numbers or phone numbers from entering the LLM context. It's also the most direct technical approach to implementing the "de-identification of personal information in AI training data" required by the Personal Information Protection Commission's guide.
The Most Common Mistakes in Practice
This part is important enough that I want to highlight it separately. Here are three mistakes I've experienced or repeatedly seen around me.
1. Leaving context keywords as an empty list
If you only include regex patterns and leave context keywords empty, score adjustments don't happen and detection rates drop significantly. If you populate various expressions actually used in documents — such as "주민등록번호" (resident registration number), "주민번호" (ID number), "생년월일" (date of birth) — the perceived accuracy changes dramatically.
2. Forgetting post-processing for particles
When NER captures "김철수에게" (to Kim Cheolsu) as a whole PER entity and you pass it to Presidio as-is, the anonymization masks "에게" (to) as well, resulting in "[Name] contacted" — breaking the context and also negatively affecting downstream NLP processing.
3. Deploying without test data
Since you can't use actual personal information, the process of building a test set with fictitious name, resident registration number, and phone number combinations and measuring precision/recall cannot be skipped.
What are Precision vs Recall? Precision is "the proportion of actual PII among what was identified as PII," and Recall is "the proportion of actual PII that was successfully detected." In PII detection, missing something (false negative) is more dangerous than a false alarm (false positive), so the general strategy is to secure recall first and then tune precision.
Pros and Cons Analysis
Pros
| Item | Details |
|---|---|
| Modular Architecture | The NLP engine, recognizers, and anonymizer are separated, allowing you to swap just the Korean model or add only recognizers |
| Multi-layered Detection Strategy | By combining NER + regex + context keywords, it's possible to achieve higher recall compared to any single approach |
| Transformer Integration | You can directly connect KLUE-BERT, KoBERT, etc. via TransformersNlpEngine. The perceived accuracy for context-dependent entities like names and addresses improves significantly compared to spaCy NER — I'll cover this in detail in the next article |
| Production-Friendly | Provides Docker images and REST API server mode for easy integration into microservices |
| Active Ecosystem | Steadily updated under Microsoft's leadership, with ongoing discussions about next-generation features including LLM integration |
Cons and Caveats
| Item | Details | Mitigation |
|---|---|---|
| No Built-in Korean Support | Languages other than English follow a "bring your own model" approach, with significant initial cost for custom recognizer implementation | Start by building resident registration number, phone number, and name recognizers based on the code in this article, then expand gradually — this is the realistic approach |
| NER Accuracy Limitations | Accuracy may drop on real-world text such as colloquial language, social media, and customer service logs | Consider switching to Transformer models like KLUE-BERT, or domain-specific fine-tuning |
| Particle Handling Issues | Entity boundary recognition errors are frequent in forms like "김철수에게" (to Kim Cheolsu) | Morphological analyzer-based particle separation logic in the NER result post-processing step is essential |
| Model Size | ko_core_news_lg is approximately 550MB per official documentation, creating memory and loading time overhead |
For lightweight environments, start with ko_core_news_sm and upgrade as needed |
| Korean Address Recognition | Regex alone has limitations due to hierarchical structures + mixed old/new addresses + abbreviations | A composite recognizer combining NER (LOC) + address keyword dictionary + regex is needed |
Conclusion
Thanks to Presidio's modular architecture, simply replacing the spaCy Korean model and adding custom recognizers is enough to extend an English-only PII detection pipeline for the Korean language environment. Of course, it's not perfect — especially for particle handling and address recognition, there's still a lot of manual work involved. But with regulations tightening now, "starting at a usable level and improving incrementally" is a far more realistic strategy than "waiting until it's perfect." In my experience, achieving 95%+ recall for resident registration numbers and phone numbers and 80%+ recall for person names is sufficient for an initial deployment, with tuning done afterward based on production service logs.
Three steps you can start right now:
-
Environment Setup — Install dependencies with
pip install presidio-analyzer presidio-anonymizer && python -m spacy download ko_core_news_lg, and verify that the Korean NLP engine works correctly using the Example 1 code. -
Register Custom Recognizers — Add the resident registration number and phone number
PatternRecognizerfrom Example 2 and theKoreanNameRecognizerfrom Example 3 to your project, and check the detection results against actual business text (de-identified). -
Pipeline Integration — Referencing the
mask_pii_for_ragpattern from Example 5, insert a PII filter into your RAG pipeline or data preprocessing workflow. Since Presidio also supports REST API mode, you can operate it as a microservice by utilizing the official Docker image.
Next Article: Taking it one step further with Transformer models — I'll cover advanced Korean PII detection methods that go beyond spaCy NER limitations by connecting KLUE-BERT to
TransformersNlpEngine.
References
Essential References — The documents to check first when following along with or extending the code in this article:
- Microsoft Presidio GitHub Repository — Framework source code and Docker images
- Presidio Custom NLP Engine Configuration Guide | Microsoft — Detailed instructions for NLP engine replacement
- Presidio Language Addition Tutorial | Microsoft — Official tutorial for non-English language support
- spaCy Korean Model Official Documentation | Explosion — Detailed information on model size, performance, and components
- KLUE Benchmark — The benchmark dataset used for Korean NER training
Advanced References — Useful when diving deep into specific issues or exploring alternatives:
- Presidio Analyzer Official Documentation | Microsoft
- Presidio Multilingual Support Documentation | Microsoft
- Presidio spaCy/Stanza NLP Engine Documentation | Microsoft
- Presidio Transformer NLP Engine Documentation | Microsoft
- spaCy Korean NER Particle Inclusion Issue (Issue #13705) | GitHub
- spaCy Korean Pipeline Feedback Discussion (Discussion #10624) | GitHub
- PatternRecognizer Multilingual Support Issue (Issue #1606) | GitHub
- LlamaIndex Presidio PII Masking Guide | LlamaIndex
- Generative AI Personal Information Processing Guide | Personal Information Protection Commission
- Kaggle: PII Microsoft Presidio KR Notebook
- mcp-pii-tools — Korean PII Tools | GitHub