PII Detection in 48 Languages

AI Is Global. AI DLP Tools Are Not.

ChatGPT, Claude, Gemini, and other AI assistants are used by employees in Frankfurt, Tokyo, São Paulo, Dubai, and Seoul — not just San Francisco. When a German employee asks Claude to help draft a response to a customer, the prompt may contain a German IBAN, an Austrian SSN, or a French INSEE number. When a Korean analyst pastes a compliance report into ChatGPT, it may include Korean RRN numbers for the individuals named.

Enterprise AI DLP tools built on English-centric ML models are blind to most of this. Their training data is dominated by English-language PII patterns — US Social Security Numbers, US credit card formats, English names. Non-English identifiers — especially structured national IDs in non-Latin scripts — are significantly underrepresented in training data and frequently missed.

Nightfall's product documentation lists no multilingual coverage. All customer case studies on their website are US-based organizations. Their ML models are described as achieving 95% precision — a figure that likely reflects English-language benchmark performance.

What English-Centric DLP Misses

Japanese My Number (マイナンバー)

12-digit individual identification number issued to all Japanese residents. Required for tax, pension, and disaster relief records. Follows a specific check-digit algorithm. Absent from most English-trained DLP models.

Korean Resident Registration Number (주민등록번호, RRN)

13-digit number encoding date of birth and gender. Heavily regulated under Korea's Personal Information Protection Act (PIPA). Appears in Korean-language documents without Latin-script context clues that would help an English ML model.

Arabic Personal Names

Arabic names are written right-to-left in Arabic script and follow different naming conventions (nasab patronymic chains, tribal names). NER models trained on English names in Latin script cannot identify Arabic names as person entities without language-specific training.

German IBAN + Steuer-IdNr + Ausweis

Germany is Europe's largest economy and a major target for GDPR enforcement. German IBANs (DE format, 22 characters, mod-97 validation), Steueridentifikationsnummer (11-digit tax ID), and Personalausweis numbers all require language-specific patterns. A generic ML model may miss these when surrounded by German-language context rather than English.

How 48-Language Detection Works

cloak.business uses a three-layer detection pipeline designed for global coverage:

317 Regex Recognizers

Country-specific patterns with checksum validation. Completely language-agnostic — a German IBAN is detected regardless of the surrounding language. Covers 70+ countries with structured national identifiers.

spaCy + Stanza NLP

Language-specific NER models for person names, organizations, and locations. Both libraries support dozens of languages with language-specific training data — not translated English models.

XLM-RoBERTa

Facebook AI's cross-lingual transformer trained on 100 languages. Enables entity detection in languages where spaCy/Stanza models are less complete — including Arabic, Hebrew, Persian, and APAC scripts.

Coverage by Region

Western Europe

German (de) · French (fr) · Spanish (es) · Italian (it) · Portuguese (pt) · Dutch (nl) · Polish (pl) · Czech (cs) · Slovak (sk) · Hungarian (hu) · Romanian (ro) · Greek (el)

IBANs, tax IDs, social security numbers, VAT numbers — all checksum-validated

Northern & Eastern Europe

Swedish (sv) · Norwegian (nb) · Danish (da) · Finnish (fi) · Bulgarian (bg) · Croatian (hr) · Serbian (sr) · Slovenian (sl) · Ukrainian (uk) · Russian (ru) · Lithuanian (lt) · Latvian (lv) · Estonian (et)

Nordic personal numbers, Eastern European national IDs, passport formats

Middle East & RTL Scripts

Arabic (ar) · Hebrew (he) · Persian/Farsi (fa) · Turkish (tr)

RTL name detection, Arabic national IDs, Turkish TC Kimlik number

Asia-Pacific

Japanese (ja) · Korean (ko) · Chinese Simplified (zh) · Hindi (hi) · Bengali (bn) · Vietnamese (vi) · Thai (th) · Malay (ms) · Indonesian (id)

Japanese My Number, Korean RRN, Chinese Resident ID, Aadhaar (India)

Americas & Other

English (en) · Portuguese BR (pt-BR) · Spanish variants · Swahili (sw) · Macedonian (mk) · Basque (eu) · Catalan (ca) · Galician (gl) · Afrikaans (af)

SSN, SIN, CURP, CPF, CUIL — Americas national identifiers

Right-to-Left (RTL) Script Detection

Arabic, Hebrew, and Persian are written right-to-left. PII in these languages requires NER models that understand script direction and language morphology — not Latin-script models applied to transliterated text.

This matters practically: multinational companies with operations in Israel, UAE, Saudi Arabia, or Iran process employee and customer data that includes RTL-script names and identifiers. An English-centric DLP tool scanning a mixed-language document will detect the English PII but miss the Arabic or Hebrew PII adjacent to it.

XLM-RoBERTa was trained on Common Crawl data in 100 languages including Arabic (ar), Hebrew (he), and Persian (fa), enabling entity detection that is not available in English-first ML DLP models.

The Business Case for Multilingual PII Detection

GDPR applies language-neutrally — a French INSEE number is personal data under GDPR regardless of what language a DLP tool was trained in
Multinational compliance requires consistent detection — a compliance posture that catches US SSNs but misses German Steuer-IdNr creates an uneven and defensible gap
Global AI adoption is accelerating — employees in APAC and MENA regions use the same ChatGPT and Claude that EU/US employees use; the same protection should apply
Customer-facing AI generates multilingual PII — a French customer's name in a support ticket, a Japanese order number, a Korean billing address — all need detection

Limitations and When Multilingual Detection Falls Short

Multilingual NLP detection is powerful but not universal. No current system achieves 100% recall across all 48 supported languages simultaneously — recall rates vary by language family, with Germanic and Romance languages achieving higher accuracy than less-resourced languages like Swahili or Tagalog where training data is sparser.

Context-dependent entities remain a limitation: a word that is both a common noun and a person's name in a given language will produce false positives or false negatives depending on surrounding context. Setting confidence thresholds requires per-language calibration — a threshold that works for German may miss valid French entities or over-detect in Indonesian.

For languages not yet in the supported set, or for highly domain-specific entity types (clinical codes, proprietary identifiers), regex-based custom recognizers remain the more reliable option. Multilingual ML should be treated as a complement to pattern rules, not a replacement for them.

Sources

AI Browser DLP vs. Zero-Knowledge Anonymization

Enterprise DLP blocks AI uploads through endpoint surveillance. Zero-knowledge anonymization lets workflows continue with zero PII reaching the LLM.

How to Detect PII in Documents: A Complete Guide

How to detect PII in documents using regex, NLP, and ML. Includes code examples for pre-processing before OpenAI API calls. GDPR-compliant approaches.