Regex vs. ML Models for PII Detection

Two Approaches to the Same Problem

When a DLP tool reads the text "john@acme.com transferred €14,250 from DE89 3704 0044 0532 0130 00", it needs to identify that this contains an email address and a German IBAN. Two fundamentally different approaches exist for doing this:

Probabilistic (ML/LLM)

A trained model assigns a confidence score to each token or span. If the score exceeds a threshold, the span is classified as PII. A German IBAN might score 0.87 — classified. Or 0.73 — below threshold, missed.

Example: Nightfall's 100+ ML models, claimed 95% precision

Deterministic (Regex + Checksum)

A pattern matches the text structure, then a checksum algorithm validates the value mathematically. A German IBAN either satisfies mod-97 validation or it does not. No probability — 100% recall on valid values.

Example: cloak.business — 317 regex recognizers with checksum validation

This is not a contest with a clear winner — each approach has distinct strengths. The key is understanding which data types benefit from deterministic detection and why this matters for compliance.

Where ML Detection Excels

Transformer-based ML models genuinely outperform regex for unstructured, context-dependent PII:

Person names in context — "John from accounting" — regex cannot know "John" is a name without context
Addresses without a fixed format — Street addresses vary enormously by country and style
Implicit PII — "the patient described above" in a healthcare context
Freeform description of sensitive topics — Confidential project names, undisclosed business information
Document classification — Categorizing a document as "employment contract" or "medical record"

For these use cases, ML is the right tool. cloak.business uses NLP models (spaCy, Stanza) and XLM-RoBERTa for exactly this category of detection alongside regex.

Where ML Detection Fails — Structured PII

The most dangerous detection gap is in structured PII — identifiers defined by national standards with specific formats and checksum algorithms. These are also the identifiers most tightly regulated by GDPR, HIPAA, and PCI-DSS:

IBANs — International Bank Account Numbers (27 country formats, mod-97 checksum)
Government IDs — German Personalausweis, French INSEE number, Austrian SSN, Dutch BSN
Tax identifiers — German Steuer-IdNr, French SIRET, Italian Codice Fiscale (checksum-validated)
Healthcare IDs — NHS numbers (modulus-11), Belgian NISS, Nordic personal numbers
APAC national IDs — Japanese My Number, Korean RRN, Chinese Resident ID, Indian Aadhaar

Why does ML struggle here? These identifiers are rare in training data (especially non-English ones), they appear without surrounding semantic context that would help a language model, and they require mathematical validation that transformers are not designed to perform.

A German IBAN appearing as a bare string in a technical log — without the word "IBAN" nearby — will likely be missed by an ML classifier. A regex with mod-97 checksum validation will catch it with 100% precision.

Detection Accuracy by Identifier Type

Entity Type	Regex (deterministic)	ML (probabilistic)
German IBAN DE89 3704 0044 0532 0130 00	Checksum algorithm (mod 97 on rearranged number)	May or may not detect — depends on training data
UK National Insurance AB 12 34 56 C	Pattern + valid prefix/suffix validation	Probabilistic — high false-negative risk
French SIRET 73282932000074	14-digit Luhn variant validation	Probabilistic — semantic context required
Swiss AHV Number 756.1234.5678.97	Check digit validation (EAN-13 algorithm)	Likely missed — rare in training data
Korean RRN YYMMDD-NNNNNNN	Date prefix + gender digit + checksum	Missed by English-centric models
Credit card (Visa) 4532015112830366	Luhn algorithm — deterministic 100%	Well-trained — high accuracy

The Hybrid Approach: 317 Regex + NLP + XLM-RoBERTa

cloak.business combines both detection paradigms into a single pipeline:

Layer 1: 317 Regex Recognizers

Deterministic pattern matching with checksum validation. 211 country-specific + 49 secrets + 20 infrastructure + 39 global recognizers. 100% recall for valid structured identifiers.

Layer 2: NLP (spaCy + Stanza)

Named Entity Recognition for person names, organizations, locations, and dates in context. Language-specific models across 48 locales.

Layer 3: XLM-RoBERTa

Multilingual transformer model for cross-lingual entity detection. Handles non-Latin scripts (Arabic, Hebrew, Chinese, Japanese, Korean) where regex alone cannot identify names.

When both regex and NLP agree on a detection, confidence scores are combined and boosted. When only one layer fires, the score reflects the uncertainty. This prevents both false positives (random number strings that happen to pass a weak regex) and false negatives (valid IBANs without surrounding context).

The Compliance Implication

GDPR, HIPAA, and PCI-DSS focus precisely on the structured identifiers where ML detection is weakest:

GDPR Special Category Data — national ID numbers, health identifiers — all structured
PCI-DSS Primary Account Numbers — credit/debit card numbers with Luhn validation
HIPAA Direct Identifiers — SSN, DEA numbers, NPI numbers — all follow strict formats
Banking Regulation (EBA) — IBAN, BIC, account numbers — mathematically validated

A 95% ML accuracy rate sounds impressive — until you consider that the 5% of missed detections are concentrated in the exact data categories your compliance program is trying to protect. Deterministic regex turns that 95% into 100% for structured PII.

Limitations: When Pure Regex Is Not Ideal

Regex-first detection has a clear limitation: it requires the PII to have a predictable format. Freeform text entities — personal names, organization names, informal descriptions of locations — do not match fixed patterns. For documents heavy in narrative text (legal briefs, medical notes, customer correspondence), regex alone will have low recall for contextual entities and must be combined with NLP layers.

The drawback of the hybrid approach is latency: combining three detection layers (regex, NLP, transformer) adds processing overhead compared to regex-only. For high-throughput pipelines requiring sub-50ms latency, a regex-only preset targeting only structured identifiers may be a better fit than the full hybrid stack.

Best For: Compliance-regulated pipelines where structured PII (IBAN, SSN, passport numbers) must be detected with 100% recall. Not ideal for sub-50ms latency requirements or pure narrative text corpora with no structured identifiers.

Sources

Why 317 Pattern Recognizers Beat 30

Microsoft Presidio ships with ~30 recognizers. cloak.business uses 317 for IBANs, national IDs, and 70+ countries. Why it matters for AI pipelines.

How to Detect PII in Documents: A Complete Guide

How to detect PII in documents using regex, NLP, and ML. Includes code examples for pre-processing before OpenAI API calls. GDPR-compliant approaches.

Deterministic vs. Probabilistic PII Detection