Two Approaches to the Same Problem
When a DLP tool reads the text "john@acme.com transferred €14,250 from DE89 3704 0044 0532 0130 00", it needs to identify that this contains an email address and a German IBAN. Two fundamentally different approaches exist for doing this:
Probabilistic (ML/LLM)
A trained model assigns a confidence score to each token or span. If the score exceeds a threshold, the span is classified as PII. A German IBAN might score 0.87 — classified. Or 0.73 — below threshold, missed.
Example: Nightfall's 100+ ML models, claimed 95% precision
Deterministic (Regex + Checksum)
A pattern matches the text structure, then a checksum algorithm validates the value mathematically. A German IBAN either satisfies mod-97 validation or it does not. No probability — 100% recall on valid values.
Example: cloak.business — 317 regex recognizers with checksum validation
This is not a contest with a clear winner — each approach has distinct strengths. The key is understanding which data types benefit from deterministic detection and why this matters for compliance.
Where ML Detection Excels
Transformer-based ML models genuinely outperform regex for unstructured, context-dependent PII:
- Person names in context — "John from accounting" — regex cannot know "John" is a name without context
- Addresses without a fixed format — Street addresses vary enormously by country and style
- Implicit PII — "the patient described above" in a healthcare context
- Freeform description of sensitive topics — Confidential project names, undisclosed business information
- Document classification — Categorizing a document as "employment contract" or "medical record"
For these use cases, ML is the right tool. cloak.business uses NLP models (spaCy, Stanza) and XLM-RoBERTa for exactly this category of detection alongside regex.
Where ML Detection Fails — Structured PII
The most dangerous detection gap is in structured PII — identifiers defined by national standards with specific formats and checksum algorithms. These are also the identifiers most tightly regulated by GDPR, HIPAA, and PCI-DSS:
- IBANs — International Bank Account Numbers (27 country formats, mod-97 checksum)
- Government IDs — German Personalausweis, French INSEE number, Austrian SSN, Dutch BSN
- Tax identifiers — German Steuer-IdNr, French SIRET, Italian Codice Fiscale (checksum-validated)
- Healthcare IDs — NHS numbers (modulus-11), Belgian NISS, Nordic personal numbers
- APAC national IDs — Japanese My Number, Korean RRN, Chinese Resident ID, Indian Aadhaar
Why does ML struggle here? These identifiers are rare in training data (especially non-English ones), they appear without surrounding semantic context that would help a language model, and they require mathematical validation that transformers are not designed to perform.
A German IBAN appearing as a bare string in a technical log — without the word "IBAN" nearby — will likely be missed by an ML classifier. A regex with mod-97 checksum validation will catch it with 100% precision.
Detection Accuracy by Identifier Type
| Entity Type | Regex (deterministic) | ML (probabilistic) |
|---|---|---|
German IBAN DE89 3704 0044 0532 0130 00 | Checksum algorithm (mod 97 on rearranged number) | May or may not detect — depends on training data |
UK National Insurance AB 12 34 56 C | Pattern + valid prefix/suffix validation | Probabilistic — high false-negative risk |
French SIRET 73282932000074 | 14-digit Luhn variant validation | Probabilistic — semantic context required |
Swiss AHV Number 756.1234.5678.97 | Check digit validation (EAN-13 algorithm) | Likely missed — rare in training data |
Korean RRN YYMMDD-NNNNNNN | Date prefix + gender digit + checksum | Missed by English-centric models |
Credit card (Visa) 4532015112830366 | Luhn algorithm — deterministic 100% | Well-trained — high accuracy |
The Hybrid Approach: 317 Regex + NLP + XLM-RoBERTa
cloak.business combines both detection paradigms into a single pipeline:
Layer 1: 317 Regex Recognizers
Deterministic pattern matching with checksum validation. 211 country-specific + 49 secrets + 20 infrastructure + 39 global recognizers. 100% recall for valid structured identifiers.
Layer 2: NLP (spaCy + Stanza)
Named Entity Recognition for person names, organizations, locations, and dates in context. Language-specific models across 48 locales.
Layer 3: XLM-RoBERTa
Multilingual transformer model for cross-lingual entity detection. Handles non-Latin scripts (Arabic, Hebrew, Chinese, Japanese, Korean) where regex alone cannot identify names.
When both regex and NLP agree on a detection, confidence scores are combined and boosted. When only one layer fires, the score reflects the uncertainty. This prevents both false positives (random number strings that happen to pass a weak regex) and false negatives (valid IBANs without surrounding context).
The Compliance Implication
GDPR, HIPAA, and PCI-DSS focus precisely on the structured identifiers where ML detection is weakest:
- GDPR Special Category Data — national ID numbers, health identifiers — all structured
- PCI-DSS Primary Account Numbers — credit/debit card numbers with Luhn validation
- HIPAA Direct Identifiers — SSN, DEA numbers, NPI numbers — all follow strict formats
- Banking Regulation (EBA) — IBAN, BIC, account numbers — mathematically validated
A 95% ML accuracy rate sounds impressive — until you consider that the 5% of missed detections are concentrated in the exact data categories your compliance program is trying to protect. Deterministic regex turns that 95% into 100% for structured PII.
Limitations: When Pure Regex Is Not Ideal
Regex-first detection has a clear limitation: it requires the PII to have a predictable format. Freeform text entities — personal names, organization names, informal descriptions of locations — do not match fixed patterns. For documents heavy in narrative text (legal briefs, medical notes, customer correspondence), regex alone will have low recall for contextual entities and must be combined with NLP layers.
The drawback of the hybrid approach is latency: combining three detection layers (regex, NLP, transformer) adds processing overhead compared to regex-only. For high-throughput pipelines requiring sub-50ms latency, a regex-only preset targeting only structured identifiers may be a better fit than the full hybrid stack.
Best For: Compliance-regulated pipelines where structured PII (IBAN, SSN, passport numbers) must be detected with 100% recall. Not ideal for sub-50ms latency requirements or pure narrative text corpora with no structured identifiers.
Sources
Related Posts
Why 317 Pattern Recognizers Beat 30
Microsoft Presidio ships with ~30 recognizers. cloak.business uses 317 for IBANs, national IDs, and 70+ countries. Why it matters for AI pipelines.
How to Detect PII in Documents: A Complete Guide
How to detect PII in documents using regex, NLP, and ML. Includes code examples for pre-processing before OpenAI API calls. GDPR-compliant approaches.