Two Approaches to the Same Problem
When a DLP tool reads the text "john@acme.com transferred €14,250 from DE89 3704 0044 0532 0130 00", it needs to identify that this contains an email address and a German IBAN. Two fundamentally different approaches exist for doing this:
Probabilistic (ML/LLM)
A trained model assigns a confidence score to each token or span. If the score exceeds a threshold, the span is classified as PII. A German IBAN might score 0.87 — classified. Or 0.73 — below threshold, missed.
Example: Nightfall's 100+ ML models, claimed 95% precision
Deterministic (Regex + Checksum)
A pattern matches the text structure, then a checksum algorithm validates the value mathematically. A German IBAN either satisfies mod-97 validation or it does not. No probability — 100% recall on valid values.
Example: cloak.business — 317 regex recognizers with checksum validation
This is not a contest with a clear winner — each approach has distinct strengths. The key is understanding which data types benefit from deterministic detection and why this matters for compliance.
Where ML Detection Excels
Transformer-based ML models genuinely outperform regex for unstructured, context-dependent PII:
- Person names in context — "John from accounting" — regex cannot know "John" is a name without context
- Addresses without a fixed format — Street addresses vary enormously by country and style
- Implicit PII — "the patient described above" in a healthcare context
- Freeform description of sensitive topics — Confidential project names, undisclosed business information
- Document classification — Categorizing a document as "employment contract" or "medical record"
For these use cases, ML is the right tool. cloak.business uses NLP models (spaCy, Stanza) and XLM-RoBERTa for exactly this category of detection alongside regex.
Where ML Detection Fails — Structured PII
The most dangerous detection gap is in structured PII — identifiers defined by national standards with specific formats and checksum algorithms. These are also the identifiers most tightly regulated by GDPR, HIPAA, and PCI-DSS:
- IBANs — International Bank Account Numbers (27 country formats, mod-97 checksum)
- Government IDs — German Personalausweis, French INSEE number, Austrian SSN, Dutch BSN
- Tax identifiers — German Steuer-IdNr, French SIRET, Italian Codice Fiscale (checksum-validated)
- Healthcare IDs — NHS numbers (modulus-11), Belgian NISS, Nordic personal numbers
- APAC national IDs — Japanese My Number, Korean RRN, Chinese Resident ID, Indian Aadhaar
Why does ML struggle here? These identifiers are rare in training data (especially non-English ones), they appear without surrounding semantic context that would help a language model, and they require mathematical validation that transformers are not designed to perform.
A German IBAN appearing as a bare string in a technical log — without the word "IBAN" nearby — will likely be missed by an ML classifier. A regex with mod-97 checksum validation will catch it with 100% precision.
Detection Accuracy by Identifier Type
| Entity Type | Regex (deterministic) | ML (probabilistic) |
|---|---|---|
German IBAN DE89 3704 0044 0532 0130 00 | Checksum algorithm (mod 97 on rearranged number) | May or may not detect — depends on training data |
UK National Insurance AB 12 34 56 C | Pattern + valid prefix/suffix validation | Probabilistic — high false-negative risk |
French SIRET 73282932000074 | 14-digit Luhn variant validation | Probabilistic — semantic context required |
Swiss AHV Number 756.1234.5678.97 | Check digit validation (EAN-13 algorithm) | Likely missed — rare in training data |
Korean RRN YYMMDD-NNNNNNN | Date prefix + gender digit + checksum | Missed by English-centric models |
Credit card (Visa) 4532015112830366 | Luhn algorithm — deterministic 100% | Well-trained — high accuracy |
The Hybrid Approach: 317 Regex + NLP + XLM-RoBERTa
cloak.business combines both detection paradigms into a single pipeline:
Layer 1: 317 Regex Recognizers
Deterministic pattern matching with checksum validation. 211 country-specific + 49 secrets + 20 infrastructure + 39 global recognizers. 100% recall for valid structured identifiers.
Layer 2: NLP (spaCy + Stanza)
Named Entity Recognition for person names, organizations, locations, and dates in context. Language-specific models across 48 locales.
Layer 3: XLM-RoBERTa
Multilingual transformer model for cross-lingual entity detection. Handles non-Latin scripts (Arabic, Hebrew, Chinese, Japanese, Korean) where regex alone cannot identify names.
When both regex and NLP agree on a detection, confidence scores are combined and boosted. When only one layer fires, the score reflects the uncertainty. This prevents both false positives (random number strings that happen to pass a weak regex) and false negatives (valid IBANs without surrounding context).
The Compliance Implication
GDPR, HIPAA, and PCI-DSS focus precisely on the structured identifiers where ML detection is weakest:
- GDPR Special Category Data — national ID numbers, health identifiers — all structured
- PCI-DSS Primary Account Numbers — credit/debit card numbers with Luhn validation
- HIPAA Direct Identifiers — SSN, DEA numbers, NPI numbers — all follow strict formats
- Banking Regulation (EBA) — IBAN, BIC, account numbers — mathematically validated
A 95% ML accuracy rate sounds impressive — until you consider that the 5% of missed detections are concentrated in the exact data categories your compliance program is trying to protect. Deterministic regex turns that 95% into 100% for structured PII.
Sources
Related Posts
Why 317 Pattern Recognizers Beat 30
Microsoft Presidio ships with ~30 recognizers focused on US formats. Learn why 317 custom recognizers with checksum validation achieve 82% higher accuracy for global PII detection.
How to Detect PII in Documents: A Complete Guide
Learn how to detect personally identifiable information in documents using regex patterns, NLP models, and hybrid approaches. Covers ~320 entity types across 48 languages with compliance context for GDPR, CCPA, and HIPAA.