Deterministic vs. Probabilistic PII Detection

Why 317 regex recognizers outperform 100 ML models for structured PII — and when each approach wins.

March 14, 20267 min read

Two Approaches to the Same Problem

When a DLP tool reads the text "john@acme.com transferred €14,250 from DE89 3704 0044 0532 0130 00", it needs to identify that this contains an email address and a German IBAN. Two fundamentally different approaches exist for doing this:

Probabilistic (ML/LLM)

A trained model assigns a confidence score to each token or span. If the score exceeds a threshold, the span is classified as PII. A German IBAN might score 0.87 — classified. Or 0.73 — below threshold, missed.

Example: Nightfall's 100+ ML models, claimed 95% precision

Deterministic (Regex + Checksum)

A pattern matches the text structure, then a checksum algorithm validates the value mathematically. A German IBAN either satisfies mod-97 validation or it does not. No probability — 100% recall on valid values.

Example: cloak.business — 317 regex recognizers with checksum validation

This is not a contest with a clear winner — each approach has distinct strengths. The key is understanding which data types benefit from deterministic detection and why this matters for compliance.

Where ML Detection Excels

Transformer-based ML models genuinely outperform regex for unstructured, context-dependent PII:

  • Person names in context — "John from accounting" — regex cannot know "John" is a name without context
  • Addresses without a fixed format — Street addresses vary enormously by country and style
  • Implicit PII — "the patient described above" in a healthcare context
  • Freeform description of sensitive topics — Confidential project names, undisclosed business information
  • Document classification — Categorizing a document as "employment contract" or "medical record"

For these use cases, ML is the right tool. cloak.business uses NLP models (spaCy, Stanza) and XLM-RoBERTa for exactly this category of detection alongside regex.

Where ML Detection Fails — Structured PII

The most dangerous detection gap is in structured PII — identifiers defined by national standards with specific formats and checksum algorithms. These are also the identifiers most tightly regulated by GDPR, HIPAA, and PCI-DSS:

  • IBANs — International Bank Account Numbers (27 country formats, mod-97 checksum)
  • Government IDs — German Personalausweis, French INSEE number, Austrian SSN, Dutch BSN
  • Tax identifiers — German Steuer-IdNr, French SIRET, Italian Codice Fiscale (checksum-validated)
  • Healthcare IDs — NHS numbers (modulus-11), Belgian NISS, Nordic personal numbers
  • APAC national IDs — Japanese My Number, Korean RRN, Chinese Resident ID, Indian Aadhaar

Why does ML struggle here? These identifiers are rare in training data (especially non-English ones), they appear without surrounding semantic context that would help a language model, and they require mathematical validation that transformers are not designed to perform.

A German IBAN appearing as a bare string in a technical log — without the word "IBAN" nearby — will likely be missed by an ML classifier. A regex with mod-97 checksum validation will catch it with 100% precision.

Detection Accuracy by Identifier Type

Entity TypeRegex (deterministic)ML (probabilistic)
German IBAN
DE89 3704 0044 0532 0130 00
Checksum algorithm (mod 97 on rearranged number)May or may not detect — depends on training data
UK National Insurance
AB 12 34 56 C
Pattern + valid prefix/suffix validationProbabilistic — high false-negative risk
French SIRET
73282932000074
14-digit Luhn variant validationProbabilistic — semantic context required
Swiss AHV Number
756.1234.5678.97
Check digit validation (EAN-13 algorithm)Likely missed — rare in training data
Korean RRN
YYMMDD-NNNNNNN
Date prefix + gender digit + checksumMissed by English-centric models
Credit card (Visa)
4532015112830366
Luhn algorithm — deterministic 100%Well-trained — high accuracy

The Hybrid Approach: 317 Regex + NLP + XLM-RoBERTa

cloak.business combines both detection paradigms into a single pipeline:

Layer 1: 317 Regex Recognizers

Deterministic pattern matching with checksum validation. 211 country-specific + 49 secrets + 20 infrastructure + 39 global recognizers. 100% recall for valid structured identifiers.

Layer 2: NLP (spaCy + Stanza)

Named Entity Recognition for person names, organizations, locations, and dates in context. Language-specific models across 48 locales.

Layer 3: XLM-RoBERTa

Multilingual transformer model for cross-lingual entity detection. Handles non-Latin scripts (Arabic, Hebrew, Chinese, Japanese, Korean) where regex alone cannot identify names.

When both regex and NLP agree on a detection, confidence scores are combined and boosted. When only one layer fires, the score reflects the uncertainty. This prevents both false positives (random number strings that happen to pass a weak regex) and false negatives (valid IBANs without surrounding context).

The Compliance Implication

GDPR, HIPAA, and PCI-DSS focus precisely on the structured identifiers where ML detection is weakest:

  • GDPR Special Category Data — national ID numbers, health identifiers — all structured
  • PCI-DSS Primary Account Numbers — credit/debit card numbers with Luhn validation
  • HIPAA Direct Identifiers — SSN, DEA numbers, NPI numbers — all follow strict formats
  • Banking Regulation (EBA) — IBAN, BIC, account numbers — mathematically validated

A 95% ML accuracy rate sounds impressive — until you consider that the 5% of missed detections are concentrated in the exact data categories your compliance program is trying to protect. Deterministic regex turns that 95% into 100% for structured PII.

Limitations: When Pure Regex Is Not Ideal

Regex-first detection has a clear limitation: it requires the PII to have a predictable format. Freeform text entities — personal names, organization names, informal descriptions of locations — do not match fixed patterns. For documents heavy in narrative text (legal briefs, medical notes, customer correspondence), regex alone will have low recall for contextual entities and must be combined with NLP layers.

The drawback of the hybrid approach is latency: combining three detection layers (regex, NLP, transformer) adds processing overhead compared to regex-only. For high-throughput pipelines requiring sub-50ms latency, a regex-only preset targeting only structured identifiers may be a better fit than the full hybrid stack.

Best For: Compliance-regulated pipelines where structured PII (IBAN, SSN, passport numbers) must be detected with 100% recall. Not ideal for sub-50ms latency requirements or pure narrative text corpora with no structured identifiers.

Sources

Related Posts

Ready to Protect Your Data?

Start detecting and anonymizing PII in minutes with our free tier.