Deterministic vs. Probabilistic PII Detection

Why 317 regex recognizers outperform 100 ML models for structured PII — and when each approach wins.

March 14, 20267 min read

Two Approaches to the Same Problem

When a DLP tool reads the text "john@acme.com transferred €14,250 from DE89 3704 0044 0532 0130 00", it needs to identify that this contains an email address and a German IBAN. Two fundamentally different approaches exist for doing this:

Probabilistic (ML/LLM)

A trained model assigns a confidence score to each token or span. If the score exceeds a threshold, the span is classified as PII. A German IBAN might score 0.87 — classified. Or 0.73 — below threshold, missed.

Example: Nightfall's 100+ ML models, claimed 95% precision

Deterministic (Regex + Checksum)

A pattern matches the text structure, then a checksum algorithm validates the value mathematically. A German IBAN either satisfies mod-97 validation or it does not. No probability — 100% recall on valid values.

Example: cloak.business — 317 regex recognizers with checksum validation

This is not a contest with a clear winner — each approach has distinct strengths. The key is understanding which data types benefit from deterministic detection and why this matters for compliance.

Where ML Detection Excels

Transformer-based ML models genuinely outperform regex for unstructured, context-dependent PII:

  • Person names in context — "John from accounting" — regex cannot know "John" is a name without context
  • Addresses without a fixed format — Street addresses vary enormously by country and style
  • Implicit PII — "the patient described above" in a healthcare context
  • Freeform description of sensitive topics — Confidential project names, undisclosed business information
  • Document classification — Categorizing a document as "employment contract" or "medical record"

For these use cases, ML is the right tool. cloak.business uses NLP models (spaCy, Stanza) and XLM-RoBERTa for exactly this category of detection alongside regex.

Where ML Detection Fails — Structured PII

The most dangerous detection gap is in structured PII — identifiers defined by national standards with specific formats and checksum algorithms. These are also the identifiers most tightly regulated by GDPR, HIPAA, and PCI-DSS:

  • IBANs — International Bank Account Numbers (27 country formats, mod-97 checksum)
  • Government IDs — German Personalausweis, French INSEE number, Austrian SSN, Dutch BSN
  • Tax identifiers — German Steuer-IdNr, French SIRET, Italian Codice Fiscale (checksum-validated)
  • Healthcare IDs — NHS numbers (modulus-11), Belgian NISS, Nordic personal numbers
  • APAC national IDs — Japanese My Number, Korean RRN, Chinese Resident ID, Indian Aadhaar

Why does ML struggle here? These identifiers are rare in training data (especially non-English ones), they appear without surrounding semantic context that would help a language model, and they require mathematical validation that transformers are not designed to perform.

A German IBAN appearing as a bare string in a technical log — without the word "IBAN" nearby — will likely be missed by an ML classifier. A regex with mod-97 checksum validation will catch it with 100% precision.

Detection Accuracy by Identifier Type

Entity TypeRegex (deterministic)ML (probabilistic)
German IBAN
DE89 3704 0044 0532 0130 00
Checksum algorithm (mod 97 on rearranged number)May or may not detect — depends on training data
UK National Insurance
AB 12 34 56 C
Pattern + valid prefix/suffix validationProbabilistic — high false-negative risk
French SIRET
73282932000074
14-digit Luhn variant validationProbabilistic — semantic context required
Swiss AHV Number
756.1234.5678.97
Check digit validation (EAN-13 algorithm)Likely missed — rare in training data
Korean RRN
YYMMDD-NNNNNNN
Date prefix + gender digit + checksumMissed by English-centric models
Credit card (Visa)
4532015112830366
Luhn algorithm — deterministic 100%Well-trained — high accuracy

The Hybrid Approach: 317 Regex + NLP + XLM-RoBERTa

cloak.business combines both detection paradigms into a single pipeline:

Layer 1: 317 Regex Recognizers

Deterministic pattern matching with checksum validation. 211 country-specific + 49 secrets + 20 infrastructure + 39 global recognizers. 100% recall for valid structured identifiers.

Layer 2: NLP (spaCy + Stanza)

Named Entity Recognition for person names, organizations, locations, and dates in context. Language-specific models across 48 locales.

Layer 3: XLM-RoBERTa

Multilingual transformer model for cross-lingual entity detection. Handles non-Latin scripts (Arabic, Hebrew, Chinese, Japanese, Korean) where regex alone cannot identify names.

When both regex and NLP agree on a detection, confidence scores are combined and boosted. When only one layer fires, the score reflects the uncertainty. This prevents both false positives (random number strings that happen to pass a weak regex) and false negatives (valid IBANs without surrounding context).

The Compliance Implication

GDPR, HIPAA, and PCI-DSS focus precisely on the structured identifiers where ML detection is weakest:

  • GDPR Special Category Data — national ID numbers, health identifiers — all structured
  • PCI-DSS Primary Account Numbers — credit/debit card numbers with Luhn validation
  • HIPAA Direct Identifiers — SSN, DEA numbers, NPI numbers — all follow strict formats
  • Banking Regulation (EBA) — IBAN, BIC, account numbers — mathematically validated

A 95% ML accuracy rate sounds impressive — until you consider that the 5% of missed detections are concentrated in the exact data categories your compliance program is trying to protect. Deterministic regex turns that 95% into 100% for structured PII.

Sources

Related Posts

Ready to Protect Your Data?

Start detecting and anonymizing PII in minutes with our free tier.