Why 317 Pattern Recognizers Beat 30

The accuracy gap between generic and specialized PII detection tools.

February 1, 20268 min readMicrosoft Presidio

The Problem with Generic PII Tools

Microsoft Presidio is a solid open-source foundation for PII detection. We use it as our base. But out-of-the-box Presidio ships with approximately 30 built-in recognizers focused primarily on US formats.

When your documents contain German tax IDs, Japanese My Numbers, or Swiss AHV numbers, generic tools return empty results.

The Numbers

CapabilityMicrosoft Presidiocloak.business
Pattern recognizers~30317
Countries coveredPrimarily US70+
Entity types~20~320
Languages (NER)English by default48
Regional tax IDsFew100+

What is Missing from Default Presidio

DACH Region (Germany, Austria, Switzerland)

IdentifierPresidiocloak.business
German Steuer-IDNot includedChecksum validated
German SteuernummerNot includedAll 16 Bundesland formats
German PersonalausweisNot includedFull pattern
Austrian SVNNot included10-digit format
Swiss AHV-NummerNot included13-digit format

APAC Region

IdentifierPresidiocloak.business
Japanese My NumberNot included12-digit checksum
Korean RRNNot included13-digit format
Chinese Resident IDNot included18-digit with region
Singapore NRICNot includedLetter + digits + checksum

The Consequence: Significant Accuracy Gaps

Research on multilingual PII detection found that hybrid approaches combining regex, NLP, and transformers outperform fine-tuned NER models by 82% on multilingual PII detection tasks.

For organizations operating in Germany, Japan, or other non-English markets, this is not a minor inconvenience - it is a compliance failure.

Why More Recognizers Matter

Pattern Diversity

The German Steuernummer alone has 16 different formats, one per Bundesland. Each requires its own pattern.

Checksum Validation

Pattern matching without checksum validation produces false positives. Our recognizers include validation where applicable.

Context Boosting

Detection confidence improves with context words. Our recognizers include context words in multiple languages.

The Hybrid Approach

Pattern recognizers alone are not enough. We combine three detection methods:

MethodStrength
Regex patternsStructured identifiers (SSN, tax IDs)
NLP NERUnstructured entities (names, locations)
Transformer modelsContext-dependent detection

Key Takeaways

  • Generic tools fail outside their training data - Hybrid approaches outperform NER by 82%
  • Custom development is expensive - 1,000-2,000 hours to build comprehensive coverage
  • Checksum validation prevents false positives - Pattern matching alone is not enough
  • Context words must be multilingual - English context does not boost German detections
  • Hybrid detection outperforms single methods - Combine regex, NER, and transformers

When to Use Stock Presidio vs. Extended Recognizers

Stock Presidio (30 recognizers) is the right choice for US-only English-language datasets where speed and simplicity outweigh recall. Extended recognizers with checksum validation are necessary when your data includes international documents — IBAN, AADHAAR, Swiss AHV, Brazilian CPF, or any EU national ID — where a false negative carries regulatory risk under GDPR Art. 83 or equivalent. The 82% accuracy gap documented here compounds with dataset size: at one million records, 18% uncaught PII equals 180,000 unredacted sensitive fields.

Limitations: When Extended Recognizers Are Not the Best Choice

Extended recognizers are not ideal for every use case. The limitation is specificity: 317 custom recognizers covering 70+ countries introduce more patterns to maintain and a higher surface area for false positives if thresholds are set too low. For US-only, English-language datasets where the stock Presidio 30 recognizers cover all required entities, the additional complexity is unnecessary overhead.

The drawback of checksum validation is occasional false negatives at edge cases — malformed identifiers that are syntactically valid but checksum-invalid (common in test data or legacy imports). If your pipeline processes synthetic or test-generated data, validation may reject valid-looking identifiers. Best For: real-world production data requiring international compliance coverage. Not ideal for synthetic data pipelines, US-only datasets, or latency-critical pipelines where regex-only suffices.

Sources

Related Posts

Ready to Protect Your Data?

Start detecting and anonymizing PII in minutes with our free tier.