How Regex-First PII Detection Works

Regex-first PII detection: 317 deterministic pattern recognizers for structured data (IDs, tax numbers, credit cards), plus spaCy, Stanza, and XLM-RoBERTa NLP for names and locations across 48 languages.

Try It Free Technical Documentation

Regex-First: Why It Matters

Our Approach: Regex + NLP

317 regex recognizers: 100% reproducible for structured data
NLP for names & locations with confidence scores
Fully auditable — every detection traceable to a pattern or model
Transparent: you always know what matched and why
Fast, predictable performance
48 languages across 3 NLP engines

AI-Only Approaches

All detections are probabilistic
Can't explain why something was flagged
Requires large training datasets
Difficult to audit for compliance
Higher compute costs (GPU needed)
Model drift degrades accuracy over time

The 10-Step Process

From input to output, here's exactly what happens to your document

Input Text

Submit your document via web interface, API, or Office Add-in

Language Detection

System identifies the document language for optimal processing

Tokenization

Text is broken into tokens for pattern matching

Pattern Matching

317 regex recognizers and NLP models scan for 317 entity types across 70+ countries

Context Analysis

Surrounding text improves detection accuracy

Confidence Scoring

Each detection receives a confidence score (0.0–1.0) enabling human-in-the-loop review decisions

Entity Classification

Detected items are categorized by type

Human-in-the-Loop Review

Review all detections, override false positives, and approve before anonymization

Apply Anonymization

Choose your method: Replace, Redact, Hash, Encrypt, Asymmetric Encrypt, Mask, or Keep

Output Document

Download your anonymized document

MCP Server: Privacy-First AI Integration

How your data flows through the MCP Server to keep AI tools safe

The MCP Server acts as a privacy shield, intercepting requests from AI tools, anonymizing PII, processing safe data through AI, and optionally restoring original values.

AI Tool Request

Your AI tool (Cursor, Claude) sends a request containing PII

MCP Server Intercepts

Server analyzes and detects all PII entities

Anonymization

PII is replaced with tokens or redacted

AI Processing

AI receives and processes only anonymized data

Response Return

AI response comes back through MCP Server

De-tokenization

Optional: Original values restored for user

Learn more about MCP Server →

Explore Further

Technology

Deep dive into regex-first detection and why it's better for compliance

Architecture

System architecture and how components work together

Security

Five security layers protecting your data at every step

Frequently Asked Questions

Does cloak.business use AI for detection?

No. Detection uses deterministic regex patterns and NLP models (spaCy, Stanza). This ensures 100% reproducible results — the same input always produces the same output, unlike probabilistic AI approaches.

Why regex patterns instead of AI?

Regex patterns are auditable, reproducible, and compliant. You can inspect exactly what each pattern matches. AI-based detection is non-deterministic — results can vary between runs, making compliance documentation difficult.

How accurate is the detection?

With 317 custom pattern recognizers including checksum validation (Luhn, IBAN, SSN), cloak.business achieves significantly higher accuracy than generic NER models, especially for structured identifiers like credit cards, tax IDs, and national ID numbers.

Which languages are supported?

48 languages are supported with dedicated NLP models for named entity recognition. Pattern-based detection (regex) works across all languages since it matches character patterns regardless of language.

Can I add custom entity patterns?

Yes. The API supports custom recognizer definitions so you can add patterns for proprietary identifiers, internal reference numbers, or domain-specific data formats.

See It in Action

Try our PII detection and anonymization free with 200 tokens per cycle.