cloak.business

Како да откриете PII во документи: Комплетен водич

Salīdzinājums starp regex, NLP un hibrīda pieejām PII noteikšanā.

1 март 2026 г.9 мин читање

What is PII?

Personally Identifiable Information (PII) is any data that can be used to identify a specific individual, either on its own or when combined with other data. The definition varies by regulation, but the core idea is the same: if it points to a person, it needs protection.

PII falls into three broad categories, each with different risk profiles and regulatory treatment:

Direct Identifiers

Full name, SSN, passport number, driver's license, email address, phone number

Directly identifies an individual without additional data

Quasi-Identifiers

Date of birth, ZIP code, gender, job title, IP address

Can identify individuals when combined with other data points

Sensitive Data

Medical records, financial data, biometrics, racial/ethnic origin, political opinions

Subject to stricter regulations under GDPR Article 9, HIPAA

Why PII Detection Matters

You cannot protect what you cannot find. PII detection is the foundational step in any data protection strategy. Without it, compliance is guesswork, breach response is slow, and AI systems risk leaking sensitive data.

Compliance

GDPR, CCPA, and HIPAA all require organizations to know what personal data they hold and where it lives. Automated detection makes this feasible at scale.

Breach Prevention

The average data breach costs $4.45M (IBM, 2023). Detecting PII before it leaves your perimeter — in documents, emails, or AI prompts — is far cheaper than incident response.

AI Safety

LLMs ingest everything you paste. Without PII detection, sensitive data ends up in training sets, chat logs, and third-party servers. Detection is the first line of defense.

Detection Methods Compared

There are four main approaches to PII detection. Each has trade-offs in speed, accuracy, and coverage. The best systems combine multiple methods.

MethodSpeedAccuracyScalability
Manual ReviewVery slowVariableDoes not scale
Regex PatternsVery fastHigh (known formats)Excellent
NLP ModelsModerateHigh (names, locations)Good
Hybrid (Regex + NLP)used by cloak.businessFastHighestExcellent

Manual Review

Strengths: Human judgment, context understanding

Weaknesses: Error-prone, expensive, inconsistent across reviewers

Regex Patterns

Strengths: Deterministic, zero false negatives for matched patterns

Weaknesses: Limited to known formats, no context awareness

NLP Models

Strengths: Context-aware, handles unstructured text, finds names

Weaknesses: Language-dependent, requires training data, slower

Hybrid (Regex + NLP)

Strengths: Best of both worlds — format precision + context awareness

Weaknesses: More complex to implement and maintain

What cloak.business Detects

cloak.business uses the hybrid approach: 317 custom regex pattern recognizers combined with NLP models for names, locations, and context-dependent entities. This combination delivers the highest accuracy across structured and unstructured text.

~320

Entity types

70+

Countries covered

48

Supported languages

Detection covers everything from universal identifiers (emails, phone numbers, credit cards) to country-specific formats (US SSN, German Personalausweis, Japanese My Number) and secrets (AWS keys, GitHub tokens, database connection strings).

Getting Started: Implementation Steps

Integrating PII detection into your workflow takes minutes, not weeks. Here is a typical implementation path:

1

Sign up and get your API key

Create a free account at cloak.business/auth/signup. Your API key is available immediately in the dashboard.

2

Choose your country presets

Select from 85+ country presets or use "Auto-detect" to let the system identify relevant entity types automatically.

3

Call the API

Send text to the analyze endpoint and get back detected entities with positions, types, and confidence scores.

Example: Detect PII via API

curl -X POST https://cloak.business/api/v1/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "John Smith, SSN 123-45-6789, lives in Berlin.",
    "language": "en",
    "entities": ["PERSON", "US_SSN", "LOCATION"]
  }'

# Response:
# [
#   { "type": "PERSON",   "start": 0,  "end": 10, "score": 0.92 },
#   { "type": "US_SSN",   "start": 16, "end": 27, "score": 1.0  },
#   { "type": "LOCATION", "start": 38, "end": 44, "score": 0.88 }
# ]

Accuracy: Why Pattern Count Matters

Most open-source PII detection tools ship with 20-30 built-in recognizers. This covers the basics — US SSNs, credit cards, email addresses — but misses the vast majority of country-specific formats. A German tax ID, a Brazilian CPF, or a Japanese My Number will pass through undetected.

cloak.business uses 317 custom pattern recognizers with checksum validation and format verification. That is 10x more coverage than default Presidio, and each recognizer is tuned for real-world formats, not just textbook examples.

317 pattern recognizers vs. ~30 in default Presidio. More patterns means fewer missed entities, fewer false negatives, and more reliable compliance evidence.

Read the detailed accuracy comparison

Compliance Frameworks That Require PII Detection

PII detection is not optional under major data protection regulations. Each framework has specific requirements that make automated detection a practical necessity:

GDPR

Article 30 — Records of Processing Activities

Organizations must maintain a register of all PII processing activities, including categories of personal data processed.

Maximum penalty: Up to 4% of annual global turnover or 20M EUR

CCPA

Section 1798.100 — Right to Know

Businesses must disclose what personal information they collect. Requires knowing where PII exists in your systems.

Maximum penalty: Up to $7,500 per intentional violation

HIPAA

Safe Harbor De-identification (164.514)

Requires removal or generalization of 18 specific identifiers to de-identify protected health information (PHI).

Maximum penalty: Up to $1.5M per violation category per year

Where to Deploy PII Detection

Effective PII protection requires detection at every point where data enters or leaves your systems. cloak.business provides tools for each:

In AI Conversations

The Chrome extension intercepts PII in real-time before messages reach ChatGPT, Claude, Gemini, and 3 more platforms.

Chrome Extension

In Documents

The desktop app scans documents locally with zero-knowledge architecture. Data never leaves your machine.

Desktop App

In Your Pipeline

The REST API integrates into any workflow — ETL pipelines, CRM exports, document management systems, or custom apps.

API Documentation

In Office Documents

The Office Add-in detects and anonymizes PII directly inside Word, preserving document formatting.

Office Add-in

Key Takeaways

  • Hybrid detection is the gold standard — Regex alone misses names and context; NLP alone misses formatted IDs. Combine both for the highest accuracy
  • Pattern count directly impacts accuracy — 317 recognizers cover 10x more entity types than default open-source tools
  • Compliance mandates detection — GDPR, CCPA, and HIPAA all require organizations to know what PII they hold
  • Detect at every boundary — AI prompts, documents, APIs, and Office files all need coverage to prevent leaks

Поврзани Пости

Подготвени да ги Защитите Вашите Податоци?

Започнете да откривате и анонимизирате PII за неколку минути со нашиот бесплатен план.