Analyzer Guide -- Finding Personal Information
Last Updated: 2026-02-11
The Analyzer is the detection engine at the core of cloak.business. It scans text for personally identifiable information (PII) using a combination of pattern matching and natural language processing (NLP) models.
Table of Contents#
- What the Analyzer Does
- How Detection Works
- Confidence Scores
- Context Words
- Entity Types
- Language Selection
- Tips for Best Results
What the Analyzer Does#
The Analyzer examines text and identifies segments that contain personal or sensitive information. For each detected entity, it returns:
- The entity type (e.g., PERSON, CREDIT_CARD, DE_TAX_ID)
- The exact position in the text (start and end offsets)
- A confidence score indicating how certain the detection is
The Analyzer does not modify text. It only identifies PII. Anonymization is a separate step that uses the Analyzer's output.
How Detection Works#
Detection uses two complementary approaches that run in parallel:
Pattern-Based Recognition (317 Recognizers)#
- Regex patterns match structured data formats: national ID numbers, tax IDs, credit card numbers, phone numbers, IBANs, license plates, postal codes, and more.
- Each recognizer is tuned for a specific country or format. For example, the German ID card recognizer matches the exact format of a German Personalausweis number.
- Pattern recognizers work regardless of the text's language because they match data format, not language.
- Coverage spans 75+ countries and 390+ entity types.
NLP-Based Recognition#
- Named Entity Recognition (NER) models detect names of people, locations, and organizations.
- NLP models understand context and grammar to identify entities that do not follow a fixed pattern.
- Multiple NLP engines are available:
- spaCy models for 25 languages
- Stanza NER for 7 languages
- XLM-RoBERTa transformers for 16 languages
Both approaches run simultaneously. When both a pattern and an NLP model detect the same entity, the higher confidence score is used.
Confidence Scores#
Every detection includes a confidence score from 0.0 to 1.0.
| Range | Level | What It Means |
|---|---|---|
| 0.85 - 1.0 | High | Strong format match with checksum validation or strong contextual support. Very likely correct. |
| 0.5 - 0.85 | Medium | Pattern match with some context. Likely correct but worth reviewing. |
| 0.3 - 0.5 | Low | Generic pattern detected. May be a false positive. Review manually. |
What Affects Confidence#
- Checksum validation: Entity types like credit cards and IBANs include mathematical validation (Luhn algorithm, IBAN check digits). Passing validation significantly boosts confidence.
- Context words: The presence of related words near the detected value increases confidence (see below).
- Pattern specificity: A highly specific pattern (e.g., a German tax ID with exact format) scores higher than a generic numeric pattern.
Context Words#
Each recognizer has language-specific context words that boost detection confidence when they appear near a potential entity.
How Context Words Work#
- The pattern recognizer finds a value matching a known format.
- The system checks the surrounding text (typically within a 5-word window) for context words.
- If context words are present, the confidence score is increased.
Examples#
| Entity Type | Context Words (English) | Context Words (German) |
|---|---|---|
| CREDIT_CARD | credit card, card number, CC | Kreditkarte, Kartennummer |
| DE_TAX_ID | tax ID, tax number | Steuer-ID, Steueridentifikationsnummer, IdNr |
| PHONE_NUMBER | phone, tel, call, mobile | Telefon, Handy, Rufnummer, Mobil |
| EMAIL_ADDRESS | email, e-mail, contact | E-Mail, Kontakt |
Context words are defined per recognizer and per language. The system supports context words in all 48 interface languages.
Entity Types#
Detected entities are organized into the following categories:
Personal Identifiers#
Names, dates of birth, ages, genders, nationalities, and biometric identifiers. Detected primarily by NLP models.
Government-Issued IDs#
National ID numbers, passport numbers, driver license numbers, social security numbers, and tax IDs. Covers 75+ countries with country-specific formats.
Financial Identifiers#
Credit card numbers, IBANs, SWIFT/BIC codes, bank account numbers, and securities identifiers (CUSIP, ISIN, SEDOL, LEI).
Location Data#
Addresses, postal codes, GPS coordinates, and IP addresses.
Digital Identifiers#
Email addresses, phone numbers, URLs, MAC addresses, and license plates.
Technical Secrets#
API keys, access tokens, SSH keys, database connection strings, and credentials from 30+ platforms (cloud providers, AI services, SaaS tools).
Healthcare#
Medical record numbers, prescription numbers, diagnosis codes (ICD-10, ICD-11), procedure codes, and country-specific health insurance numbers.
Organization#
Company names, legal entity identifiers, and registration numbers.
Temporal#
Dates, times, and durations that may be personally identifying.
For the complete list of all 390+ entity types and 157+ presets, see the Entity & Preset Inventory.
Backend Request Limits#
The analyzer enforces server-side limits on every request to prevent resource exhaustion. These limits apply regardless of which client sends the request.
| Limit | Value | Description |
|---|---|---|
| Entity types per request | 250 | Maximum entity types in the filter list |
| Ad-hoc recognizers per request | 50 | Maximum custom recognizers |
| Patterns per recognizer | 10 | Maximum regex patterns per ad-hoc recognizer |
| Context words per recognizer | 30 | Maximum context boost words |
| Total ad-hoc patterns per request | 200 | Total regex compilation budget |
Requests exceeding these limits receive a 422 validation error with a descriptive message. Clients can discover the current limits via the GET /limits endpoint.
Language Selection#
Auto-Detect (Default)#
The system identifies the primary language of the input text automatically. This works well for most cases.
Manual Selection#
Choose a specific language to:
- Load the correct NLP model for that language, improving name and location detection.
- Activate language-specific context words for pattern recognizers.
- Improve OCR accuracy when processing images.
What Language Affects#
| Component | Effect of Language Selection |
|---|---|
| NLP models | Loads the appropriate language model for name/location detection |
| Context words | Activates language-specific context words that boost confidence |
| OCR (images) | Sends a language hint to the OCR engine for better text extraction |
| Pattern matching | Not affected -- patterns match data format regardless of language |
Tips for Best Results#
-
Provide more context. Longer text gives the Analyzer more surrounding words to evaluate. A credit card number in isolation may score lower than one preceded by "Card Number:".
-
Use the right preset. Country-specific presets enable only the relevant entity types, reducing false positives. If you are processing German documents, select the Germany preset.
-
Check low-confidence detections. Entities with scores below 0.5 are more likely to be false positives. Review them before anonymizing.
-
Select the correct language. If auto-detect picks the wrong language, manually select the correct one. This improves NLP detection significantly.
-
Include headers and labels. Documents with clear labels like "Name:", "Address:", "Tax ID:" provide strong context words that boost confidence scores.
-
Review before anonymizing. Always review detected entities and deselect any false positives before proceeding to anonymization.