Analyzer Guide -- Finding Personal Information

Last Updated: 2026-02-11


The Analyzer is the detection engine at the core of cloak.business. It scans text for personally identifiable information (PII) using a combination of pattern matching and natural language processing (NLP) models.


Table of Contents#

  1. What the Analyzer Does
  2. How Detection Works
  3. Confidence Scores
  4. Context Words
  5. Entity Types
  6. Language Selection
  7. Tips for Best Results

What the Analyzer Does#

The Analyzer examines text and identifies segments that contain personal or sensitive information. For each detected entity, it returns:

  • The entity type (e.g., PERSON, CREDIT_CARD, DE_TAX_ID)
  • The exact position in the text (start and end offsets)
  • A confidence score indicating how certain the detection is

The Analyzer does not modify text. It only identifies PII. Anonymization is a separate step that uses the Analyzer's output.


How Detection Works#

Detection uses two complementary approaches that run in parallel:

Pattern-Based Recognition (317 Recognizers)#

  • Regex patterns match structured data formats: national ID numbers, tax IDs, credit card numbers, phone numbers, IBANs, license plates, postal codes, and more.
  • Each recognizer is tuned for a specific country or format. For example, the German ID card recognizer matches the exact format of a German Personalausweis number.
  • Pattern recognizers work regardless of the text's language because they match data format, not language.
  • Coverage spans 75+ countries and 390+ entity types.

NLP-Based Recognition#

  • Named Entity Recognition (NER) models detect names of people, locations, and organizations.
  • NLP models understand context and grammar to identify entities that do not follow a fixed pattern.
  • Multiple NLP engines are available:
    • spaCy models for 25 languages
    • Stanza NER for 7 languages
    • XLM-RoBERTa transformers for 16 languages

Both approaches run simultaneously. When both a pattern and an NLP model detect the same entity, the higher confidence score is used.


Confidence Scores#

Every detection includes a confidence score from 0.0 to 1.0.

RangeLevelWhat It Means
0.85 - 1.0HighStrong format match with checksum validation or strong contextual support. Very likely correct.
0.5 - 0.85MediumPattern match with some context. Likely correct but worth reviewing.
0.3 - 0.5LowGeneric pattern detected. May be a false positive. Review manually.

What Affects Confidence#

  • Checksum validation: Entity types like credit cards and IBANs include mathematical validation (Luhn algorithm, IBAN check digits). Passing validation significantly boosts confidence.
  • Context words: The presence of related words near the detected value increases confidence (see below).
  • Pattern specificity: A highly specific pattern (e.g., a German tax ID with exact format) scores higher than a generic numeric pattern.

Context Words#

Each recognizer has language-specific context words that boost detection confidence when they appear near a potential entity.

How Context Words Work#

  1. The pattern recognizer finds a value matching a known format.
  2. The system checks the surrounding text (typically within a 5-word window) for context words.
  3. If context words are present, the confidence score is increased.

Examples#

Entity TypeContext Words (English)Context Words (German)
CREDIT_CARDcredit card, card number, CCKreditkarte, Kartennummer
DE_TAX_IDtax ID, tax numberSteuer-ID, Steueridentifikationsnummer, IdNr
PHONE_NUMBERphone, tel, call, mobileTelefon, Handy, Rufnummer, Mobil
EMAIL_ADDRESSemail, e-mail, contactE-Mail, Kontakt

Context words are defined per recognizer and per language. The system supports context words in all 48 interface languages.


Entity Types#

Detected entities are organized into the following categories:

Personal Identifiers#

Names, dates of birth, ages, genders, nationalities, and biometric identifiers. Detected primarily by NLP models.

Government-Issued IDs#

National ID numbers, passport numbers, driver license numbers, social security numbers, and tax IDs. Covers 75+ countries with country-specific formats.

Financial Identifiers#

Credit card numbers, IBANs, SWIFT/BIC codes, bank account numbers, and securities identifiers (CUSIP, ISIN, SEDOL, LEI).

Location Data#

Addresses, postal codes, GPS coordinates, and IP addresses.

Digital Identifiers#

Email addresses, phone numbers, URLs, MAC addresses, and license plates.

Technical Secrets#

API keys, access tokens, SSH keys, database connection strings, and credentials from 30+ platforms (cloud providers, AI services, SaaS tools).

Healthcare#

Medical record numbers, prescription numbers, diagnosis codes (ICD-10, ICD-11), procedure codes, and country-specific health insurance numbers.

Organization#

Company names, legal entity identifiers, and registration numbers.

Temporal#

Dates, times, and durations that may be personally identifying.

For the complete list of all 390+ entity types and 157+ presets, see the Entity & Preset Inventory.


Backend Request Limits#

The analyzer enforces server-side limits on every request to prevent resource exhaustion. These limits apply regardless of which client sends the request.

LimitValueDescription
Entity types per request250Maximum entity types in the filter list
Ad-hoc recognizers per request50Maximum custom recognizers
Patterns per recognizer10Maximum regex patterns per ad-hoc recognizer
Context words per recognizer30Maximum context boost words
Total ad-hoc patterns per request200Total regex compilation budget

Requests exceeding these limits receive a 422 validation error with a descriptive message. Clients can discover the current limits via the GET /limits endpoint.


Language Selection#

Auto-Detect (Default)#

The system identifies the primary language of the input text automatically. This works well for most cases.

Manual Selection#

Choose a specific language to:

  • Load the correct NLP model for that language, improving name and location detection.
  • Activate language-specific context words for pattern recognizers.
  • Improve OCR accuracy when processing images.

What Language Affects#

ComponentEffect of Language Selection
NLP modelsLoads the appropriate language model for name/location detection
Context wordsActivates language-specific context words that boost confidence
OCR (images)Sends a language hint to the OCR engine for better text extraction
Pattern matchingNot affected -- patterns match data format regardless of language

Tips for Best Results#

  1. Provide more context. Longer text gives the Analyzer more surrounding words to evaluate. A credit card number in isolation may score lower than one preceded by "Card Number:".

  2. Use the right preset. Country-specific presets enable only the relevant entity types, reducing false positives. If you are processing German documents, select the Germany preset.

  3. Check low-confidence detections. Entities with scores below 0.5 are more likely to be false positives. Review them before anonymizing.

  4. Select the correct language. If auto-detect picks the wrong language, manually select the correct one. This improves NLP detection significantly.

  5. Include headers and labels. Documents with clear labels like "Name:", "Address:", "Tax ID:" provide strong context words that boost confidence scores.

  6. Review before anonymizing. Always review detected entities and deselect any false positives before proceeding to anonymization.