System Architecture

Last Updated: 2026-02-12


Overview#

cloak.business is built on Microsoft Presidio, an open-source SDK for PII detection and anonymization. The platform extends Presidio with 317 custom pattern recognizers, multilingual NLP models, image redaction, and a full suite of client applications.

The system follows a microservices architecture where each core capability runs as an independent service. This design allows services to be scaled, updated, and maintained independently.


Core Services#

Analyzer Service#

The Analyzer is the detection engine. It receives text and returns a list of detected PII entities with their types, positions, and confidence scores.

  • 317 pattern-based recognizers (regex) for structured data formats
  • NLP models (spaCy, Stanza NER, XLM-RoBERTa) for names, locations, and organizations
  • Context word analysis to refine confidence scores based on surrounding text
  • Backend-enforced request limits — caps on entity filters, ad-hoc recognizers, and regex patterns per request to prevent resource exhaustion
  • Supports 48 languages for detection

Anonymizer Service#

The Anonymizer takes detected entities and applies the chosen anonymization method:

  • Replace — Substitute with a type label (e.g., <PERSON>)
  • Redact — Remove entirely
  • Hash (SHA-256) — One-way cryptographic hash
  • Encrypt (AES-256-GCM) — Reversible encryption with session key
  • Mask — Partial character masking

Also supports deanonymization for reversible methods (Encrypt), allowing authorized users to restore original text within a session.

Image Redactor Service#

The Image Redactor processes images to find and redact PII:

  • Extracts text from images using OCR (37 Tesseract language packs)
  • Applies the same pattern recognizers used for text analysis
  • Draws colored bounding boxes over detected PII on the original image
  • Handles EXIF orientation correction for photos taken on mobile devices
  • Merges adjacent bounding boxes for multi-word entities

Structured Data Processor#

Processes tabular and structured data formats (CSV, spreadsheets) by applying PII detection and anonymization to individual cells while preserving the data structure.


Frontend Application#

The web application is built with Next.js and provides:

  • Responsive design — Works on desktop, tablet, and mobile
  • 48 locale translations — Full UI in 48 languages with RTL support
  • Real-time analysis — Results appear as you type or upload
  • Interactive entity highlighting — Detected PII is visually highlighted with confidence scores
  • Configurable settings — Choose entity types, anonymization methods, confidence thresholds, and language

Detection Pipeline#

When text is submitted for analysis, it passes through the following stages:

Input Text
    │
    ▼
┌─────────────────────┐
│  Language Detection  │  Identify text language for NLP model selection
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│    Tokenization      │  Break text into processable units
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Pattern Matching    │  Run 317 regex recognizers against text
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│   NLP Processing     │  Run spaCy / Stanza / XLM-RoBERTa models
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Context Analysis    │  Check surrounding words to adjust confidence
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Confidence Scoring  │  Assign final confidence score to each entity
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Result Aggregation  │  Merge overlapping detections, deduplicate
└─────────┬───────────┘
          │
          ▼
    Detection Results
    (entity type, position, score)

Data Flow#

cloak.business is designed around a zero-storage principle:

  1. Input — The client sends text or an image to the service
  2. Processing — The service processes the input entirely in memory
  3. Response — Detection results (or the anonymized output) are returned to the client
  4. Disposal — No original text, no images, and no detection results are stored on the server after the response is sent

Documents are never written to disk, never logged, and never retained. The system processes data transiently and returns results immediately.


NLP Models#

All NLP models are hosted on cloak.business's own servers in a German data center. No data is sent to external model providers.

ModelProviderLanguagesUse Case
spaCyExplosion AI25Named entity recognition — fast, general-purpose
Stanza NERStanford NLP7High-accuracy NER for Arabic, Farsi, Hebrew, Hindi, Turkish, Ukrainian, Vietnamese
XLM-RoBERTaMeta AI (model only)16Cross-lingual transformer for underserved languages

Important: While these models were originally developed by their respective organizations, cloak.business runs them locally on its own infrastructure. No user data is transmitted to Meta, Stanford, Explosion AI, or any other third party.


Client Applications#

ApplicationTechnologyDescription
Web AppNext.jsFull-featured browser interface
Desktop AppTauri (Rust + Web)Native app for Windows, macOS, Linux
Office Add-inOffice.jsAnonymize inside Word, Excel, PowerPoint
MCP ServerModel Context ProtocolAI tool integration (Claude Desktop, Cursor)
REST APIHTTP/JSONProgrammatic access for custom integrations

All client applications connect to the same backend services, ensuring consistent detection and anonymization results regardless of which interface is used.