System Architecture

Last Updated: 2026-02-12

Overview#

cloak.business is built on Microsoft Presidio, an open-source SDK for PII detection and anonymization. The platform extends Presidio with 317 custom pattern recognizers, multilingual NLP models, image redaction, and a full suite of client applications.

The system follows a microservices architecture where each core capability runs as an independent service. This design allows services to be scaled, updated, and maintained independently.

Core Services#

Analyzer Service#

The Analyzer is the detection engine. It receives text and returns a list of detected PII entities with their types, positions, and confidence scores.

317 pattern-based recognizers (regex) for structured data formats
NLP models (spaCy, Stanza NER, XLM-RoBERTa) for names, locations, and organizations
Context word analysis to refine confidence scores based on surrounding text
Backend-enforced request limits — caps on entity filters, ad-hoc recognizers, and regex patterns per request to prevent resource exhaustion
Supports 48 languages for detection

Anonymizer Service#

The Anonymizer takes detected entities and applies the chosen anonymization method:

Replace — Substitute with a type label (e.g., <PERSON>)
Redact — Remove entirely
Hash (SHA-256) — One-way cryptographic hash
Encrypt (AES-256-GCM) — Reversible encryption with session key
Mask — Partial character masking

Also supports deanonymization for reversible methods (Encrypt), allowing authorized users to restore original text within a session.

Image Redactor Service#

The Image Redactor processes images to find and redact PII:

Extracts text from images using OCR (37 Tesseract language packs)
Applies the same pattern recognizers used for text analysis
Draws colored bounding boxes over detected PII on the original image
Handles EXIF orientation correction for photos taken on mobile devices
Merges adjacent bounding boxes for multi-word entities

Structured Data Processor#

Processes tabular and structured data formats (CSV, spreadsheets) by applying PII detection and anonymization to individual cells while preserving the data structure.

Frontend Application#

The web application is built with Next.js and provides:

Responsive design — Works on desktop, tablet, and mobile
48 locale translations — Full UI in 48 languages with RTL support
Real-time analysis — Results appear as you type or upload
Interactive entity highlighting — Detected PII is visually highlighted with confidence scores
Configurable settings — Choose entity types, anonymization methods, confidence thresholds, and language

Detection Pipeline#

When text is submitted for analysis, it passes through the following stages:

Input Text
    │
    ▼
┌─────────────────────┐
│  Language Detection  │  Identify text language for NLP model selection
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│    Tokenization      │  Break text into processable units
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Pattern Matching    │  Run 317 regex recognizers against text
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│   NLP Processing     │  Run spaCy / Stanza / XLM-RoBERTa models
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Context Analysis    │  Check surrounding words to adjust confidence
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Confidence Scoring  │  Assign final confidence score to each entity
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Result Aggregation  │  Merge overlapping detections, deduplicate
└─────────┬───────────┘
          │
          ▼
    Detection Results
    (entity type, position, score)

Data Flow#

cloak.business is designed around a zero-storage principle:

Input — The client sends text or an image to the service
Processing — The service processes the input entirely in memory
Response — Detection results (or the anonymized output) are returned to the client
Disposal — No original text, no images, and no detection results are stored on the server after the response is sent

Documents are never written to disk, never logged, and never retained. The system processes data transiently and returns results immediately.

NLP Models#

All NLP models are hosted on cloak.business's own servers in a German data center. No data is sent to external model providers.

Model	Provider	Languages	Use Case
spaCy	Explosion AI	25	Named entity recognition — fast, general-purpose
Stanza NER	Stanford NLP	7	High-accuracy NER for Arabic, Farsi, Hebrew, Hindi, Turkish, Ukrainian, Vietnamese
XLM-RoBERTa	Meta AI (model only)	16	Cross-lingual transformer for underserved languages

Important: While these models were originally developed by their respective organizations, cloak.business runs them locally on its own infrastructure. No user data is transmitted to Meta, Stanford, Explosion AI, or any other third party.

Client Applications#

Application	Technology	Description
Web App	Next.js	Full-featured browser interface
Desktop App	Tauri (Rust + Web)	Native app for Windows, macOS, Linux
Office Add-in	Office.js	Anonymize inside Word, Excel, PowerPoint
MCP Server	Model Context Protocol	AI tool integration (Claude Desktop, Cursor)
REST API	HTTP/JSON	Programmatic access for custom integrations

All client applications connect to the same backend services, ensuring consistent detection and anonymization results regardless of which interface is used.