Structured Data Anonymization

Last Updated: 2026-02-16 Service Version: 4.19.0

Overview#

cloak.business supports anonymization of structured data formats including CSV files and JSON arrays. This allows you to process spreadsheets, database exports, and API responses while preserving data structure.

Key Benefits:

Column-level control: Choose which columns to anonymize
Preserve structure: Output maintains the same format as input
Batch efficiency: Process thousands of rows in a single request
Multiple operators: Apply different anonymization methods per column

Supported Formats#

Format	Extension	Max Size	Max Rows
CSV	.csv	10 MB	100,000
JSON	.json	10 MB	100,000
TSV	.tsv	10 MB	100,000

Encoding: UTF-8 recommended. Latin-1 and Windows-1252 are also supported.

Web Interface#

Processing CSV Files#

Navigate to Dashboard > Structured Data
Click Upload CSV or drag and drop your file
Select columns to anonymize from the detected columns list
Choose anonymization method for each column
Click Process
Download the anonymized CSV

Column Selection#

After upload, the interface shows:

Column name: Detected from CSV header
Sample values: First 3 values for identification
Include toggle: Enable/disable processing for this column
Method selector: Choose Replace, Redact, Hash, Mask, or Encrypt

API Usage#

Process JSON Data#

Endpoint: POST /api/presidio/structured/process

Process an array of JSON objects with column-level configuration.

Request:

curl -X POST https://cloak.business/api/presidio/structured/process \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      {"name": "John Doe", "email": "john@example.com", "notes": "Customer since 2020"},
      {"name": "Jane Smith", "email": "jane@example.com", "notes": "VIP customer"}
    ],
    "columns": [
      {"column": "name", "entities": ["PERSON"], "operator": "replace"},
      {"column": "email", "entities": ["EMAIL_ADDRESS"], "operator": "hash"}
    ],
    "language": "en",
    "score_threshold": 0.5
  }'

Response:

{
  "data": [
    {"name": "<PERSON>", "email": "a1b2c3d4...", "notes": "Customer since 2020"},
    {"name": "<PERSON>", "email": "e5f6g7h8...", "notes": "VIP customer"}
  ],
  "stats": {
    "rows_processed": 2,
    "entities_found": 4,
    "columns_processed": 2
  },
  "processing_time": 0.234
}

Process CSV File#

Endpoint: POST /api/presidio/structured/process-csv

Upload and process a CSV file directly.

Request:

curl -X POST https://cloak.business/api/presidio/structured/process-csv \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@customers.csv" \
  -F "columns=name,email,phone" \
  -F "language=en" \
  -F "operator=replace" \
  -F "score_threshold=0.5" \
  --output anonymized_customers.csv

Response: Binary CSV file with anonymized data.

Response Headers:

Header	Description
`X-Processing-Time`	Processing duration in seconds
`X-Entities-Found`	Total PII entities detected
`Content-Disposition`	Suggested filename

Column Configuration#

Column Config Object#

Field	Type	Required	Description
`column`	string	Yes	Column name to process
`entities`	string[]	No	Entity types to detect (default: all)
`operator`	string	No	Anonymization method (default: `replace`)
`operator_params`	object	No	Method-specific parameters

Operator Types#

Operator	Description	Parameters
`replace`	Replace with placeholder	`new_value`: custom placeholder
`redact`	Remove entirely	None
`hash`	SHA-256 hash	`hash_type`: `sha256` or `sha512`
`mask`	Partial masking	`masking_char`, `chars_to_mask`, `from_end`
`encrypt`	AES-256 encryption	`key`: encryption key

Examples#

Replace with custom value:

{
  "column": "ssn",
  "entities": ["US_SSN"],
  "operator": "replace",
  "operator_params": {"new_value": "[SSN REMOVED]"}
}

Mask keeping last 4 characters:

{
  "column": "credit_card",
  "entities": ["CREDIT_CARD"],
  "operator": "mask",
  "operator_params": {
    "masking_char": "*",
    "chars_to_mask": 12,
    "from_end": false
  }
}

Hash email addresses:

{
  "column": "email",
  "entities": ["EMAIL_ADDRESS"],
  "operator": "hash",
  "operator_params": {"hash_type": "sha256"}
}

CSV Processing#

Input Requirements#

Header row required: First row must contain column names
Consistent columns: All rows must have the same number of columns
Text encoding: UTF-8 recommended
Quote handling: Standard CSV quoting with double quotes

Example Input CSV#

name,email,phone,address,notes
John Doe,john@example.com,555-123-4567,"123 Main St, City",Regular customer
Jane Smith,jane@example.com,555-987-6543,"456 Oak Ave, Town",VIP status

Processing Multiple Columns#

curl -X POST https://cloak.business/api/presidio/structured/process-csv \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@data.csv" \
  -F "columns=name,email,phone,address" \
  -F "operator=replace" \
  --output anonymized.csv

Example Output CSV#

name,email,phone,address,notes
<PERSON>,<EMAIL_ADDRESS>,<PHONE_NUMBER>,<LOCATION>,Regular customer
<PERSON>,<EMAIL_ADDRESS>,<PHONE_NUMBER>,<LOCATION>,VIP status

JSON Processing#

Input Format#

JSON data must be an array of objects:

[
  {"field1": "value1", "field2": "value2"},
  {"field1": "value3", "field2": "value4"}
]

Processing Nested Objects#

For nested data, flatten before processing or process individual nested arrays:

Before:

{
  "customer": {
    "name": "John Doe",
    "contact": {
      "email": "john@example.com"
    }
  }
}

Flatten to:

{
  "customer_name": "John Doe",
  "customer_contact_email": "john@example.com"
}

Mixed Entity Types Per Column#

Different columns can have different entity configurations:

{
  "data": [...],
  "columns": [
    {
      "column": "full_name",
      "entities": ["PERSON"]
    },
    {
      "column": "contact_info",
      "entities": ["EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"]
    },
    {
      "column": "government_id",
      "entities": ["US_SSN", "US_PASSPORT", "US_DRIVER_LICENSE"]
    }
  ]
}

Best Practices#

1. Identify PII Columns First#

Before processing, analyze your data to identify which columns contain PII:

# Analyze first to see what entities exist
curl -X POST https://cloak.business/api/presidio/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"text": "Sample text from your data column"}'

2. Use Specific Entity Types#

Narrow entity detection to relevant types for faster processing:

{
  "column": "ssn_field",
  "entities": ["US_SSN"]
}

Instead of detecting all entities (slower, more false positives).

3. Set Appropriate Thresholds#

High confidence (0.7+): Fewer false positives, may miss some PII
Medium confidence (0.5): Balanced (recommended)
Low confidence (0.3): Catches more, but more false positives

4. Test with Sample Data#

Process a small sample first to verify configuration:

{
  "data": [{"name": "Test User", "email": "test@example.com"}],
  "columns": [...]
}

5. Preserve Non-PII Columns#

Only include columns that need anonymization. Other columns pass through unchanged:

{
  "columns": [
    {"column": "name"},
    {"column": "email"}
    // "order_id", "product", "quantity" pass through unchanged
  ]
}

Troubleshooting#

Common Issues#

Issue	Cause	Solution
"Column not found"	Column name mismatch	Check exact column name (case-sensitive)
Empty output	No PII detected	Lower `score_threshold` or check entity types
Slow processing	Large file	Process in batches of 10,000 rows
Encoding errors	Non-UTF-8 file	Convert to UTF-8 before upload
Missing header	No header row	Add header row to CSV

Error Responses#

400 Bad Request:

{
  "error": "Invalid request",
  "message": "Column 'customer_name' not found in data"
}

413 Payload Too Large:

{
  "error": "File too large",
  "message": "Maximum file size is 10 MB"
}

429 Rate Limited:

{
  "error": "Rate limit exceeded",
  "retry_after": 60
}

Debugging Tips#

Check column names: Print headers with head -1 file.csv
Verify encoding: Use file -i file.csv to check encoding
Test single row: Process one row first to validate config
Check entity coverage: Ensure your preset includes expected entity types

Token Cost#

Structured data processing uses the same token calculation as text analysis:

Base cost: 1 token per column per row
Entity cost: +0.5 token per entity found

Example:

1,000 rows, 3 columns, 500 entities found
Cost: (1,000 × 3) + (500 × 0.5) = 3,250 tokens

API Reference - Complete API documentation
Entity Inventory - All 390+ entity types
Batch Processing - Batch text analysis

Document maintained by cloak.business