Structured Data Anonymization
Last Updated: 2026-02-16 Service Version: 4.19.0
Overview#
cloak.business supports anonymization of structured data formats including CSV files and JSON arrays. This allows you to process spreadsheets, database exports, and API responses while preserving data structure.
Key Benefits:
- Column-level control: Choose which columns to anonymize
- Preserve structure: Output maintains the same format as input
- Batch efficiency: Process thousands of rows in a single request
- Multiple operators: Apply different anonymization methods per column
Table of Contents#
- Supported Formats
- Web Interface
- API Usage
- Column Configuration
- CSV Processing
- JSON Processing
- Best Practices
- Troubleshooting
Supported Formats#
| Format | Extension | Max Size | Max Rows |
|---|---|---|---|
| CSV | .csv | 10 MB | 100,000 |
| JSON | .json | 10 MB | 100,000 |
| TSV | .tsv | 10 MB | 100,000 |
Encoding: UTF-8 recommended. Latin-1 and Windows-1252 are also supported.
Web Interface#
Processing CSV Files#
- Navigate to Dashboard > Structured Data
- Click Upload CSV or drag and drop your file
- Select columns to anonymize from the detected columns list
- Choose anonymization method for each column
- Click Process
- Download the anonymized CSV
Column Selection#
After upload, the interface shows:
- Column name: Detected from CSV header
- Sample values: First 3 values for identification
- Include toggle: Enable/disable processing for this column
- Method selector: Choose Replace, Redact, Hash, Mask, or Encrypt
API Usage#
Process JSON Data#
Endpoint: POST /api/presidio/structured/process
Process an array of JSON objects with column-level configuration.
Request:
curl -X POST https://cloak.business/api/presidio/structured/process \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"data": [
{"name": "John Doe", "email": "john@example.com", "notes": "Customer since 2020"},
{"name": "Jane Smith", "email": "jane@example.com", "notes": "VIP customer"}
],
"columns": [
{"column": "name", "entities": ["PERSON"], "operator": "replace"},
{"column": "email", "entities": ["EMAIL_ADDRESS"], "operator": "hash"}
],
"language": "en",
"score_threshold": 0.5
}'
Response:
{
"data": [
{"name": "<PERSON>", "email": "a1b2c3d4...", "notes": "Customer since 2020"},
{"name": "<PERSON>", "email": "e5f6g7h8...", "notes": "VIP customer"}
],
"stats": {
"rows_processed": 2,
"entities_found": 4,
"columns_processed": 2
},
"processing_time": 0.234
}
Process CSV File#
Endpoint: POST /api/presidio/structured/process-csv
Upload and process a CSV file directly.
Request:
curl -X POST https://cloak.business/api/presidio/structured/process-csv \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@customers.csv" \
-F "columns=name,email,phone" \
-F "language=en" \
-F "operator=replace" \
-F "score_threshold=0.5" \
--output anonymized_customers.csv
Response: Binary CSV file with anonymized data.
Response Headers:
| Header | Description |
|---|---|
X-Processing-Time | Processing duration in seconds |
X-Entities-Found | Total PII entities detected |
Content-Disposition | Suggested filename |
Column Configuration#
Column Config Object#
| Field | Type | Required | Description |
|---|---|---|---|
column | string | Yes | Column name to process |
entities | string[] | No | Entity types to detect (default: all) |
operator | string | No | Anonymization method (default: replace) |
operator_params | object | No | Method-specific parameters |
Operator Types#
| Operator | Description | Parameters |
|---|---|---|
replace | Replace with placeholder | new_value: custom placeholder |
redact | Remove entirely | None |
hash | SHA-256 hash | hash_type: sha256 or sha512 |
mask | Partial masking | masking_char, chars_to_mask, from_end |
encrypt | AES-256 encryption | key: encryption key |
Examples#
Replace with custom value:
{
"column": "ssn",
"entities": ["US_SSN"],
"operator": "replace",
"operator_params": {"new_value": "[SSN REMOVED]"}
}
Mask keeping last 4 characters:
{
"column": "credit_card",
"entities": ["CREDIT_CARD"],
"operator": "mask",
"operator_params": {
"masking_char": "*",
"chars_to_mask": 12,
"from_end": false
}
}
Hash email addresses:
{
"column": "email",
"entities": ["EMAIL_ADDRESS"],
"operator": "hash",
"operator_params": {"hash_type": "sha256"}
}
CSV Processing#
Input Requirements#
- Header row required: First row must contain column names
- Consistent columns: All rows must have the same number of columns
- Text encoding: UTF-8 recommended
- Quote handling: Standard CSV quoting with double quotes
Example Input CSV#
name,email,phone,address,notes
John Doe,john@example.com,555-123-4567,"123 Main St, City",Regular customer
Jane Smith,jane@example.com,555-987-6543,"456 Oak Ave, Town",VIP status
Processing Multiple Columns#
curl -X POST https://cloak.business/api/presidio/structured/process-csv \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@data.csv" \
-F "columns=name,email,phone,address" \
-F "operator=replace" \
--output anonymized.csv
Example Output CSV#
name,email,phone,address,notes
<PERSON>,<EMAIL_ADDRESS>,<PHONE_NUMBER>,<LOCATION>,Regular customer
<PERSON>,<EMAIL_ADDRESS>,<PHONE_NUMBER>,<LOCATION>,VIP status
JSON Processing#
Input Format#
JSON data must be an array of objects:
[
{"field1": "value1", "field2": "value2"},
{"field1": "value3", "field2": "value4"}
]
Processing Nested Objects#
For nested data, flatten before processing or process individual nested arrays:
Before:
{
"customer": {
"name": "John Doe",
"contact": {
"email": "john@example.com"
}
}
}
Flatten to:
{
"customer_name": "John Doe",
"customer_contact_email": "john@example.com"
}
Mixed Entity Types Per Column#
Different columns can have different entity configurations:
{
"data": [...],
"columns": [
{
"column": "full_name",
"entities": ["PERSON"]
},
{
"column": "contact_info",
"entities": ["EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"]
},
{
"column": "government_id",
"entities": ["US_SSN", "US_PASSPORT", "US_DRIVER_LICENSE"]
}
]
}
Best Practices#
1. Identify PII Columns First#
Before processing, analyze your data to identify which columns contain PII:
# Analyze first to see what entities exist
curl -X POST https://cloak.business/api/presidio/analyze \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"text": "Sample text from your data column"}'
2. Use Specific Entity Types#
Narrow entity detection to relevant types for faster processing:
{
"column": "ssn_field",
"entities": ["US_SSN"]
}
Instead of detecting all entities (slower, more false positives).
3. Set Appropriate Thresholds#
- High confidence (0.7+): Fewer false positives, may miss some PII
- Medium confidence (0.5): Balanced (recommended)
- Low confidence (0.3): Catches more, but more false positives
4. Test with Sample Data#
Process a small sample first to verify configuration:
{
"data": [{"name": "Test User", "email": "test@example.com"}],
"columns": [...]
}
5. Preserve Non-PII Columns#
Only include columns that need anonymization. Other columns pass through unchanged:
{
"columns": [
{"column": "name"},
{"column": "email"}
// "order_id", "product", "quantity" pass through unchanged
]
}
Troubleshooting#
Common Issues#
| Issue | Cause | Solution |
|---|---|---|
| "Column not found" | Column name mismatch | Check exact column name (case-sensitive) |
| Empty output | No PII detected | Lower score_threshold or check entity types |
| Slow processing | Large file | Process in batches of 10,000 rows |
| Encoding errors | Non-UTF-8 file | Convert to UTF-8 before upload |
| Missing header | No header row | Add header row to CSV |
Error Responses#
400 Bad Request:
{
"error": "Invalid request",
"message": "Column 'customer_name' not found in data"
}
413 Payload Too Large:
{
"error": "File too large",
"message": "Maximum file size is 10 MB"
}
429 Rate Limited:
{
"error": "Rate limit exceeded",
"retry_after": 60
}
Debugging Tips#
- Check column names: Print headers with
head -1 file.csv - Verify encoding: Use
file -i file.csvto check encoding - Test single row: Process one row first to validate config
- Check entity coverage: Ensure your preset includes expected entity types
Token Cost#
Structured data processing uses the same token calculation as text analysis:
- Base cost: 1 token per column per row
- Entity cost: +0.5 token per entity found
Example:
- 1,000 rows, 3 columns, 500 entities found
- Cost: (1,000 × 3) + (500 × 0.5) = 3,250 tokens
Related Documentation#
- API Reference - Complete API documentation
- Entity Inventory - All 390+ entity types
- Batch Processing - Batch text analysis
Document maintained by cloak.business