Structured Data Anonymization

Last Updated: 2026-02-16 Service Version: 4.19.0


Overview#

cloak.business supports anonymization of structured data formats including CSV files and JSON arrays. This allows you to process spreadsheets, database exports, and API responses while preserving data structure.

Key Benefits:

  • Column-level control: Choose which columns to anonymize
  • Preserve structure: Output maintains the same format as input
  • Batch efficiency: Process thousands of rows in a single request
  • Multiple operators: Apply different anonymization methods per column

Table of Contents#

  1. Supported Formats
  2. Web Interface
  3. API Usage
  4. Column Configuration
  5. CSV Processing
  6. JSON Processing
  7. Best Practices
  8. Troubleshooting

Supported Formats#

FormatExtensionMax SizeMax Rows
CSV.csv10 MB100,000
JSON.json10 MB100,000
TSV.tsv10 MB100,000

Encoding: UTF-8 recommended. Latin-1 and Windows-1252 are also supported.


Web Interface#

Processing CSV Files#

  1. Navigate to Dashboard > Structured Data
  2. Click Upload CSV or drag and drop your file
  3. Select columns to anonymize from the detected columns list
  4. Choose anonymization method for each column
  5. Click Process
  6. Download the anonymized CSV

Column Selection#

After upload, the interface shows:

  • Column name: Detected from CSV header
  • Sample values: First 3 values for identification
  • Include toggle: Enable/disable processing for this column
  • Method selector: Choose Replace, Redact, Hash, Mask, or Encrypt

API Usage#

Process JSON Data#

Endpoint: POST /api/presidio/structured/process

Process an array of JSON objects with column-level configuration.

Request:

curl -X POST https://cloak.business/api/presidio/structured/process \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      {"name": "John Doe", "email": "john@example.com", "notes": "Customer since 2020"},
      {"name": "Jane Smith", "email": "jane@example.com", "notes": "VIP customer"}
    ],
    "columns": [
      {"column": "name", "entities": ["PERSON"], "operator": "replace"},
      {"column": "email", "entities": ["EMAIL_ADDRESS"], "operator": "hash"}
    ],
    "language": "en",
    "score_threshold": 0.5
  }'

Response:

{
  "data": [
    {"name": "<PERSON>", "email": "a1b2c3d4...", "notes": "Customer since 2020"},
    {"name": "<PERSON>", "email": "e5f6g7h8...", "notes": "VIP customer"}
  ],
  "stats": {
    "rows_processed": 2,
    "entities_found": 4,
    "columns_processed": 2
  },
  "processing_time": 0.234
}

Process CSV File#

Endpoint: POST /api/presidio/structured/process-csv

Upload and process a CSV file directly.

Request:

curl -X POST https://cloak.business/api/presidio/structured/process-csv \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@customers.csv" \
  -F "columns=name,email,phone" \
  -F "language=en" \
  -F "operator=replace" \
  -F "score_threshold=0.5" \
  --output anonymized_customers.csv

Response: Binary CSV file with anonymized data.

Response Headers:

HeaderDescription
X-Processing-TimeProcessing duration in seconds
X-Entities-FoundTotal PII entities detected
Content-DispositionSuggested filename

Column Configuration#

Column Config Object#

FieldTypeRequiredDescription
columnstringYesColumn name to process
entitiesstring[]NoEntity types to detect (default: all)
operatorstringNoAnonymization method (default: replace)
operator_paramsobjectNoMethod-specific parameters

Operator Types#

OperatorDescriptionParameters
replaceReplace with placeholdernew_value: custom placeholder
redactRemove entirelyNone
hashSHA-256 hashhash_type: sha256 or sha512
maskPartial maskingmasking_char, chars_to_mask, from_end
encryptAES-256 encryptionkey: encryption key

Examples#

Replace with custom value:

{
  "column": "ssn",
  "entities": ["US_SSN"],
  "operator": "replace",
  "operator_params": {"new_value": "[SSN REMOVED]"}
}

Mask keeping last 4 characters:

{
  "column": "credit_card",
  "entities": ["CREDIT_CARD"],
  "operator": "mask",
  "operator_params": {
    "masking_char": "*",
    "chars_to_mask": 12,
    "from_end": false
  }
}

Hash email addresses:

{
  "column": "email",
  "entities": ["EMAIL_ADDRESS"],
  "operator": "hash",
  "operator_params": {"hash_type": "sha256"}
}

CSV Processing#

Input Requirements#

  • Header row required: First row must contain column names
  • Consistent columns: All rows must have the same number of columns
  • Text encoding: UTF-8 recommended
  • Quote handling: Standard CSV quoting with double quotes

Example Input CSV#

name,email,phone,address,notes
John Doe,john@example.com,555-123-4567,"123 Main St, City",Regular customer
Jane Smith,jane@example.com,555-987-6543,"456 Oak Ave, Town",VIP status

Processing Multiple Columns#

curl -X POST https://cloak.business/api/presidio/structured/process-csv \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@data.csv" \
  -F "columns=name,email,phone,address" \
  -F "operator=replace" \
  --output anonymized.csv

Example Output CSV#

name,email,phone,address,notes
<PERSON>,<EMAIL_ADDRESS>,<PHONE_NUMBER>,<LOCATION>,Regular customer
<PERSON>,<EMAIL_ADDRESS>,<PHONE_NUMBER>,<LOCATION>,VIP status

JSON Processing#

Input Format#

JSON data must be an array of objects:

[
  {"field1": "value1", "field2": "value2"},
  {"field1": "value3", "field2": "value4"}
]

Processing Nested Objects#

For nested data, flatten before processing or process individual nested arrays:

Before:

{
  "customer": {
    "name": "John Doe",
    "contact": {
      "email": "john@example.com"
    }
  }
}

Flatten to:

{
  "customer_name": "John Doe",
  "customer_contact_email": "john@example.com"
}

Mixed Entity Types Per Column#

Different columns can have different entity configurations:

{
  "data": [...],
  "columns": [
    {
      "column": "full_name",
      "entities": ["PERSON"]
    },
    {
      "column": "contact_info",
      "entities": ["EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION"]
    },
    {
      "column": "government_id",
      "entities": ["US_SSN", "US_PASSPORT", "US_DRIVER_LICENSE"]
    }
  ]
}

Best Practices#

1. Identify PII Columns First#

Before processing, analyze your data to identify which columns contain PII:

# Analyze first to see what entities exist
curl -X POST https://cloak.business/api/presidio/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"text": "Sample text from your data column"}'

2. Use Specific Entity Types#

Narrow entity detection to relevant types for faster processing:

{
  "column": "ssn_field",
  "entities": ["US_SSN"]
}

Instead of detecting all entities (slower, more false positives).

3. Set Appropriate Thresholds#

  • High confidence (0.7+): Fewer false positives, may miss some PII
  • Medium confidence (0.5): Balanced (recommended)
  • Low confidence (0.3): Catches more, but more false positives

4. Test with Sample Data#

Process a small sample first to verify configuration:

{
  "data": [{"name": "Test User", "email": "test@example.com"}],
  "columns": [...]
}

5. Preserve Non-PII Columns#

Only include columns that need anonymization. Other columns pass through unchanged:

{
  "columns": [
    {"column": "name"},
    {"column": "email"}
    // "order_id", "product", "quantity" pass through unchanged
  ]
}

Troubleshooting#

Common Issues#

IssueCauseSolution
"Column not found"Column name mismatchCheck exact column name (case-sensitive)
Empty outputNo PII detectedLower score_threshold or check entity types
Slow processingLarge fileProcess in batches of 10,000 rows
Encoding errorsNon-UTF-8 fileConvert to UTF-8 before upload
Missing headerNo header rowAdd header row to CSV

Error Responses#

400 Bad Request:

{
  "error": "Invalid request",
  "message": "Column 'customer_name' not found in data"
}

413 Payload Too Large:

{
  "error": "File too large",
  "message": "Maximum file size is 10 MB"
}

429 Rate Limited:

{
  "error": "Rate limit exceeded",
  "retry_after": 60
}

Debugging Tips#

  1. Check column names: Print headers with head -1 file.csv
  2. Verify encoding: Use file -i file.csv to check encoding
  3. Test single row: Process one row first to validate config
  4. Check entity coverage: Ensure your preset includes expected entity types

Token Cost#

Structured data processing uses the same token calculation as text analysis:

  • Base cost: 1 token per column per row
  • Entity cost: +0.5 token per entity found

Example:

  • 1,000 rows, 3 columns, 500 entities found
  • Cost: (1,000 × 3) + (500 × 0.5) = 3,250 tokens


Document maintained by cloak.business