PII & Data Privacy Glossary
Clear definitions of key privacy, compliance, and data protection terms used across the industry.
Privacy & Compliance Terms
Personally Identifiable Information (PII)
Any data that can identify a specific individual, such as names, email addresses, social security numbers, or phone numbers.
Anonymization
The irreversible process of altering data so that individuals cannot be identified, directly or indirectly.
Pseudonymization
Replacing identifiable data with artificial identifiers (pseudonyms) so that re-identification requires a separately held key.
De-identification
Removing or obscuring personal identifiers from data so that it can no longer be linked to a specific individual without additional information.
Data Subject
An identified or identifiable natural person whose personal data is processed by a controller or processor.
Data Controller
The entity that determines the purposes and means of processing personal data.
Data Processor
An entity that processes personal data on behalf of a data controller, following the controller's instructions.
Consent
A freely given, specific, informed, and unambiguous indication of a data subject's agreement to the processing of their personal data.
Lawful Basis
A legal ground under which personal data processing is permitted, such as consent, contract necessity, legal obligation, or legitimate interest.
Data Minimization
The principle that personal data collected should be adequate, relevant, and limited to what is necessary for its intended purpose.
Right to Erasure
A data subject's right to have their personal data deleted when it is no longer necessary, also known as the 'right to be forgotten' under GDPR.
Data Portability
The right of data subjects to receive their personal data in a structured, commonly used format and to transfer it to another controller.
Data Protection Officer (DPO)
A designated individual responsible for overseeing an organization's data protection strategy and ensuring compliance with privacy regulations.
Data Protection Impact Assessment (DPIA)
A process to identify and minimize data protection risks of a project, required under GDPR for high-risk processing activities.
Data Breach
A security incident where personal data is accessed, disclosed, altered, or destroyed without authorization.
Shadow AI
Unauthorized use of AI tools (ChatGPT, Copilot, Gemini) by employees without IT approval. Shadow AI is a leading cause of PII data leaks, as users paste sensitive business data — customer records, patient information, financial data — directly into AI prompts.
Data Minimization
A GDPR principle (Art. 5(1)(c)) requiring organizations to collect and process only the minimum personal data necessary for a specific purpose. In AI systems, data minimization means anonymizing or removing PII before data enters AI pipelines, reducing compliance risk and breach surface.
Regulatory Frameworks
GDPR (General Data Protection Regulation)
The EU regulation governing the processing of personal data of individuals within the European Economic Area, effective since May 2018.
CCPA (California Consumer Privacy Act)
A California state law granting consumers rights over their personal information collected by businesses, effective since January 2020.
HIPAA (Health Insurance Portability and Accountability Act)
A US federal law establishing standards for protecting sensitive patient health information from disclosure without consent.
ISO 27001
An international standard for information security management systems (ISMS), specifying requirements for establishing, implementing, and continuously improving security controls.
SOC 2 (System and Organization Controls 2)
An auditing framework for service organizations that evaluates controls related to security, availability, processing integrity, confidentiality, and privacy.
EU AI Act
European Union regulation on artificial intelligence (enforced from August 2026). High-risk AI systems must implement data governance measures including personal data minimization, documentation, and DPIA. Organizations using AI for decision-making on individuals must ensure training data is anonymized or pseudonymized.
ISO 42001
International standard for AI Management Systems (AIMS), published in 2023. Provides a framework for responsible AI development and deployment, including data quality, bias controls, and privacy safeguards. Often paired with ISO 27001 for organizations operating AI systems with personal data.
India DPDP Act
India's Digital Personal Data Protection Act (2023), enforced from 2025. Requires explicit consent for processing personal data of Indian residents, data localization for sensitive data, and breach notification within 72 hours. Applies to organizations globally that process Indian citizens' data.
Technical Terms
Named Entity Recognition (NER)
An NLP technique that identifies and classifies named entities in text into predefined categories such as person names, locations, and organizations.
Natural Language Processing (NLP)
A branch of artificial intelligence that enables computers to understand, interpret, and generate human language.
Pattern Recognizer
A rule-based detector that uses regular expressions and context clues to identify specific data patterns, such as credit card numbers or social security numbers.
Confidence Score
A numerical value between 0 and 1 indicating how certain a detection engine is that a piece of text matches a specific entity type.
Regular Expression (Regex)
A sequence of characters defining a search pattern, commonly used to validate and detect structured data formats like phone numbers or email addresses.
AES-256-GCM
An authenticated encryption algorithm using a 256-bit key with Galois/Counter Mode, providing both confidentiality and integrity verification of encrypted data.
Zero-Knowledge Encryption
An encryption architecture where only the user holds the decryption key, meaning even the service provider cannot access the plaintext data.
Tokenization
Replacing sensitive data with non-sensitive placeholder tokens that can be mapped back to the original data through a secure lookup.
Data Masking
Obscuring specific data within a dataset so that sensitive information is hidden while the data remains usable for testing or analysis.
Redaction
The permanent removal of sensitive information from a document or dataset, replacing it with a marker such as [REDACTED].
Synthetic Data
AI-generated data that statistically mimics real data without containing actual records. Compared to anonymization: anonymized data preserves higher analytical accuracy for downstream ML; synthetic data eliminates re-identification risk but introduces statistical drift. Reversible anonymization is preferred when original records may be needed for compliance audits.
LLM Prompt Injection
An attack technique where malicious input manipulates a large language model to ignore instructions or leak sensitive information. In PII protection contexts, prompt injection can cause an AI model to reveal anonymized data patterns or user information. Pre-anonymizing inputs before they reach LLMs reduces the attack surface.
Privacy-by-Design
A GDPR Art. 25 principle requiring data protection to be built into systems from the ground up rather than added as an afterthought. For AI systems, privacy-by-design means anonymizing data before it enters AI pipelines, implementing zero-knowledge encryption, and minimizing data retention.
Anonymization Methods
Replace
Substitutes detected PII with a generic placeholder of the same entity type, such as replacing 'John Smith' with '<PERSON>'.
Mask
Partially obscures PII by replacing characters with masking symbols, for example turning '123-45-6789' into '***-**-6789'.
Redact
Completely removes detected PII from the text, leaving no trace of the original value.
Hash
Converts PII into a fixed-length cryptographic hash, allowing consistent replacement while making reversal computationally infeasible.
Encrypt
Transforms PII using AES-256-GCM encryption with a user-held key, enabling authorized reversal (de-anonymization) when needed.
Frequently Asked Questions
What is the difference between anonymization and pseudonymization?
Anonymization irreversibly removes all identifying information so re-identification is impossible. Pseudonymization replaces identifiers with artificial ones while keeping a separate key that allows re-identification when authorized. Under GDPR, pseudonymized data is still considered personal data.
Why does PII detection use both NLP and pattern recognizers?
NLP models detect context-dependent entities like person names and locations that lack a fixed format. Pattern recognizers use regular expressions to catch structured identifiers like social security numbers, credit card numbers, and phone numbers. Combining both approaches maximizes detection accuracy across all entity types.
What is zero-knowledge encryption and why does it matter?
Zero-knowledge encryption means only you hold the decryption key — the service provider cannot read your data. This matters because even in the event of a server breach, your encrypted data remains unreadable without your key, providing the strongest possible data protection.
How does reversible encryption differ from hashing?
Hashing is a one-way transformation — once data is hashed, the original cannot be recovered. Reversible encryption (using AES-256-GCM) allows authorized users with the correct key to decrypt and recover the original data, enabling workflows where de-anonymization is needed.