001: Sensitive data discovery
Overview
The first step in implementing Zero Trust for your data assets is to identify and categorize these assets according to their sensitivity, which encompasses areas like confidentiality, criticality and business impact. The following tools are available for discovering Sensitive Data across the organization:
- Out of the box and custom classifiers (A.K.A. Sensitive Information Types). Classifiers enable you to identify documents and emails that contain contents matching certain patterns or criteria.
- Out of the box classifiers allow you to detect identifiable information corresponding to well-known entities such as people's national identity numbers, bank accounts or drivers licenses.
- Custom sensitive information types can be created to identify non-standard, business-specific identifiers or other sensitive information, or to customize how standard entities are detected.
- Advanced classifiers. These include:
- Exact Data Match for accurately detecting known personal information such as that corresponding to customers or employees.
- Trainable classifiers (both pre-trained and custom) to detect documents likely to belong to certain categories or classes
- Fingerprints to identify documents closely matching the content in well-known documents.
- Named entities, which can be used to detect people's names, addresses, credentials and other potentially sensitive data.
- Content Explorer and the Content Explorer Export PowerShell tool: Individual data assets matching individual classifiers can be identified and investigated in Content Explorer directly. Alternatively, information about all data assets can be identified and exported using PowerShell and imported into a SIEM tool for further analysis.
- Activity Explorer: it reflects creation, access and sharing of sensitive information in your environment, allowing you to perform an initial assessment of risky behaviors and actions by your users involving sensitive data.
- Unified Audit Log and custom tools built on top of it. Every action related to the creation, modification, classification, discovery or sharing of data in a Microsoft 365 tenant is reflected in the Audit Log, which can be connected to a SIEM to help identify the presence of sensitive information and patterns in its exposure and usage.
Reference
- Content Explorer: https://learn.microsoft.com/en-us/purview/data-classification-content-explorer
- Content Explorer PowerShell: https://learn.microsoft.com/en-us/powershell/module/exchange/export-contentexplorerdata
- Activity Explorer: https://learn.microsoft.com/en-us/purview/data-classification-activity-explorer
- Unified Audit Log: https://learn.microsoft.com/en-us/purview/audit-solutions-overview
- Out of the box Sensitive Information Types: https://aka.ms/sensitiveinfotypes
- Custom Sensitive information types: https://learn.microsoft.com/en-us/purview/sit-create-a-custom-sensitive-information-type
- Exact data matching: https://learn.microsoft.com/en-us/purview/sit-learn-about-exact-data-match-based-sits
- Trainable classifiers: https://learn.microsoft.com/en-us/purview/trainable-classifiers-learn-about
- Fingerprinting: https://learn.microsoft.com/en-us/purview/sit-document-fingerprinting
- Named Entities: https://learn.microsoft.com/en-us/purview/sit-named-entities-learn
Additional resources
Scenario | Preferred method | Alternative methods (less accurate) | Techniques to reduce false positives |
---|---|---|---|
Detect PII/PHI for known individuals (customers/patients) | Exact Data Match to data from LoB app extract | Custom SITs including employee ID + common PII SITs (e.g. SSN). | If using regular SITs, consider adding requiring the presence of All Full Names, and limiting rules to documents with more than a certain minimum match count. |
Detect PII/PHI for employees or contractors | Exact Data Match to data from HR system extract | Custom SITs including employee ID + Named Entities (e.g. all full names) + common PII SITs (e.g. SSN). | If using regular SITs, consider adding requiring the presence of All Full Names, and limiting rules to documents with more than a certain minimum match count. |
Detect forms with personal data (e.g. sign-up forms, tax forms, account management forms, etc.) | Form fingerprinting + standard SITs or fingerprinting + custom SITs | (Custom SITs) + keywords + OCR (for scanned forms) | Use EDM instead of custom SITs if not using fingerprinting. |
Contracts, legal documents or other business forms | Custom trainable classifiers (+ OCR) | OOB trainable classifiers + OCR | Identify documents in your organization that are correctly identified by the OOB trainable classifiers, copy them to a repository and use them to train a custom classifier. This will produce a fine tuned classifier that will better align with your organization's typical terms (e.g. include company names, jargon, boilerplate). |
Important contracts or other documents | Trainable classifier + sensitivity label (manually or automatically applied) | Trainable classifier + custom SIT (e.g. regex to detect monetary amounts in excess of $100K, or dictionary or EDM with important customer names) | Identify documents in your organization that are correctly identified by the OOB trainable classifiers, as well as documents manually tagged/labeled as such, copy them to a repository and use them to train a custom classifier. This will produce a fine tuned classifier that will better align with your organization's typical terms (e.g. include company names, jargon, boilerplate). You can also combine use of multiple trainable classifiers in a single rule, e.g. "Contracts" and "Documents about Project X" to find documents relevant to both subjects. |
General PII or PHI of unknown individuals (e.g. non-customers or prospective customers) | OOB SITs if available fo the desired PII or custom SITs. | Manual labeling | Copy and edit an existing SIT to fine tune its keyword requirements. Expand proximity limit requirements for matching content in filled forms in PDF format since content in forms is not stored within the form structure. Add requirements for named entities such as All Full Names. Ensure custom regexes are defined as "word match" or start and end with \b. |
Sensitive conversations of known nature | Custom Trainable classifier trained based on confirmed samples | Identify in Content Explorer documents with labels that are relevant to those subjects, extract them and use them to train a custom classifier. | |
Sensitive conversations of non-specific nature | Manual labeling (let the user decide) | Use dictionaries of "hush words" or other relevant keywords which might hint at sensitive subjects. | |
Scanned identity cards (or similar) | OOB or Custom SITs + OCR | Custom SIT with keyword lists present in such documents + OCR | Add requirement for Named Entities (names or addresses) if expected to be written in one line in the document. |
Project data | Location-based labeling (e.g. default label for library) | Manual labeling or custom trainable classifiers | Once location-based labeling or manual labeling has identified enough relevant documents use those to train a custom classifier. |