Skip to main content

001: Sensitive data discovery

Overview

The first step in implementing Zero Trust for your data assets is to identify and categorize these assets according to their sensitivity, which encompasses areas like confidentiality, criticality and business impact. The following tools are available for discovering Sensitive Data across the organization:

  • Out of the box and custom classifiers (A.K.A. Sensitive Information Types). Classifiers enable you to identify documents and emails that contain contents matching certain patterns or criteria.
    • Out of the box classifiers allow you to detect identifiable information corresponding to well-known entities such as people's national identity numbers, bank accounts or drivers licenses.
    • Custom sensitive information types can be created to identify non-standard, business-specific identifiers or other sensitive information, or to customize how standard entities are detected.
  • Advanced classifiers. These include:
    • Exact Data Match for accurately detecting known personal information such as that corresponding to customers or employees.
    • Trainable classifiers (both pre-trained and custom) to detect documents likely to belong to certain categories or classes
    • Fingerprints to identify documents closely matching the content in well-known documents.
    • Named entities, which can be used to detect people's names, addresses, credentials and other potentially sensitive data.
  • Content Explorer and the Content Explorer Export PowerShell tool: Individual data assets matching individual classifiers can be identified and investigated in Content Explorer directly. Alternatively, information about all data assets can be identified and exported using PowerShell and imported into a SIEM tool for further analysis.
  • Activity Explorer: it reflects creation, access and sharing of sensitive information in your environment, allowing you to perform an initial assessment of risky behaviors and actions by your users involving sensitive data.
  • Unified Audit Log and custom tools built on top of it. Every action related to the creation, modification, classification, discovery or sharing of data in a Microsoft 365 tenant is reflected in the Audit Log, which can be connected to a SIEM to help identify the presence of sensitive information and patterns in its exposure and usage.

Reference

Additional resources

ScenarioPreferred methodAlternative methods (less accurate)Techniques to reduce false positives
Detect PII/PHI for known individuals (customers/patients)Exact Data Match to data from LoB app extractCustom SITs including employee ID + common PII SITs (e.g. SSN).If using regular SITs, consider adding requiring the presence of All Full Names, and limiting rules to documents with more than a certain minimum match count.
Detect PII/PHI for employees or contractorsExact Data Match to data from HR system extractCustom SITs including employee ID + Named Entities  (e.g. all full names) + common PII SITs (e.g. SSN).If using regular SITs, consider adding requiring the presence of All Full Names, and limiting rules to documents with more than a certain minimum match count.
Detect forms with personal data (e.g. sign-up forms, tax forms, account management forms, etc.)Form fingerprinting + standard SITs or fingerprinting + custom SITs(Custom SITs) + keywords + OCR (for scanned forms)Use EDM instead of custom SITs if not using fingerprinting.
Contracts, legal documents or other business formsCustom trainable classifiers (+ OCR)OOB trainable classifiers + OCRIdentify documents in your organization that are correctly identified by the OOB trainable classifiers, copy them to a repository and use them to train a custom classifier. This will produce a fine tuned classifier that will better align with your organization's typical terms (e.g. include company names, jargon, boilerplate).
Important contracts or other documentsTrainable classifier + sensitivity label (manually or automatically applied)Trainable classifier + custom SIT (e.g. regex to detect monetary amounts in excess of $100K, or dictionary or EDM with important customer names)Identify documents in your organization that are correctly identified by the OOB trainable classifiers, as well as documents manually tagged/labeled as such, copy them to a repository and use them to train a custom classifier. This will produce a fine tuned classifier that will better align with your organization's typical terms (e.g. include company names, jargon, boilerplate). You can also combine use of multiple trainable classifiers in a single rule, e.g. "Contracts" and "Documents about Project X" to find documents relevant to both subjects.
General PII or PHI of unknown individuals (e.g. non-customers or prospective customers)OOB SITs if available fo the desired PII or custom SITs.Manual labelingCopy and edit an existing SIT to fine tune its keyword requirements. Expand proximity limit requirements for matching content in filled forms in PDF format since content in forms is not stored within the form structure. Add requirements for named entities such as All Full Names.
Ensure custom regexes are defined as "word match" or start and end with \b.
Sensitive conversations of known natureCustom Trainable classifier trained based on confirmed samples Identify in Content Explorer documents with labels that are relevant to those subjects, extract them and use them to train a custom classifier.
Sensitive conversations of non-specific natureManual labeling (let the user decide)Use dictionaries of "hush words" or other relevant keywords which might hint at sensitive subjects. 
Scanned identity cards (or similar)OOB or Custom SITs + OCRCustom SIT with keyword lists present in such documents + OCRAdd requirement for Named Entities (names or addresses) if expected to be written in one line in the document.
Project dataLocation-based labeling (e.g. default label for library)Manual labeling or custom trainable classifiersOnce location-based labeling or manual labeling has identified enough relevant documents use those to train a custom classifier.