Skip to main content

016: Identify sensitive data using Trainable Classifiers

Overview

List all uses cases where a file can be automatically identified as sensitive based using Trainable Classifiers (Pretrained classifiers provided by Microsoft and/or Custom trainable classifiers). Consider whether the out of the box classifiers cover your business needs and have the required accuracy in your environment or they need to be fine-tuned by creating new classifiers based on their matches in your content or using manually selected content. Fine-tuning an out of the box classifier can be done by copying all files detected by the original classifier in a specific location, removing files that were improperly detected and adding known files that were not correctly matched and training a classifier based on these results. In most cases, just training a classifier based on the matches to the original classifier provides enough benefit by identifying internal terms and jargon which can considerably improve accuracy of the classifier.

For classifiers where high rates of false positives are a problem consider combining their use with regular classifiers, e.g.:

  • Contract classifier together with a custom SIT that looks for monetary amounts in excess of a certain value (e.g. $100,000).
  • Project collateral classifier together with a dictionary-based SIT that looks for terms referring to specific projects or stakeholders.

Reference