Recognizers Development - Best Practices and Considerations
Recognizers are the main building blocks in Presidio. Each recognizer is in charge of detecting one or more entities in one or more languages. Recognizers define the logic for detection, as well as the confidence a prediction receives and a list of words to be used when context is leveraged.
Implementation Considerations
Accuracy
Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets. For tools and documentation on evaluating and analyzing recognizers, refer to the presidio-research Github repository.
Note
When contributing recognizers to the Presidio OSS, new predefined recognizers should be added to the supported entities list, and follow the contribution guidelines.
Performance
Make sure your recognizer doesn't take too long to process text. Anything above 100ms per request with 100 tokens is probably not good enough.
Environment
When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies.
In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a RemoteRecognizer
on the presidio-analyzer side to interact with the model's endpoint.
Recognizer Types
Generally speaking, there are three types of recognizers:
Deny Lists
A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (["Mr.", "Mrs.", "Ms.", "Dr."]
to detect a "Title" entity.)
See this documentation on adding a new recognizer. The PatternRecognizer
class has built-in support for a deny-list input.
Pattern Based
Pattern based recognizers use regular expressions to identify entities in text.
See this documentation on adding a new recognizer via code.
The PatternRecognizer
class should be extended.
See some examples here:
Examples
Examples of pattern based recognizers are the CreditCardRecognizer
and EmailRecognizer
.
Machine Learning (ML) Based or Rule-Based
Many PII entities are undetectable using naive approaches like deny-lists or regular expressions. In these cases, we would wish to utilize a Machine Learning model capable of identifying entities in free text, or a rule-based recognizer.
ML: Utilize SpaCy, Stanza or Transformers
Presidio currently uses spaCy as a framework for text analysis and Named Entity Recognition (NER), and stanza and huggingface transformers as an alternative. To avoid introducing new tools, it is recommended to first try to use spaCy
, stanza
or transformers
over other tools if possible.
spaCy
provides descent results compared to state-of-the-art NER models, but with much better computational performance.
spaCy
, stanza
and transformers
models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned.
In addition to those, it is also possible to use other ML models. In that case, a new EntityRecognizer
should be created.
See an example using Flair here.
Apply Custom Logic
In some cases, rule-based logic provides reasonable ways for detecting entities.
The Presidio EntityRecognizer
API allows you to use spaCy
extracted features like lemmas, part of speech, dependencies and more to create your logic.
When integrating such logic into Presidio, a class inheriting from the EntityRecognizer
should be created.
Considerations for selecting one option over another
- Accuracy.
- Ease of integration.
- Runtime considerations (For example if the new model requires a GPU).
- 3rd party dependencies of the new model vs. the existing
presidio-analyzer
package.