Presidio Analyzer API Reference
Presidio analyzer package.
AnalysisExplanation
Hold tracing information to explain why PII entities were identified as such.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizer
|
str
|
name of recognizer that made the decision |
required |
original_score
|
float
|
recognizer's confidence in result |
required |
pattern_name
|
str
|
name of pattern (if decision was made by a PatternRecognizer) |
None
|
pattern
|
str
|
regex pattern that was applied (if PatternRecognizer) |
None
|
validation_result
|
float
|
result of a validation (e.g. checksum) |
None
|
textual_explanation
|
str
|
Free text for describing a decision of a logic or model |
None
|
Source code in presidio_analyzer/analysis_explanation.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
__repr__()
Create string representation of the object.
Source code in presidio_analyzer/analysis_explanation.py
39 40 41 |
|
append_textual_explanation_line(text)
Append a new line to textual_explanation field.
Source code in presidio_analyzer/analysis_explanation.py
52 53 54 55 56 57 |
|
set_improved_score(score)
Update the score and calculate the difference from the original score.
Source code in presidio_analyzer/analysis_explanation.py
43 44 45 46 |
|
set_supportive_context_word(word)
Set the context word which helped increase the score.
Source code in presidio_analyzer/analysis_explanation.py
48 49 50 |
|
to_dict()
Serialize self to dictionary.
Returns:
Type | Description |
---|---|
Dict
|
a dictionary |
Source code in presidio_analyzer/analysis_explanation.py
59 60 61 62 63 64 65 |
|
AnalyzerEngine
Entry point for Presidio Analyzer.
Orchestrating the detection of PII entities and all related logic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
registry
|
RecognizerRegistry
|
instance of type RecognizerRegistry |
None
|
nlp_engine
|
NlpEngine
|
instance of type NlpEngine (for example SpacyNlpEngine) |
None
|
app_tracer
|
AppTracer
|
instance of type AppTracer, used to trace the logic used during each request for interpretability reasons. |
None
|
log_decision_process
|
bool
|
bool, defines whether the decision process within the analyzer should be logged or not. |
False
|
default_score_threshold
|
float
|
Minimum confidence value for detected entities to be returned |
0
|
supported_languages
|
List[str]
|
List of possible languages this engine could be run on. Used for loading the right NLP models and recognizers for these languages. |
None
|
context_aware_enhancer
|
Optional[ContextAwareEnhancer]
|
instance of type ContextAwareEnhancer for enhancing confidence score based on context words, (LemmaContextAwareEnhancer will be created by default if None passed) |
None
|
Source code in presidio_analyzer/analyzer_engine.py
|
|
__add_recognizer_id_if_not_exists(results, recognizer)
staticmethod
Ensure recognition metadata with recognizer id existence.
Ensure recognizer result list contains recognizer id inside recognition metadata dictionary, and if not create it. recognizer_id is needed for context aware enhancement.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
results
|
List[RecognizerResult]
|
List of RecognizerResult |
required |
recognizer
|
EntityRecognizer
|
Entity recognizer |
required |
Source code in presidio_analyzer/analyzer_engine.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 |
|
__remove_decision_process(results)
staticmethod
Remove decision process / analysis explanation from response.
Source code in presidio_analyzer/analyzer_engine.py
416 417 418 419 420 421 422 423 424 425 |
|
__remove_low_scores(results, score_threshold=None)
Remove results for which the confidence is lower than the threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
results
|
List[RecognizerResult]
|
List of RecognizerResult |
required |
score_threshold
|
float
|
float value for minimum possible confidence |
None
|
Returns:
Type | Description |
---|---|
List[RecognizerResult]
|
List[RecognizerResult] |
Source code in presidio_analyzer/analyzer_engine.py
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
|
analyze(text, language, entities=None, correlation_id=None, score_threshold=None, return_decision_process=False, ad_hoc_recognizers=None, context=None, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, nlp_artifacts=None)
Find PII entities in text using different PII recognizers for a given language.
:example:
from presidio_analyzer import AnalyzerEngine
Set up the engine, loads the NLP module (spaCy model by default)
and other PII recognizers
analyzer = AnalyzerEngine()
Call analyzer to get results
results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501 print(results) [type: PHONE_NUMBER, start: 19, end: 31, score: 0.85]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
the text to analyze |
required |
language
|
str
|
the language of the text |
required |
entities
|
Optional[List[str]]
|
List of PII entities that should be looked for in the text. If entities=None then all entities are looked for. |
None
|
correlation_id
|
Optional[str]
|
cross call ID for this request |
None
|
score_threshold
|
Optional[float]
|
A minimum value for which to return an identified entity |
None
|
return_decision_process
|
Optional[bool]
|
Whether the analysis decision process steps returned in the response. |
False
|
ad_hoc_recognizers
|
Optional[List[EntityRecognizer]]
|
List of recognizers which will be used only for this specific request. |
None
|
context
|
Optional[List[str]]
|
List of context words to enhance confidence score if matched with the recognized entity's recognizer context |
None
|
allow_list
|
Optional[List[str]]
|
List of words that the user defines as being allowed to keep in the text |
None
|
allow_list_match
|
Optional[str]
|
How the allow_list should be interpreted; either as "exact" or as "regex". - If |
'exact'
|
regex_flags
|
Optional[int]
|
regex flags to be used for when allow_list_match is "regex" |
DOTALL | MULTILINE | IGNORECASE
|
nlp_artifacts
|
Optional[NlpArtifacts]
|
precomputed NlpArtifacts |
None
|
Returns:
Type | Description |
---|---|
List[RecognizerResult]
|
an array of the found entities in the text |
Source code in presidio_analyzer/analyzer_engine.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
|
get_recognizers(language=None)
Return a list of PII recognizers currently loaded.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
language
|
Optional[str]
|
Return the recognizers supporting a given language. |
None
|
Returns:
Type | Description |
---|---|
List[EntityRecognizer]
|
List of [Recognizer] as a RecognizersAllResponse |
Source code in presidio_analyzer/analyzer_engine.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
get_supported_entities(language=None)
Return a list of the entities that can be detected.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
language
|
Optional[str]
|
Return only entities supported in a specific language. |
None
|
Returns:
Type | Description |
---|---|
List[str]
|
List of entity names |
Source code in presidio_analyzer/analyzer_engine.py
134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
AnalyzerEngineProvider
Utility function for loading Presidio Analyzer.
Use this class to load presidio analyzer engine from a yaml file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
analyzer_engine_conf_file
|
Optional[Union[Path, str]]
|
the path to the analyzer configuration file |
None
|
nlp_engine_conf_file
|
Optional[Union[Path, str]]
|
the path to the nlp engine configuration file |
None
|
recognizer_registry_conf_file
|
Optional[Union[Path, str]]
|
the path to the recognizer registry configuration file |
None
|
Source code in presidio_analyzer/analyzer_engine_provider.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
|
create_engine()
Load Presidio Analyzer from yaml configuration file.
Returns:
Type | Description |
---|---|
AnalyzerEngine
|
analyzer engine initialized with yaml configuration |
Source code in presidio_analyzer/analyzer_engine_provider.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
get_configuration(conf_file)
Retrieve the analyzer engine configuration from the provided file.
Source code in presidio_analyzer/analyzer_engine_provider.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
AnalyzerRequest
Analyzer request data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
req_data
|
Dict
|
A request dictionary with the following fields: text: the text to analyze language: the language of the text entities: List of PII entities that should be looked for in the text. If entities=None then all entities are looked for. correlation_id: cross call ID for this request score_threshold: A minimum value for which to return an identified entity log_decision_process: Should the decision points within the analysis be logged return_decision_process: Should the decision points within the analysis returned as part of the response |
required |
Source code in presidio_analyzer/analyzer_request.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
BatchAnalyzerEngine
Batch analysis of documents (tables, lists, dicts).
Wrapper class to run Presidio Analyzer Engine on multiple values, either lists/iterators of strings, or dictionaries.
Source code in presidio_analyzer/batch_analyzer_engine.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
analyze_dict(input_dict, language, keys_to_skip=None, **kwargs)
Analyze a dictionary of keys (strings) and values/iterable of values.
Non-string values are returned as is.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_dict
|
Dict[str, Union[Any, Iterable[Any]]]
|
The input dictionary for analysis |
required |
language
|
str
|
Input language |
required |
keys_to_skip
|
Optional[List[str]]
|
Keys to ignore during analysis |
None
|
kwargs
|
Additional keyword arguments for the |
{}
|
Source code in presidio_analyzer/batch_analyzer_engine.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
analyze_iterator(texts, language, batch_size=None, **kwargs)
Analyze an iterable of strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Iterable[Union[str, bool, float, int]]
|
An list containing strings to be analyzed. |
required |
language
|
str
|
Input language |
required |
batch_size
|
Optional[int]
|
Batch size to process in a single iteration |
None
|
kwargs
|
Additional parameters for the |
{}
|
Source code in presidio_analyzer/batch_analyzer_engine.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
ContextAwareEnhancer
A class representing an abstract context aware enhancer.
Context words might enhance confidence score of a recognized entity, ContextAwareEnhancer is an abstract class to be inherited by a context aware enhancer logic.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context_similarity_factor
|
float
|
How much to enhance confidence of match entity |
required |
min_score_with_context_similarity
|
float
|
Minimum confidence score |
required |
context_prefix_count
|
int
|
how many words before the entity to match context |
required |
context_suffix_count
|
int
|
how many words after the entity to match context |
required |
Source code in presidio_analyzer/context_aware_enhancers/context_aware_enhancer.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
enhance_using_context(text, raw_results, nlp_artifacts, recognizers, context=None)
abstractmethod
Update results in case surrounding words are relevant to the context words.
Using the surrounding words of the actual word matches, look for specific strings that if found contribute to the score of the result, improving the confidence that the match is indeed of that PII entity type
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The actual text that was analyzed |
required |
raw_results
|
List[RecognizerResult]
|
Recognizer results which didn't take context into consideration |
required |
nlp_artifacts
|
NlpArtifacts
|
The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process |
required |
recognizers
|
List[EntityRecognizer]
|
the list of recognizers |
required |
context
|
Optional[List[str]]
|
list of context words |
None
|
Source code in presidio_analyzer/context_aware_enhancers/context_aware_enhancer.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
DictAnalyzerResult
dataclass
Data class for holding the output of the Presidio Analyzer on dictionaries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key
|
str
|
key in dictionary |
required |
value
|
Union[str, List[str], dict]
|
value to run analysis on (either string or list of strings) |
required |
recognizer_results
|
Union[List[RecognizerResult], List[List[RecognizerResult]], Iterator[DictAnalyzerResult]]
|
Analyzer output for one value. Could be either: - A list of recognizer results if the input is one string - A list of lists of recognizer results, if the input is a list of strings. - An iterator of a DictAnalyzerResult, if the input is a dictionary. In this case the recognizer_results would be the iterator of the DictAnalyzerResults next level in the dictionary. |
required |
Source code in presidio_analyzer/dict_analyzer_result.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
EntityRecognizer
A class representing an abstract PII entity recognizer.
EntityRecognizer is an abstract class to be inherited by Recognizers which hold the logic for recognizing specific PII entities.
EntityRecognizer exposes a method called enhance_using_context which can be overridden in case a custom context aware enhancement is needed in derived class of a recognizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
supported_entities
|
List[str]
|
the entities supported by this recognizer (for example, phone number, address, etc.) |
required |
supported_language
|
str
|
the language supported by this recognizer. The supported langauge code is iso6391Name |
'en'
|
name
|
str
|
the name of this recognizer (optional) |
None
|
version
|
str
|
the recognizer current version |
'0.0.1'
|
context
|
Optional[List[str]]
|
a list of words which can help boost confidence score when they appear in context of the matched entity |
None
|
Source code in presidio_analyzer/entity_recognizer.py
|
|
id
property
Return a unique identifier of this recognizer.
analyze(text, entities, nlp_artifacts)
abstractmethod
Analyze text to identify entities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to be analyzed |
required |
entities
|
List[str]
|
The list of entities this recognizer is able to detect |
required |
nlp_artifacts
|
NlpArtifacts
|
A group of attributes which are the result of an NLP process over the input text. |
required |
Returns:
Type | Description |
---|---|
List[RecognizerResult]
|
List of results detected by this recognizer. |
Source code in presidio_analyzer/entity_recognizer.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|
enhance_using_context(text, raw_recognizer_results, other_raw_recognizer_results, nlp_artifacts, context=None)
Enhance confidence score using context of the entity.
Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.
in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The actual text that was analyzed |
required |
raw_recognizer_results
|
List[RecognizerResult]
|
This recognizer's results, to be updated based on recognizer specific context. |
required |
other_raw_recognizer_results
|
List[RecognizerResult]
|
Other recognizer results matched in the given text to allow related entity context enhancement |
required |
nlp_artifacts
|
NlpArtifacts
|
The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process |
required |
context
|
Optional[List[str]]
|
list of context words |
None
|
Source code in presidio_analyzer/entity_recognizer.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
from_dict(entity_recognizer_dict)
classmethod
Create EntityRecognizer from a dict input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity_recognizer_dict
|
Dict
|
Dict containing keys and values for instantiation |
required |
Source code in presidio_analyzer/entity_recognizer.py
157 158 159 160 161 162 163 164 |
|
get_supported_entities()
Return the list of entities this recognizer can identify.
Returns:
Type | Description |
---|---|
List[str]
|
A list of the supported entities by this recognizer |
Source code in presidio_analyzer/entity_recognizer.py
119 120 121 122 123 124 125 |
|
get_supported_language()
Return the language this recognizer can support.
Returns:
Type | Description |
---|---|
str
|
A list of the supported language by this recognizer |
Source code in presidio_analyzer/entity_recognizer.py
127 128 129 130 131 132 133 |
|
get_version()
Return the version of this recognizer.
Returns:
Type | Description |
---|---|
str
|
The current version of this recognizer |
Source code in presidio_analyzer/entity_recognizer.py
135 136 137 138 139 140 141 |
|
load()
abstractmethod
Initialize the recognizer assets if needed.
(e.g. machine learning models)
Source code in presidio_analyzer/entity_recognizer.py
67 68 69 70 71 72 73 |
|
remove_duplicates(results)
staticmethod
Remove duplicate results.
Remove duplicates in case the two results have identical start and ends and types.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
results
|
List[RecognizerResult]
|
List[RecognizerResult] |
required |
Returns:
Type | Description |
---|---|
List[RecognizerResult]
|
List[RecognizerResult] |
Source code in presidio_analyzer/entity_recognizer.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
to_dict()
Serialize self to dictionary.
Returns:
Type | Description |
---|---|
Dict
|
a dictionary |
Source code in presidio_analyzer/entity_recognizer.py
143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
LemmaContextAwareEnhancer
Bases: ContextAwareEnhancer
A class representing a lemma based context aware enhancer logic.
Context words might enhance confidence score of a recognized entity, LemmaContextAwareEnhancer is an implementation of Lemma based context aware logic, it compares spacy lemmas of each word in context of the matched entity to given context and the recognizer context words, if matched it enhance the recognized entity confidence score by a given factor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context_similarity_factor
|
float
|
How much to enhance confidence of match entity |
0.35
|
min_score_with_context_similarity
|
float
|
Minimum confidence score |
0.4
|
context_prefix_count
|
int
|
how many words before the entity to match context |
5
|
context_suffix_count
|
int
|
how many words after the entity to match context |
0
|
Source code in presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py
|
|
enhance_using_context(text, raw_results, nlp_artifacts, recognizers, context=None)
Update results in case the lemmas of surrounding words or input context words are identical to the context words.
Using the surrounding words of the actual word matches, look for specific strings that if found contribute to the score of the result, improving the confidence that the match is indeed of that PII entity type
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The actual text that was analyzed |
required |
raw_results
|
List[RecognizerResult]
|
Recognizer results which didn't take context into consideration |
required |
nlp_artifacts
|
NlpArtifacts
|
The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process |
required |
recognizers
|
List[EntityRecognizer]
|
the list of recognizers |
required |
context
|
Optional[List[str]]
|
list of context words |
None
|
Source code in presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
LocalRecognizer
Bases: ABC
, EntityRecognizer
PII entity recognizer which runs on the same process as the AnalyzerEngine.
Source code in presidio_analyzer/local_recognizer.py
6 7 |
|
Pattern
A class that represents a regex pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
the name of the pattern |
required |
regex
|
str
|
the regex pattern to detect |
required |
score
|
float
|
the pattern's strength (values varies 0-1) |
required |
Source code in presidio_analyzer/pattern.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
__repr__()
Return string representation of instance.
Source code in presidio_analyzer/pattern.py
40 41 42 |
|
__str__()
Return string representation of instance.
Source code in presidio_analyzer/pattern.py
44 45 46 |
|
from_dict(pattern_dict)
classmethod
Load an instance from a dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern_dict
|
Dict
|
a dictionary holding the pattern's parameters |
required |
Returns:
Type | Description |
---|---|
Pattern
|
a Pattern instance |
Source code in presidio_analyzer/pattern.py
30 31 32 33 34 35 36 37 38 |
|
to_dict()
Turn this instance into a dictionary.
Returns:
Type | Description |
---|---|
Dict
|
a dictionary |
Source code in presidio_analyzer/pattern.py
21 22 23 24 25 26 27 28 |
|
PatternRecognizer
Bases: LocalRecognizer
PII entity recognizer using regular expressions or deny-lists.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
patterns
|
List[Pattern]
|
A list of patterns to detect |
None
|
deny_list
|
List[str]
|
A list of words to detect, in case our recognizer uses a predefined list of words (deny list) |
None
|
context
|
List[str]
|
list of context words |
None
|
deny_list_score
|
float
|
confidence score for a term identified using a deny-list |
1.0
|
global_regex_flags
|
Optional[int]
|
regex flags to be used in regex matching, including deny-lists. |
DOTALL | MULTILINE | IGNORECASE
|
Source code in presidio_analyzer/pattern_recognizer.py
|
|
__analyze_patterns(text, flags=None)
Evaluate all patterns in the provided text.
Including words in the provided deny-list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
text to analyze |
required |
flags
|
int
|
regex flags |
None
|
Returns:
Type | Description |
---|---|
List[RecognizerResult]
|
A list of RecognizerResult |
Source code in presidio_analyzer/pattern_recognizer.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
|
analyze(text, entities, nlp_artifacts=None, regex_flags=None)
Analyzes text to detect PII using regular expressions or deny-lists.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Text to be analyzed |
required |
entities
|
List[str]
|
Entities this recognizer can detect |
required |
nlp_artifacts
|
Optional[NlpArtifacts]
|
Output values from the NLP engine |
None
|
regex_flags
|
Optional[int]
|
regex flags to be used in regex matching |
None
|
Returns:
Type | Description |
---|---|
List[RecognizerResult]
|
|
Source code in presidio_analyzer/pattern_recognizer.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
build_regex_explanation(recognizer_name, pattern_name, pattern, original_score, validation_result, regex_flags)
staticmethod
Construct an explanation for why this entity was detected.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizer_name
|
str
|
Name of recognizer detecting the entity |
required |
pattern_name
|
str
|
Regex pattern name which detected the entity |
required |
pattern
|
str
|
Regex pattern logic |
required |
original_score
|
float
|
Score given by the recognizer |
required |
validation_result
|
bool
|
Whether validation was used and its result |
required |
regex_flags
|
int
|
Regex flags used in the regex matching |
required |
Returns:
Type | Description |
---|---|
AnalysisExplanation
|
Analysis explanation |
Source code in presidio_analyzer/pattern_recognizer.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
from_dict(entity_recognizer_dict)
classmethod
Create instance from a serialized dict.
Source code in presidio_analyzer/pattern_recognizer.py
266 267 268 269 270 271 272 273 274 |
|
invalidate_result(pattern_text)
Logic to check for result invalidation by running pruning logic.
For example, each SSN number group should not consist of all the same digits.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern_text
|
str
|
the text to validated. Only the part in text that was detected by the regex engine |
required |
Returns:
Type | Description |
---|---|
Optional[bool]
|
A bool indicating whether the result is invalidated |
Source code in presidio_analyzer/pattern_recognizer.py
127 128 129 130 131 132 133 134 135 136 137 |
|
to_dict()
Serialize instance into a dictionary.
Source code in presidio_analyzer/pattern_recognizer.py
254 255 256 257 258 259 260 261 262 263 264 |
|
validate_result(pattern_text)
Validate the pattern logic e.g., by running checksum on a detected pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern_text
|
str
|
the text to validated. Only the part in text that was detected by the regex engine |
required |
Returns:
Type | Description |
---|---|
Optional[bool]
|
A bool indicating whether the validation was successful. |
Source code in presidio_analyzer/pattern_recognizer.py
117 118 119 120 121 122 123 124 125 |
|
PresidioAnalyzerUtils
Utility functions for Presidio Analyzer.
The class provides a bundle of utility functions that help centralizing the logic for re-usability and maintainability
Source code in presidio_analyzer/analyzer_utils.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
is_palindrome(text, case_insensitive=False)
staticmethod
Validate if input text is a true palindrome.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
input text string to check for palindrome |
required |
case_insensitive
|
bool
|
optional flag to check palindrome with no case |
False
|
Returns:
Type | Description |
---|---|
True / False |
Source code in presidio_analyzer/analyzer_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
is_verhoeff_number(input_number)
staticmethod
Check if the input number is a true verhoeff number.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_number
|
int
|
|
required |
Returns:
Type | Description |
---|---|
|
Source code in presidio_analyzer/analyzer_utils.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
sanitize_value(text, replacement_pairs)
staticmethod
Cleanse the input string of the replacement pairs specified as argument.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
input string |
required |
replacement_pairs
|
List[Tuple[str, str]]
|
pairs of what has to be replaced with which value |
required |
Returns:
Type | Description |
---|---|
str
|
cleansed string |
Source code in presidio_analyzer/analyzer_utils.py
26 27 28 29 30 31 32 33 34 35 36 37 |
|
RecognizerRegistry
Detect, register and hold all recognizers to be used by the analyzer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizers
|
Optional[Iterable[EntityRecognizer]]
|
An optional list of recognizers, that will be available instead of the predefined recognizers |
None
|
|
global_regex_flags
|
regex flags to be used in regex matching, including deny-lists |
required |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
|
|
__instantiate_recognizer(recognizer_class, supported_language)
Instantiate a recognizer class given type and input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizer_class
|
Type[EntityRecognizer]
|
Class object of the recognizer |
required |
supported_language
|
str
|
Language this recognizer should support |
required |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
|
add_nlp_recognizer(nlp_engine)
Adding NLP recognizer in accordance with the nlp engine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nlp_engine
|
NlpEngine
|
The NLP engine. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
|
add_pattern_recognizer_from_dict(recognizer_dict)
Load a pattern recognizer from a Dict into the recognizer registry.
:example:
registry = RecognizerRegistry() recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]} registry.add_pattern_recognizer_from_dict(recognizer)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizer_dict
|
Dict
|
Dict holding a serialization of an PatternRecognizer |
required |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
|
add_recognizer(recognizer)
Add a new recognizer to the list of recognizers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizer
|
EntityRecognizer
|
Recognizer to add |
required |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
198 199 200 201 202 203 204 205 206 207 |
|
add_recognizers_from_yaml(yml_path)
Read YAML file and load recognizers into the recognizer registry.
See example yaml file here: https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml
:example:
yaml_file = "recognizers.yaml" registry = RecognizerRegistry() registry.add_recognizers_from_yaml(yaml_file)
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
|
get_recognizers(language, entities=None, all_fields=False, ad_hoc_recognizers=None)
Return a list of recognizers which supports the specified name and language.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entities
|
Optional[List[str]]
|
the requested entities |
None
|
language
|
str
|
the requested language |
required |
all_fields
|
bool
|
a flag to return all fields of a requested language. |
False
|
ad_hoc_recognizers
|
Optional[List[EntityRecognizer]]
|
Additional recognizers provided by the user as part of the request |
None
|
Returns:
Type | Description |
---|---|
List[EntityRecognizer]
|
A list of the recognizers which supports the supplied entities and language |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
|
get_supported_entities(languages=None)
Return the supported entities by the set of recognizers loaded.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
languages
|
Optional[List[str]]
|
The languages to get the supported entities for. If languages=None, returns all entities for all languages. |
None
|
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 |
|
load_predefined_recognizers(languages=None, nlp_engine=None)
Load the existing recognizers into memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
languages
|
Optional[List[str]]
|
List of languages for which to load recognizers |
None
|
nlp_engine
|
NlpEngine
|
The NLP engine to use. |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
remove_recognizer(recognizer_name, language=None)
Remove a recognizer based on its name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recognizer_name
|
str
|
Name of recognizer to remove |
required |
language
|
Optional[str]
|
The supported language of the recognizer to be removed, in case multiple recognizers with the same name are present, and only one should be removed. |
None
|
Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
|
RecognizerResult
Recognizer Result represents the findings of the detected entity.
Result of a recognizer analyzing the text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity_type
|
str
|
the type of the entity |
required |
start
|
int
|
the start location of the detected entity |
required |
end
|
int
|
the end location of the detected entity |
required |
score
|
float
|
the score of the detection |
required |
analysis_explanation
|
AnalysisExplanation
|
contains the explanation of why this entity was identified |
None
|
recognition_metadata
|
Dict
|
a dictionary of metadata to be used in recognizer specific cases, for example specific recognized context words and recognizer name |
None
|
Source code in presidio_analyzer/recognizer_result.py
|
|
__eq__(other)
Check two results are equal by using all class fields.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
RecognizerResult
|
another RecognizerResult |
required |
Returns:
Type | Description |
---|---|
bool
|
bool |
Source code in presidio_analyzer/recognizer_result.py
145 146 147 148 149 150 151 152 153 154 |
|
__gt__(other)
Check if one result is greater by using the results indices in the text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
RecognizerResult
|
another RecognizerResult |
required |
Returns:
Type | Description |
---|---|
bool
|
bool |
Source code in presidio_analyzer/recognizer_result.py
134 135 136 137 138 139 140 141 142 143 |
|
__hash__()
Hash the result data by using all class fields.
Returns:
Type | Description |
---|---|
int |
Source code in presidio_analyzer/recognizer_result.py
156 157 158 159 160 161 162 163 164 |
|
__repr__()
Return a string representation of the instance.
Source code in presidio_analyzer/recognizer_result.py
89 90 91 |
|
__str__()
Return a string representation of the instance.
Source code in presidio_analyzer/recognizer_result.py
166 167 168 169 170 171 172 173 |
|
append_analysis_explanation_text(text)
Add text to the analysis explanation.
Source code in presidio_analyzer/recognizer_result.py
57 58 59 60 |
|
contained_in(other)
Check if self is contained in a different RecognizerResult.
Returns:
Type | Description |
---|---|
bool
|
true if contained |
Source code in presidio_analyzer/recognizer_result.py
108 109 110 111 112 113 114 |
|
contains(other)
Check if one result is contained or equal to another result.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
RecognizerResult
|
another RecognizerResult |
required |
Returns:
Type | Description |
---|---|
bool
|
bool |
Source code in presidio_analyzer/recognizer_result.py
116 117 118 119 120 121 122 123 |
|
equal_indices(other)
Check if the indices are equal between two results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
RecognizerResult
|
another RecognizerResult |
required |
Returns:
Type | Description |
---|---|
bool
|
|
Source code in presidio_analyzer/recognizer_result.py
125 126 127 128 129 130 131 132 |
|
from_json(data)
classmethod
Create RecognizerResult from json.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Dict
|
e.g. { "start": 24, "end": 32, "score": 0.8, "entity_type": "NAME" } |
required |
Returns:
Type | Description |
---|---|
RecognizerResult
|
RecognizerResult |
Source code in presidio_analyzer/recognizer_result.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
has_conflict(other)
Check if two recognizer results are conflicted or not.
I have a conflict if: 1. My indices are the same as the other and my score is lower. 2. If my indices are contained in another.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
RecognizerResult
|
RecognizerResult |
required |
Returns:
Type | Description |
---|---|
bool
|
|
Source code in presidio_analyzer/recognizer_result.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
|
intersects(other)
Check if self intersects with a different RecognizerResult.
Returns:
Type | Description |
---|---|
int
|
If intersecting, returns the number of intersecting characters. If not, returns 0 |
Source code in presidio_analyzer/recognizer_result.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
to_dict()
Serialize self to dictionary.
Returns:
Type | Description |
---|---|
Dict
|
a dictionary |
Source code in presidio_analyzer/recognizer_result.py
62 63 64 65 66 67 68 |
|
RemoteRecognizer
Bases: ABC
, EntityRecognizer
A configuration for a recognizer that runs on a different process / remote machine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
supported_entities
|
List[str]
|
A list of entities this recognizer can identify |
required |
name
|
Optional[str]
|
name of recognizer |
required |
supported_language
|
str
|
The language this recognizer can detect entities in |
required |
version
|
str
|
Version of this recognizer |
required |
Source code in presidio_analyzer/remote_recognizer.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
analyze(text, entities, nlp_artifacts)
abstractmethod
Call an external service for PII detection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
text to be analyzed |
required |
entities
|
List[str]
|
Entities that should be looked for |
required |
nlp_artifacts
|
NlpArtifacts
|
Additional metadata from the NLP engine |
required |
Returns:
Type | Description |
---|---|
List of identified PII entities |
Source code in presidio_analyzer/remote_recognizer.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|