Mkdocstrings
Overview¶
The pteredactyl
module provides various functions and classes for data anonymization and redaction.
Defaults¶
change_model(new_model)
¶
Change the default NER model.
Parameters¶
new_model : str The new model path to be set as the default NER model.
Returns¶
None
Source code in pteredactyl\pteredactyl\defaults.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|
show_defaults()
¶
Print the default values used by pteredactyl.
This function shows the default values for the following variables: - DEFAULT_NER_MODEL (for model_path) - DEFAULT_SPACY_MODEL (for spacy_model) - DEFAULT_ENTITIES (for entities) - DEFAULT_REGEX_ENTITIES (for regex_entities)
Returns¶
None
Source code in pteredactyl\pteredactyl\defaults.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
Exceptions¶
MissingRegexRecogniserError
¶
Bases: KeyError
Exception raised when a regex recogniser is requested but not found in the supported regex_entities list.
Attributes:
Name | Type | Description |
---|---|---|
message |
str
|
The error message. |
Source code in pteredactyl\pteredactyl\exceptions.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Redactor¶
This module provides functionalities for text redaction.
Create an analyser engine with a Transformers NER model and spaCy model.
Source code in pteredactyl\pteredactyl\redactor.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|
Redactor Analyser¶
Analyses text using the provided NER models and entities, and returns list of those identified. It is recommended to first create an analyser and feed this in to be reused with: >>> analyser = create_analyser() >>> analyser(text=text, analyser=analyser)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be analyzed. |
required |
analyser |
AnalyzerEngine
|
An instance of AnalyzerEngine. If not provided, a new analyser will be created (recommend creating first viacreate_analyser(), before feeding in). |
None
|
entities |
list
|
A list of entity types to analyse. If not provided, a default list will be used. |
DEFAULT_ENTITIES
|
regex_entities |
list
|
A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used. |
DEFAULT_REGEX_ENTITIES
|
model_path |
str
|
The path to the model used for analysis (e.g. 'StanfordAIMI/stanford-deidentifier-base'). Used only if analyser not provided. |
DEFAULT_NER_MODEL
|
spacy_model |
str
|
The spaCy model to use (e.g. 'en_core_web_sm'). Used only if analyser not provided. |
DEFAULT_SPACY_MODEL
|
language |
str
|
The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided. |
'en'
|
mask_individual_words |
bool
|
If True, prevents joining of next-door entities together. (i.e. with Jane Smith, both 'Jane' and 'Smith' are identified separately if True, combined if False). Defaults to False. |
False
|
text_separator |
str
|
Text separator. Default is whitespace. |
' '
|
rebuild_regex_recognisers |
bool
|
If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution. |
True
|
**kwargs |
Additional keyword arguments for the analyzer. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
list |
list[RecognizerResult]
|
The analysis results. |
Example
from pteredactyl.redactor import analyse text = "My name is John Doe and my NHS number is 7890123450" results = analyse(text) print(results) [RecognizerResult(entity_type='PERSON', start=10, end=19, score=1.0), RecognizerResult(entity_type='NHS_NUMBER', start=36, end=46, score=1.0)]
Source code in pteredactyl\pteredactyl\redactor.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
Redactor Anonymiser¶
Anonymises the given text by replacing specified entities by NER, and and regex entities by REGEX. Regex entities take priority and are analysed first. It is recommended to first create an analyser and feed this in to be reused.
Args:
text (str): The text to be anonymized.
analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created.
entities (list, optional): A list of entity types to anonymize. If not provided, a default list will be used.
regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.
highlight (bool): If True, highlights the anonymized parts in the text.
replacement_lists: (dict, optional): A dictionary with entity types as keys and lists of replacement values for hide-in-plain-sight redaction.
model_path (str): The path to the model used for analysis. Used only if analyser not provided.
spacy_model (str): The spaCy model to use. Used only if analyser not provided.
language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.
mask_individual_words (bool): If True, prevents joining of next-door entities together.
(i.e. Jane Smith becomes
Returns: str: The anonymized text.
Example
analyser = create_analyser() text = ''' Patient Name: John Doe NHS Number: 7890123450 Address: AB1 0CD Date: January 1, 2022 Diagnostic Findings: The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs. The mass is suspected to be malignant and is likely to be a tumor. Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis. Recommendations: The patient is advised to consult with a medical specialist for a thorough evaluation of the mass. If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended. '''
results = anonymise(text, analyser=analyser, entities=["DATE_TIME", "PERSON"], regex_recognizers=["POSTCODE", "NHS_NUMBER"]) print(results)
Patient Name: <PERSON>
NHS Number: <NHS_NUMBER>
Address: <POSTCODE>
Date: <DATE_TIME>
Diagnostic Findings:
The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
The mass is suspected to be malignant and is likely to be a tumor.
Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
Recommendations:
The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.
Source code in pteredactyl\pteredactyl\redactor.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 |
|
Regex Check Functions¶
is_nhs_number(nhs_number)
¶
Check if a given value is a valid NHS number.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
nhs_number |
int | str
|
The NHS number to be checked. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the given value is a valid NHS number, otherwise False. |
Example
is_nhs_number(1234567890) True is_nhs_number("1234567890") True is_nhs_number("1234567898") True # (fails checksum) is_nhs_number("12345") False # (fails length check)
Note
The NHS number is a 10-digit number used in the United Kingdom for healthcare identification. The last digit of the NHS number is a check digit calculated by a special modules 11 algorithm for validation.
Source code in pteredactyl\pteredactyl\regex_check_functions.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
Regex Entities¶
build_pteredactyl_recogniser(entity_type, regex, check_function)
¶
Build a custom regex ptererecogniser for pteredactyl.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity_type |
str
|
The name of the entity to be recognised. |
required |
regex |
str or Pattern
|
The regular expression to match the entity. |
required |
check_function |
Callable
|
A function to check if the matched string is a valid entity. Should take a single argument (the matched string) and return a boolean. |
required |
Returns:
Name | Type | Description |
---|---|---|
PteredactylRecogniser |
PteredactylRecogniser
|
A custom presidio EntityRecognizer object. |
Example:
def check_soton_landline(input: str): ... cleaned = input.replace('-','').replace(' ','') ... return cleaned.startswith('0238')
recogniser = build_pteredactyl_recogniser(entity_type='SOUTHAMPTON_LANDLINE', ... regex=r'(?:\d[\s-]?){11}', ... check_function=check_soton_landline)
Source code in pteredactyl\pteredactyl\regex_entities.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
build_regex_entity_recogniser_list(regex_entities)
¶
Build a list of custom regex PteredactylRecognisers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
regex_entities |
list[str or PteredactylRecogniser]
|
A list of PteredactylRecogniser objects or strings referencing pre-built PteredactylRecognisers. |
required |
Returns:
Type | Description |
---|---|
list[PteredactylRecogniser]
|
list[PteredactylRecogniser]: A list of custom presidio EntityRecognizer objects. |
Example:
recognisers = build_regex_entity_recogniser_list(['NHS_NUMBER', ... 'ENTITY_2', ... PteredactylRecogniser(entity_type='SOUTHAMPTON_LANDLINE', ... regex=r'\b((?:+44\s?7\d{3}|?)\s?\d{3}\s?\d{3}|?\s?\d{1,4}\s?\d{1,4})\b', ... check_function=check_so_landline ... ])
Source code in pteredactyl\pteredactyl\regex_entities.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
rebuild_analyser_regex_recognisers(analyser, regex_entities)
¶
Rebuilds the analyser's regex recognisers with the supplied list of regex entities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
analyser |
AnalyzerEngine
|
The analyser to rebuild. |
required |
regex_entities |
list[str or PteredactylRecogniser]
|
The list of regex entities to use. |
required |
Returns:
Source code in pteredactyl\pteredactyl\regex_entities.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
|
Support¶
find_substring_positions(s, sep=' ')
¶
Finds the starting and ending indexes of substrings in the input string s
.
The substrings are determined by splitting s
at separator.
Args:
s (str): The input string containing substrings separated by newlines.
sept (str): Separator for substrings.
Returns:
list[tuple[int, int]]: A list of tuples, each containing the start and end index of a substring.
Examples:
>>> s = "abc
def" >>> positions = find_substring_positions(s, sep=" ") >>> print("Replacement Positions: ", positions) Replacement Positions: [(0, 3), (4, 7)]
Source code in pteredactyl\pteredactyl\support.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
load_nlp_configuration(language, spacy_model)
¶
Loads NLP configuration for spacy model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
language |
str
|
Model language (e.g. en) |
required |
spacy_model |
str
|
Name of spacy model (e.g. en_core_web_sm) |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict[str, Any]
|
configuration dictionary that can be passed to create an NlpEngineProvider |
Source code in pteredactyl\pteredactyl\support.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
load_nlp_engine(presidio_logger, nlp_configuration)
¶
Loads a NlpEngineProvider by creating a new engine.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
presidio_logger |
Logger
|
Logger object to set and restore logging level. |
required |
nlp_configuration |
dict
|
Configuration for the NlpEngineProvider. |
required |
Returns:
Name | Type | Description |
---|---|---|
NlpEngineProvider |
NlpEngine
|
The loaded engine. |
Source code in pteredactyl\pteredactyl\support.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
load_registry(transformers_recogniser, regex_entities)
¶
Creates an AnalyzerEngine.registry by combining a TransformersRecogniser with a list of custom PteredactylRecognisers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformers_recogniser |
TransformersRecogniser
|
Custom transformers recogniser |
required |
regex_entities |
list[str | PteredactylRecogniser]
|
Named regex entities to generate PtereractylRecognisers, or custom PtereractylRecognisers |
required |
Returns:
Name | Type | Description |
---|---|---|
RecognizerRegistry |
RecognizerRegistry
|
registry of Recognisers for an AnalyzerEngine |
Source code in pteredactyl\pteredactyl\support.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
|
load_spacy_model(spacy_model)
¶
Downloads spacy model if not already installed
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spacy_model |
str
|
Name of spacy model |
required |
Source code in pteredactyl\pteredactyl\support.py
123 124 125 126 127 128 129 130 131 |
|
load_transformers_recognizer(model_path)
¶
Loads transformers recognizer with the specified model path
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path |
str
|
Path to the transformer model |
required |
Returns:
Name | Type | Description |
---|---|---|
TransformersRecogniser |
TransformersRecogniser
|
Loaded transformers recognizer |
Source code in pteredactyl\pteredactyl\support.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
|
return_allowed_results(initial_results, allowed_entities, allowed_regex_entities)
¶
Checks list of RecognizerResults for allowed entities returns a list of allowed results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
initial_results |
list[RecognizerResult]
|
The list of RecognizerResults to filter. |
required |
allowed_entities |
list[str]
|
The list of entity types to allow. |
required |
allowed_regex_entities |
list[str]
|
The list of regex entity types to allow. |
required |
Returns:
Type | Description |
---|---|
list[RecognizerResult]
|
list[RecognizerResult]: The filtered list of RecognizerResults. |
Source code in pteredactyl\pteredactyl\support.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
split_results_into_individual_words(text, results, text_separator=' ')
¶
Splits identified RecognizerResults into individual words. For example, Jane Smith becomes
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text that was analyzed. |
required |
results |
list[RecognizerResult]
|
The results of the analysis. |
required |
text_separator |
str
|
The separator used to split the text into individual words. |
' '
|
Returns:
Type | Description |
---|---|
list[RecognizerResult]
|
list[RecognizerResult]: A list of RecognizerResults, each representing an individual word. |
Source code in pteredactyl\pteredactyl\support.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|