Skip to content

Mkdocstrings

Overview

The pteredactyl module provides various functions and classes for data anonymization and redaction.

Defaults

change_model(new_model)

Change the default NER model.

Parameters

new_model : str The new model path to be set as the default NER model.

Returns

None

Source code in pteredactyl\pteredactyl\defaults.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def change_model(new_model: str) -> None:
    """
    Change the default NER model.

    Parameters
    ----------
    new_model : str
        The new model path to be set as the default NER model.

    Returns
    -------
    None
    """
    global DEFAULT_NER_MODEL
    DEFAULT_NER_MODEL = new_model
    print(f"DEFAULT_NER_MODEL changed to: {DEFAULT_NER_MODEL}")

show_defaults()

Print the default values used by pteredactyl.

This function shows the default values for the following variables: - DEFAULT_NER_MODEL (for model_path) - DEFAULT_SPACY_MODEL (for spacy_model) - DEFAULT_ENTITIES (for entities) - DEFAULT_REGEX_ENTITIES (for regex_entities)

Returns

None

Source code in pteredactyl\pteredactyl\defaults.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def show_defaults() -> None:
    """
    Print the default values used by pteredactyl.

    This function shows the default values for the following variables:
    - DEFAULT_NER_MODEL (for model_path)
    - DEFAULT_SPACY_MODEL (for spacy_model)
    - DEFAULT_ENTITIES (for entities)
    - DEFAULT_REGEX_ENTITIES (for regex_entities)

    Returns
    -------
    None
    """
    print("PteRedactyl Defaults")
    print("--------------------")
    print(f"DEFAULT_NER_MODEL:      {DEFAULT_NER_MODEL}")
    print(f"DEFAULT_SPACY_MODEL:    {DEFAULT_SPACY_MODEL}")
    print(f"DEFAULT_ENTITIES:       {DEFAULT_ENTITIES}")
    print(f"DEFAULT_REGEX_ENTITIES: {DEFAULT_REGEX_ENTITIES}")

Exceptions

MissingRegexRecogniserError

Bases: KeyError

Exception raised when a regex recogniser is requested but not found in the supported regex_entities list.

Attributes:

Name Type Description
message str

The error message.

Source code in pteredactyl\pteredactyl\exceptions.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class MissingRegexRecogniserError(KeyError):
    """
    Exception raised when a regex recogniser is requested but not found in the supported regex_entities list.

    Attributes:
        message (str): The error message.
    """

    def __init__(
        self,
        message: str = "No regex settings could detected in pteredactyl.regex_entities",
    ):
        super().__init__(message)
        self.message = message

Redactor

This module provides functionalities for text redaction.

Create an analyser engine with a Transformers NER model and spaCy model.

Source code in pteredactyl\pteredactyl\redactor.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def create_analyser(
    model_path: str = DEFAULT_NER_MODEL,
    spacy_model: str = DEFAULT_SPACY_MODEL,
    language: str = "en",
    regex_entities: Sequence[str | PteredactylRecogniser] = DEFAULT_REGEX_ENTITIES,
) -> AnalyzerEngine:
    """
    Create an analyser engine with a Transformers NER model and spaCy model.
    """
    if not model_path:
        raise ValueError("No model path provided for NER model.")

    print(f"Using model path: {model_path}")

    if regex_entities:
        regex_entities = build_regex_entity_recogniser_list(
            regex_entities=regex_entities
        )

    load_spacy_model(spacy_model)

    transformers_recogniser = load_transformers_recognizer(model_path)

    nlp_configuration = load_nlp_configuration(
        language=language, spacy_model=spacy_model
    )

    registry = load_registry(
        transformers_recogniser=transformers_recogniser, regex_entities=regex_entities
    )

    nlp_engine = load_nlp_engine(
        presidio_logger=presidio_logger, nlp_configuration=nlp_configuration
    )

    analyser = AnalyzerEngine(nlp_engine=nlp_engine, registry=registry)

    return analyser

Redactor Analyser

Analyses text using the provided NER models and entities, and returns list of those identified. It is recommended to first create an analyser and feed this in to be reused with: >>> analyser = create_analyser() >>> analyser(text=text, analyser=analyser)

Parameters:

Name Type Description Default
text str

The text to be analyzed.

required
analyser AnalyzerEngine

An instance of AnalyzerEngine. If not provided, a new analyser will be created (recommend creating first viacreate_analyser(), before feeding in).

None
entities list

A list of entity types to analyse. If not provided, a default list will be used.

DEFAULT_ENTITIES
regex_entities list

A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.

DEFAULT_REGEX_ENTITIES
model_path str

The path to the model used for analysis (e.g. 'StanfordAIMI/stanford-deidentifier-base'). Used only if analyser not provided.

DEFAULT_NER_MODEL
spacy_model str

The spaCy model to use (e.g. 'en_core_web_sm'). Used only if analyser not provided.

DEFAULT_SPACY_MODEL
language str

The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.

'en'
mask_individual_words bool

If True, prevents joining of next-door entities together. (i.e. with Jane Smith, both 'Jane' and 'Smith' are identified separately if True, combined if False). Defaults to False.

False
text_separator str

Text separator. Default is whitespace.

' '
rebuild_regex_recognisers bool

If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution.

True
**kwargs

Additional keyword arguments for the analyzer.

{}

Returns:

Name Type Description
list list[RecognizerResult]

The analysis results.

Example

from pteredactyl.redactor import analyse text = "My name is John Doe and my NHS number is 7890123450" results = analyse(text) print(results) [RecognizerResult(entity_type='PERSON', start=10, end=19, score=1.0), RecognizerResult(entity_type='NHS_NUMBER', start=36, end=46, score=1.0)]

Source code in pteredactyl\pteredactyl\redactor.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
def analyse(
    text: str,
    analyser: AnalyzerEngine | None = None,
    entities: str | list[str] = DEFAULT_ENTITIES,
    regex_entities: Sequence[str | PteredactylRecogniser] = DEFAULT_REGEX_ENTITIES,
    model_path: str = DEFAULT_NER_MODEL,
    spacy_model: str = DEFAULT_SPACY_MODEL,
    language: str = "en",
    mask_individual_words: bool = False,
    text_separator: str = " ",
    rebuild_regex_recognisers: bool = True,
    **kwargs,
) -> list[RecognizerResult]:
    """
    Analyses text using the provided NER models and entities, and returns list of those identified.
    It is recommended to first create an analyser and feed this in to be reused with:
        >>> analyser = create_analyser()
        >>> analyser(text=text, analyser=analyser)

    Args:
        text (str): The text to be analyzed.
        analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created
            (recommend creating first viacreate_analyser(), before feeding in).
        entities (list, optional): A list of entity types to analyse. If not provided, a default list will be used.
        regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.
        model_path (str): The path to the model used for analysis (e.g. 'StanfordAIMI/stanford-deidentifier-base'). Used only if analyser not provided.
        spacy_model (str): The spaCy model to use (e.g. 'en_core_web_sm'). Used only if analyser not provided.
        language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.
        mask_individual_words (bool): If True, prevents joining of next-door entities together.
            (i.e. with Jane Smith, both 'Jane' and 'Smith' are identified separately if True, combined if False). Defaults to False.
        text_separator (str): Text separator. Default is whitespace.
        rebuild_regex_recognisers (bool): If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution.
        **kwargs: Additional keyword arguments for the analyzer.

    Returns:
        list: The analysis results.

    Example:
        >>> from pteredactyl.redactor import analyse
        >>> text = "My name is John Doe and my NHS number is 7890123450"
        >>> results = analyse(text)
        >>> print(results)
        [RecognizerResult(entity_type='PERSON', start=10, end=19, score=1.0),
         RecognizerResult(entity_type='NHS_NUMBER', start=36, end=46, score=1.0)]
    """

    # Prepare
    entities = [entities] if isinstance(entities, str) else entities if entities else []
    regex_entities = (
        build_regex_entity_recogniser_list(regex_entities=regex_entities)
        if regex_entities
        else []
    )
    allowed_entities = entities
    allowed_regex_entities = [
        regex_entity.entity_type for regex_entity in regex_entities
    ]
    entities = allowed_entities + allowed_regex_entities

    # Check Analyser
    if not analyser:
        analyser = create_analyser(
            model_path=model_path,
            spacy_model=spacy_model,
            language=language,
            regex_entities=regex_entities,
        )
    else:
        if rebuild_regex_recognisers:
            rebuild_analyser_regex_recognisers(
                analyser=analyser, regex_entities=regex_entities
            )

    # Analyse
    initial_results = analyser.analyze(
        text, language=language, entities=entities, **kwargs
    )

    if mask_individual_words:
        initial_results = split_results_into_individual_words(
            text=text, results=initial_results, text_separator=text_separator
        )

    results = return_allowed_results(
        initial_results=initial_results,
        allowed_entities=allowed_entities,
        allowed_regex_entities=allowed_regex_entities,
    )

    results.sort(key=lambda x: x.start)

    return results

Redactor Anonymiser

Anonymises the given text by replacing specified entities by NER, and and regex entities by REGEX. Regex entities take priority and are analysed first. It is recommended to first create an analyser and feed this in to be reused.

Args: text (str): The text to be anonymized. analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created. entities (list, optional): A list of entity types to anonymize. If not provided, a default list will be used. regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used. highlight (bool): If True, highlights the anonymized parts in the text. replacement_lists: (dict, optional): A dictionary with entity types as keys and lists of replacement values for hide-in-plain-sight redaction. model_path (str): The path to the model used for analysis. Used only if analyser not provided. spacy_model (str): The spaCy model to use. Used only if analyser not provided. language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided. mask_individual_words (bool): If True, prevents joining of next-door entities together. (i.e. Jane Smith becomes if True, or if False). Defaults to False. text_separator (str): Text separator. Default is whitespace. rebuild_regex_recognisers (bool): If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution. **kwargs: Additional keyword arguments for analyse.

Returns: str: The anonymized text.

Example

analyser = create_analyser() text = ''' Patient Name: John Doe NHS Number: 7890123450 Address: AB1 0CD Date: January 1, 2022 Diagnostic Findings: The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs. The mass is suspected to be malignant and is likely to be a tumor. Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis. Recommendations: The patient is advised to consult with a medical specialist for a thorough evaluation of the mass. If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended. '''

results = anonymise(text, analyser=analyser, entities=["DATE_TIME", "PERSON"], regex_recognizers=["POSTCODE", "NHS_NUMBER"]) print(results)

Patient Name: <PERSON>
NHS Number: <NHS_NUMBER>
Address: <POSTCODE>
Date: <DATE_TIME>
Diagnostic Findings:
The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
The mass is suspected to be malignant and is likely to be a tumor.
Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
Recommendations:
The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.
Source code in pteredactyl\pteredactyl\redactor.py
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
def anonymise(
    text: str,
    analyser: AnalyzerEngine | None = None,
    entities: str | list[str] = DEFAULT_ENTITIES,
    regex_entities: Sequence[str | PteredactylRecogniser] = DEFAULT_REGEX_ENTITIES,
    highlight: bool = False,
    replacement_lists: dict | None = None,
    model_path: str = DEFAULT_NER_MODEL,
    spacy_model: str = DEFAULT_SPACY_MODEL,
    language: str = "en",
    mask_individual_words: bool = False,
    text_separator: str = " ",
    rebuild_regex_recognisers: bool = True,
    **kwargs,
) -> str:
    """
    Anonymises the given text by replacing specified entities by NER, and and regex entities by REGEX. Regex entities take priority and are analysed first.
    It is recommended to first create an analyser and feed this in to be reused.

    Args:
    text (str): The text to be anonymized.
    analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created.
    entities (list, optional): A list of entity types to anonymize. If not provided, a default list will be used.
    regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.
    highlight (bool): If True, highlights the anonymized parts in the text.
    replacement_lists: (dict, optional): A dictionary with entity types as keys and lists of replacement values for hide-in-plain-sight redaction.
    model_path (str): The path to the model used for analysis. Used only if analyser not provided.
    spacy_model (str): The spaCy model to use. Used only if analyser not provided.
    language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.
    mask_individual_words (bool): If True, prevents joining of next-door entities together.
            (i.e. Jane Smith becomes <PERSON> <PERSON> if True, or <PERSON> if False). Defaults to False.
    text_separator (str): Text separator. Default is whitespace.
    rebuild_regex_recognisers (bool): If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution.
    **kwargs: Additional keyword arguments for analyse.

    Returns:
    str: The anonymized text.

    Example:
        >>> analyser = create_analyser()
        >>> text = '''
            Patient Name: John Doe
            NHS Number: 7890123450
            Address: AB1 0CD
            Date: January 1, 2022
            Diagnostic Findings:
            The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
            The mass is suspected to be malignant and is likely to be a tumor.
            Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
            Recommendations:
            The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
            If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.
            '''

        >>> results = anonymise(text, analyser=analyser, entities=["DATE_TIME", "PERSON"], regex_recognizers=["POSTCODE", "NHS_NUMBER"])
        >>> print(results)

            Patient Name: <PERSON>
            NHS Number: <NHS_NUMBER>
            Address: <POSTCODE>
            Date: <DATE_TIME>
            Diagnostic Findings:
            The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
            The mass is suspected to be malignant and is likely to be a tumor.
            Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
            Recommendations:
            The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
            If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.
    """
    # Prepare
    entities = [entities] if isinstance(entities, str) else entities if entities else []
    regex_entities = (
        build_regex_entity_recogniser_list(regex_entities=regex_entities)
        if regex_entities
        else []
    )
    allowed_entities = entities
    allowed_regex_entities = [
        regex_entity.entity_type for regex_entity in regex_entities
    ]
    entities = allowed_entities + allowed_regex_entities

    # Check Analyser
    if not analyser:
        analyser = create_analyser(
            model_path=model_path,
            spacy_model=spacy_model,
            language=language,
            regex_entities=regex_entities,
        )
    else:
        if rebuild_regex_recognisers:
            rebuild_analyser_regex_recognisers(
                analyser=analyser, regex_entities=regex_entities
            )

    # Analyse the text
    initial_results = analyse(
        text,
        analyser,
        entities=entities,
        regex_entities=regex_entities,
        mask_individual_words=mask_individual_words,
        text_separator=text_separator,
        rebuild_regex_recognisers=False,
        **kwargs,
    )

    # Create an OperatorConfig that randomly selects replacements from the replacement list
    operator_config = None
    if entities:
        if replacement_lists:
            operator_config = {}
            for entity in entities:
                if entity in replacement_lists:
                    operator_config[entity] = OperatorConfig(
                        "replace",
                        {"new_value": random.choice(replacement_lists[entity])},
                    )

    # Anonymise the text
    anonymiser = AnonymizerEngine()

    # if-else is strictly required as the anonymize method modifies initial_results variable when called
    if not mask_individual_words:
        anonymized_result = anonymiser.anonymize(
            text=text, analyzer_results=initial_results, operators=operator_config
        )
    else:
        # this is essentially AnonymizerEngine.anonymize without merging adjacent entities of the same type
        # some discussion around merging adjacent entities: https://github.com/microsoft/presidio/issues/1090
        analyzer_results = anonymiser._remove_conflicts_and_get_text_manipulation_data(
            initial_results, ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED
        )
        operators = anonymiser._AnonymizerEngine__check_or_add_default_operator(
            operator_config
        )
        anonymized_result = anonymiser._operate(
            text, analyzer_results, operators, OperatorType.Anonymize
        )

    # TODO - could be managed by creating an Operatorconfig for "PHONE_NUMBER"
    anonymised_text = anonymized_result.text.replace("PHONE_NUMBER", "NUMBER")

    return highlight_text(anonymised_text) if highlight else anonymised_text

Regex Check Functions

is_nhs_number(nhs_number)

Check if a given value is a valid NHS number.

Parameters:

Name Type Description Default
nhs_number int | str

The NHS number to be checked.

required

Returns:

Name Type Description
bool bool

True if the given value is a valid NHS number, otherwise False.

Example

is_nhs_number(1234567890) True is_nhs_number("1234567890") True is_nhs_number("1234567898") True # (fails checksum) is_nhs_number("12345") False # (fails length check)

Note

The NHS number is a 10-digit number used in the United Kingdom for healthcare identification. The last digit of the NHS number is a check digit calculated by a special modules 11 algorithm for validation.

Source code in pteredactyl\pteredactyl\regex_check_functions.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def is_nhs_number(nhs_number: str | int) -> bool:
    """
    Check if a given value is a valid NHS number.

    Args:
        nhs_number (int | str): The NHS number to be checked.
        Should be a string containing only numbers, (use of spaces and hyphens is permitted and will be processed).

    Returns:
        bool: True if the given value is a valid NHS number, otherwise False.

    Example:
        >>> is_nhs_number(1234567890)
        True
        >>> is_nhs_number("1234567890")
        True
        >>> is_nhs_number("1234567898")
        True # (fails checksum)
        >>> is_nhs_number("12345")
        False # (fails length check)

    Note:
        The NHS number is a 10-digit number used in the United Kingdom for healthcare identification.
        The last digit of the NHS number is a check digit calculated by a special modules 11 algorithm for validation.
    """

    # Prepare NHS Number
    nhs_number = (
        str(nhs_number)
        if isinstance(nhs_number, int)
        else nhs_number.replace(" ", "").replace("-", "")
    )

    # Check Only Digits
    if not nhs_number.isdigit():
        return False

    # Check Length
    if len(nhs_number) != 10:
        return False

    # Check Checksum
    total = 0
    for i, digit in enumerate(nhs_number[0:-1]):
        position = i + 1
        multiplier = 11 - position
        total += int(digit) * multiplier

    checksum = 11 - (total % 11)
    checksum = 0 if checksum == 11 else checksum
    check_digit = int(nhs_number[-1])

    if checksum != check_digit or checksum == 10:
        return False

    # All checks passed
    else:
        return True

Regex Entities

build_pteredactyl_recogniser(entity_type, regex, check_function)

Build a custom regex ptererecogniser for pteredactyl.

Parameters:

Name Type Description Default
entity_type str

The name of the entity to be recognised.

required
regex str or Pattern

The regular expression to match the entity.

required
check_function Callable

A function to check if the matched string is a valid entity. Should take a single argument (the matched string) and return a boolean.

required

Returns:

Name Type Description
PteredactylRecogniser PteredactylRecogniser

A custom presidio EntityRecognizer object.

Example:

def check_soton_landline(input: str): ... cleaned = input.replace('-','').replace(' ','') ... return cleaned.startswith('0238')

recogniser = build_pteredactyl_recogniser(entity_type='SOUTHAMPTON_LANDLINE', ... regex=r'(?:\d[\s-]?){11}', ... check_function=check_soton_landline)

Source code in pteredactyl\pteredactyl\regex_entities.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def build_pteredactyl_recogniser(
    entity_type: str,
    regex: str | re.Pattern,
    check_function: Callable[..., bool] | None,
) -> PteredactylRecogniser:
    """
    Build a custom regex ptererecogniser for pteredactyl.

    Args:
        entity_type (str): The name of the entity to be recognised.
        regex (str or re.Pattern): The regular expression to match the entity.
        check_function (Callable, optional): A function to check if the matched string is a valid entity. Should take a single argument (the matched string) and return a boolean.

    Returns:
        PteredactylRecogniser: A custom presidio EntityRecognizer object.

    Example:
    >>> def check_soton_landline(input: str):
    ... cleaned = input.replace('-','').replace(' ','')
    ... return cleaned.startswith('0238')


    >>> recogniser = build_pteredactyl_recogniser(entity_type='SOUTHAMPTON_LANDLINE',
    ...                                           regex=r'(?:\\d[\\s-]?){11}',
    ...                                           check_function=check_soton_landline)
    """

    regex = re.compile(regex) if isinstance(regex, str) else regex
    return PteredactylRecogniser(
        entity_type=entity_type, regex=regex, check_function=check_function
    )

build_regex_entity_recogniser_list(regex_entities)

Build a list of custom regex PteredactylRecognisers.

Parameters:

Name Type Description Default
regex_entities list[str or PteredactylRecogniser]

A list of PteredactylRecogniser objects or strings referencing pre-built PteredactylRecognisers.

required

Returns:

Type Description
list[PteredactylRecogniser]

list[PteredactylRecogniser]: A list of custom presidio EntityRecognizer objects.

Example:

recognisers = build_regex_entity_recogniser_list(['NHS_NUMBER', ... 'ENTITY_2', ... PteredactylRecogniser(entity_type='SOUTHAMPTON_LANDLINE', ... regex=r'\b((?:+44\s?7\d{3}|?)\s?\d{3}\s?\d{3}|?\s?\d{1,4}\s?\d{1,4})\b', ... check_function=check_so_landline ... ])

Source code in pteredactyl\pteredactyl\regex_entities.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def build_regex_entity_recogniser_list(
    regex_entities: str | PteredactylRecogniser | Sequence[str | PteredactylRecogniser],
) -> list[PteredactylRecogniser]:
    """
    Build a list of custom regex PteredactylRecognisers.

    Args:
        regex_entities (list[str or PteredactylRecogniser]): A list of PteredactylRecogniser objects or strings referencing pre-built PteredactylRecognisers.

    Returns:
        list[PteredactylRecogniser]: A list of custom presidio EntityRecognizer objects.

    Example:
    >>> recognisers = build_regex_entity_recogniser_list(['NHS_NUMBER',
    ...                                                    'ENTITY_2',
    ...                                                    PteredactylRecogniser(entity_type='SOUTHAMPTON_LANDLINE',
    ...                                                                          regex=r'\\b((?:\\+44\\s?7\\d{3}|\\(?07\\d{3}\\)?)\\s?\\d{3}\\s?\\d{3}|\\(?01\\d{1,4}\\)?\\s?\\d{1,4}\\s?\\d{1,4})\\b',
    ...                                                                          check_function=check_so_landline
    ...                                                   ])
    """

    regex_entity_recognisers = []
    if type(regex_entities) in (str, PteredactylRecogniser):
        regex_entities = [regex_entities]

    for regex_entity in regex_entities:
        if isinstance(regex_entity, str):
            regex_entity_recognisers.append(
                fetch_pteredactyl_recogniser(entity_type=regex_entity)
            )
        else:
            regex_entity_recognisers.append(regex_entity)

    return regex_entity_recognisers

rebuild_analyser_regex_recognisers(analyser, regex_entities)

Rebuilds the analyser's regex recognisers with the supplied list of regex entities.

Parameters:

Name Type Description Default
analyser AnalyzerEngine

The analyser to rebuild.

required
regex_entities list[str or PteredactylRecogniser]

The list of regex entities to use.

required

Returns:

Source code in pteredactyl\pteredactyl\regex_entities.py
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def rebuild_analyser_regex_recognisers(
    analyser: AnalyzerEngine, regex_entities: Sequence[str | PteredactylRecogniser]
) -> None:
    """
    Rebuilds the analyser's regex recognisers with the supplied list of regex entities.

    Args:
        analyser (AnalyzerEngine): The analyser to rebuild.
        regex_entities (list[str or PteredactylRecogniser]): The list of regex entities to use.

    Returns:
    """

    analyser.registry.remove_recognizer(PTEREDACTYL_RECOGNISER_NAME)
    pteredactyl_recognisers = build_regex_entity_recogniser_list(regex_entities)
    for recogniser in pteredactyl_recognisers:
        analyser.registry.add_recognizer(recogniser)

Support

find_substring_positions(s, sep=' ')

Finds the starting and ending indexes of substrings in the input string s. The substrings are determined by splitting s at separator.

Args:
    s (str): The input string containing substrings separated by newlines.
    sept (str): Separator for substrings.

Returns:
    list[tuple[int, int]]: A list of tuples, each containing the start and end index of a substring.

Examples:
>>> s = "abc

def" >>> positions = find_substring_positions(s, sep=" ") >>> print("Replacement Positions: ", positions) Replacement Positions: [(0, 3), (4, 7)]

Source code in pteredactyl\pteredactyl\support.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def find_substring_positions(s: str, sep: str = " ") -> list[tuple[int, int]]:
    """Finds the starting and ending indexes of substrings in the input string `s`.
    The substrings are determined by splitting `s` at separator.

    Args:
        s (str): The input string containing substrings separated by newlines.
        sept (str): Separator for substrings.

    Returns:
        list[tuple[int, int]]: A list of tuples, each containing the start and end index of a substring.

    Examples:
    >>> s = "abc\ndef"
    >>> positions = find_substring_positions(s, sep="\n")
    >>> print("Replacement Positions: ", positions)
    Replacement Positions: [(0, 3), (4, 7)]
    """
    replacement_positions = []

    for substring in s.split(sep):
        for match in re.finditer(re.escape(substring), s):
            start, end = match.span()
            replacement_positions.append((start, end))

    return replacement_positions

load_nlp_configuration(language, spacy_model)

Loads NLP configuration for spacy model

Parameters:

Name Type Description Default
language str

Model language (e.g. en)

required
spacy_model str

Name of spacy model (e.g. en_core_web_sm)

required

Returns:

Name Type Description
dict dict[str, Any]

configuration dictionary that can be passed to create an NlpEngineProvider

Source code in pteredactyl\pteredactyl\support.py
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def load_nlp_configuration(language: str, spacy_model: str) -> dict[str, Any]:
    """Loads NLP configuration for spacy model

    Args:
        language (str): Model language (e.g. en)
        spacy_model (str): Name of spacy model (e.g. en_core_web_sm)

    Returns:
        dict: configuration dictionary that can be passed to create an NlpEngineProvider
    """
    return {
        "nlp_engine_name": "spacy",
        "models": [
            {
                "lang_code": language,
                "model_name": spacy_model,
            }
        ],
        "ner_model_configuration": {"labels_to_ignore": SPACY_LABELS_TO_IGNORE},
    }

load_nlp_engine(presidio_logger, nlp_configuration)

Loads a NlpEngineProvider by creating a new engine.

Parameters:

Name Type Description Default
presidio_logger Logger

Logger object to set and restore logging level.

required
nlp_configuration dict

Configuration for the NlpEngineProvider.

required

Returns:

Name Type Description
NlpEngineProvider NlpEngine

The loaded engine.

Source code in pteredactyl\pteredactyl\support.py
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
def load_nlp_engine(
    presidio_logger: Logger, nlp_configuration: dict[str, Any]
) -> NlpEngine:
    """
    Loads a NlpEngineProvider by creating a new engine.

    Args:
        presidio_logger (Logger): Logger object to set and restore logging level.
        nlp_configuration (dict): Configuration for the NlpEngineProvider.

    Returns:
        NlpEngineProvider: The loaded engine.
    """
    log_level = presidio_logger.level
    presidio_logger.setLevel("ERROR")
    nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
    presidio_logger.setLevel(log_level)

    return nlp_engine

load_registry(transformers_recogniser, regex_entities)

Creates an AnalyzerEngine.registry by combining a TransformersRecogniser with a list of custom PteredactylRecognisers

Parameters:

Name Type Description Default
transformers_recogniser TransformersRecogniser

Custom transformers recogniser

required
regex_entities list[str | PteredactylRecogniser]

Named regex entities to generate PtereractylRecognisers, or custom PtereractylRecognisers

required

Returns:

Name Type Description
RecognizerRegistry RecognizerRegistry

registry of Recognisers for an AnalyzerEngine

Source code in pteredactyl\pteredactyl\support.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def load_registry(
    transformers_recogniser: TransformersRecogniser,
    regex_entities: Sequence[str | PteredactylRecogniser],
) -> RecognizerRegistry:
    """Creates an AnalyzerEngine.registry by combining a TransformersRecogniser with a list of custom PteredactylRecognisers

    Args:
        transformers_recogniser (TransformersRecogniser): Custom transformers recogniser
        regex_entities (list[str | PteredactylRecogniser]): Named regex entities to generate PtereractylRecognisers, or custom PtereractylRecognisers

    Returns:
        RecognizerRegistry: registry of Recognisers for an AnalyzerEngine
    """
    registry = RecognizerRegistry()
    # registry.load_predefined_recognizers() # Presidio default recognizers - largely not needed
    registry.add_recognizer(transformers_recogniser)
    registry.remove_recognizer("SpacyRecognizer")

    if regex_entities:
        for entity in regex_entities:
            if isinstance(entity, str):
                recogniser = fetch_pteredactyl_recogniser(entity_type=entity)
            elif isinstance(entity, PteredactylRecogniser):
                recogniser = entity
            registry.add_recognizer(recogniser)

    return registry

load_spacy_model(spacy_model)

Downloads spacy model if not already installed

Parameters:

Name Type Description Default
spacy_model str

Name of spacy model

required
Source code in pteredactyl\pteredactyl\support.py
123
124
125
126
127
128
129
130
131
def load_spacy_model(spacy_model: str) -> None:
    """Downloads spacy model if not already installed

    Args:
        spacy_model (str): Name of spacy model
    """
    if not spacy.util.is_package(spacy_model):
        print(f"Downloading model '{spacy_model}' for the first time, please wait...")
        spacy.cli.download(spacy_model)

load_transformers_recognizer(model_path)

Loads transformers recognizer with the specified model path

Parameters:

Name Type Description Default
model_path str

Path to the transformer model

required

Returns:

Name Type Description
TransformersRecogniser TransformersRecogniser

Loaded transformers recognizer

Source code in pteredactyl\pteredactyl\support.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
def load_transformers_recognizer(model_path: str) -> TransformersRecogniser:
    """Loads transformers recognizer with the specified model path

    Args:
        model_path (str): Path to the transformer model

    Returns:
        TransformersRecogniser: Loaded transformers recognizer
    """
    print(f"Loading transformers recognizer with model path: {model_path}")
    config = _get_config(model_path=model_path)
    transformers_recognizer = TransformersRecogniser(model_path=model_path)
    transformers_recognizer.load_transformer(**config)
    print(f"Model {model_path} loaded successfully")
    return transformers_recognizer

return_allowed_results(initial_results, allowed_entities, allowed_regex_entities)

Checks list of RecognizerResults for allowed entities returns a list of allowed results.

Parameters:

Name Type Description Default
initial_results list[RecognizerResult]

The list of RecognizerResults to filter.

required
allowed_entities list[str]

The list of entity types to allow.

required
allowed_regex_entities list[str]

The list of regex entity types to allow.

required

Returns:

Type Description
list[RecognizerResult]

list[RecognizerResult]: The filtered list of RecognizerResults.

Source code in pteredactyl\pteredactyl\support.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def return_allowed_results(
    initial_results: list[RecognizerResult],
    allowed_entities: list[str],
    allowed_regex_entities: list[str],
) -> list[RecognizerResult]:
    """
    Checks list of RecognizerResults for allowed entities returns a list of allowed results.

    Args:
        initial_results (list[RecognizerResult]): The list of RecognizerResults to filter.
        allowed_entities (list[str]): The list of entity types to allow.
        allowed_regex_entities (list[str]): The list of regex entity types to allow.

    Returns:
        list[RecognizerResult]: The filtered list of RecognizerResults.
    """
    results = []
    for result in initial_results:
        recogniser = result.recognition_metadata["recognizer_name"]
        entity_type = result.entity_type

        if recogniser == PTEREDACTYL_RECOGNISER_NAME:
            if entity_type in allowed_regex_entities:
                results.append(result)
        else:
            if entity_type in allowed_entities:
                results.append(result)

    return results

split_results_into_individual_words(text, results, text_separator=' ')

Splits identified RecognizerResults into individual words. For example, Jane Smith becomes , rather than .

Parameters:

Name Type Description Default
text str

The text that was analyzed.

required
results list[RecognizerResult]

The results of the analysis.

required
text_separator str

The separator used to split the text into individual words.

' '

Returns:

Type Description
list[RecognizerResult]

list[RecognizerResult]: A list of RecognizerResults, each representing an individual word.

Source code in pteredactyl\pteredactyl\support.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def split_results_into_individual_words(
    text: str, results: list[RecognizerResult], text_separator: str = " "
) -> list[RecognizerResult]:
    """
    Splits identified RecognizerResults into individual words. For example, Jane Smith becomes <PERSON> <PERSON>, rather than <PERSON>.

    Args:
        text (str): The text that was analyzed.
        results (list[RecognizerResult]): The results of the analysis.
        text_separator (str): The separator used to split the text into individual words.

    Returns:
        list[RecognizerResult]: A list of RecognizerResults, each representing an individual word.
    """
    masked_individual_words_results = []
    for result in results:
        substrings = text[result.start : result.end]
        for substring_position in find_substring_positions(
            substrings, sep=text_separator
        ):
            offset = result.start
            masked_individual_words_results.append(
                RecognizerResult(
                    entity_type=result.entity_type,
                    start=substring_position[0] + offset,
                    end=substring_position[1] + offset,
                    score=result.score,
                    analysis_explanation=result.analysis_explanation,
                    recognition_metadata=result.recognition_metadata,
                )
            )
    return masked_individual_words_results