Mkdocstrings

Overview¶

The pteredactyl module provides various functions and classes for data anonymization and redaction.

Defaults¶

`change_model(new_model)` ¶

Change the default NER model.

Parameters¶

new_model : str The new model path to be set as the default NER model.

Returns¶

None

Source code in pteredactyl\pteredactyl\defaults.py

def change_model(new_model: str) -> None:
    """
    Change the default NER model.

    Parameters
    ----------
    new_model : str
        The new model path to be set as the default NER model.

    Returns
    -------
    None
    """
    global DEFAULT_NER_MODEL
    DEFAULT_NER_MODEL = new_model
    print(f"DEFAULT_NER_MODEL changed to: {DEFAULT_NER_MODEL}")

`show_defaults()` ¶

Print the default values used by pteredactyl.

This function shows the default values for the following variables: - DEFAULT_NER_MODEL (for model_path) - DEFAULT_SPACY_MODEL (for spacy_model) - DEFAULT_ENTITIES (for entities) - DEFAULT_REGEX_ENTITIES (for regex_entities)

Returns¶

None

Source code in pteredactyl\pteredactyl\defaults.py

def show_defaults() -> None:
    """
    Print the default values used by pteredactyl.

    This function shows the default values for the following variables:
    - DEFAULT_NER_MODEL (for model_path)
    - DEFAULT_SPACY_MODEL (for spacy_model)
    - DEFAULT_ENTITIES (for entities)
    - DEFAULT_REGEX_ENTITIES (for regex_entities)

    Returns
    -------
    None
    """
    print("PteRedactyl Defaults")
    print("--------------------")
    print(f"DEFAULT_NER_MODEL:      {DEFAULT_NER_MODEL}")
    print(f"DEFAULT_SPACY_MODEL:    {DEFAULT_SPACY_MODEL}")
    print(f"DEFAULT_ENTITIES:       {DEFAULT_ENTITIES}")
    print(f"DEFAULT_REGEX_ENTITIES: {DEFAULT_REGEX_ENTITIES}")

Exceptions¶

`MissingRegexRecogniserError` ¶

Bases: KeyError

Exception raised when a regex recogniser is requested but not found in the supported regex_entities list.

Attributes:

Name	Type	Description
`message`	`str`	The error message.

Source code in pteredactyl\pteredactyl\exceptions.py

class MissingRegexRecogniserError(KeyError):
    """
    Exception raised when a regex recogniser is requested but not found in the supported regex_entities list.

    Attributes:
        message (str): The error message.
    """

    def __init__(
        self,
        message: str = "No regex settings could detected in pteredactyl.regex_entities",
    ):
        super().__init__(message)
        self.message = message

Redactor¶

This module provides functionalities for text redaction.

Create an analyser engine with a Transformers NER model and spaCy model.

Source code in pteredactyl\pteredactyl\redactor.py

def create_analyser(
    model_path: str = DEFAULT_NER_MODEL,
    spacy_model: str = DEFAULT_SPACY_MODEL,
    language: str = "en",
    regex_entities: Sequence[str | PteredactylRecogniser] = DEFAULT_REGEX_ENTITIES,
) -> AnalyzerEngine:
    """
    Create an analyser engine with a Transformers NER model and spaCy model.
    """
    if not model_path:
        raise ValueError("No model path provided for NER model.")

    print(f"Using model path: {model_path}")

    if regex_entities:
        regex_entities = build_regex_entity_recogniser_list(
            regex_entities=regex_entities
        )

    load_spacy_model(spacy_model)

    transformers_recogniser = load_transformers_recognizer(model_path)

    nlp_configuration = load_nlp_configuration(
        language=language, spacy_model=spacy_model
    )

    registry = load_registry(
        transformers_recogniser=transformers_recogniser, regex_entities=regex_entities
    )

    nlp_engine = load_nlp_engine(
        presidio_logger=presidio_logger, nlp_configuration=nlp_configuration
    )

    analyser = AnalyzerEngine(nlp_engine=nlp_engine, registry=registry)

    return analyser

Redactor Analyser¶

Analyses text using the provided NER models and entities, and returns list of those identified. It is recommended to first create an analyser and feed this in to be reused with: >>> analyser = create_analyser() >>> analyser(text=text, analyser=analyser)

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to be analyzed.	required
`analyser`	`AnalyzerEngine`	An instance of AnalyzerEngine. If not provided, a new analyser will be created (recommend creating first viacreate_analyser(), before feeding in).	`None`
`entities`	`list`	A list of entity types to analyse. If not provided, a default list will be used.	`DEFAULT_ENTITIES`
`regex_entities`	`list`	A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.	`DEFAULT_REGEX_ENTITIES`
`model_path`	`str`	The path to the model used for analysis (e.g. 'StanfordAIMI/stanford-deidentifier-base'). Used only if analyser not provided.	`DEFAULT_NER_MODEL`
`spacy_model`	`str`	The spaCy model to use (e.g. 'en_core_web_sm'). Used only if analyser not provided.	`DEFAULT_SPACY_MODEL`
`language`	`str`	The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.	`'en'`
`mask_individual_words`	`bool`	If True, prevents joining of next-door entities together. (i.e. with Jane Smith, both 'Jane' and 'Smith' are identified separately if True, combined if False). Defaults to False.	`False`
`text_separator`	`str`	Text separator. Default is whitespace.	`' '`
`rebuild_regex_recognisers`	`bool`	If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution.	`True`
`**kwargs`		Additional keyword arguments for the analyzer.	`{}`

Returns:

Name	Type	Description
`list`	`list[RecognizerResult]`	The analysis results.

Example

from pteredactyl.redactor import analyse text = "My name is John Doe and my NHS number is 7890123450" results = analyse(text) print(results) [RecognizerResult(entity_type='PERSON', start=10, end=19, score=1.0), RecognizerResult(entity_type='NHS_NUMBER', start=36, end=46, score=1.0)]

Source code in pteredactyl\pteredactyl\redactor.py

def analyse(
    text: str,
    analyser: AnalyzerEngine | None = None,
    entities: str | list[str] = DEFAULT_ENTITIES,
    regex_entities: Sequence[str | PteredactylRecogniser] = DEFAULT_REGEX_ENTITIES,
    model_path: str = DEFAULT_NER_MODEL,
    spacy_model: str = DEFAULT_SPACY_MODEL,
    language: str = "en",
    mask_individual_words: bool = False,
    text_separator: str = " ",
    rebuild_regex_recognisers: bool = True,
    **kwargs,
) -> list[RecognizerResult]:
    """
    Analyses text using the provided NER models and entities, and returns list of those identified.
    It is recommended to first create an analyser and feed this in to be reused with:
        >>> analyser = create_analyser()
        >>> analyser(text=text, analyser=analyser)

    Args:
        text (str): The text to be analyzed.
        analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created
            (recommend creating first viacreate_analyser(), before feeding in).
        entities (list, optional): A list of entity types to analyse. If not provided, a default list will be used.
        regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.
        model_path (str): The path to the model used for analysis (e.g. 'StanfordAIMI/stanford-deidentifier-base'). Used only if analyser not provided.
        spacy_model (str): The spaCy model to use (e.g. 'en_core_web_sm'). Used only if analyser not provided.
        language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.
        mask_individual_words (bool): If True, prevents joining of next-door entities together.
            (i.e. with Jane Smith, both 'Jane' and 'Smith' are identified separately if True, combined if False). Defaults to False.
        text_separator (str): Text separator. Default is whitespace.
        rebuild_regex_recognisers (bool): If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution.
        **kwargs: Additional keyword arguments for the analyzer.

    Returns:
        list: The analysis results.

    Example:
        >>> from pteredactyl.redactor import analyse
        >>> text = "My name is John Doe and my NHS number is 7890123450"
        >>> results = analyse(text)
        >>> print(results)
        [RecognizerResult(entity_type='PERSON', start=10, end=19, score=1.0),
         RecognizerResult(entity_type='NHS_NUMBER', start=36, end=46, score=1.0)]
    """

    # Prepare
    entities = [entities] if isinstance(entities, str) else entities if entities else []
    regex_entities = (
        build_regex_entity_recogniser_list(regex_entities=regex_entities)
        if regex_entities
        else []
    )
    allowed_entities = entities
    allowed_regex_entities = [
        regex_entity.entity_type for regex_entity in regex_entities
    ]
    entities = allowed_entities + allowed_regex_entities

    # Check Analyser
    if not analyser:
        analyser = create_analyser(
            model_path=model_path,
            spacy_model=spacy_model,
            language=language,
            regex_entities=regex_entities,
        )
    else:
        if rebuild_regex_recognisers:
            rebuild_analyser_regex_recognisers(
                analyser=analyser, regex_entities=regex_entities
            )

    # Analyse
    initial_results = analyser.analyze(
        text, language=language, entities=entities, **kwargs
    )

    if mask_individual_words:
        initial_results = split_results_into_individual_words(
            text=text, results=initial_results, text_separator=text_separator
        )

    results = return_allowed_results(
        initial_results=initial_results,
        allowed_entities=allowed_entities,
        allowed_regex_entities=allowed_regex_entities,
    )

    results.sort(key=lambda x: x.start)

    return results

Redactor Anonymiser¶

Anonymises the given text by replacing specified entities by NER, and and regex entities by REGEX. Regex entities take priority and are analysed first. It is recommended to first create an analyser and feed this in to be reused.

Args: text (str): The text to be anonymized. analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created. entities (list, optional): A list of entity types to anonymize. If not provided, a default list will be used. regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used. highlight (bool): If True, highlights the anonymized parts in the text. replacement_lists: (dict, optional): A dictionary with entity types as keys and lists of replacement values for hide-in-plain-sight redaction. model_path (str): The path to the model used for analysis. Used only if analyser not provided. spacy_model (str): The spaCy model to use. Used only if analyser not provided. language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided. mask_individual_words (bool): If True, prevents joining of next-door entities together. (i.e. Jane Smith becomes if True, or if False). Defaults to False. text_separator (str): Text separator. Default is whitespace. rebuild_regex_recognisers (bool): If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution. **kwargs: Additional keyword arguments for analyse.

Returns: str: The anonymized text.

Example

analyser = create_analyser() text = ''' Patient Name: John Doe NHS Number: 7890123450 Address: AB1 0CD Date: January 1, 2022 Diagnostic Findings: The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs. The mass is suspected to be malignant and is likely to be a tumor. Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis. Recommendations: The patient is advised to consult with a medical specialist for a thorough evaluation of the mass. If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended. '''

results = anonymise(text, analyser=analyser, entities=["DATE_TIME", "PERSON"], regex_recognizers=["POSTCODE", "NHS_NUMBER"]) print(results)

Patient Name: <PERSON>
NHS Number: <NHS_NUMBER>
Address: <POSTCODE>
Date: <DATE_TIME>
Diagnostic Findings:
The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
The mass is suspected to be malignant and is likely to be a tumor.
Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
Recommendations:
The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.

Source code in pteredactyl\pteredactyl\redactor.py

def anonymise(
    text: str,
    analyser: AnalyzerEngine | None = None,
    entities: str | list[str] = DEFAULT_ENTITIES,
    regex_entities: Sequence[str | PteredactylRecogniser] = DEFAULT_REGEX_ENTITIES,
    highlight: bool = False,
    replacement_lists: dict | None = None,
    model_path: str = DEFAULT_NER_MODEL,
    spacy_model: str = DEFAULT_SPACY_MODEL,
    language: str = "en",
    mask_individual_words: bool = False,
    text_separator: str = " ",
    rebuild_regex_recognisers: bool = True,
    **kwargs,
) -> str:
    """
    Anonymises the given text by replacing specified entities by NER, and and regex entities by REGEX. Regex entities take priority and are analysed first.
    It is recommended to first create an analyser and feed this in to be reused.

    Args:
    text (str): The text to be anonymized.
    analyser (AnalyzerEngine, optional): An instance of AnalyzerEngine. If not provided, a new analyser will be created.
    entities (list, optional): A list of entity types to anonymize. If not provided, a default list will be used.
    regex_entities (list, optional): A list of regex entities or PteredactylRecognisers to analyse. If not provided, a default list will be used.
    highlight (bool): If True, highlights the anonymized parts in the text.
    replacement_lists: (dict, optional): A dictionary with entity types as keys and lists of replacement values for hide-in-plain-sight redaction.
    model_path (str): The path to the model used for analysis. Used only if analyser not provided.
    spacy_model (str): The spaCy model to use. Used only if analyser not provided.
    language (str): The language of the text to be analyzed. Defaults to "en". Used only if analyser not provided.
    mask_individual_words (bool): If True, prevents joining of next-door entities together.
            (i.e. Jane Smith becomes <PERSON> <PERSON> if True, or <PERSON> if False). Defaults to False.
    text_separator (str): Text separator. Default is whitespace.
    rebuild_regex_recognisers (bool): If True, and an existing analyser is provided, the analyser's regex recognisers will be rebuilt before execution.
    **kwargs: Additional keyword arguments for analyse.

    Returns:
    str: The anonymized text.

    Example:
        >>> analyser = create_analyser()
        >>> text = '''
            Patient Name: John Doe
            NHS Number: 7890123450
            Address: AB1 0CD
            Date: January 1, 2022
            Diagnostic Findings:
            The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
            The mass is suspected to be malignant and is likely to be a tumor.
            Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
            Recommendations:
            The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
            If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.
            '''

        >>> results = anonymise(text, analyser=analyser, entities=["DATE_TIME", "PERSON"], regex_recognizers=["POSTCODE", "NHS_NUMBER"])
        >>> print(results)

            Patient Name: <PERSON>
            NHS Number: <NHS_NUMBER>
            Address: <POSTCODE>
            Date: <DATE_TIME>
            Diagnostic Findings:
            The CT scan of the patient's chest revealed a mass in the right upper lobe of the lungs.
            The mass is suspected to be malignant and is likely to be a tumor.
            Further diagnostic tests, such as biopsy or CT scan of the mass, may be required to confirm the diagnosis.
            Recommendations:
            The patient is advised to consult with a medical specialist for a thorough evaluation of the mass.
            If the tumor is malignant, further treatment, such as surgery or radiotherapy, may be recommended.
    """
    # Prepare
    entities = [entities] if isinstance(entities, str) else entities if entities else []
    regex_entities = (
        build_regex_entity_recogniser_list(regex_entities=regex_entities)
        if regex_entities
        else []
    )
    allowed_entities = entities
    allowed_regex_entities = [
        regex_entity.entity_type for regex_entity in regex_entities
    ]
    entities = allowed_entities + allowed_regex_entities

    # Check Analyser
    if not analyser:
        analyser = create_analyser(
            model_path=model_path,
            spacy_model=spacy_model,
            language=language,
            regex_entities=regex_entities,
        )
    else:
        if rebuild_regex_recognisers:
            rebuild_analyser_regex_recognisers(
                analyser=analyser, regex_entities=regex_entities
            )

    # Analyse the text
    initial_results = analyse(
        text,
        analyser,
        entities=entities,
        regex_entities=regex_entities,
        mask_individual_words=mask_individual_words,
        text_separator=text_separator,
        rebuild_regex_recognisers=False,
        **kwargs,
    )

    # Create an OperatorConfig that randomly selects replacements from the replacement list
    operator_config = None
    if entities:
        if replacement_lists:
            operator_config = {}
            for entity in entities:
                if entity in replacement_lists:
                    operator_config[entity] = OperatorConfig(
                        "replace",
                        {"new_value": random.choice(replacement_lists[entity])},
                    )

    # Anonymise the text
    anonymiser = AnonymizerEngine()

    # if-else is strictly required as the anonymize method modifies initial_results variable when called
    if not mask_individual_words:
        anonymized_result = anonymiser.anonymize(
            text=text, analyzer_results=initial_results, operators=operator_config
        )
    else:
        # this is essentially AnonymizerEngine.anonymize without merging adjacent entities of the same type
        # some discussion around merging adjacent entities: https://github.com/microsoft/presidio/issues/1090
        analyzer_results = anonymiser._remove_conflicts_and_get_text_manipulation_data(
            initial_results, ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED
        )
        operators = anonymiser._AnonymizerEngine__check_or_add_default_operator(
            operator_config
        )
        anonymized_result = anonymiser._operate(
            text, analyzer_results, operators, OperatorType.Anonymize
        )

    # TODO - could be managed by creating an Operatorconfig for "PHONE_NUMBER"
    anonymised_text = anonymized_result.text.replace("PHONE_NUMBER", "NUMBER")

    return highlight_text(anonymised_text) if highlight else anonymised_text

Regex Check Functions¶

`is_nhs_number(nhs_number)` ¶

Check if a given value is a valid NHS number.

Parameters:

Name	Type	Description	Default
`nhs_number`	`int \| str`	The NHS number to be checked.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the given value is a valid NHS number, otherwise False.

Example

is_nhs_number(1234567890) True is_nhs_number("1234567890") True is_nhs_number("1234567898") True # (fails checksum) is_nhs_number("12345") False # (fails length check)

Note

The NHS number is a 10-digit number used in the United Kingdom for healthcare identification. The last digit of the NHS number is a check digit calculated by a special modules 11 algorithm for validation.

Source code in pteredactyl\pteredactyl\regex_check_functions.py

def is_nhs_number(nhs_number: str | int) -> bool:
    """
    Check if a given value is a valid NHS number.

    Args:
        nhs_number (int | str): The NHS number to be checked.
        Should be a string containing only numbers, (use of spaces and hyphens is permitted and will be processed).

    Returns:
        bool: True if the given value is a valid NHS number, otherwise False.

    Example:
        >>> is_nhs_number(1234567890)
        True
        >>> is_nhs_number("1234567890")
        True
        >>> is_nhs_number("1234567898")
        True # (fails checksum)
        >>> is_nhs_number("12345")
        False # (fails length check)

    Note:
        The NHS number is a 10-digit number used in the United Kingdom for healthcare identification.
        The last digit of the NHS number is a check digit calculated by a special modules 11 algorithm for validation.
    """

    # Prepare NHS Number
    nhs_number = (
        str(nhs_number)
        if isinstance(nhs_number, int)
        else nhs_number.replace(" ", "").replace("-", "")
    )

    # Check Only Digits
    if not nhs_number.isdigit():
        return False

    # Check Length
    if len(nhs_number) != 10:
        return False

    # Check Checksum
    total = 0
    for i, digit in enumerate(nhs_number[0:-1]):
        position = i + 1
        multiplier = 11 - position
        total += int(digit) * multiplier

    checksum = 11 - (total % 11)
    checksum = 0 if checksum == 11 else checksum
    check_digit = int(nhs_number[-1])

    if checksum != check_digit or checksum == 10:
        return False

    # All checks passed
    else:
        return True

Regex Entities¶

`build_pteredactyl_recogniser(entity_type, regex, check_function)` ¶

Build a custom regex ptererecogniser for pteredactyl.

Parameters:

Name	Type	Description	Default
`entity_type`	`str`	The name of the entity to be recognised.	required
`regex`	`str or Pattern`	The regular expression to match the entity.	required
`check_function`	`Callable`	A function to check if the matched string is a valid entity. Should take a single argument (the matched string) and return a boolean.	required

Returns:

Name	Type	Description
`PteredactylRecogniser`	`PteredactylRecogniser`	A custom presidio EntityRecognizer object.

Example:

def check_soton_landline(input: str): ... cleaned = input.replace('-','').replace(' ','') ... return cleaned.startswith('0238')

recogniser = build_pteredactyl_recogniser(entity_type='SOUTHAMPTON_LANDLINE', ... regex=r'(?:\d[\s-]?){11}', ... check_function=check_soton_landline)

Source code in pteredactyl\pteredactyl\regex_entities.py

def build_pteredactyl_recogniser(
    entity_type: str,
    regex: str | re.Pattern,
    check_function: Callable[..., bool] | None,
) -> PteredactylRecogniser:
    """
    Build a custom regex ptererecogniser for pteredactyl.

    Args:
        entity_type (str): The name of the entity to be recognised.
        regex (str or re.Pattern): The regular expression to match the entity.
        check_function (Callable, optional): A function to check if the matched string is a valid entity. Should take a single argument (the matched string) and return a boolean.

    Returns:
        PteredactylRecogniser: A custom presidio EntityRecognizer object.

    Example:
    >>> def check_soton_landline(input: str):
    ... cleaned = input.replace('-','').replace(' ','')
    ... return cleaned.startswith('0238')


    >>> recogniser = build_pteredactyl_recogniser(entity_type='SOUTHAMPTON_LANDLINE',
    ...                                           regex=r'(?:\\d[\\s-]?){11}',
    ...                                           check_function=check_soton_landline)
    """

    regex = re.compile(regex) if isinstance(regex, str) else regex
    return PteredactylRecogniser(
        entity_type=entity_type, regex=regex, check_function=check_function
    )

`build_regex_entity_recogniser_list(regex_entities)` ¶

Build a list of custom regex PteredactylRecognisers.

Parameters:

Name	Type	Description	Default
`regex_entities`	`list[str or PteredactylRecogniser]`	A list of PteredactylRecogniser objects or strings referencing pre-built PteredactylRecognisers.	required

Returns:

Type	Description
`list[PteredactylRecogniser]`	list[PteredactylRecogniser]: A list of custom presidio EntityRecognizer objects.

Example:

recognisers = build_regex_entity_recogniser_list(['NHS_NUMBER', ... 'ENTITY_2', ... PteredactylRecogniser(entity_type='SOUTHAMPTON_LANDLINE', ... regex=r'\b((?:+44\s?7\d{3}| $?07\d{3}$ ?)\s?\d{3}\s?\d{3}| $?01\d{1,4}$ ?\s?\d{1,4}\s?\d{1,4})\b', ... check_function=check_so_landline ... ])

Source code in pteredactyl\pteredactyl\regex_entities.py

def build_regex_entity_recogniser_list(
    regex_entities: str | PteredactylRecogniser | Sequence[str | PteredactylRecogniser],
) -> list[PteredactylRecogniser]:
    """
    Build a list of custom regex PteredactylRecognisers.

    Args:
        regex_entities (list[str or PteredactylRecogniser]): A list of PteredactylRecogniser objects or strings referencing pre-built PteredactylRecognisers.

    Returns:
        list[PteredactylRecogniser]: A list of custom presidio EntityRecognizer objects.

    Example:
    >>> recognisers = build_regex_entity_recogniser_list(['NHS_NUMBER',
    ...                                                    'ENTITY_2',
    ...                                                    PteredactylRecogniser(entity_type='SOUTHAMPTON_LANDLINE',
    ...                                                                          regex=r'\\b((?:\\+44\\s?7\\d{3}|\\(?07\\d{3}\\)?)\\s?\\d{3}\\s?\\d{3}|\\(?01\\d{1,4}\\)?\\s?\\d{1,4}\\s?\\d{1,4})\\b',
    ...                                                                          check_function=check_so_landline
    ...                                                   ])
    """

    regex_entity_recognisers = []
    if type(regex_entities) in (str, PteredactylRecogniser):
        regex_entities = [regex_entities]

    for regex_entity in regex_entities:
        if isinstance(regex_entity, str):
            regex_entity_recognisers.append(
                fetch_pteredactyl_recogniser(entity_type=regex_entity)
            )
        else:
            regex_entity_recognisers.append(regex_entity)

    return regex_entity_recognisers

`rebuild_analyser_regex_recognisers(analyser, regex_entities)` ¶

Rebuilds the analyser's regex recognisers with the supplied list of regex entities.

Parameters:

Name	Type	Description	Default
`analyser`	`AnalyzerEngine`	The analyser to rebuild.	required
`regex_entities`	`list[str or PteredactylRecogniser]`	The list of regex entities to use.	required

Returns:

Source code in pteredactyl\pteredactyl\regex_entities.py

def rebuild_analyser_regex_recognisers(
    analyser: AnalyzerEngine, regex_entities: Sequence[str | PteredactylRecogniser]
) -> None:
    """
    Rebuilds the analyser's regex recognisers with the supplied list of regex entities.

    Args:
        analyser (AnalyzerEngine): The analyser to rebuild.
        regex_entities (list[str or PteredactylRecogniser]): The list of regex entities to use.

    Returns:
    """

    analyser.registry.remove_recognizer(PTEREDACTYL_RECOGNISER_NAME)
    pteredactyl_recognisers = build_regex_entity_recogniser_list(regex_entities)
    for recogniser in pteredactyl_recognisers:
        analyser.registry.add_recognizer(recogniser)

Support¶

`find_substring_positions(s, sep=' ')` ¶

Finds the starting and ending indexes of substrings in the input string s. The substrings are determined by splitting s at separator.

Args:
    s (str): The input string containing substrings separated by newlines.
    sept (str): Separator for substrings.

Returns:
    list[tuple[int, int]]: A list of tuples, each containing the start and end index of a substring.

Examples:
>>> s = "abc

def" >>> positions = find_substring_positions(s, sep=" ") >>> print("Replacement Positions: ", positions) Replacement Positions: [(0, 3), (4, 7)]

Source code in pteredactyl\pteredactyl\support.py

def find_substring_positions(s: str, sep: str = " ") -> list[tuple[int, int]]:
    """Finds the starting and ending indexes of substrings in the input string `s`.
    The substrings are determined by splitting `s` at separator.

    Args:
        s (str): The input string containing substrings separated by newlines.
        sept (str): Separator for substrings.

    Returns:
        list[tuple[int, int]]: A list of tuples, each containing the start and end index of a substring.

    Examples:
    >>> s = "abc\ndef"
    >>> positions = find_substring_positions(s, sep="\n")
    >>> print("Replacement Positions: ", positions)
    Replacement Positions: [(0, 3), (4, 7)]
    """
    replacement_positions = []

    for substring in s.split(sep):
        for match in re.finditer(re.escape(substring), s):
            start, end = match.span()
            replacement_positions.append((start, end))

    return replacement_positions

`load_nlp_configuration(language, spacy_model)` ¶

Loads NLP configuration for spacy model

Parameters:

Name	Type	Description	Default
`language`	`str`	Model language (e.g. en)	required
`spacy_model`	`str`	Name of spacy model (e.g. en_core_web_sm)	required

Returns:

Name	Type	Description
`dict`	`dict[str, Any]`	configuration dictionary that can be passed to create an NlpEngineProvider

Source code in pteredactyl\pteredactyl\support.py

def load_nlp_configuration(language: str, spacy_model: str) -> dict[str, Any]:
    """Loads NLP configuration for spacy model

    Args:
        language (str): Model language (e.g. en)
        spacy_model (str): Name of spacy model (e.g. en_core_web_sm)

    Returns:
        dict: configuration dictionary that can be passed to create an NlpEngineProvider
    """
    return {
        "nlp_engine_name": "spacy",
        "models": [
            {
                "lang_code": language,
                "model_name": spacy_model,
            }
        ],
        "ner_model_configuration": {"labels_to_ignore": SPACY_LABELS_TO_IGNORE},
    }

`load_nlp_engine(presidio_logger, nlp_configuration)` ¶

Loads a NlpEngineProvider by creating a new engine.

Parameters:

Name	Type	Description	Default
`presidio_logger`	`Logger`	Logger object to set and restore logging level.	required
`nlp_configuration`	`dict`	Configuration for the NlpEngineProvider.	required

Returns:

Name	Type	Description
`NlpEngineProvider`	`NlpEngine`	The loaded engine.

Source code in pteredactyl\pteredactyl\support.py

def load_nlp_engine(
    presidio_logger: Logger, nlp_configuration: dict[str, Any]
) -> NlpEngine:
    """
    Loads a NlpEngineProvider by creating a new engine.

    Args:
        presidio_logger (Logger): Logger object to set and restore logging level.
        nlp_configuration (dict): Configuration for the NlpEngineProvider.

    Returns:
        NlpEngineProvider: The loaded engine.
    """
    log_level = presidio_logger.level
    presidio_logger.setLevel("ERROR")
    nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
    presidio_logger.setLevel(log_level)

    return nlp_engine

`load_registry(transformers_recogniser, regex_entities)` ¶

Creates an AnalyzerEngine.registry by combining a TransformersRecogniser with a list of custom PteredactylRecognisers

Parameters:

Name	Type	Description	Default
`transformers_recogniser`	`TransformersRecogniser`	Custom transformers recogniser	required
`regex_entities`	`list[str \| PteredactylRecogniser]`	Named regex entities to generate PtereractylRecognisers, or custom PtereractylRecognisers	required

Returns:

Name	Type	Description
`RecognizerRegistry`	`RecognizerRegistry`	registry of Recognisers for an AnalyzerEngine

Source code in pteredactyl\pteredactyl\support.py

def load_registry(
    transformers_recogniser: TransformersRecogniser,
    regex_entities: Sequence[str | PteredactylRecogniser],
) -> RecognizerRegistry:
    """Creates an AnalyzerEngine.registry by combining a TransformersRecogniser with a list of custom PteredactylRecognisers

    Args:
        transformers_recogniser (TransformersRecogniser): Custom transformers recogniser
        regex_entities (list[str | PteredactylRecogniser]): Named regex entities to generate PtereractylRecognisers, or custom PtereractylRecognisers

    Returns:
        RecognizerRegistry: registry of Recognisers for an AnalyzerEngine
    """
    registry = RecognizerRegistry()
    # registry.load_predefined_recognizers() # Presidio default recognizers - largely not needed
    registry.add_recognizer(transformers_recogniser)
    registry.remove_recognizer("SpacyRecognizer")

    if regex_entities:
        for entity in regex_entities:
            if isinstance(entity, str):
                recogniser = fetch_pteredactyl_recogniser(entity_type=entity)
            elif isinstance(entity, PteredactylRecogniser):
                recogniser = entity
            registry.add_recognizer(recogniser)

    return registry

`load_spacy_model(spacy_model)` ¶

Downloads spacy model if not already installed

Parameters:

Name	Type	Description	Default
`spacy_model`	`str`	Name of spacy model	required

Source code in pteredactyl\pteredactyl\support.py

def load_spacy_model(spacy_model: str) -> None:
    """Downloads spacy model if not already installed

    Args:
        spacy_model (str): Name of spacy model
    """
    if not spacy.util.is_package(spacy_model):
        print(f"Downloading model '{spacy_model}' for the first time, please wait...")
        spacy.cli.download(spacy_model)

`load_transformers_recognizer(model_path)` ¶

Loads transformers recognizer with the specified model path

Parameters:

Name	Type	Description	Default
`model_path`	`str`	Path to the transformer model	required

Returns:

Name	Type	Description
`TransformersRecogniser`	`TransformersRecogniser`	Loaded transformers recognizer

Source code in pteredactyl\pteredactyl\support.py

def load_transformers_recognizer(model_path: str) -> TransformersRecogniser:
    """Loads transformers recognizer with the specified model path

    Args:
        model_path (str): Path to the transformer model

    Returns:
        TransformersRecogniser: Loaded transformers recognizer
    """
    print(f"Loading transformers recognizer with model path: {model_path}")
    config = _get_config(model_path=model_path)
    transformers_recognizer = TransformersRecogniser(model_path=model_path)
    transformers_recognizer.load_transformer(**config)
    print(f"Model {model_path} loaded successfully")
    return transformers_recognizer

`return_allowed_results(initial_results, allowed_entities, allowed_regex_entities)` ¶

Checks list of RecognizerResults for allowed entities returns a list of allowed results.

Parameters:

Name	Type	Description	Default
`initial_results`	`list[RecognizerResult]`	The list of RecognizerResults to filter.	required
`allowed_entities`	`list[str]`	The list of entity types to allow.	required
`allowed_regex_entities`	`list[str]`	The list of regex entity types to allow.	required

Returns:

Type	Description
`list[RecognizerResult]`	list[RecognizerResult]: The filtered list of RecognizerResults.

Source code in pteredactyl\pteredactyl\support.py

def return_allowed_results(
    initial_results: list[RecognizerResult],
    allowed_entities: list[str],
    allowed_regex_entities: list[str],
) -> list[RecognizerResult]:
    """
    Checks list of RecognizerResults for allowed entities returns a list of allowed results.

    Args:
        initial_results (list[RecognizerResult]): The list of RecognizerResults to filter.
        allowed_entities (list[str]): The list of entity types to allow.
        allowed_regex_entities (list[str]): The list of regex entity types to allow.

    Returns:
        list[RecognizerResult]: The filtered list of RecognizerResults.
    """
    results = []
    for result in initial_results:
        recogniser = result.recognition_metadata["recognizer_name"]
        entity_type = result.entity_type

        if recogniser == PTEREDACTYL_RECOGNISER_NAME:
            if entity_type in allowed_regex_entities:
                results.append(result)
        else:
            if entity_type in allowed_entities:
                results.append(result)

    return results

`split_results_into_individual_words(text, results, text_separator=' ')` ¶

Splits identified RecognizerResults into individual words. For example, Jane Smith becomes , rather than .

Parameters:

Name	Type	Description	Default
`text`	`str`	The text that was analyzed.	required
`results`	`list[RecognizerResult]`	The results of the analysis.	required
`text_separator`	`str`	The separator used to split the text into individual words.	`' '`

Returns:

Type	Description
`list[RecognizerResult]`	list[RecognizerResult]: A list of RecognizerResults, each representing an individual word.

Source code in pteredactyl\pteredactyl\support.py

def split_results_into_individual_words(
    text: str, results: list[RecognizerResult], text_separator: str = " "
) -> list[RecognizerResult]:
    """
    Splits identified RecognizerResults into individual words. For example, Jane Smith becomes <PERSON> <PERSON>, rather than <PERSON>.

    Args:
        text (str): The text that was analyzed.
        results (list[RecognizerResult]): The results of the analysis.
        text_separator (str): The separator used to split the text into individual words.

    Returns:
        list[RecognizerResult]: A list of RecognizerResults, each representing an individual word.
    """
    masked_individual_words_results = []
    for result in results:
        substrings = text[result.start : result.end]
        for substring_position in find_substring_positions(
            substrings, sep=text_separator
        ):
            offset = result.start
            masked_individual_words_results.append(
                RecognizerResult(
                    entity_type=result.entity_type,
                    start=substring_position[0] + offset,
                    end=substring_position[1] + offset,
                    score=result.score,
                    analysis_explanation=result.analysis_explanation,
                    recognition_metadata=result.recognition_metadata,
                )
            )
    return masked_individual_words_results

Mkdocstrings

Overview¶

Defaults¶

change_model(new_model) ¶

Parameters¶

Returns¶

show_defaults() ¶

Returns¶

Exceptions¶

MissingRegexRecogniserError ¶

Redactor¶

Redactor Analyser¶

Redactor Anonymiser¶

Regex Check Functions¶

is_nhs_number(nhs_number) ¶

Regex Entities¶

build_pteredactyl_recogniser(entity_type, regex, check_function) ¶

build_regex_entity_recogniser_list(regex_entities) ¶

rebuild_analyser_regex_recognisers(analyser, regex_entities) ¶

Support¶

find_substring_positions(s, sep=' ') ¶

load_nlp_configuration(language, spacy_model) ¶

load_nlp_engine(presidio_logger, nlp_configuration) ¶

load_registry(transformers_recogniser, regex_entities) ¶

load_spacy_model(spacy_model) ¶

load_transformers_recognizer(model_path) ¶

return_allowed_results(initial_results, allowed_entities, allowed_regex_entities) ¶

split_results_into_individual_words(text, results, text_separator=' ') ¶

`change_model(new_model)` ¶

`show_defaults()` ¶

`MissingRegexRecogniserError` ¶

`is_nhs_number(nhs_number)` ¶

`build_pteredactyl_recogniser(entity_type, regex, check_function)` ¶

`build_regex_entity_recogniser_list(regex_entities)` ¶

`rebuild_analyser_regex_recognisers(analyser, regex_entities)` ¶

`find_substring_positions(s, sep=' ')` ¶

`load_nlp_configuration(language, spacy_model)` ¶

`load_nlp_engine(presidio_logger, nlp_configuration)` ¶

`load_registry(transformers_recogniser, regex_entities)` ¶

`load_spacy_model(spacy_model)` ¶

`load_transformers_recognizer(model_path)` ¶

`return_allowed_results(initial_results, allowed_entities, allowed_regex_entities)` ¶

`split_results_into_individual_words(text, results, text_separator=' ')` ¶