Anonymising DataFrames

The anonymise_df() function is designed to anonymise sensitive information in a pandas DataFrame. This function can handle Named Entity Recognition (NER) and regex-based entity recognition to redact information such as names, addresses, and other personal identifiers. It can act on a single column or multiple, and redact inplace or by returning a new DataFrame/column

Parameters¶

df: The DataFrame containing the text to anonymise.
column: The name or list of names of the column(s) to anonymise.
analyser: An optional AnalyzerEngine instance. If not provided, a new analyser will be created.
entities: A list of entities to anonymise using the NER model.
regex_entities: A list of regex patterns or custom recognisers to identify and anonymise.
highlight: If True, anonymised parts are highlighted.
replacement_lists: A dictionary for hide-in-plain-sight redaction containing replacement values per entity type.
inplace: If True, the original DataFrame is modified; otherwise, a copy is returned.
col_inplace: If True, the original column is replaced with the redacted version.

Basic Example¶

Here's a basic example of anonymising a DataFrame column:

import pandas as pd
import pteredactyl as pt

analyser = pt.create_analyser()

# Create a DataFrame with sample data
df = pd.DataFrame({
    'text': [
        "John Doe's number is 07111 293892.",
        "Jane Smith's lives at 123 Shirley Road."
    ]
})

# Anonymise the 'text' column
anonymised_df = pt.anonymise_df(
    df=df,
    column='text',
    analyser=analyser,
    entities=['PERSON', 'PHONE_NUMBER', 'LOCATION'],
    inplace=False
)

# Print the result
print(anonymised_df)

                                      text                       text_redacted
0       John Doe's number is 07111 293892.  <PERSON>'s number is 07111 293892.
1  Jane Smith's lives at 123 Shirley Road.     <PERSON>'s lives at <LOCATION>.