Skip to content

Anonymising DataFrames

The anonymise_df() function is designed to anonymise sensitive information in a pandas DataFrame. This function can handle Named Entity Recognition (NER) and regex-based entity recognition to redact information such as names, addresses, and other personal identifiers. It can act on a single column or multiple, and redact inplace or by returning a new DataFrame/column

Parameters

  • df: The DataFrame containing the text to anonymise.
  • column: The name or list of names of the column(s) to anonymise.
  • analyser: An optional AnalyzerEngine instance. If not provided, a new analyser will be created.
  • entities: A list of entities to anonymise using the NER model.
  • regex_entities: A list of regex patterns or custom recognisers to identify and anonymise.
  • highlight: If True, anonymised parts are highlighted.
  • replacement_lists: A dictionary for hide-in-plain-sight redaction containing replacement values per entity type.
  • inplace: If True, the original DataFrame is modified; otherwise, a copy is returned.
  • col_inplace: If True, the original column is replaced with the redacted version.

Basic Example

Here's a basic example of anonymising a DataFrame column:

import pandas as pd
import pteredactyl as pt

analyser = pt.create_analyser()

# Create a DataFrame with sample data
df = pd.DataFrame({
    'text': [
        "John Doe's number is 07111 293892.",
        "Jane Smith's lives at 123 Shirley Road."
    ]
})

# Anonymise the 'text' column
anonymised_df = pt.anonymise_df(
    df=df,
    column='text',
    analyser=analyser,
    entities=['PERSON', 'PHONE_NUMBER', 'LOCATION'],
    inplace=False
)

# Print the result
print(anonymised_df)
                                      text                       text_redacted
0       John Doe's number is 07111 293892.  <PERSON>'s number is 07111 293892.
1  Jane Smith's lives at 123 Shirley Road.     <PERSON>'s lives at <LOCATION>.