Anonymising DataFrames
The anonymise_df()
function is designed to anonymise sensitive information in a pandas DataFrame. This function can handle Named Entity Recognition (NER) and regex-based entity recognition to redact information such as names, addresses, and other personal identifiers. It can act on a single column or multiple, and redact inplace or by returning a new DataFrame/column
Parameters¶
df
: The DataFrame containing the text to anonymise.column
: The name or list of names of the column(s) to anonymise.analyser
: An optional AnalyzerEngine instance. If not provided, a new analyser will be created.entities
: A list of entities to anonymise using the NER model.regex_entities
: A list of regex patterns or custom recognisers to identify and anonymise.highlight
: IfTrue
, anonymised parts are highlighted.replacement_lists
: A dictionary for hide-in-plain-sight redaction containing replacement values per entity type.inplace
: IfTrue
, the original DataFrame is modified; otherwise, a copy is returned.col_inplace
: IfTrue
, the original column is replaced with the redacted version.
Basic Example¶
Here's a basic example of anonymising a DataFrame column:
import pandas as pd
import pteredactyl as pt
analyser = pt.create_analyser()
# Create a DataFrame with sample data
df = pd.DataFrame({
'text': [
"John Doe's number is 07111 293892.",
"Jane Smith's lives at 123 Shirley Road."
]
})
# Anonymise the 'text' column
anonymised_df = pt.anonymise_df(
df=df,
column='text',
analyser=analyser,
entities=['PERSON', 'PHONE_NUMBER', 'LOCATION'],
inplace=False
)
# Print the result
print(anonymised_df)
text text_redacted
0 John Doe's number is 07111 293892. <PERSON>'s number is 07111 293892.
1 Jane Smith's lives at 123 Shirley Road. <PERSON>'s lives at <LOCATION>.