Robust Utility-Preserving Text Anonymization Based on Large Language Models

Abstract

Anonymizing text that contains sensitive information is crucial for a wide range of applications. Existing techniques face the emerging challenges of the re-identification ability of large language models (LLMs), which have shown advanced capability in memorizing detailed information and reasoning over dispersed pieces of patterns to draw conclusions. When defending against LLM-based re-identification, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks. In general, the interaction between anonymization and data utility requires a deeper understanding within the context of LLMs. In this paper, we propose a framework composed of three key LLM-based components: a privacy evaluator, a utility evaluator, and an optimization component, which work collaboratively to perform anonymization. Extensive experiments demonstrate that the proposed model outperforms existing baselines, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. We provide detailed studies on these core modules. To consider large-scale and real-time applications, we investigate the distillation of the anonymization capabilities into lightweight models.

Two Challenges Faced by The Text Anonymization Task

Privacy Safety: current text anonymization techniques are vulnerable to disclosure threats from increasingly sophisticated LLMs. Many recent studies have demonstrated that such models can re-identify private information, even from texts anonymized by advanced methods
Utility For Downstream Tasks: existing studies only evaluate the utility of the anonymized text for downstream tasks after the anonymization process, thus they can not adapt the anonymization strategy to balance the privacy-utility tradeoff dynamically. Besides, existing studies mainly evaluate the utility mainly from the perspective of text quality, lacking investigation of the impact on downstream tasks.

Our Contributions

RUPTA: A multi-objective framework for text anonymization

This paper proposes to model the text anonymization task with the mult-objective problem where both privacy and utility are optimized during the anonymization process. Advanced LLMs are employed to act as the evaluator and black-box optimizer. Specefically, a novel framework, RUPTA, is presented, where a privacy evaluator, a utility evaluator, and an optimizer are integrated to effectively anonymize text, ensuring reduced risk of re-identification while maintaining utility for downstream tasks.

Db-bio dataset

Previous anonymization studies have been conducted on celebrity data available in Wikipedia. Inspired by that, we sampled celebrity biographies from the DBpedia Classes dataset to build a new DB-bio dataset for our study and future research, where we used the category labels of each celebrity in the DBpedia Classes as the occupation classification label. Anonymization results based on the RUPTA framework by GPT-4 are released together with the dataset to facilitate future studies.

BibTeX

@article{yang2024robust,
  title={Robust Utility-Preserving Text Anonymization Based on Large Language Models},
  author={Yang, Tianyu and Zhu, Xiaodan and Gurevych, Iryna},
  journal={arXiv preprint arXiv:2407.11770},
  year={2024}
}