The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization
نویسندگان
چکیده
Abstract We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as task editing document to prevent disclosure personal information, currently suffers from shortage privacy-oriented annotated resources, making it difficult properly evaluate level privacy protection offered by various This paper presents TAB (Text Anonymization Benchmark), new, open-source corpus developed address this shortage. The comprises 1,268 English-language court cases European Court Human Rights (ECHR) enriched with comprehensive annotations about information appearing in each document, including their semantic category, identifier type, confidential attributes, co-reference relations. Compared previous work, is designed go beyond traditional de-identification (which limited detection predefined categories), explicitly marks which spans ought be masked order conceal identity person protected. Along presenting its annotation layers, we also propose set that are specifically tailored toward measuring both terms utility preservation. illustrate use proposed empirical several baseline models. full along guidelines, scripts, models available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.
منابع مشابه
Corpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملSecGraph: A Uniform and Open-source Evaluation System for Graph Data Anonymization and De-anonymization
In this paper, we analyze and systematize the state-ofthe-art graph data privacy and utility techniques. Specifically, we propose and develop SecGraph (available at [1]), a uniform and open-source Secure Graph data sharing/publishing system. In SecGraph, we systematically study, implement, and evaluate 11 graph data anonymization algorithms, 19 data utility metrics, and 15 modern Structure-base...
متن کاملEvaluation of Data Anonymization Tools
This survey became possible due to coming request of one of Siemens Business Units to look for data anonymization solutions being presented in the market today. The customer plans to implement and deploy it within software development projects to provide offshore team with a fully functional environment without any critical data in it. Critical data are, for instance, Personal Identifiable Info...
متن کاملC-safety: a framework for the anonymization of semantic trajectories
The increasing abundance of data about the trajectories of personal movement is opening new opportunities for analyzing and mining human mobility. However, new risks emerge since it opens new ways of intruding into personal privacy. Representing the personal movements as sequences of places visited by a person during her/his movements semantic trajectory poses great privacy threats. In this pap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computational Linguistics
سال: 2022
ISSN: ['1530-9312', '0891-2017']
DOI: https://doi.org/10.1162/coli_a_00458