PIILO: an open-source system for personally identifiable information labeling and obfuscation

IF 1.6 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE
Langdon Holmes, Scott Crossley, Harshvardhan Sikka, Wesley Morris
{"title":"PIILO: an open-source system for personally identifiable information labeling and obfuscation","authors":"Langdon Holmes, Scott Crossley, Harshvardhan Sikka, Wesley Morris","doi":"10.1108/ils-04-2023-0032","DOIUrl":null,"url":null,"abstract":"Purpose This study aims to report on an automatic deidentification system for labeling and obfuscating personally identifiable information (PII) in student-generated text. Design/methodology/approach The authors evaluate the performance of their deidentification system on two data sets of student-generated text. Each data set was human-annotated for PII. The authors evaluate using two approaches: per-token PII classification accuracy and a simulated reidentification attack design. In the reidentification attack, two reviewers attempted to recover student identities from the data after PII was obfuscated by the authors’ system. In both cases, results are reported in terms of recall and precision. Findings The authors’ deidentification system recalled 84% of student name tokens in their first data set (96% of full names). On the second data set, it achieved a recall of 74% for student name tokens (91% of full names) and 75% for all direct identifiers. After the second data set was obfuscated by the authors’ system, two reviewers attempted to recover the identities of students from the obfuscated data. They performed below chance, indicating that the obfuscated data presents a low identity disclosure risk. Research limitations/implications The two data sets used in this study are not representative of all forms of student-generated text, so further work is needed to evaluate performance on more data. Practical implications This paper presents an open-source and automatic deidentification system appropriate for student-generated text with technical explanations and evaluations of performance. Originality/value Previous study on text deidentification has shown success in the medical domain. This paper develops on these approaches and applies them to text in the educational domain.","PeriodicalId":44588,"journal":{"name":"Information and Learning Sciences","volume":"161 1","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Learning Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ils-04-2023-0032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 1

Abstract

Purpose This study aims to report on an automatic deidentification system for labeling and obfuscating personally identifiable information (PII) in student-generated text. Design/methodology/approach The authors evaluate the performance of their deidentification system on two data sets of student-generated text. Each data set was human-annotated for PII. The authors evaluate using two approaches: per-token PII classification accuracy and a simulated reidentification attack design. In the reidentification attack, two reviewers attempted to recover student identities from the data after PII was obfuscated by the authors’ system. In both cases, results are reported in terms of recall and precision. Findings The authors’ deidentification system recalled 84% of student name tokens in their first data set (96% of full names). On the second data set, it achieved a recall of 74% for student name tokens (91% of full names) and 75% for all direct identifiers. After the second data set was obfuscated by the authors’ system, two reviewers attempted to recover the identities of students from the obfuscated data. They performed below chance, indicating that the obfuscated data presents a low identity disclosure risk. Research limitations/implications The two data sets used in this study are not representative of all forms of student-generated text, so further work is needed to evaluate performance on more data. Practical implications This paper presents an open-source and automatic deidentification system appropriate for student-generated text with technical explanations and evaluations of performance. Originality/value Previous study on text deidentification has shown success in the medical domain. This paper develops on these approaches and applies them to text in the educational domain.
PIILO:用于个人身份信息标记和混淆的开源系统
本研究旨在报告一个自动去识别系统,用于在学生生成的文本中标记和混淆个人身份信息(PII)。设计/方法/方法作者在两个学生生成的文本数据集上评估了他们的去识别系统的性能。每个数据集都对PII进行了人工注释。作者使用两种方法进行评估:每个令牌PII分类准确性和模拟重新识别攻击设计。在重新识别攻击中,两名审查者试图在PII被作者的系统混淆后从数据中恢复学生身份。在这两种情况下,结果都是根据召回率和准确率来报告的。作者的去识别系统在他们的第一个数据集中召回了84%的学生名字标记(96%的全名)。在第二个数据集上,它实现了74%的学生姓名标记(91%的全名)和75%的所有直接标识符的召回。在第二组数据被作者的系统混淆后,两名审稿人试图从被混淆的数据中恢复学生的身份。他们的表现低于机会,表明混淆的数据呈现出低身份泄露风险。本研究中使用的两个数据集并不能代表所有形式的学生生成的文本,因此需要进一步的工作来评估更多数据的表现。本文提出了一个开源和自动去识别系统,适用于学生生成的具有技术解释和性能评估的文本。原创性/价值以往对文本去识别的研究在医学领域取得了成功。本文在这些方法的基础上进行了发展,并将其应用于教育领域的文本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information and Learning Sciences
Information and Learning Sciences INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
9.50
自引率
2.90%
发文量
30
期刊介绍: Information and Learning Sciences advances inter-disciplinary research that explores scholarly intersections shared within 2 key fields: information science and the learning sciences / education sciences. The journal provides a publication venue for work that strengthens our scholarly understanding of human inquiry and learning phenomena, especially as they relate to design and uses of information and e-learning systems innovations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信