Prediction of monetary penalties for data protection cases in multiple languages

Aaron Ceross, Tingting Zhu
{"title":"Prediction of monetary penalties for data protection cases in multiple languages","authors":"Aaron Ceross, Tingting Zhu","doi":"10.1145/3462757.3466097","DOIUrl":null,"url":null,"abstract":"As the use of personal data becomes further entrenched in the function of societal interaction, the regulation of such data continues to grow as an important area of law. Nevertheless, it is unfortunately the case that data protection authorities have limited resources to address an increasing number of investigations. The leveraging of appropriate data-driven models, coupled with the automation of decision making, has the potential to help in such circumstances. In this paper, we evaluate machine learning models in the literature (such as Support Vector Machine (SVM), Random Forest, and Multinomial Naive Bayes (MNB) classifiers) for natural language processing in order to predict whether a monetary penalty was levied based on a description of case facts. We tested these models on a novel data set collected from the data protection authority of Macao across the three languages (i.e., Chinese, English, and Portuguese). Our experimental results show that the machine learning models provide the necessary predictability in order to automate the evaluation of data protection cases. In particular, SVM has consistent performance across three languages and achieving an AUROC of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, respectively. We further evaluated the interpretability of the results independently for each of the languages and found that the salient texts that were identified are shared across the three languages.","PeriodicalId":323592,"journal":{"name":"Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law","volume":"593 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3462757.3466097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As the use of personal data becomes further entrenched in the function of societal interaction, the regulation of such data continues to grow as an important area of law. Nevertheless, it is unfortunately the case that data protection authorities have limited resources to address an increasing number of investigations. The leveraging of appropriate data-driven models, coupled with the automation of decision making, has the potential to help in such circumstances. In this paper, we evaluate machine learning models in the literature (such as Support Vector Machine (SVM), Random Forest, and Multinomial Naive Bayes (MNB) classifiers) for natural language processing in order to predict whether a monetary penalty was levied based on a description of case facts. We tested these models on a novel data set collected from the data protection authority of Macao across the three languages (i.e., Chinese, English, and Portuguese). Our experimental results show that the machine learning models provide the necessary predictability in order to automate the evaluation of data protection cases. In particular, SVM has consistent performance across three languages and achieving an AUROC of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, respectively. We further evaluated the interpretability of the results independently for each of the languages and found that the salient texts that were identified are shared across the three languages.
以多种语言预测数据保护案件的罚款
随着个人资料的使用在社会互动功能中变得更加根深蒂固,对这些资料的监管继续成为法律的一个重要领域。然而,不幸的是,数据保护当局的资源有限,无法应对越来越多的调查。利用适当的数据驱动模型,再加上决策的自动化,有可能在这种情况下提供帮助。在本文中,我们评估了文献中的机器学习模型(如支持向量机(SVM),随机森林和多项朴素贝叶斯(MNB)分类器)用于自然语言处理,以预测是否根据案件事实的描述征收罚款。我们在一个从澳门数据保护局收集的新数据集上测试了这些模型,该数据集跨三种语言(即中文、英语和葡萄牙语)。我们的实验结果表明,机器学习模型提供了必要的可预测性,以便自动评估数据保护案例。特别是,SVM在三种语言之间具有一致的性能,中文、英语和葡萄牙语的AUROC分别为0.725、0.762和0.748。我们进一步独立评估了每种语言结果的可解释性,发现识别出的突出文本在三种语言中是共享的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信