Prediction of monetary penalties for data protection cases in multiple languages

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law Pub Date : 2021-06-21 DOI:10.1145/3462757.3466097

Aaron Ceross, Tingting Zhu

{"title":"Prediction of monetary penalties for data protection cases in multiple languages","authors":"Aaron Ceross, Tingting Zhu","doi":"10.1145/3462757.3466097","DOIUrl":null,"url":null,"abstract":"As the use of personal data becomes further entrenched in the function of societal interaction, the regulation of such data continues to grow as an important area of law. Nevertheless, it is unfortunately the case that data protection authorities have limited resources to address an increasing number of investigations. The leveraging of appropriate data-driven models, coupled with the automation of decision making, has the potential to help in such circumstances. In this paper, we evaluate machine learning models in the literature (such as Support Vector Machine (SVM), Random Forest, and Multinomial Naive Bayes (MNB) classifiers) for natural language processing in order to predict whether a monetary penalty was levied based on a description of case facts. We tested these models on a novel data set collected from the data protection authority of Macao across the three languages (i.e., Chinese, English, and Portuguese). Our experimental results show that the machine learning models provide the necessary predictability in order to automate the evaluation of data protection cases. In particular, SVM has consistent performance across three languages and achieving an AUROC of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, respectively. We further evaluated the interpretability of the results independently for each of the languages and found that the salient texts that were identified are shared across the three languages.","PeriodicalId":323592,"journal":{"name":"Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law","volume":"593 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3462757.3466097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As the use of personal data becomes further entrenched in the function of societal interaction, the regulation of such data continues to grow as an important area of law. Nevertheless, it is unfortunately the case that data protection authorities have limited resources to address an increasing number of investigations. The leveraging of appropriate data-driven models, coupled with the automation of decision making, has the potential to help in such circumstances. In this paper, we evaluate machine learning models in the literature (such as Support Vector Machine (SVM), Random Forest, and Multinomial Naive Bayes (MNB) classifiers) for natural language processing in order to predict whether a monetary penalty was levied based on a description of case facts. We tested these models on a novel data set collected from the data protection authority of Macao across the three languages (i.e., Chinese, English, and Portuguese). Our experimental results show that the machine learning models provide the necessary predictability in order to automate the evaluation of data protection cases. In particular, SVM has consistent performance across three languages and achieving an AUROC of 0.725, 0.762, and 0.748 for Chinese, English, and Portuguese, respectively. We further evaluated the interpretability of the results independently for each of the languages and found that the salient texts that were identified are shared across the three languages.

查看原文本刊更多论文

以多种语言预测数据保护案件的罚款

随着个人资料的使用在社会互动功能中变得更加根深蒂固，对这些资料的监管继续成为法律的一个重要领域。然而，不幸的是，数据保护当局的资源有限，无法应对越来越多的调查。利用适当的数据驱动模型，再加上决策的自动化，有可能在这种情况下提供帮助。在本文中，我们评估了文献中的机器学习模型(如支持向量机(SVM)，随机森林和多项朴素贝叶斯(MNB)分类器)用于自然语言处理，以预测是否根据案件事实的描述征收罚款。我们在一个从澳门数据保护局收集的新数据集上测试了这些模型，该数据集跨三种语言(即中文、英语和葡萄牙语)。我们的实验结果表明，机器学习模型提供了必要的可预测性，以便自动评估数据保护案例。特别是，SVM在三种语言之间具有一致的性能，中文、英语和葡萄牙语的AUROC分别为0.725、0.762和0.748。我们进一步独立评估了每种语言结果的可解释性，发现识别出的突出文本在三种语言中是共享的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law

自引率

0.00%

发文量