立法句子可读性的机器学习

Proceedings of the 15th International Conference on Artificial Intelligence and Law Pub Date : 2015-06-08 DOI:10.1145/2746090.2746095

Michael Curtotti, Eric C. McCreath, Tom Bruce, Sara S. Frug, W. Weibel, Nicolas Ceynowa

{"title":"立法句子可读性的机器学习","authors":"Michael Curtotti, Eric C. McCreath, Tom Bruce, Sara S. Frug, W. Weibel, Nicolas Ceynowa","doi":"10.1145/2746090.2746095","DOIUrl":null,"url":null,"abstract":"Improving the readability of legislation is an important and unresolved problem. Recently, researchers have begun to apply legal informatics to this problem. This paper applies machine learning to predict the readability of sentences from legislation and regulations. A corpus of sentences from the United States Code and US Code of Federal Regulations was created. Each sentence was labelled for language difficulty using results from a large-scale crowdsourced study undertaken during 2014. The corpus was used as training and test data for machine learning. The corpus includes a version tagged using the Stanford parser context free grammar and a version tagged using the Stanford dependency grammar parser. The corpus is described and made available to interested researchers. We investigated whether extending natural language features available as input to machine learning improves the accuracy of prediction. Among features evaluated are those from the context free and dependency grammars. Letter and word ngrams were also studied. We found the addition of such features improves accuracy of prediction on legal language. We also undertake a correlation study of natural language features and language difficulty drawing insights as to the characteristics that may make legal language more difficult. These insights, and those from machine learning, enable us to describe a system for reducing legal language difficulty and to identify a number of suggested heuristics for improving the writing of legislation and regulations.","PeriodicalId":309125,"journal":{"name":"Proceedings of the 15th International Conference on Artificial Intelligence and Law","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Machine learning for readability of legislative sentences\",\"authors\":\"Michael Curtotti, Eric C. McCreath, Tom Bruce, Sara S. Frug, W. Weibel, Nicolas Ceynowa\",\"doi\":\"10.1145/2746090.2746095\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Improving the readability of legislation is an important and unresolved problem. Recently, researchers have begun to apply legal informatics to this problem. This paper applies machine learning to predict the readability of sentences from legislation and regulations. A corpus of sentences from the United States Code and US Code of Federal Regulations was created. Each sentence was labelled for language difficulty using results from a large-scale crowdsourced study undertaken during 2014. The corpus was used as training and test data for machine learning. The corpus includes a version tagged using the Stanford parser context free grammar and a version tagged using the Stanford dependency grammar parser. The corpus is described and made available to interested researchers. We investigated whether extending natural language features available as input to machine learning improves the accuracy of prediction. Among features evaluated are those from the context free and dependency grammars. Letter and word ngrams were also studied. We found the addition of such features improves accuracy of prediction on legal language. We also undertake a correlation study of natural language features and language difficulty drawing insights as to the characteristics that may make legal language more difficult. These insights, and those from machine learning, enable us to describe a system for reducing legal language difficulty and to identify a number of suggested heuristics for improving the writing of legislation and regulations.\",\"PeriodicalId\":309125,\"journal\":{\"name\":\"Proceedings of the 15th International Conference on Artificial Intelligence and Law\",\"volume\":\"119 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 15th International Conference on Artificial Intelligence and Law\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2746090.2746095\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Artificial Intelligence and Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2746090.2746095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

提高立法的可读性是一个重要而未解决的问题。近年来，研究人员开始将法律信息学应用于这一问题。本文应用机器学习来预测法律法规句子的可读性。创建了美国法典和美国联邦法规法典的句子语料库。根据2014年进行的一项大规模众包研究的结果，每个句子都被标记为语言困难。语料库被用作机器学习的训练和测试数据。语料库包括一个使用斯坦福解析器上下文无关语法标记的版本和一个使用斯坦福依赖语法解析器标记的版本。语料库被描述并提供给感兴趣的研究人员。我们研究了扩展自然语言特征作为机器学习的输入是否可以提高预测的准确性。评估的特性包括来自上下文无关和依赖语法的特性。还研究了字母和单词的图形。我们发现这些特征的加入提高了法律语言预测的准确性。我们还进行了自然语言特征和语言难度的相关性研究，以了解可能使法律语言更加困难的特征。这些见解，以及那些来自机器学习的见解，使我们能够描述一个减少法律语言困难的系统，并确定一些建议的启发式方法，以改善立法和法规的写作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Machine learning for readability of legislative sentences

Improving the readability of legislation is an important and unresolved problem. Recently, researchers have begun to apply legal informatics to this problem. This paper applies machine learning to predict the readability of sentences from legislation and regulations. A corpus of sentences from the United States Code and US Code of Federal Regulations was created. Each sentence was labelled for language difficulty using results from a large-scale crowdsourced study undertaken during 2014. The corpus was used as training and test data for machine learning. The corpus includes a version tagged using the Stanford parser context free grammar and a version tagged using the Stanford dependency grammar parser. The corpus is described and made available to interested researchers. We investigated whether extending natural language features available as input to machine learning improves the accuracy of prediction. Among features evaluated are those from the context free and dependency grammars. Letter and word ngrams were also studied. We found the addition of such features improves accuracy of prediction on legal language. We also undertake a correlation study of natural language features and language difficulty drawing insights as to the characteristics that may make legal language more difficult. These insights, and those from machine learning, enable us to describe a system for reducing legal language difficulty and to identify a number of suggested heuristics for improving the writing of legislation and regulations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 15th International Conference on Artificial Intelligence and Law

自引率

0.00%

发文量