空白还是幻觉？凝视机器生成的法律分析，进行精细文本评估

arXiv - CS - Computers and Society Pub Date : 2024-09-16 DOI:arxiv-2409.09947

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

{"title":"空白还是幻觉？凝视机器生成的法律分析，进行精细文本评估","authors":"Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme","doi":"arxiv-2409.09947","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) show promise as a writing aid for professionals\nperforming legal analyses. However, LLMs can often hallucinate in this setting,\nin ways difficult to recognize by non-professionals and existing text\nevaluation metrics. In this work, we pose the question: when can\nmachine-generated legal analysis be evaluated as acceptable? We introduce the\nneutral notion of gaps, as opposed to hallucinations in a strict erroneous\nsense, to refer to the difference between human-written and machine-generated\nlegal analysis. Gaps do not always equate to invalid generation. Working with\nlegal experts, we consider the CLERC generation task proposed in Hou et al.\n(2024b), leading to a taxonomy, a fine-grained detector for predicting gap\ncategories, and an annotated dataset for automatic evaluation. Our best\ndetector achieves 67% F1 score and 80% precision on the test set. Employing\nthis detector as an automated metric on legal analysis generated by SOTA LLMs,\nwe find around 80% contain hallucinations of different kinds.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations\",\"authors\":\"Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme\",\"doi\":\"arxiv-2409.09947\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) show promise as a writing aid for professionals\\nperforming legal analyses. However, LLMs can often hallucinate in this setting,\\nin ways difficult to recognize by non-professionals and existing text\\nevaluation metrics. In this work, we pose the question: when can\\nmachine-generated legal analysis be evaluated as acceptable? We introduce the\\nneutral notion of gaps, as opposed to hallucinations in a strict erroneous\\nsense, to refer to the difference between human-written and machine-generated\\nlegal analysis. Gaps do not always equate to invalid generation. Working with\\nlegal experts, we consider the CLERC generation task proposed in Hou et al.\\n(2024b), leading to a taxonomy, a fine-grained detector for predicting gap\\ncategories, and an annotated dataset for automatic evaluation. Our best\\ndetector achieves 67% F1 score and 80% precision on the test set. Employing\\nthis detector as an automated metric on legal analysis generated by SOTA LLMs,\\nwe find around 80% contain hallucinations of different kinds.\",\"PeriodicalId\":501112,\"journal\":{\"name\":\"arXiv - CS - Computers and Society\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computers and Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09947\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09947","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）有望成为专业人士进行法律分析的写作辅助工具。然而，LLM 在这种情况下往往会产生幻觉，非专业人士和现有的文本评估指标很难识别。在这项工作中，我们提出了这样一个问题：什么时候机器生成的法律分析可以被评价为可接受的？我们引入了 "空白 "这一中性概念，与严格意义上的 "幻觉 "相对，指的是人工撰写的法律分析与机器生成的法律分析之间的差异。空白并不总是等同于无效生成。通过与法律专家合作，我们考虑了 Hou 等人（2024b）提出的 CLERC 生成任务，并由此产生了一个分类法、一个用于预测空白类别的细粒度检测器和一个用于自动评估的注释数据集。我们的最佳检测器在测试集上取得了 67% 的 F1 分数和 80% 的精确度。将该检测器作为自动度量标准用于 SOTA LLM 生成的法律分析，我们发现约 80% 的分析包含不同类型的幻觉。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computers and Society

自引率

0.00%

发文量