空白还是幻觉?凝视机器生成的法律分析,进行精细文本评估

Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
{"title":"空白还是幻觉?凝视机器生成的法律分析,进行精细文本评估","authors":"Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme","doi":"arxiv-2409.09947","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) show promise as a writing aid for professionals\nperforming legal analyses. However, LLMs can often hallucinate in this setting,\nin ways difficult to recognize by non-professionals and existing text\nevaluation metrics. In this work, we pose the question: when can\nmachine-generated legal analysis be evaluated as acceptable? We introduce the\nneutral notion of gaps, as opposed to hallucinations in a strict erroneous\nsense, to refer to the difference between human-written and machine-generated\nlegal analysis. Gaps do not always equate to invalid generation. Working with\nlegal experts, we consider the CLERC generation task proposed in Hou et al.\n(2024b), leading to a taxonomy, a fine-grained detector for predicting gap\ncategories, and an annotated dataset for automatic evaluation. Our best\ndetector achieves 67% F1 score and 80% precision on the test set. Employing\nthis detector as an automated metric on legal analysis generated by SOTA LLMs,\nwe find around 80% contain hallucinations of different kinds.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations\",\"authors\":\"Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme\",\"doi\":\"arxiv-2409.09947\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) show promise as a writing aid for professionals\\nperforming legal analyses. However, LLMs can often hallucinate in this setting,\\nin ways difficult to recognize by non-professionals and existing text\\nevaluation metrics. In this work, we pose the question: when can\\nmachine-generated legal analysis be evaluated as acceptable? We introduce the\\nneutral notion of gaps, as opposed to hallucinations in a strict erroneous\\nsense, to refer to the difference between human-written and machine-generated\\nlegal analysis. Gaps do not always equate to invalid generation. Working with\\nlegal experts, we consider the CLERC generation task proposed in Hou et al.\\n(2024b), leading to a taxonomy, a fine-grained detector for predicting gap\\ncategories, and an annotated dataset for automatic evaluation. Our best\\ndetector achieves 67% F1 score and 80% precision on the test set. Employing\\nthis detector as an automated metric on legal analysis generated by SOTA LLMs,\\nwe find around 80% contain hallucinations of different kinds.\",\"PeriodicalId\":501112,\"journal\":{\"name\":\"arXiv - CS - Computers and Society\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computers and Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09947\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09947","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(LLM)有望成为专业人士进行法律分析的写作辅助工具。然而,LLM 在这种情况下往往会产生幻觉,非专业人士和现有的文本评估指标很难识别。在这项工作中,我们提出了这样一个问题:什么时候机器生成的法律分析可以被评价为可接受的?我们引入了 "空白 "这一中性概念,与严格意义上的 "幻觉 "相对,指的是人工撰写的法律分析与机器生成的法律分析之间的差异。空白并不总是等同于无效生成。通过与法律专家合作,我们考虑了 Hou 等人(2024b)提出的 CLERC 生成任务,并由此产生了一个分类法、一个用于预测空白类别的细粒度检测器和一个用于自动评估的注释数据集。我们的最佳检测器在测试集上取得了 67% 的 F1 分数和 80% 的精确度。将该检测器作为自动度量标准用于 SOTA LLM 生成的法律分析,我们发现约 80% 的分析包含不同类型的幻觉。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations
Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps, as opposed to hallucinations in a strict erroneous sense, to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信