Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
{"title":"空白还是幻觉?凝视机器生成的法律分析,进行精细文本评估","authors":"Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme","doi":"arxiv-2409.09947","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) show promise as a writing aid for professionals\nperforming legal analyses. However, LLMs can often hallucinate in this setting,\nin ways difficult to recognize by non-professionals and existing text\nevaluation metrics. In this work, we pose the question: when can\nmachine-generated legal analysis be evaluated as acceptable? We introduce the\nneutral notion of gaps, as opposed to hallucinations in a strict erroneous\nsense, to refer to the difference between human-written and machine-generated\nlegal analysis. Gaps do not always equate to invalid generation. Working with\nlegal experts, we consider the CLERC generation task proposed in Hou et al.\n(2024b), leading to a taxonomy, a fine-grained detector for predicting gap\ncategories, and an annotated dataset for automatic evaluation. Our best\ndetector achieves 67% F1 score and 80% precision on the test set. Employing\nthis detector as an automated metric on legal analysis generated by SOTA LLMs,\nwe find around 80% contain hallucinations of different kinds.","PeriodicalId":501112,"journal":{"name":"arXiv - CS - Computers and Society","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations\",\"authors\":\"Abe Bohan Hou, William Jurayj, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme\",\"doi\":\"arxiv-2409.09947\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) show promise as a writing aid for professionals\\nperforming legal analyses. However, LLMs can often hallucinate in this setting,\\nin ways difficult to recognize by non-professionals and existing text\\nevaluation metrics. In this work, we pose the question: when can\\nmachine-generated legal analysis be evaluated as acceptable? We introduce the\\nneutral notion of gaps, as opposed to hallucinations in a strict erroneous\\nsense, to refer to the difference between human-written and machine-generated\\nlegal analysis. Gaps do not always equate to invalid generation. Working with\\nlegal experts, we consider the CLERC generation task proposed in Hou et al.\\n(2024b), leading to a taxonomy, a fine-grained detector for predicting gap\\ncategories, and an annotated dataset for automatic evaluation. Our best\\ndetector achieves 67% F1 score and 80% precision on the test set. Employing\\nthis detector as an automated metric on legal analysis generated by SOTA LLMs,\\nwe find around 80% contain hallucinations of different kinds.\",\"PeriodicalId\":501112,\"journal\":{\"name\":\"arXiv - CS - Computers and Society\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computers and Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09947\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computers and Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09947","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations
Large Language Models (LLMs) show promise as a writing aid for professionals
performing legal analyses. However, LLMs can often hallucinate in this setting,
in ways difficult to recognize by non-professionals and existing text
evaluation metrics. In this work, we pose the question: when can
machine-generated legal analysis be evaluated as acceptable? We introduce the
neutral notion of gaps, as opposed to hallucinations in a strict erroneous
sense, to refer to the difference between human-written and machine-generated
legal analysis. Gaps do not always equate to invalid generation. Working with
legal experts, we consider the CLERC generation task proposed in Hou et al.
(2024b), leading to a taxonomy, a fine-grained detector for predicting gap
categories, and an annotated dataset for automatic evaluation. Our best
detector achieves 67% F1 score and 80% precision on the test set. Employing
this detector as an automated metric on legal analysis generated by SOTA LLMs,
we find around 80% contain hallucinations of different kinds.