Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

IF 1.2 2区 社会学 Q1 LAW
Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, Daniel E. Ho
{"title":"Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools","authors":"Varun Magesh,&nbsp;Faiz Surani,&nbsp;Matthew Dahl,&nbsp;Mirac Suzgun,&nbsp;Christopher D. Manning,&nbsp;Daniel E. Ho","doi":"10.1111/jels.12413","DOIUrl":null,"url":null,"abstract":"<p>Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. However, the large language models used in these tools are prone to “hallucinate,” or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as “eliminating” or “avoid[ing]” hallucinations, or guaranteeing “hallucination-free” legal citations. Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.</p>","PeriodicalId":47187,"journal":{"name":"Journal of Empirical Legal Studies","volume":"22 2","pages":"216-242"},"PeriodicalIF":1.2000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jels.12413","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Empirical Legal Studies","FirstCategoryId":"90","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jels.12413","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LAW","Score":null,"Total":0}
引用次数: 0

Abstract

Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. However, the large language models used in these tools are prone to “hallucinate,” or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as “eliminating” or “avoid[ing]” hallucinations, or guaranteeing “hallucination-free” legal citations. Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.

Abstract Image

Hallucination-Free吗?评估领先人工智能法律研究工具的可靠性
在法律实践中,人工智能(AI)产品急剧增加。这些工具的目的是协助广泛的核心法律任务,从案例法的搜索和摘要到文件起草。然而,这些工具中使用的大型语言模型容易产生“幻觉”,或编造虚假信息,这使得它们在高风险领域的使用存在风险。最近,一些法律研究机构将诸如检索增强生成(RAG)之类的方法吹捧为“消除”或“避免”幻觉,或保证“无幻觉”的法律引用。由于这些系统的封闭性,系统地评估这些索赔是具有挑战性的。在本文中,我们设计并报告了人工智能驱动的法律研究工具的首次预注册实证评估。我们证明供应商的说法被夸大了。虽然幻觉相对于通用聊天机器人(GPT-4)有所减少,但我们发现,LexisNexis (Lexis+ AI)和汤森路透(Westlaw AI辅助研究和Ask Practical Law AI)制造的人工智能研究工具在17%至33%的时间里都产生了幻觉。我们还记录了系统之间在响应性和准确性方面的实质性差异。我们的文章做出了四个关键贡献。它是第一个评估和报告基于rag的专有法律人工智能工具的性能的公司。其次,它引入了一个全面的、预先注册的数据集,用于识别和理解这些系统中的漏洞。第三,它提出了一个明确的类型区分幻觉和准确的法律反应。最后,它提供了证据,告知法律专业人员在监督和验证人工智能输出方面的责任,这仍然是将人工智能负责任地纳入法律的一个核心未决问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.30
自引率
11.80%
发文量
34
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信