Discriminative meets generative: Automated information retrieval from unstructured corporate documents via (large) language models

IF 6 3区 管理学 Q2 BUSINESS
Sergej Levich , Lucas Knust
{"title":"Discriminative meets generative: Automated information retrieval from unstructured corporate documents via (large) language models","authors":"Sergej Levich ,&nbsp;Lucas Knust","doi":"10.1016/j.accinf.2025.100750","DOIUrl":null,"url":null,"abstract":"<div><div>This paper demonstrates the potential of (large) language models to transform accounting practice and research by automating information retrieval from unstructured sources. While information retrieval in accounting has hitherto been predominantly addressed through handcrafted, rule-based systems, we have devised an approach based entirely on machine learning methods from the field of natural language processing (NLP). Specifically, we consider and contrast two modeling paradigms in NLP: discriminative modeling and generative modeling. In the former case, we fine-tune a language model pre-trained for multilingual, visually rich document understanding. In the latter, we apply prompting techniques to utilize a large language model (LLM) without prior training. We illustrate our approach by retrieving group ownership data from annual reports in Portable Document Format (PDF). We successfully retrieve group ownership information for both modeling paradigms, achieving a strong overall accuracy with a low percentage of false negatives. Furthermore, we consider the impact of different reporting languages and accounting standards.</div></div>","PeriodicalId":47170,"journal":{"name":"International Journal of Accounting Information Systems","volume":"56 ","pages":"Article 100750"},"PeriodicalIF":6.0000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Accounting Information Systems","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1467089525000260","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BUSINESS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper demonstrates the potential of (large) language models to transform accounting practice and research by automating information retrieval from unstructured sources. While information retrieval in accounting has hitherto been predominantly addressed through handcrafted, rule-based systems, we have devised an approach based entirely on machine learning methods from the field of natural language processing (NLP). Specifically, we consider and contrast two modeling paradigms in NLP: discriminative modeling and generative modeling. In the former case, we fine-tune a language model pre-trained for multilingual, visually rich document understanding. In the latter, we apply prompting techniques to utilize a large language model (LLM) without prior training. We illustrate our approach by retrieving group ownership data from annual reports in Portable Document Format (PDF). We successfully retrieve group ownership information for both modeling paradigms, achieving a strong overall accuracy with a low percentage of false negatives. Furthermore, we consider the impact of different reporting languages and accounting standards.
判别满足生成:通过(大型)语言模型从非结构化公司文档中自动检索信息
本文展示了(大型)语言模型通过从非结构化资源中自动检索信息来改变会计实践和研究的潜力。虽然迄今为止,会计信息检索主要通过手工制作的、基于规则的系统来解决,但我们已经设计了一种完全基于自然语言处理(NLP)领域的机器学习方法的方法。具体来说,我们考虑并对比了NLP中的两种建模范式:判别建模和生成建模。在前一种情况下,我们对预先训练的语言模型进行微调,以用于多语言、视觉丰富的文档理解。在后者中,我们应用提示技术来利用大型语言模型(LLM),而无需事先训练。我们通过从可移植文档格式(PDF)的年度报告中检索集团所有权数据来说明我们的方法。我们成功地检索了两种建模范例的组所有权信息,实现了高总体准确性和低假阴性率。此外,我们还考虑了不同报告语言和会计准则的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
9.00
自引率
6.50%
发文量
23
期刊介绍: The International Journal of Accounting Information Systems will publish thoughtful, well developed articles that examine the rapidly evolving relationship between accounting and information technology. Articles may range from empirical to analytical, from practice-based to the development of new techniques, but must be related to problems facing the integration of accounting and information technology. The journal will address (but will not limit itself to) the following specific issues: control and auditability of information systems; management of information technology; artificial intelligence research in accounting; development issues in accounting and information systems; human factors issues related to information technology; development of theories related to information technology; methodological issues in information technology research; information systems validation; human–computer interaction research in accounting information systems. The journal welcomes and encourages articles from both practitioners and academicians.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信