{"title":"Discriminative meets generative: Automated information retrieval from unstructured corporate documents via (large) language models","authors":"Sergej Levich , Lucas Knust","doi":"10.1016/j.accinf.2025.100750","DOIUrl":null,"url":null,"abstract":"<div><div>This paper demonstrates the potential of (large) language models to transform accounting practice and research by automating information retrieval from unstructured sources. While information retrieval in accounting has hitherto been predominantly addressed through handcrafted, rule-based systems, we have devised an approach based entirely on machine learning methods from the field of natural language processing (NLP). Specifically, we consider and contrast two modeling paradigms in NLP: discriminative modeling and generative modeling. In the former case, we fine-tune a language model pre-trained for multilingual, visually rich document understanding. In the latter, we apply prompting techniques to utilize a large language model (LLM) without prior training. We illustrate our approach by retrieving group ownership data from annual reports in Portable Document Format (PDF). We successfully retrieve group ownership information for both modeling paradigms, achieving a strong overall accuracy with a low percentage of false negatives. Furthermore, we consider the impact of different reporting languages and accounting standards.</div></div>","PeriodicalId":47170,"journal":{"name":"International Journal of Accounting Information Systems","volume":"56 ","pages":"Article 100750"},"PeriodicalIF":6.0000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Accounting Information Systems","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1467089525000260","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BUSINESS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper demonstrates the potential of (large) language models to transform accounting practice and research by automating information retrieval from unstructured sources. While information retrieval in accounting has hitherto been predominantly addressed through handcrafted, rule-based systems, we have devised an approach based entirely on machine learning methods from the field of natural language processing (NLP). Specifically, we consider and contrast two modeling paradigms in NLP: discriminative modeling and generative modeling. In the former case, we fine-tune a language model pre-trained for multilingual, visually rich document understanding. In the latter, we apply prompting techniques to utilize a large language model (LLM) without prior training. We illustrate our approach by retrieving group ownership data from annual reports in Portable Document Format (PDF). We successfully retrieve group ownership information for both modeling paradigms, achieving a strong overall accuracy with a low percentage of false negatives. Furthermore, we consider the impact of different reporting languages and accounting standards.
期刊介绍:
The International Journal of Accounting Information Systems will publish thoughtful, well developed articles that examine the rapidly evolving relationship between accounting and information technology. Articles may range from empirical to analytical, from practice-based to the development of new techniques, but must be related to problems facing the integration of accounting and information technology. The journal will address (but will not limit itself to) the following specific issues: control and auditability of information systems; management of information technology; artificial intelligence research in accounting; development issues in accounting and information systems; human factors issues related to information technology; development of theories related to information technology; methodological issues in information technology research; information systems validation; human–computer interaction research in accounting information systems. The journal welcomes and encourages articles from both practitioners and academicians.