{"title":"HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models","authors":"Mathieu Chartier , Nabil Dakkoune , Guillaume Bourgeois , Stéphane Jean","doi":"10.1016/j.datak.2024.102383","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public’s attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102383"},"PeriodicalIF":2.7000,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X24001071","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public’s attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.
期刊介绍:
Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.