Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods

Esmaeil NarimissaAustralian Taxation Office, David RaithelAustralian Taxation Office
{"title":"Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods","authors":"Esmaeil NarimissaAustralian Taxation Office, David RaithelAustralian Taxation Office","doi":"arxiv-2409.08479","DOIUrl":null,"url":null,"abstract":"The performance of Retrieval-Augmented Generation (RAG) systems in\ninformation retrieval is significantly influenced by the characteristics of the\ndocuments being processed. In this study, the structured nature of textbooks,\nthe conciseness of articles, and the narrative complexity of novels are shown\nto require distinct retrieval strategies. A comparative evaluation of multiple\ndocument-splitting methods reveals that the Recursive Character Splitter\noutperforms the Token-based Splitter in preserving contextual integrity. A\nnovel evaluation technique is introduced, utilizing an open-source model to\ngenerate a comprehensive dataset of question-and-answer pairs, simulating\nrealistic retrieval scenarios to enhance testing efficiency and metric\nreliability. The evaluation employs weighted scoring metrics, including\nSequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy\nand relevance. This approach establishes a refined standard for evaluating the\nprecision of RAG systems, with future research focusing on optimizing chunk and\noverlap sizes to improve retrieval accuracy and efficiency.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The performance of Retrieval-Augmented Generation (RAG) systems in information retrieval is significantly influenced by the characteristics of the documents being processed. In this study, the structured nature of textbooks, the conciseness of articles, and the narrative complexity of novels are shown to require distinct retrieval strategies. A comparative evaluation of multiple document-splitting methods reveals that the Recursive Character Splitter outperforms the Token-based Splitter in preserving contextual integrity. A novel evaluation technique is introduced, utilizing an open-source model to generate a comprehensive dataset of question-and-answer pairs, simulating realistic retrieval scenarios to enhance testing efficiency and metric reliability. The evaluation employs weighted scoring metrics, including SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy and relevance. This approach establishes a refined standard for evaluating the precision of RAG systems, with future research focusing on optimizing chunk and overlap sizes to improve retrieval accuracy and efficiency.
探索信息检索景观:新型评估技术和比较文档分割方法研究
检索增强生成(RAG)系统在信息检索中的性能受到被处理文档特征的显著影响。在这项研究中,教科书的结构性、文章的简洁性和小说的叙事复杂性被证明需要不同的检索策略。对多文档分割方法的比较评估表明,递归字符分割器在保持上下文完整性方面优于基于标记的分割器。介绍了一种新颖的评估技术,它利用一个开源模型生成一个全面的问答对数据集,模拟现实检索场景,以提高测试效率和指标可靠性。评估采用了加权评分标准,包括序列捕手(SequenceMatcher)、BLEU、METEOR 和 BERT Score,以评估系统的准确性和相关性。这种方法为评估 RAG 系统的准确性建立了一个完善的标准,未来的研究重点是优化块和重叠的大小,以提高检索的准确性和效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信