Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration

Engineering and Technology Journal Pub Date : 2024-07-25 DOI:10.47191/etj/v9i07.09

Ismail Oubah, Dr. Selçuk Şener

{"title":"Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration","authors":"Ismail Oubah, Dr. Selçuk Şener","doi":"10.47191/etj/v9i07.09","DOIUrl":null,"url":null,"abstract":"This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.","PeriodicalId":507832,"journal":{"name":"Engineering and Technology Journal","volume":"45 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering and Technology Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47191/etj/v9i07.09","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.

查看原文本刊更多论文

高级检索增强生成：通过微调基于转换器的语言模型和 OCR 集成实现跨文档类型的多语言语义检索

本研究介绍了一种先进的多语种文档语义检索系统，它将检索增强生成（RAG）与基于转换器的语言模型和光学字符识别（OCR）技术相结合。为了应对创建强大的多语言问答（QA）系统这一挑战，我们开发了一个定制数据集，该数据集来自 XQuAD、FQuAD 和 MLQA，并通过使用 OpenAI 的 GPT-3.5 Turbo 生成的合成数据进行了增强。这确保了答案的全面性和语境的丰富性。Paddle OCR 的加入为高质量的法语、英语和西班牙语文本提取提供了便利，但阿拉伯语的提取存在一些困难。多语言 E5 嵌入模型采用多重否定排序损失法进行了微调，优化了上下文问题对的检索。我们使用了两种文本生成模型：MT5 针对增强的上下文理解和较长的答案生成进行了微调，适用于 CPU 友好型应用；LLAMA 3 8b-instruct 针对高级语言生成进行了优化，适用于需要大量 GPU 资源的专业和行业应用。评估采用了针对单个组件的 F1、EM 和 BLEU 分数等指标，以及针对整个系统的 RAGAS 框架。MT5 显示出良好的结果，在上下文精确度和相关性方面表现出色，而 LLAMA 3 的量化版本在答案正确性和相似性方面遥遥领先。这项工作凸显了我们的 RAG 系统在多语种语义检索中的有效性，为现实世界中的质量保证应用提供了强大的解决方案，并为多语种文档处理的未来发展奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Engineering and Technology Journal

自引率

0.00%

发文量