{"title":"高级检索增强生成:通过微调基于转换器的语言模型和 OCR 集成实现跨文档类型的多语言语义检索","authors":"Ismail Oubah, Dr. Selçuk Şener","doi":"10.47191/etj/v9i07.09","DOIUrl":null,"url":null,"abstract":"This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.","PeriodicalId":507832,"journal":{"name":"Engineering and Technology Journal","volume":"45 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration\",\"authors\":\"Ismail Oubah, Dr. Selçuk Şener\",\"doi\":\"10.47191/etj/v9i07.09\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.\",\"PeriodicalId\":507832,\"journal\":{\"name\":\"Engineering and Technology Journal\",\"volume\":\"45 12\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering and Technology Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.47191/etj/v9i07.09\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering and Technology Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47191/etj/v9i07.09","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration
This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.