NLLB-E5：可扩展的多语言检索模型

arXiv - CS - Information Retrieval Pub Date : 2024-09-09 DOI:arxiv-2409.05401

Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen

{"title":"NLLB-E5：可扩展的多语言检索模型","authors":"Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen","doi":"arxiv-2409.05401","DOIUrl":null,"url":null,"abstract":"Despite significant progress in multilingual information retrieval, the lack\nof models capable of effectively supporting multiple languages, particularly\nlow-resource like Indic languages, remains a critical challenge. This paper\npresents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages\nthe in-built multilingual capabilities in the NLLB encoder for translation\ntasks. It proposes a distillation approach from multilingual retriever E5 to\nprovide a zero-shot retrieval approach handling multiple languages, including\nall major Indic languages, without requiring multilingual training data. We\nevaluate the model on a comprehensive suite of existing benchmarks, including\nHindi-BEIR, highlighting its robust performance across diverse languages and\ntasks. Our findings uncover task and domain-specific challenges, providing\nvaluable insights into the retrieval performance, especially for low-resource\nlanguages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and\nlanguage-agnostic text retrieval model, advancing the field of multilingual\ninformation access and promoting digital inclusivity for millions of users\nglobally.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NLLB-E5: A Scalable Multilingual Retrieval Model\",\"authors\":\"Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen\",\"doi\":\"arxiv-2409.05401\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite significant progress in multilingual information retrieval, the lack\\nof models capable of effectively supporting multiple languages, particularly\\nlow-resource like Indic languages, remains a critical challenge. This paper\\npresents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages\\nthe in-built multilingual capabilities in the NLLB encoder for translation\\ntasks. It proposes a distillation approach from multilingual retriever E5 to\\nprovide a zero-shot retrieval approach handling multiple languages, including\\nall major Indic languages, without requiring multilingual training data. We\\nevaluate the model on a comprehensive suite of existing benchmarks, including\\nHindi-BEIR, highlighting its robust performance across diverse languages and\\ntasks. Our findings uncover task and domain-specific challenges, providing\\nvaluable insights into the retrieval performance, especially for low-resource\\nlanguages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and\\nlanguage-agnostic text retrieval model, advancing the field of multilingual\\ninformation access and promoting digital inclusivity for millions of users\\nglobally.\",\"PeriodicalId\":501281,\"journal\":{\"name\":\"arXiv - CS - Information Retrieval\",\"volume\":\"63 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05401\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

尽管在多语言信息检索方面取得了重大进展，但缺乏能够有效支持多种语言（尤其是像印度语这样的低资源语言）的模型仍然是一个严峻的挑战。本文介绍了 NLLB-E5：一种可扩展的多语言检索模型。NLLB-E5 充分利用了 NLLB 编码器的内置多语言功能来完成翻译任务。它提出了一种从多语言检索器 E5 中提炼出来的方法，提供了一种无需多语言训练数据即可处理多种语言（包括所有主要印度语言）的零点检索方法。我们在一套全面的现有基准（包括印地语-BEIR）上对该模型进行了评估，突出显示了它在不同语言和任务中的强大性能。我们的研究结果揭示了任务和特定领域的挑战，为检索性能提供了宝贵的见解，特别是对于低资源语言。NLLB-E5 解决了对包容性、可扩展性和语言无关性文本检索模型的迫切需求，推动了多语言信息访问领域的发展，促进了全球数百万用户的数字包容性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NLLB-E5: A Scalable Multilingual Retrieval Model

Despite significant progress in multilingual information retrieval, the lack of models capable of effectively supporting multiple languages, particularly low-resource like Indic languages, remains a critical challenge. This paper presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages the in-built multilingual capabilities in the NLLB encoder for translation tasks. It proposes a distillation approach from multilingual retriever E5 to provide a zero-shot retrieval approach handling multiple languages, including all major Indic languages, without requiring multilingual training data. We evaluate the model on a comprehensive suite of existing benchmarks, including Hindi-BEIR, highlighting its robust performance across diverse languages and tasks. Our findings uncover task and domain-specific challenges, providing valuable insights into the retrieval performance, especially for low-resource languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and language-agnostic text retrieval model, advancing the field of multilingual information access and promoting digital inclusivity for millions of users globally.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Information Retrieval

自引率

0.00%

发文量