NLLB-E5: A Scalable Multilingual Retrieval Model

arXiv - CS - Information Retrieval Pub Date : 2024-09-09 DOI:arxiv-2409.05401

Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen

{"title":"NLLB-E5: A Scalable Multilingual Retrieval Model","authors":"Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen","doi":"arxiv-2409.05401","DOIUrl":null,"url":null,"abstract":"Despite significant progress in multilingual information retrieval, the lack\nof models capable of effectively supporting multiple languages, particularly\nlow-resource like Indic languages, remains a critical challenge. This paper\npresents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages\nthe in-built multilingual capabilities in the NLLB encoder for translation\ntasks. It proposes a distillation approach from multilingual retriever E5 to\nprovide a zero-shot retrieval approach handling multiple languages, including\nall major Indic languages, without requiring multilingual training data. We\nevaluate the model on a comprehensive suite of existing benchmarks, including\nHindi-BEIR, highlighting its robust performance across diverse languages and\ntasks. Our findings uncover task and domain-specific challenges, providing\nvaluable insights into the retrieval performance, especially for low-resource\nlanguages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and\nlanguage-agnostic text retrieval model, advancing the field of multilingual\ninformation access and promoting digital inclusivity for millions of users\nglobally.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Despite significant progress in multilingual information retrieval, the lack of models capable of effectively supporting multiple languages, particularly low-resource like Indic languages, remains a critical challenge. This paper presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages the in-built multilingual capabilities in the NLLB encoder for translation tasks. It proposes a distillation approach from multilingual retriever E5 to provide a zero-shot retrieval approach handling multiple languages, including all major Indic languages, without requiring multilingual training data. We evaluate the model on a comprehensive suite of existing benchmarks, including Hindi-BEIR, highlighting its robust performance across diverse languages and tasks. Our findings uncover task and domain-specific challenges, providing valuable insights into the retrieval performance, especially for low-resource languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and language-agnostic text retrieval model, advancing the field of multilingual information access and promoting digital inclusivity for millions of users globally.

查看原文本刊更多论文

NLLB-E5：可扩展的多语言检索模型

尽管在多语言信息检索方面取得了重大进展，但缺乏能够有效支持多种语言（尤其是像印度语这样的低资源语言）的模型仍然是一个严峻的挑战。本文介绍了 NLLB-E5：一种可扩展的多语言检索模型。NLLB-E5 充分利用了 NLLB 编码器的内置多语言功能来完成翻译任务。它提出了一种从多语言检索器 E5 中提炼出来的方法，提供了一种无需多语言训练数据即可处理多种语言（包括所有主要印度语言）的零点检索方法。我们在一套全面的现有基准（包括印地语-BEIR）上对该模型进行了评估，突出显示了它在不同语言和任务中的强大性能。我们的研究结果揭示了任务和特定领域的挑战，为检索性能提供了宝贵的见解，特别是对于低资源语言。NLLB-E5 解决了对包容性、可扩展性和语言无关性文本检索模型的迫切需求，推动了多语言信息访问领域的发展，促进了全球数百万用户的数字包容性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Information Retrieval

自引率

0.00%

发文量