Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen
{"title":"NLLB-E5: A Scalable Multilingual Retrieval Model","authors":"Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen","doi":"arxiv-2409.05401","DOIUrl":null,"url":null,"abstract":"Despite significant progress in multilingual information retrieval, the lack\nof models capable of effectively supporting multiple languages, particularly\nlow-resource like Indic languages, remains a critical challenge. This paper\npresents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages\nthe in-built multilingual capabilities in the NLLB encoder for translation\ntasks. It proposes a distillation approach from multilingual retriever E5 to\nprovide a zero-shot retrieval approach handling multiple languages, including\nall major Indic languages, without requiring multilingual training data. We\nevaluate the model on a comprehensive suite of existing benchmarks, including\nHindi-BEIR, highlighting its robust performance across diverse languages and\ntasks. Our findings uncover task and domain-specific challenges, providing\nvaluable insights into the retrieval performance, especially for low-resource\nlanguages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and\nlanguage-agnostic text retrieval model, advancing the field of multilingual\ninformation access and promoting digital inclusivity for millions of users\nglobally.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite significant progress in multilingual information retrieval, the lack
of models capable of effectively supporting multiple languages, particularly
low-resource like Indic languages, remains a critical challenge. This paper
presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages
the in-built multilingual capabilities in the NLLB encoder for translation
tasks. It proposes a distillation approach from multilingual retriever E5 to
provide a zero-shot retrieval approach handling multiple languages, including
all major Indic languages, without requiring multilingual training data. We
evaluate the model on a comprehensive suite of existing benchmarks, including
Hindi-BEIR, highlighting its robust performance across diverse languages and
tasks. Our findings uncover task and domain-specific challenges, providing
valuable insights into the retrieval performance, especially for low-resource
languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and
language-agnostic text retrieval model, advancing the field of multilingual
information access and promoting digital inclusivity for millions of users
globally.