Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen
{"title":"NLLB-E5:可扩展的多语言检索模型","authors":"Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen","doi":"arxiv-2409.05401","DOIUrl":null,"url":null,"abstract":"Despite significant progress in multilingual information retrieval, the lack\nof models capable of effectively supporting multiple languages, particularly\nlow-resource like Indic languages, remains a critical challenge. This paper\npresents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages\nthe in-built multilingual capabilities in the NLLB encoder for translation\ntasks. It proposes a distillation approach from multilingual retriever E5 to\nprovide a zero-shot retrieval approach handling multiple languages, including\nall major Indic languages, without requiring multilingual training data. We\nevaluate the model on a comprehensive suite of existing benchmarks, including\nHindi-BEIR, highlighting its robust performance across diverse languages and\ntasks. Our findings uncover task and domain-specific challenges, providing\nvaluable insights into the retrieval performance, especially for low-resource\nlanguages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and\nlanguage-agnostic text retrieval model, advancing the field of multilingual\ninformation access and promoting digital inclusivity for millions of users\nglobally.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NLLB-E5: A Scalable Multilingual Retrieval Model\",\"authors\":\"Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar, Jaydeep Sen\",\"doi\":\"arxiv-2409.05401\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite significant progress in multilingual information retrieval, the lack\\nof models capable of effectively supporting multiple languages, particularly\\nlow-resource like Indic languages, remains a critical challenge. This paper\\npresents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages\\nthe in-built multilingual capabilities in the NLLB encoder for translation\\ntasks. It proposes a distillation approach from multilingual retriever E5 to\\nprovide a zero-shot retrieval approach handling multiple languages, including\\nall major Indic languages, without requiring multilingual training data. We\\nevaluate the model on a comprehensive suite of existing benchmarks, including\\nHindi-BEIR, highlighting its robust performance across diverse languages and\\ntasks. Our findings uncover task and domain-specific challenges, providing\\nvaluable insights into the retrieval performance, especially for low-resource\\nlanguages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and\\nlanguage-agnostic text retrieval model, advancing the field of multilingual\\ninformation access and promoting digital inclusivity for millions of users\\nglobally.\",\"PeriodicalId\":501281,\"journal\":{\"name\":\"arXiv - CS - Information Retrieval\",\"volume\":\"63 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05401\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Despite significant progress in multilingual information retrieval, the lack
of models capable of effectively supporting multiple languages, particularly
low-resource like Indic languages, remains a critical challenge. This paper
presents NLLB-E5: A Scalable Multilingual Retrieval Model. NLLB-E5 leverages
the in-built multilingual capabilities in the NLLB encoder for translation
tasks. It proposes a distillation approach from multilingual retriever E5 to
provide a zero-shot retrieval approach handling multiple languages, including
all major Indic languages, without requiring multilingual training data. We
evaluate the model on a comprehensive suite of existing benchmarks, including
Hindi-BEIR, highlighting its robust performance across diverse languages and
tasks. Our findings uncover task and domain-specific challenges, providing
valuable insights into the retrieval performance, especially for low-resource
languages. NLLB-E5 addresses the urgent need for an inclusive, scalable, and
language-agnostic text retrieval model, advancing the field of multilingual
information access and promoting digital inclusivity for millions of users
globally.