MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages

IF 4.2 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics Pub Date : 2023-09-01 DOI:10.1162/tacl_a_00595

Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin

{"title":"MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages","authors":"Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin","doi":"10.1162/tacl_a_00595","DOIUrl":null,"url":null,"abstract":"Abstract MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"11 1","pages":"1114-1131"},"PeriodicalIF":4.2000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of the Association for Computational Linguistics","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1162/tacl_a_00595","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.

查看原文本刊更多论文

MIRACL:一个涵盖18种不同语言的多语言检索数据集

MIRACL是一个多语言数据集，用于跨18种语言的临时检索，这些语言共同涵盖了全球超过30亿的母语使用者。该资源旨在支持单语言检索任务，其中查询和语料库使用相同的语言。总的来说，我们在维基百科上收集了726k个高质量的相关判断，其中78k个查询是用这些语言进行的，所有的注释都是由我们团队雇佣的母语人士执行的。MIRACL涵盖了在类型学上接近或远离10个语系和13个亚语系的语言，这些语言与不同数量的公共可用资源有关。在标注过程中进行了广泛的自动启发式验证和手动评估，以控制数据质量。总的来说，MIRACL代表了大约五人年的人类注释工作的投资。我们的目标是促进对跨语言连续体检索的改进研究，从而提高世界各地不同人群的信息访问能力，特别是那些传统上服务不足的人群。MIRACL可在http://miracl.ai/上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transactions of the Association for Computational Linguistics Multiple-

CiteScore

32.60

自引率

4.60%

发文量

审稿时长

8 weeks

期刊介绍： The highly regarded quarterly journal Computational Linguistics has a companion journal called Transactions of the Association for Computational Linguistics. This open access journal publishes articles in all areas of natural language processing and is an important resource for academic and industry computational linguists, natural language processing experts, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, as well as linguists and philosophers. The journal disseminates work of vital relevance to these professionals on an annual basis.