The JOKER Corpus: English-French Parallel Data for Multilingual Wordplay Recognition

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI:10.1145/3539618.3591885

Liana Ermakova, Anne-Gwenn Bosser, A. Jatowt, Tristan Miller

{"title":"The JOKER Corpus: English-French Parallel Data for Multilingual Wordplay Recognition","authors":"Liana Ermakova, Anne-Gwenn Bosser, A. Jatowt, Tristan Miller","doi":"10.1145/3539618.3591885","DOIUrl":null,"url":null,"abstract":"Despite recent advances in information retrieval and natural language processing, rhetorical devices that exploit ambiguity or subvert linguistic rules remain a challenge for such systems. However, corpus-based analysis of wordplay has been a perennial topic of scholarship in the humanities, including literary criticism, language education, and translation studies. The immense data-gathering effort required for these studies points to the need for specialized text retrieval and classification technology, and consequently for appropriate test collections. In this paper, we introduce and analyze a new dataset for research and applications in the retrieval and processing of wordplay. Developed for the JOKER track at CLEF 2023, our annotated corpus extends and improves upon past English wordplay detection datasets in several ways. First, we introduce hundreds of additional positive examples of wordplay; second, we provide French translations for the examples; and third, we provide negative examples of non-wordplay with characteristics closely matching those of the positive examples. This last feature helps ensure that AI models learn to effectively distinguish wordplay from non-wordplay, and not simply texts differing in length, style, or vocabulary. Our test collection represents then a step towards wordplay-aware multilingual information retrieval.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539618.3591885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Despite recent advances in information retrieval and natural language processing, rhetorical devices that exploit ambiguity or subvert linguistic rules remain a challenge for such systems. However, corpus-based analysis of wordplay has been a perennial topic of scholarship in the humanities, including literary criticism, language education, and translation studies. The immense data-gathering effort required for these studies points to the need for specialized text retrieval and classification technology, and consequently for appropriate test collections. In this paper, we introduce and analyze a new dataset for research and applications in the retrieval and processing of wordplay. Developed for the JOKER track at CLEF 2023, our annotated corpus extends and improves upon past English wordplay detection datasets in several ways. First, we introduce hundreds of additional positive examples of wordplay; second, we provide French translations for the examples; and third, we provide negative examples of non-wordplay with characteristics closely matching those of the positive examples. This last feature helps ensure that AI models learn to effectively distinguish wordplay from non-wordplay, and not simply texts differing in length, style, or vocabulary. Our test collection represents then a step towards wordplay-aware multilingual information retrieval.

查看原文本刊更多论文

JOKER语料库:用于多语言文字游戏识别的英法平行数据

尽管最近在信息检索和自然语言处理方面取得了进展，但利用歧义或颠覆语言规则的修辞手段仍然是这些系统面临的挑战。然而，基于语料库的文字游戏分析一直是文学批评、语言教育和翻译研究等人文学科的长期研究课题。这些研究所需的大量数据收集工作表明需要专门的文本检索和分类技术，因此需要适当的测试集合。在本文中，我们介绍和分析了一个新的数据集，用于研究和应用于文字游戏的检索和处理。为CLEF 2023的JOKER赛道开发，我们的注释语料库在几个方面扩展和改进了过去的英语单词游戏检测数据集。首先，我们介绍了数百个额外的积极的文字游戏例子;其次，我们为示例提供法语翻译;第三，我们提供了非文字游戏的负面例子，这些例子的特征与正面例子的特征非常接近。最后一个功能有助于确保人工智能模型学会有效地区分文字游戏和非文字游戏，而不仅仅是长度、风格或词汇不同的文本。我们的测试集合代表了向文字游戏感知的多语言信息检索迈出的一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量