Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish

IF 2.3 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka
{"title":"Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish","authors":"Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka","doi":"10.1017/s1351324923000086","DOIUrl":null,"url":null,"abstract":"\n In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs.\n We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000086","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs. We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.
走向多样化和基于上下文的转述建模:芬兰语的数据集和基线
本文从语料库创建和建模两个角度对自然语言转述进行了研究。我们特别关注的是允许在其自然文本上下文中提取具有挑战性的转述对示例的方法,从而形成一个数据集,与使用各种句子级启发式方法收集的数据集相比,该数据集可能更适合评估模型表示意义的能力,尤其是在文档上下文中。为此,我们介绍了第一个大规模的、完全手动注释的芬兰语转述语料库——图尔库转述语料库。语料库包含104645个人工标记的转述对,其中98%被证明是真实的转述,无论是普遍的还是在其当前上下文中。为了控制转述对的多样性,避免在自动候选提取中容易引入的某些偏差,转述是从不同的转述丰富的文本源中手动收集的。这使我们能够创建一个具有挑战性的数据集,其中包括比通过启发式方法收集的数据更长、更具词汇多样性的释义。除了质量之外,这还允许我们保留每一对的原始文档上下文,从而有可能在上下文中研究转述。据我们所知,这是第一个为注释对提供原始文档上下文的转述语料库。我们还研究了在新数据上训练和评估的几个转述模型。我们最初的转述分类实验表明,当使用语料库注释中使用的详细标记方案进行分类时,数据集具有挑战性,其准确性远远落后于人类表现。然而,在对近400M个候选句子进行大规模转述检索任务的情况下,对模型进行评估时,结果非常令人鼓舞,根据转述类型,29-53%的对被排在前10位。图尔库Paraphrase语料库可在github.com/TurkuNLP/Turku-rebread-Corpus上获得,也可通过CC-BY-SA许可证下的流行HuggingFace数据集获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Natural Language Engineering
Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
5.90
自引率
12.00%
发文量
60
审稿时长
>12 weeks
期刊介绍: Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信