{"title":"PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models","authors":"Andrianos Michail, Simon Clematide, Juri Opitz","doi":"arxiv-2409.12060","DOIUrl":null,"url":null,"abstract":"The task of determining whether two texts are paraphrases has long been a\nchallenge in NLP. However, the prevailing notion of paraphrase is often quite\nsimplistic, offering only a limited view of the vast spectrum of paraphrase\nphenomena. Indeed, we find that evaluating models in a paraphrase dataset can\nleave uncertainty about their true semantic understanding. To alleviate this,\nwe release paraphrasus, a benchmark designed for multi-dimensional assessment\nof paraphrase detection models and finer model selection. We find that\nparaphrase detection models under a fine-grained evaluation lens exhibit\ntrade-offs that cannot be captured through a single classification dataset.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"118 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The task of determining whether two texts are paraphrases has long been a
challenge in NLP. However, the prevailing notion of paraphrase is often quite
simplistic, offering only a limited view of the vast spectrum of paraphrase
phenomena. Indeed, we find that evaluating models in a paraphrase dataset can
leave uncertainty about their true semantic understanding. To alleviate this,
we release paraphrasus, a benchmark designed for multi-dimensional assessment
of paraphrase detection models and finer model selection. We find that
paraphrase detection models under a fine-grained evaluation lens exhibit
trade-offs that cannot be captured through a single classification dataset.