Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems最新文献

筛选
英文 中文
Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models 精准度、召回率和F1分数的概率扩展,以更彻底地评估分类模型
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.9
Reda Yacouby, Dustin Axman
{"title":"Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models","authors":"Reda Yacouby, Dustin Axman","doi":"10.18653/v1/2020.eval4nlp-1.9","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.9","url":null,"abstract":"In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134585057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
On the Evaluation of Machine Translation n-best Lists 关于机器翻译n-best列表的评价
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.7
Jacob Bremerman, Huda Khayrallah, Douglas W. Oard, Matt Post
{"title":"On the Evaluation of Machine Translation n-best Lists","authors":"Jacob Bremerman, Huda Khayrallah, Douglas W. Oard, Matt Post","doi":"10.18653/v1/2020.eval4nlp-1.7","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.7","url":null,"abstract":"The standard machine translation evaluation framework measures the single-best output of machine translation systems. There are, however, many situations where n-best lists are needed, yet there is no established way of evaluating them. This paper establishes a framework for addressing n-best evaluation by outlining three different questions one could consider when determining how one would define a ‘good’ n-best list and proposing evaluation measures for each question. The first and principal contribution is an evaluation measure that characterizes the translation quality of an entire n-best list by asking whether many of the valid translations are placed near the top of the list. The second is a measure that uses gold translations with preference annotations to ask to what degree systems can produce ranked lists in preference order. The third is a measure that rewards partial matches, evaluating the closeness of the many items in an n-best list to a set of many valid references. These three perspectives make clear that having access to many references can be useful when n-best evaluation is the goal.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132057249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Item Response Theory for Efficient Human Evaluation of Chatbots 人类对聊天机器人有效评价的项目反应理论
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.3
João Sedoc, L. Ungar
{"title":"Item Response Theory for Efficient Human Evaluation of Chatbots","authors":"João Sedoc, L. Ungar","doi":"10.18653/v1/2020.eval4nlp-1.3","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.3","url":null,"abstract":"Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123084729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
On Aligning OpenIE Extractions with Knowledge Bases: A Case Study 关于与知识库对齐OpenIE抽取:一个案例研究
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.14
Kiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, S. Hertling, Christian Meilicke
{"title":"On Aligning OpenIE Extractions with Knowledge Bases: A Case Study","authors":"Kiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, S. Hertling, Christian Meilicke","doi":"10.18653/v1/2020.eval4nlp-1.14","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.14","url":null,"abstract":"Open information extraction (OIE) is the task of extracting relations and their corresponding arguments from a natural language text in un- supervised manner. Outputs of such systems are used for downstream tasks such as ques- tion answering and automatic knowledge base (KB) construction. Many of these downstream tasks rely on aligning OIE triples with refer- ence KBs. Such alignments are usually eval- uated w.r.t. a specific downstream task and, to date, no direct manual evaluation of such alignments has been performed. In this paper, we directly evaluate how OIE triples from the OPIEC corpus are related to the DBpedia KB w.r.t. information content. First, we investigate OPIEC triples and DBpedia facts having the same arguments by comparing the information on the OIE surface relation with the KB rela- tion. Second, we evaluate the expressibility of general OPIEC triples in DBpedia. We in- vestigate whether—and, if so, how—a given OIE triple can be mapped to a single KB fact. We found that such mappings are not always possible because the information in the OIE triples tends to be more specific. Our evalua- tion suggests, however, that significant part of OIE triples can be expressed by means of KB formulas instead of individual facts.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124752900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations 这些单词中的一个不像另一个:使用非上下文单词表示再现离群值识别
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.12
Jesper Brink Andersen, Mikkel Bak Bertelsen, Mikkel Hørby Schou, Manuel R. Ciosici, I. Assent
{"title":"One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations","authors":"Jesper Brink Andersen, Mikkel Bak Bertelsen, Mikkel Hørby Schou, Manuel R. Ciosici, I. Assent","doi":"10.18653/v1/2020.eval4nlp-1.12","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.12","url":null,"abstract":"Word embeddings are an active topic in the NLP research community. State-of-the-art neural models achieve high performance on downstream tasks, albeit at the cost of computationally expensive training. Cost aware solutions require cheaper models that still achieve good performance. We present several reproduction studies of intrinsic evaluation tasks that evaluate non-contextual word representations in multiple languages. Furthermore, we present 50-8-8, a new data set for the outlier identification task, which avoids limitations of the original data set, such as ambiguous words, infrequent words, and multi-word tokens, while increasing the number of test cases. The data set is expanded to contain semantic and syntactic tests and is multilingual (English, German, and Italian). We provide an in-depth analysis of word embedding models with a range of hyper-parameters. Our analysis shows the suitability of different models and hyper-parameters for different tasks and the greater difficulty of representing German and Italian languages.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127457006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation ClusterDataSplit:探索具有挑战性的基于聚类的数据分割模型性能评估
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.15
Hanna Wecker, Annemarie Friedrich, Heike Adel
{"title":"ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation","authors":"Hanna Wecker, Annemarie Friedrich, Heike Adel","doi":"10.18653/v1/2020.eval4nlp-1.15","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.15","url":null,"abstract":"This paper adds to the ongoing discussion in the natural language processing community on how to choose a good development set. Motivated by the real-life necessity of applying machine learning models to different data distributions, we propose a clustering-based data splitting algorithm. It creates development (or test) sets which are lexically different from the training data while ensuring similar label distributions. Hence, we are able to create challenging cross-validation evaluation setups while abstracting away from performance differences resulting from label distribution shifts between training and test data. In addition, we present a Python-based tool for analyzing and visualizing data split characteristics and model performance. We illustrate the workings and results of our approach using a sentiment analysis and a patent classification task.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126087366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Truth or Error? Towards systematic analysis of factual errors in abstractive summaries 真理还是谬误?对抽象摘要中的事实性错误进行系统分析
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.1
Klaus-Michael Lux, Maya Sappelli, M. Larson
{"title":"Truth or Error? Towards systematic analysis of factual errors in abstractive summaries","authors":"Klaus-Michael Lux, Maya Sappelli, M. Larson","doi":"10.18653/v1/2020.eval4nlp-1.1","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.1","url":null,"abstract":"This paper presents a typology of errors produced by automatic summarization systems. The typology was created by manually analyzing the output of four recent neural summarization systems. Our work is motivated by the growing awareness of the need for better summary evaluation methods that go beyond conventional overlap-based metrics. Our typology is structured into two dimensions. First, the Mapping Dimension describes surface-level errors and provides insight into word-sequence transformation issues. Second, the Meaning Dimension describes issues related to interpretation and provides insight into breakdowns in truth, i.e., factual faithfulness to the original text. Comparative analysis revealed that two neural summarization systems leveraging pre-trained models have an advantage in decreasing grammaticality errors, but not necessarily factual errors. We also discuss the importance of ensuring that summary length and abstractiveness do not interfere with evaluating summary quality.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126655572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT ViLBERTScore:使用视觉和语言BERT评估图像标题
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.4
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Kyomin Jung
{"title":"ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT","authors":"Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Kyomin Jung","doi":"10.18653/v1/2020.eval4nlp-1.4","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.4","url":null,"abstract":"In this paper, we propose an evaluation metric for image captioning systems using both image and text information. Unlike the previous methods that rely on textual representations in evaluating the caption, our approach uses visiolinguistic representations. The proposed method generates image-conditioned embeddings for each token using ViLBERT from both generated and reference texts. Then, these contextual embeddings from each of the two sentence-pair are compared to compute the similarity score. Experimental results on three benchmark datasets show that our method correlates significantly better with human judgments than all existing metrics.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121903288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Evaluating Word Embeddings on Low-Resource Languages 低资源语言的词嵌入评价
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.17
Nathan Stringham, Michael Izbicki
{"title":"Evaluating Word Embeddings on Low-Resource Languages","authors":"Nathan Stringham, Michael Izbicki","doi":"10.18653/v1/2020.eval4nlp-1.17","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.17","url":null,"abstract":"The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114959951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Grammaticality and Language Modelling 语法性和语言建模
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI: 10.18653/v1/2020.eval4nlp-1.11
Jingcheng Niu, Gerald Penn
{"title":"Grammaticality and Language Modelling","authors":"Jingcheng Niu, Gerald Penn","doi":"10.18653/v1/2020.eval4nlp-1.11","DOIUrl":"https://doi.org/10.18653/v1/2020.eval4nlp-1.11","url":null,"abstract":"Ever since Pereira (2000) provided evidence against Chomsky’s (1957) conjecture that statistical language modelling is incommensurable with the aims of grammaticality prediction as a research enterprise, a new area of research has emerged that regards statistical language models as “psycholinguistic subjects” and probes their ability to acquire syntactic knowledge. The advent of The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) has earned a spot on the leaderboard for acceptability judgements, and the polemic between Lau et al. (2017) and Sprouse et al. (2018) has raised fundamental questions about the nature of grammaticality and how acceptability judgements should be elicited. All the while, we are told that neural language models continue to improve. That is not an easy claim to test at present, however, because there is almost no agreement on how to measure their improvement when it comes to grammaticality and acceptability judgements. The GLUE leaderboard bundles CoLA together with a Matthews correlation coefficient (MCC), although probably because CoLA’s seminal publication was using it to compute inter-rater reliabilities. Researchers working in this area have used other accuracy and correlation scores, often driven by a need to reconcile and compare various discrete and continuous variables with each other. The score that we will advocate for in this paper, the point biserial correlation, in fact compares a discrete variable (for us, acceptability judgements) to a continuous variable (for us, neural language model probabilities). The only previous work in this area to choose the PBC that we are aware of is Sprouse et al. (2018a), and that paper actually applied it backwards (with some justification) so that the language model probability was treated as the discrete binary variable by setting a threshold. With the PBC in mind, we will first reappraise some recent work in syntactically targeted linguistic evaluations (Hu et al., 2020), arguing that while their experimental design sets a new high watermark for this topic, their results may not prove what they have claimed. We then turn to the task-independent assessment of language models as grammaticality classifiers. Prior to the introduction of the GLUE leaderboard, the vast majority of this assessment was essentially anecdotal, and we find the use of the MCC in this regard to be problematic. We conduct several studies with PBCs to compare several popular language models. We also study the effects of several variables such as normalization and data homogeneity on PBC.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"338 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132907441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信