精准度、召回率和F1分数的概率扩展，以更彻底地评估分类模型

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI:10.18653/v1/2020.eval4nlp-1.9

Reda Yacouby, Dustin Axman

{"title":"精准度、召回率和F1分数的概率扩展，以更彻底地评估分类模型","authors":"Reda Yacouby, Dustin Axman","doi":"10.18653/v1/2020.eval4nlp-1.9","DOIUrl":null,"url":null,"abstract":"In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"101","resultStr":"{\"title\":\"Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models\",\"authors\":\"Reda Yacouby, Dustin Axman\",\"doi\":\"10.18653/v1/2020.eval4nlp-1.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.\",\"PeriodicalId\":448066,\"journal\":{\"name\":\"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"101\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.eval4nlp-1.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.eval4nlp-1.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 101

摘要

在追求完美的监督NLP分类器的过程中，极薄的边际和低资源测试集会使建模决策变得困难。诸如Accuracy、Precision和Recall等流行的度量标准通常是不够的，因为它们不能给出模型行为的完整图像。我们提出了Precision, Recall和F1分数的概率扩展，我们分别称之为confidence-Precision (cPrecision)， confidence-Recall (cRecall)和confidence-F1 (cF1)。所提出的指标解决了在评估大规模NLP系统时面临的一些挑战，特别是当模型的置信度评分分配对系统行为有影响时。与基于阈值的度量相比，我们描述了我们提议的度量的四个关键优点。其中两个好处，我们称之为对缺失值的稳健性和对模型置信度评分分配的敏感性，从指标的定义中是不言而喻的;剩下的好处、泛化和功能一致性都是经验证明的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models

In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

自引率

0.00%

发文量