精准度、召回率和F1分数的概率扩展,以更彻底地评估分类模型

Reda Yacouby, Dustin Axman
{"title":"精准度、召回率和F1分数的概率扩展,以更彻底地评估分类模型","authors":"Reda Yacouby, Dustin Axman","doi":"10.18653/v1/2020.eval4nlp-1.9","DOIUrl":null,"url":null,"abstract":"In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"101","resultStr":"{\"title\":\"Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models\",\"authors\":\"Reda Yacouby, Dustin Axman\",\"doi\":\"10.18653/v1/2020.eval4nlp-1.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.\",\"PeriodicalId\":448066,\"journal\":{\"name\":\"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"101\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.eval4nlp-1.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.eval4nlp-1.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 101

摘要

在追求完美的监督NLP分类器的过程中,极薄的边际和低资源测试集会使建模决策变得困难。诸如Accuracy、Precision和Recall等流行的度量标准通常是不够的,因为它们不能给出模型行为的完整图像。我们提出了Precision, Recall和F1分数的概率扩展,我们分别称之为confidence-Precision (cPrecision), confidence-Recall (cRecall)和confidence-F1 (cF1)。所提出的指标解决了在评估大规模NLP系统时面临的一些挑战,特别是当模型的置信度评分分配对系统行为有影响时。与基于阈值的度量相比,我们描述了我们提议的度量的四个关键优点。其中两个好处,我们称之为对缺失值的稳健性和对模型置信度评分分配的敏感性,从指标的定义中是不言而喻的;剩下的好处、泛化和功能一致性都是经验证明的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models
In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions; the remaining benefits, generalization, and functional consistency are demonstrated empirically.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信