分歧反卷积:使机器学习性能指标符合现实

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems Pub Date : 2021-05-06 DOI:10.1145/3411764.3445423

Mitchell L. Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori B. Hashimoto, Michael S. Bernstein

{"title":"分歧反卷积:使机器学习性能指标符合现实","authors":"Mitchell L. Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori B. Hashimoto, Michael S. Bernstein","doi":"10.1145/3411764.3445423","DOIUrl":null,"url":null,"abstract":"Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people’s reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.","PeriodicalId":20451,"journal":{"name":"Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems","volume":"112 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":"{\"title\":\"The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality\",\"authors\":\"Mitchell L. Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori B. Hashimoto, Michael S. Bernstein\",\"doi\":\"10.1145/3411764.3445423\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people’s reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.\",\"PeriodicalId\":20451,\"journal\":{\"name\":\"Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems\",\"volume\":\"112 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"89\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3411764.3445423\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3411764.3445423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 89

摘要

机器学习分类器用于面向人类的任务，如评论毒性和错误信息，通常在ROC AUC等指标上得分很高，但在实践中接受度很低。为什么会有这样的差距?今天，诸如ROC AUC、精度和召回率等指标被用来衡量技术性能;然而，人机交互观察到，对面向人类的系统的评估应该考虑到人们对系统的反应。在本文中，我们引入了一种转换，该转换将机器学习分类度量与面向用户的性能度量的值和方法更紧密地结合起来。分歧反卷积采用任何多注释器(例如众包)数据集，通过估计注释器内部一致性从噪声中分离出稳定意见，并将每个测试集预测与每个注释器的单个稳定意见进行比较。将分歧反卷积应用于现有的社会计算数据集，我们发现当前的指标显着夸大了许多面向人类的机器学习任务的性能:例如，评论毒性任务的性能从0.95校正到0.73 ROC AUC。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality

Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people’s reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems

自引率

0.00%

发文量