我的模型在哪里表现不佳?切片发现算法的人类评价

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing Pub Date : 2023-11-03 DOI:10.1609/hcomp.v11i1.27548

Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar

{"title":"我的模型在哪里表现不佳?切片发现算法的人类评价","authors":"Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar","doi":"10.1609/hcomp.v11i1.27548","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets (\"slices\") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms\",\"authors\":\"Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar\",\"doi\":\"10.1609/hcomp.v11i1.27548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets (\\\"slices\\\") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.\",\"PeriodicalId\":87339,\"journal\":{\"name\":\"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1609/hcomp.v11i1.27548\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/hcomp.v11i1.27548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

达到高平均精度的机器学习(ML)模型在数据的语义连贯子集(“切片”)上仍然表现不佳。这种行为可能会对部署中的模型的安全性或偏差产生重大的社会后果，但是在实践中识别这些表现不佳的部分可能很困难，特别是在从业者无法访问组注释以定义其数据的一致子集的领域中。在这些挑战的激励下，机器学习研究人员开发了新的切片发现算法，旨在将连贯且高错误的数据子集分组在一起。然而，很少有评估集中在这些工具是否帮助人类形成正确的假设，关于他们的模型在哪里(对哪些群体)表现不佳。我们进行了一项受控用户研究(N = 15)，其中我们向用户展示了两种最先进的切片发现算法输出的40个切片，并要求他们形成关于对象检测模型的假设。我们的研究结果提供了积极的证据，证明这些工具比简单的基线提供了一些好处，也揭示了用户在假设形成步骤中面临的挑战。最后，我们讨论了ML和HCI研究人员的设计机会。我们的研究结果表明，在创建和评估切片发现的新工具时，以用户为中心的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

自引率

0.00%

发文量