The Limits of Abstract Evaluation Metrics: The Case of Hate Speech Detection

Proceedings of the 2017 ACM on Web Science Conference Pub Date : 2017-06-25 DOI:10.1145/3091478.3098871

Alexandra Olteanu, Kartik Talamadupula, Kush R. Varshney

{"title":"The Limits of Abstract Evaluation Metrics: The Case of Hate Speech Detection","authors":"Alexandra Olteanu, Kartik Talamadupula, Kush R. Varshney","doi":"10.1145/3091478.3098871","DOIUrl":null,"url":null,"abstract":"Wagstaff (2012) draws attention to the pervasiveness of abstract evaluation metrics that explicitly ignore or remove problem specifics. While such metrics allow practitioners to compare numbers across application domains, they offer limited insight into the impact of algorithmic decisions on humans and their perception of the algorithm's correctness. Even for problems that are mathematically the same, both the real-cost of (mathematically) identical errors, as well as their perceived-cost by users, may significantly vary according to the specifics of each problem domain, as well as of the user perceiving the result. While the real-cost of errors has been considered previously, little attention has been paid to the perceived-cost issue. We advocate for the inclusion of human-centered metrics that elicit error costs from humans from two perspectives: the nature of the error, and the user context. Focusing on hate speech detection on social media, we demonstrate that even when fixing the performance as measured by an abstract metric such as precision, user perception of correctness varies greatly depending on the nature of errors and user characteristics.","PeriodicalId":165747,"journal":{"name":"Proceedings of the 2017 ACM on Web Science Conference","volume":"274 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Web Science Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3091478.3098871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

Wagstaff (2012) draws attention to the pervasiveness of abstract evaluation metrics that explicitly ignore or remove problem specifics. While such metrics allow practitioners to compare numbers across application domains, they offer limited insight into the impact of algorithmic decisions on humans and their perception of the algorithm's correctness. Even for problems that are mathematically the same, both the real-cost of (mathematically) identical errors, as well as their perceived-cost by users, may significantly vary according to the specifics of each problem domain, as well as of the user perceiving the result. While the real-cost of errors has been considered previously, little attention has been paid to the perceived-cost issue. We advocate for the inclusion of human-centered metrics that elicit error costs from humans from two perspectives: the nature of the error, and the user context. Focusing on hate speech detection on social media, we demonstrate that even when fixing the performance as measured by an abstract metric such as precision, user perception of correctness varies greatly depending on the nature of errors and user characteristics.

查看原文本刊更多论文

抽象评价指标的局限性:以仇恨言论检测为例

Wagstaff(2012)提请注意普遍存在的抽象评估指标，这些指标明确地忽略或删除了问题细节。虽然这样的指标允许从业者跨应用领域比较数字，但它们对算法决策对人类的影响以及他们对算法正确性的感知提供了有限的见解。即使对于数学上相同的问题，(数学上)相同错误的实际成本，以及用户感知到的成本，也可能根据每个问题领域的具体情况以及用户感知到的结果而显著不同。虽然以前已经考虑过错误的实际成本，但很少注意到感知成本问题。我们提倡包含以人为中心的指标，从两个角度引出人类的错误成本:错误的性质和用户环境。专注于社交媒体上的仇恨言论检测，我们证明，即使将性能固定为精度等抽象指标来衡量，用户对正确性的感知也会因错误的性质和用户特征而有很大差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on Web Science Conference

自引率

0.00%

发文量