Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from?

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency Pub Date : 2019-12-17 DOI:10.1145/3351095.3372862

R. Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, Jenny Huang

{"title":"Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from?","authors":"R. Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, Jenny Huang","doi":"10.1145/3351095.3372862","DOIUrl":null,"url":null,"abstract":"Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a \"gold standard\" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.","PeriodicalId":377829,"journal":{"name":"Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"98","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3351095.3372862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 98

Abstract

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

查看原文本刊更多论文

垃圾进，垃圾出?社会化计算领域的机器学习应用论文是否报告了人工标记训练数据的来源?

许多新应用领域的机器学习项目都涉及到为特定目的标记数据的人类团队，从雇佣众包工人到论文作者自己标记数据。这样的任务非常类似于(或一种形式)结构化内容分析，结构化内容分析是社会科学和人文科学中一种长期存在的方法，具有许多已建立的最佳实践。在本文中，我们调查了社会计算中的机器学习应用论文样本在多大程度上——特别是来自ArXiv和传统出版物的论文，这些论文在Twitter数据上执行ML分类任务——给出了是否遵循此类最佳实践的具体细节。我们的团队对每篇论文进行了多轮的结构化内容分析，做出如下决定:论文是否报告了标注员是谁，他们的资格是什么，他们是否独立标注了相同的项目，是否披露了内部可靠性指标，对标注员进行了什么程度的培训和/或指导，是否披露了众包工作者的薪酬，以及培训数据是否公开。我们发现在这些实践是否被遵循和记录方面存在很大的分歧。很多机器学习研究和教育关注的是，一旦有了训练数据的“黄金标准”，该怎么做，但我们讨论的是同样重要的问题，即这些数据首先是否可靠。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency

自引率

0.00%

发文量