Lessons from archives: strategies for collecting sociocultural data in machine learning

Eun Seo Jo, Timnit Gebru
{"title":"Lessons from archives: strategies for collecting sociocultural data in machine learning","authors":"Eun Seo Jo, Timnit Gebru","doi":"10.1145/3351095.3372829","DOIUrl":null,"url":null,"abstract":"A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the machine learning (ML) pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics & privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural ML. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection and draw from interdisciplinary expertise.","PeriodicalId":377829,"journal":{"name":"Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"213","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3351095.3372829","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 213

Abstract

A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the machine learning (ML) pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics & privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural ML. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection and draw from interdisciplinary expertise.
从档案中吸取教训:机器学习中收集社会文化数据的策略
越来越多的研究表明,机器学习系统中公平性、问责制、透明度和道德方面的许多问题都植根于围绕数据收集和注释过程的决策。然而,尽管数据收集具有基本性质,但它仍然是机器学习(ML)管道中被忽视的一部分。在本文中,我们认为应该在ML中形成一个新的专业化,专注于数据收集和注释的方法:需要制度框架和程序的努力。特别是对于社会文化数据,可以从档案和图书馆中找到相似之处。档案是收集人类信息的长期共同努力,档案学者已经开发了语言和程序来解决和讨论与数据收集有关的许多挑战,如同意、权力、包容性、透明度、道德和隐私。我们讨论了档案文件收集实践中的这五种关键方法,这些方法可以为社会文化机器学习的数据收集提供信息。通过展示来自另一个领域的数据收集实践,我们鼓励机器学习研究在数据收集方面更加认知和系统化,并借鉴跨学科的专业知识。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信