数据报表:从技术概念到社区实践

ACM Journal on Responsible Computing Pub Date : 2023-05-08 DOI:10.1145/3594737

Angelina McMillan-Major, Emily M. Bender, Batya Friedman

{"title":"数据报表:从技术概念到社区实践","authors":"Angelina McMillan-Major, Emily M. Bender, Batya Friedman","doi":"10.1145/3594737","DOIUrl":null,"url":null,"abstract":"Responsible computing ultimately requires that technical communities develop and adopt tools, processes, and practices that mitigate harms and support human flourishing. Prior efforts toward the responsible development and use of datasets, machine learning models, and other technical systems have led to the creation of documentation toolkits to facilitate transparency, diagnosis, and inclusion. This work takes the next step: to catalyze community uptake, alongside toolkit improvement. Specifically, starting from one such proposed toolkit specialized for language datasets, data statements for natural language processing (NLP), we explore how to improve the toolkit in three senses: (1) the content of the toolkit itself, (2) engagement with professional practice, and (3) moving from a conceptual proposal to a tested schema that the intended community of use may readily adopt. To achieve these goals, we first conducted a workshop with NLP practitioners in order to identify gaps and limitations of the toolkit as well as to develop best practices for writing data statements, yielding an interim improved toolkit. Then we conducted an analytic comparison between the interim toolkit and another documentation toolkit, datasheets for datasets. Based on these two integrated processes, we present our revised Version 2 schema and best practices in a guide for writing data statements. Our findings more generally provide integrated processes for co-evolving both technology and practice to address ethical concerns within situated technical communities.","PeriodicalId":329595,"journal":{"name":"ACM Journal on Responsible Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Data Statements: From Technical Concept to Community Practice\",\"authors\":\"Angelina McMillan-Major, Emily M. Bender, Batya Friedman\",\"doi\":\"10.1145/3594737\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Responsible computing ultimately requires that technical communities develop and adopt tools, processes, and practices that mitigate harms and support human flourishing. Prior efforts toward the responsible development and use of datasets, machine learning models, and other technical systems have led to the creation of documentation toolkits to facilitate transparency, diagnosis, and inclusion. This work takes the next step: to catalyze community uptake, alongside toolkit improvement. Specifically, starting from one such proposed toolkit specialized for language datasets, data statements for natural language processing (NLP), we explore how to improve the toolkit in three senses: (1) the content of the toolkit itself, (2) engagement with professional practice, and (3) moving from a conceptual proposal to a tested schema that the intended community of use may readily adopt. To achieve these goals, we first conducted a workshop with NLP practitioners in order to identify gaps and limitations of the toolkit as well as to develop best practices for writing data statements, yielding an interim improved toolkit. Then we conducted an analytic comparison between the interim toolkit and another documentation toolkit, datasheets for datasets. Based on these two integrated processes, we present our revised Version 2 schema and best practices in a guide for writing data statements. Our findings more generally provide integrated processes for co-evolving both technology and practice to address ethical concerns within situated technical communities.\",\"PeriodicalId\":329595,\"journal\":{\"name\":\"ACM Journal on Responsible Computing\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal on Responsible Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3594737\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Responsible Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3594737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

负责任的计算最终需要技术社区开发和采用减轻危害和支持人类繁荣的工具、过程和实践。之前对数据集、机器学习模型和其他技术系统的负责任开发和使用的努力导致了文档工具包的创建，以促进透明度、诊断和包容性。这项工作的下一步是:促进社区的吸收，同时改进工具包。具体地说，我们从一个专门针对语言数据集、自然语言处理(NLP)的数据语句的拟议工具包开始，探索如何在三个方面改进工具包:(1)工具包本身的内容，(2)与专业实践的接触，以及(3)从概念建议转变为预期使用社区可能容易采用的经过测试的模式。为了实现这些目标，我们首先与NLP从业者进行了一次研讨会，以确定工具包的差距和局限性，并开发编写数据语句的最佳实践，从而产生一个临时改进的工具包。然后，我们对临时工具包和另一个文档工具包(数据集的数据表)进行了分析比较。基于这两个集成的过程，我们在编写数据语句的指南中提出了修订后的Version 2模式和最佳实践。我们的发现更普遍地为共同发展技术和实践提供了集成过程，以解决技术社区内的伦理问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Statements: From Technical Concept to Community Practice

Responsible computing ultimately requires that technical communities develop and adopt tools, processes, and practices that mitigate harms and support human flourishing. Prior efforts toward the responsible development and use of datasets, machine learning models, and other technical systems have led to the creation of documentation toolkits to facilitate transparency, diagnosis, and inclusion. This work takes the next step: to catalyze community uptake, alongside toolkit improvement. Specifically, starting from one such proposed toolkit specialized for language datasets, data statements for natural language processing (NLP), we explore how to improve the toolkit in three senses: (1) the content of the toolkit itself, (2) engagement with professional practice, and (3) moving from a conceptual proposal to a tested schema that the intended community of use may readily adopt. To achieve these goals, we first conducted a workshop with NLP practitioners in order to identify gaps and limitations of the toolkit as well as to develop best practices for writing data statements, yielding an interim improved toolkit. Then we conducted an analytic comparison between the interim toolkit and another documentation toolkit, datasheets for datasets. Based on these two integrated processes, we present our revised Version 2 schema and best practices in a guide for writing data statements. Our findings more generally provide integrated processes for co-evolving both technology and practice to address ethical concerns within situated technical communities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Journal on Responsible Computing

自引率

0.00%

发文量