Fides:迈向负责任的数据科学平台

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI:10.1145/3085504.3085530

Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum

{"title":"Fides:迈向负责任的数据科学平台","authors":"Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum","doi":"10.1145/3085504.3085530","DOIUrl":null,"url":null,"abstract":"Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. Any domain that involves sparse or sampled data has exposure to potential bias. In this vision paper, we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. We see a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of urban analytics, outlining a systems research agenda in responsible data science.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Fides: Towards a Platform for Responsible Data Science\",\"authors\":\"Julia Stoyanovich, Bill Howe, S. Abiteboul, G. Miklau, Arnaud Sahuguet, G. Weikum\",\"doi\":\"10.1145/3085504.3085530\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. Any domain that involves sparse or sampled data has exposure to potential bias. In this vision paper, we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. We see a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of urban analytics, outlining a systems research agenda in responsible data science.\",\"PeriodicalId\":431308,\"journal\":{\"name\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3085504.3085530\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3085530","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

摘要

在数据科学研究和实践中，负责任的数据分析和使用问题正成为讨论的前沿，迄今为止，数据挖掘、机器学习、安全和隐私社区在这方面做出了最重大的努力。在这些领域中，研究的重点是分析特定算法及其输出的公平性、问责性和透明度(FAT)属性。虽然这些问题在社会科学中最为明显，在社会科学中，公平被解释为在受保护群体之间分配资源，但源数据中的偏见管理影响到各种领域。考虑需要来自不同地理区域的代表性数据的气候变化研究，或需要代表产品和客户多样性数据的供应链分析。任何涉及稀疏或采样数据的领域都有潜在的偏差。在这篇愿景论文中，我们认为FAT属性必须被视为数据库系统问题，在数据科学生命周期的上游:源数据中的偏差不会被注意到，并且偏差可能在预处理(公平性)期间引入，虚假相关性导致可重复性问题(问责制)，并且在预处理期间做出的假设对决策具有无形但显著的影响(透明度)。随着机器学习方法继续被非专家广泛应用，滥用的可能性也在增加。我们看到需要一个数据共享和协作分析平台，其功能可以鼓励(在某些情况下，强制执行)数据科学生命周期的所有阶段的最佳实践。我们在城市分析的背景下描述了这样一个平台的特征，我们称之为Fides，概述了负责任数据科学的系统研究议程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fides: Towards a Platform for Responsible Data Science

Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. Any domain that involves sparse or sampled data has exposure to potential bias. In this vision paper, we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. We see a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of urban analytics, outlining a systems research agenda in responsible data science.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量