Ethical Dimensions for Data Quality

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-12-05 DOI:10.1145/3362121

D. Firmani, L. Tanca, Riccardo Torlone

{"title":"Ethical Dimensions for Data Quality","authors":"D. Firmani, L. Tanca, Riccardo Torlone","doi":"10.1145/3362121","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Association for Computing Machinery. 1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:2 Donatella Firmani, Letizia Tanca, and Riccardo Torlone Transparency is the ability to interpret the information extraction process in order to verify which aspects of the data determine its results. In this context, transparency metrics can use the notions of (i) data provenance [19, 18], by measuring the amount of meta-data describing where the original data come from; (ii) explanation [15], by describing how a result has been obtained. Diversity is the degree to which different kinds of objects are represented in a dataset. Several metrics are proposed in [9]. Ensuring diversity at the beginning of the information extraction process may be useful for enforcing fairness at the end. The diversity dimension may conflict with established dimensions in the Trust cluster of [5], that prioritizes few high-reputation sources. Data Protection concerns the ways to secure data, algorithms and models against unauthorized access. Defining measures can be an elusive goal since, on the one hand, anonymized datasets that are secure in isolation can reveal sensible information when combined [1], and on the other hand, robust techniques such as ε-differential privacy [10] can only describe the privacy impact of specific queries. Data protection is related to the well-established security dimension of [5]. 3 ETHICAL CHALLENGES IN THE INFORMATION EXTRACTION PROCESS We highlight some challenges of complying with the dimensions of the Ethics Cluster, throughout the three phases of the information extraction process mentioned in the Introduction. A. Source Selection. Data can typically come from multiple sources, and it is most desirable that each of these complies with the ethics dimensions described in the previous section. If sources do not comply with (some) dimension individually, we should consider that the really important requirement is that the data that are finally used for analysis or recommendations do. It is thus appropriate to consider ethics for multiple sources in combination, so that the bias towards a certain category in a single source can be eliminated by another source with opposite bias. While for the fairness, transparency and diversity dimensions this is clearly possible, for the privacy we can only act on the single data sources because adding more information can only lower the protection level, or, at most, leave it as it is. Ethics in source selection is tightly related to the transparency of the source, specifically for sources that are themselves aggregators. Information lineage is of paramount importance in this case and can be accomplished with the popular notion of provenance [18]; however, how to capture the most fine-grained type of provenance, namely data provenance, remains an open question [12]. A more general challenge is source meta-data extraction, especially for interpreting unstructured contents and thus their ethical implications. Finally, we note that also the data acquisition process plays a role, and developing inherently transparent and fair collection and extraction methods is an almost unstudied topic. B. Data Integration. Ensuring ethics in the selection step is not enough: even if the collected data satisfy the ethical requirements, not necessarily their integration does [1]. Data integration usually involves three main steps: (i) schema matching, i.e. the alignment of the schemata of the data sources (when present), (ii) identification of the items stored in different data sources that refer to the same entity (also called record linkage or entity resolution), and (iii) construction of an integrated database over the data sources, obtained by merging their contents (also called data fusion). Each step is prone to different ethical concerns, as discussed below. Schema Matching. Groups treated fairly in the sources can become overor under-represented as a consequence of the integration process, possibly causing, in the following steps, unfair decisions. Similar issues arise in connection with diversity. Entity Resolution. Integrating sources that, in isolation, protect identity (e.g. via anonymization) might generate a dataset that violates privacy: an instance of this is the so-called linkage attack [1]. We refer the reader to [21] for a survey of techniques and challenges of privacy-preserving entity resolution in the context of Big Data. ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. Ethical Dimensions for DataQuality 1:3 Data Fusion. Data disclosure, i.e., violation of data protection, can happen also in the fusion step if privacy-preserving noise is accidentally removed by merging the data. Fusion can also affect fairness, when combining data coming from different sources leads to the exclusion of some groups. In all the above steps transparency is fundamental: we can check the fulfilment of the ethical dimensions only if we can (i) provide explanations of the intermediate results (ii) describe the provenance of the final data. Unfortunately, this can conflict with data protection since removing identity information can cause lack of transparency, which ultimately may lead to unfair outcomes. As source selection, also the integration process – especially the last two steps, where schema information is not present – can benefit from the existence of meta-data, allowing to infer contextual meanings for individual terms and phrases. Fair extraction of meta-data is an exciting topic, as stereotypes and prejudices can be often found into automatically derived word semantics. C. Knowledge Extraction. An information extraction process presents the user with data organized as to satisfy their information needs. Here we highlight some ethical challenges for a sample of the many possible information extractions operations. Search and Query. These are typical data selection tasks. Diversifying the results of information retrieval and recommendation systems has traditionally been used to minimize dissatisfaction of the average user [4]. However, since these search algorithms are employed also in critical tasks such as job candidate selection or for university admissions, diversity has also become a way to ensure the fairness of the selection process [9]. Interestingly, if integrated data are unfair and over-represent a certain category, diversity can lead to data exclusion of the same category. Aggregation. Many typical decision-support queries, such as GROUP BY queries, might yield biased result, e.g. trends appearing in different groups of data can disappear or even be reversedwhen these groups are combined, leading to incorrect insights. The work of [17] provides a framework for incorporating fairness in aggregated data based on independence tests, for specific aggregations. A future work is to detect bias in combined data with full-fledged query systems. Analytics. Data are typically analyzed by means of statistical, data mining and machine learning techniques, providing encouraging results in decision making, even in data management problems [14]. However, while we are able to understand statistics and data mining models, when using techniques such as deep learning we are still far from fully grasping how a model produces its output. Therefore, explaining systems has become an important new research area [16], related to the fairness and transparency of the training data as well as of the learning process. 4 RESEARCH DIRECTIONS In the spirit of the responsible data science initiatives towards a full-fledged data quality perspective on ethics (see, for instance, redasci.org and dataresponsibly.github.io), the key ingredient is shared responsibility. Like for any other engineering product, responsibility for data usage is shared by a contractor and a producer: only if the latter is able to provide a quality certification for the various ethical dimensions, the former can share the responsibility for improper usage. Similarly, producers should be aware of their responsibility when quality goes below the granted level. While such guarantees are available for many classical dimensions of quality, for instance timeliness, the same does not hold for most of the ethical dimensions. Privacy already has a well defined way for guaranteeing a privacy level by design: (i) in the knowledge extraction step, thanks to the notion of ε-differential privacy [10], and (ii) in the integration step (see [21] for a survey). The so-called nutritional labels [13] mark a major step towards the idea of a quality certificate for fairness and diversity in the source selection and knowledge extraction steps, but how to preserve these properties throughout the process remains instead an open problem. Transparency is perhaps the hardest dimension to guarantee, and we believe that the well-known notion of provenance [12] ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:4 Donatella Firmani, Letizia Tanca, and Riccardo Torlone can provide a promising starting point. However, the rise of machine learning and deep learning techniques also for some data integration tasks [8] poses new and exciting challenges in tracking the way integration is achieved [22]. Summing up, recent literature provides a variety of methods for verifying/enforcing ethical dimensions. However, they typically apply to the very early (such as, collection) or very late steps (such as, analytics) of the information extraction process, but very few works study how to preserve ethics by design throughout the process. 5 RELATEDWORKS AND CONCLUDING REMARKS An early attempt to con","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"37 1","pages":"1 - 5"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3362121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Association for Computing Machinery. 1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:2 Donatella Firmani, Letizia Tanca, and Riccardo Torlone Transparency is the ability to interpret the information extraction process in order to verify which aspects of the data determine its results. In this context, transparency metrics can use the notions of (i) data provenance [19, 18], by measuring the amount of meta-data describing where the original data come from; (ii) explanation [15], by describing how a result has been obtained. Diversity is the degree to which different kinds of objects are represented in a dataset. Several metrics are proposed in [9]. Ensuring diversity at the beginning of the information extraction process may be useful for enforcing fairness at the end. The diversity dimension may conflict with established dimensions in the Trust cluster of [5], that prioritizes few high-reputation sources. Data Protection concerns the ways to secure data, algorithms and models against unauthorized access. Defining measures can be an elusive goal since, on the one hand, anonymized datasets that are secure in isolation can reveal sensible information when combined [1], and on the other hand, robust techniques such as ε-differential privacy [10] can only describe the privacy impact of specific queries. Data protection is related to the well-established security dimension of [5]. 3 ETHICAL CHALLENGES IN THE INFORMATION EXTRACTION PROCESS We highlight some challenges of complying with the dimensions of the Ethics Cluster, throughout the three phases of the information extraction process mentioned in the Introduction. A. Source Selection. Data can typically come from multiple sources, and it is most desirable that each of these complies with the ethics dimensions described in the previous section. If sources do not comply with (some) dimension individually, we should consider that the really important requirement is that the data that are finally used for analysis or recommendations do. It is thus appropriate to consider ethics for multiple sources in combination, so that the bias towards a certain category in a single source can be eliminated by another source with opposite bias. While for the fairness, transparency and diversity dimensions this is clearly possible, for the privacy we can only act on the single data sources because adding more information can only lower the protection level, or, at most, leave it as it is. Ethics in source selection is tightly related to the transparency of the source, specifically for sources that are themselves aggregators. Information lineage is of paramount importance in this case and can be accomplished with the popular notion of provenance [18]; however, how to capture the most fine-grained type of provenance, namely data provenance, remains an open question [12]. A more general challenge is source meta-data extraction, especially for interpreting unstructured contents and thus their ethical implications. Finally, we note that also the data acquisition process plays a role, and developing inherently transparent and fair collection and extraction methods is an almost unstudied topic. B. Data Integration. Ensuring ethics in the selection step is not enough: even if the collected data satisfy the ethical requirements, not necessarily their integration does [1]. Data integration usually involves three main steps: (i) schema matching, i.e. the alignment of the schemata of the data sources (when present), (ii) identification of the items stored in different data sources that refer to the same entity (also called record linkage or entity resolution), and (iii) construction of an integrated database over the data sources, obtained by merging their contents (also called data fusion). Each step is prone to different ethical concerns, as discussed below. Schema Matching. Groups treated fairly in the sources can become overor under-represented as a consequence of the integration process, possibly causing, in the following steps, unfair decisions. Similar issues arise in connection with diversity. Entity Resolution. Integrating sources that, in isolation, protect identity (e.g. via anonymization) might generate a dataset that violates privacy: an instance of this is the so-called linkage attack [1]. We refer the reader to [21] for a survey of techniques and challenges of privacy-preserving entity resolution in the context of Big Data. ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. Ethical Dimensions for DataQuality 1:3 Data Fusion. Data disclosure, i.e., violation of data protection, can happen also in the fusion step if privacy-preserving noise is accidentally removed by merging the data. Fusion can also affect fairness, when combining data coming from different sources leads to the exclusion of some groups. In all the above steps transparency is fundamental: we can check the fulfilment of the ethical dimensions only if we can (i) provide explanations of the intermediate results (ii) describe the provenance of the final data. Unfortunately, this can conflict with data protection since removing identity information can cause lack of transparency, which ultimately may lead to unfair outcomes. As source selection, also the integration process – especially the last two steps, where schema information is not present – can benefit from the existence of meta-data, allowing to infer contextual meanings for individual terms and phrases. Fair extraction of meta-data is an exciting topic, as stereotypes and prejudices can be often found into automatically derived word semantics. C. Knowledge Extraction. An information extraction process presents the user with data organized as to satisfy their information needs. Here we highlight some ethical challenges for a sample of the many possible information extractions operations. Search and Query. These are typical data selection tasks. Diversifying the results of information retrieval and recommendation systems has traditionally been used to minimize dissatisfaction of the average user [4]. However, since these search algorithms are employed also in critical tasks such as job candidate selection or for university admissions, diversity has also become a way to ensure the fairness of the selection process [9]. Interestingly, if integrated data are unfair and over-represent a certain category, diversity can lead to data exclusion of the same category. Aggregation. Many typical decision-support queries, such as GROUP BY queries, might yield biased result, e.g. trends appearing in different groups of data can disappear or even be reversedwhen these groups are combined, leading to incorrect insights. The work of [17] provides a framework for incorporating fairness in aggregated data based on independence tests, for specific aggregations. A future work is to detect bias in combined data with full-fledged query systems. Analytics. Data are typically analyzed by means of statistical, data mining and machine learning techniques, providing encouraging results in decision making, even in data management problems [14]. However, while we are able to understand statistics and data mining models, when using techniques such as deep learning we are still far from fully grasping how a model produces its output. Therefore, explaining systems has become an important new research area [16], related to the fairness and transparency of the training data as well as of the learning process. 4 RESEARCH DIRECTIONS In the spirit of the responsible data science initiatives towards a full-fledged data quality perspective on ethics (see, for instance, redasci.org and dataresponsibly.github.io), the key ingredient is shared responsibility. Like for any other engineering product, responsibility for data usage is shared by a contractor and a producer: only if the latter is able to provide a quality certification for the various ethical dimensions, the former can share the responsibility for improper usage. Similarly, producers should be aware of their responsibility when quality goes below the granted level. While such guarantees are available for many classical dimensions of quality, for instance timeliness, the same does not hold for most of the ethical dimensions. Privacy already has a well defined way for guaranteeing a privacy level by design: (i) in the knowledge extraction step, thanks to the notion of ε-differential privacy [10], and (ii) in the integration step (see [21] for a survey). The so-called nutritional labels [13] mark a major step towards the idea of a quality certificate for fairness and diversity in the source selection and knowledge extraction steps, but how to preserve these properties throughout the process remains instead an open problem. Transparency is perhaps the hardest dimension to guarantee, and we believe that the well-known notion of provenance [12] ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:4 Donatella Firmani, Letizia Tanca, and Riccardo Torlone can provide a promising starting point. However, the rise of machine learning and deep learning techniques also for some data integration tasks [8] poses new and exciting challenges in tracking the way integration is achieved [22]. Summing up, recent literature provides a variety of methods for verifying/enforcing ethical dimensions. However, they typically apply to the very early (such as, collection) or very late steps (such as, analytics) of the information extraction process, but very few works study how to preserve ethics by design throughout the process. 5 RELATEDWORKS AND CONCLUDING REMARKS An early attempt to con

查看原文本刊更多论文

数据质量的伦理维度

允许赊账付款。以其他方式复制或重新发布，在服务器上发布或重新分发到列表，需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2019美国计算机协会。1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform。质量，第一卷，第1期，第一条。出版日期:2019年1月。1:2 Donatella Firmani, Letizia Tanca和Riccardo Torlone透明度是解释信息提取过程的能力，以验证数据的哪些方面决定了结果。在这种情况下，透明度指标可以使用以下概念:(i)数据来源[19,18]，通过测量描述原始数据来源的元数据的数量;(ii)通过描述如何获得结果来解释[15]。多样性是指不同类型的对象在数据集中表现的程度。b[9]中提出了几个度量标准。在信息提取过程开始时确保多样性可能有助于在最后确保公平性。多样性维度可能与b[5]信任集群中的既定维度相冲突，b[5]优先考虑少数高声誉来源。数据保护涉及保护数据、算法和模型免受未经授权访问的方法。定义度量可能是一个难以实现的目标，因为一方面，单独安全的匿名数据集在组合时可以揭示合理的信息[1]，另一方面，ε-差分隐私[10]等稳健技术只能描述特定查询对隐私的影响。数据保护与[5]完善的安全维度有关。在引言中提到的信息提取过程的三个阶段中，我们强调了遵守道德集群维度的一些挑战。A.来源选择。数据通常可以来自多个来源，最理想的是每个来源都符合前一节中描述的道德维度。如果数据源不单独符合(某些)维度，我们应该考虑真正重要的需求是最终用于分析或建议的数据符合。因此，将多个来源的伦理结合起来考虑是合适的，这样，单个来源中对某一类别的偏见就可以被另一个具有相反偏见的来源所消除。虽然对于公平性、透明度和多样性维度来说，这显然是可能的，但对于隐私，我们只能对单个数据源采取行动，因为添加更多的信息只会降低保护水平，或者最多让它保持原样。来源选择中的道德与来源的透明度密切相关，特别是对于本身就是聚合器的来源。在这种情况下，信息谱系是至关重要的，可以用流行的起源概念[18]来完成;然而，如何捕获最细粒度的来源类型，即数据来源，仍然是一个悬而未决的问题。一个更普遍的挑战是源元数据的提取，特别是在解释非结构化内容及其伦理含义时。最后，我们注意到数据采集过程也起着作用，而开发本质上透明和公平的收集和提取方法几乎是一个未研究的主题。B.数据集成。在选择步骤中确保道德是不够的:即使收集的数据满足道德要求，它们的整合也不一定是[1]。数据集成通常包括三个主要步骤:(i)模式匹配，即对数据源的模式进行对齐(如果存在);(ii)识别存储在引用同一实体的不同数据源中的项(也称为记录链接或实体解析);(iii)通过合并数据源的内容构建数据源上的集成数据库(也称为数据融合)。每个步骤都可能涉及不同的伦理问题，如下所述。模式匹配。在信息源中被公平对待的群体可能会因为整合过程而变得人数过多或不足，在接下来的步骤中可能导致不公平的决定。在多样性方面也出现了类似的问题。实体解析。整合孤立地保护身份的来源(例如通过匿名化)可能会生成侵犯隐私的数据集:所谓的链接攻击[1]就是一个例子。我们建议读者参阅[21]，了解大数据背景下保护隐私的实体解析的技术和挑战。J.数据信息。质量，第一卷，第1期，第一条。出版日期:2019年1月。数据质量的伦理维度1:3数据融合。资料披露，即在融合步骤中，如果在合并数据的过程中不小心去除了隐私保护噪声，也可能发生违反数据保护的情况。当合并来自不同来源的数据时，融合也会影响公平性，导致某些群体被排除在外。在上述所有步骤中，透明度是最基本的:只有当我们能够(i)提供对中间结果的解释(ii)描述最终数据的来源时，我们才能检查道德维度的实现。不幸的是，这可能与数据保护相冲突，因为删除身份信息可能导致缺乏透明度，最终可能导致不公平的结果。作为源选择，集成过程——尤其是没有模式信息的最后两个步骤——也可以从元数据的存在中获益，允许推断单个术语和短语的上下文含义。公平地提取元数据是一个令人兴奋的话题，因为通常可以在自动派生的词语义中找到刻板印象和偏见。C.知识提取。信息提取过程向用户提供组织起来以满足其信息需求的数据。在这里，我们强调了许多可能的信息提取操作示例的一些道德挑战。搜索和查询。这些是典型的数据选择任务。信息检索和推荐系统的多样化结果传统上被用来最小化普通用户的不满意度。然而，由于这些搜索算法也被用于求职者选择或大学录取等关键任务，多样性也成为确保选择过程公平性的一种方式。有趣的是，如果综合数据是不公平的，并且过度代表某一类别，多样性可能导致同一类别的数据被排除在外。聚合。许多典型的决策支持查询，例如GROUP BY查询，可能会产生有偏差的结果，例如，当这些组组合在一起时，不同组数据中出现的趋势可能会消失甚至反转，从而导致错误的见解。[17]的工作提供了一个框架，将公平性纳入基于独立性测试的汇总数据中，用于特定的汇总。未来的工作是通过成熟的查询系统检测组合数据中的偏差。分析。数据通常通过统计、数据挖掘和机器学习技术进行分析，在决策中提供了令人鼓舞的结果，甚至在数据管理问题中也是如此。然而，虽然我们能够理解统计和数据挖掘模型，但在使用深度学习等技术时，我们仍然远远不能完全掌握模型是如何产生输出的。因此，解释系统已成为一个重要的新研究领域[16]，关系到训练数据的公平性和透明度以及学习过程。本着负责任的数据科学倡议的精神，以全面的数据质量角度来看待伦理(例如，参见redasci.org和dataresponsiy.github.io)，关键因素是共同责任。与任何其他工程产品一样，数据使用的责任由承包商和生产者共同承担:只有后者能够提供各种道德层面的质量认证，前者才能分担不当使用的责任。同样，当质量低于规定水平时，生产者应该意识到自己的责任。虽然这种保证适用于许多传统的质量方面，例如及时性，但在大多数道德方面并不适用。通过设计，隐私已经有了一个很好的定义方法来保证隐私级别:(i)在知识提取步骤中，由于ε-微分隐私[10]的概念，以及(ii)在集成步骤中(参见[21]的调查)。所谓的营养标签[13]标志着在来源选择和知识提取步骤中向公平和多样性的质量证书迈出了重要的一步，但是如何在整个过程中保持这些属性仍然是一个悬而未决的问题。透明度可能是最难保证的维度，我们相信众所周知的来源概念[12]ACM J. Data Inform。质量，第一卷，第1期，第一条。出版日期:2019年1月。多纳泰拉·菲尔马尼，莱蒂齐亚·坦卡和里卡多·托隆可以提供一个有希望的起点。然而，机器学习和深度学习技术的兴起也为一些数据集成任务提出了新的和令人兴奋的挑战，即跟踪集成的实现方式。综上所述，最近的文献提供了各种方法来验证/执行伦理维度。然而，它们通常适用于信息提取过程的非常早期(例如，收集)或非常后期的步骤(例如，分析)，但是很少有作品研究如何在整个过程中通过设计来保持道德。5相关作品和结束语早期的欺骗尝试

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量