D. Firmani, L. Tanca, Riccardo Torlone
{"title":"数据质量的伦理维度","authors":"D. Firmani, L. Tanca, Riccardo Torlone","doi":"10.1145/3362121","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Association for Computing Machinery. 1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:2 Donatella Firmani, Letizia Tanca, and Riccardo Torlone Transparency is the ability to interpret the information extraction process in order to verify which aspects of the data determine its results. In this context, transparency metrics can use the notions of (i) data provenance [19, 18], by measuring the amount of meta-data describing where the original data come from; (ii) explanation [15], by describing how a result has been obtained. Diversity is the degree to which different kinds of objects are represented in a dataset. Several metrics are proposed in [9]. Ensuring diversity at the beginning of the information extraction process may be useful for enforcing fairness at the end. The diversity dimension may conflict with established dimensions in the Trust cluster of [5], that prioritizes few high-reputation sources. Data Protection concerns the ways to secure data, algorithms and models against unauthorized access. Defining measures can be an elusive goal since, on the one hand, anonymized datasets that are secure in isolation can reveal sensible information when combined [1], and on the other hand, robust techniques such as ε-differential privacy [10] can only describe the privacy impact of specific queries. Data protection is related to the well-established security dimension of [5]. 3 ETHICAL CHALLENGES IN THE INFORMATION EXTRACTION PROCESS We highlight some challenges of complying with the dimensions of the Ethics Cluster, throughout the three phases of the information extraction process mentioned in the Introduction. A. Source Selection. Data can typically come from multiple sources, and it is most desirable that each of these complies with the ethics dimensions described in the previous section. If sources do not comply with (some) dimension individually, we should consider that the really important requirement is that the data that are finally used for analysis or recommendations do. It is thus appropriate to consider ethics for multiple sources in combination, so that the bias towards a certain category in a single source can be eliminated by another source with opposite bias. While for the fairness, transparency and diversity dimensions this is clearly possible, for the privacy we can only act on the single data sources because adding more information can only lower the protection level, or, at most, leave it as it is. Ethics in source selection is tightly related to the transparency of the source, specifically for sources that are themselves aggregators. Information lineage is of paramount importance in this case and can be accomplished with the popular notion of provenance [18]; however, how to capture the most fine-grained type of provenance, namely data provenance, remains an open question [12]. A more general challenge is source meta-data extraction, especially for interpreting unstructured contents and thus their ethical implications. Finally, we note that also the data acquisition process plays a role, and developing inherently transparent and fair collection and extraction methods is an almost unstudied topic. B. Data Integration. Ensuring ethics in the selection step is not enough: even if the collected data satisfy the ethical requirements, not necessarily their integration does [1]. Data integration usually involves three main steps: (i) schema matching, i.e. the alignment of the schemata of the data sources (when present), (ii) identification of the items stored in different data sources that refer to the same entity (also called record linkage or entity resolution), and (iii) construction of an integrated database over the data sources, obtained by merging their contents (also called data fusion). Each step is prone to different ethical concerns, as discussed below. Schema Matching. Groups treated fairly in the sources can become overor under-represented as a consequence of the integration process, possibly causing, in the following steps, unfair decisions. Similar issues arise in connection with diversity. Entity Resolution. Integrating sources that, in isolation, protect identity (e.g. via anonymization) might generate a dataset that violates privacy: an instance of this is the so-called linkage attack [1]. We refer the reader to [21] for a survey of techniques and challenges of privacy-preserving entity resolution in the context of Big Data. ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. Ethical Dimensions for DataQuality 1:3 Data Fusion. Data disclosure, i.e., violation of data protection, can happen also in the fusion step if privacy-preserving noise is accidentally removed by merging the data. Fusion can also affect fairness, when combining data coming from different sources leads to the exclusion of some groups. In all the above steps transparency is fundamental: we can check the fulfilment of the ethical dimensions only if we can (i) provide explanations of the intermediate results (ii) describe the provenance of the final data. Unfortunately, this can conflict with data protection since removing identity information can cause lack of transparency, which ultimately may lead to unfair outcomes. As source selection, also the integration process – especially the last two steps, where schema information is not present – can benefit from the existence of meta-data, allowing to infer contextual meanings for individual terms and phrases. Fair extraction of meta-data is an exciting topic, as stereotypes and prejudices can be often found into automatically derived word semantics. C. Knowledge Extraction. An information extraction process presents the user with data organized as to satisfy their information needs. Here we highlight some ethical challenges for a sample of the many possible information extractions operations. Search and Query. These are typical data selection tasks. Diversifying the results of information retrieval and recommendation systems has traditionally been used to minimize dissatisfaction of the average user [4]. However, since these search algorithms are employed also in critical tasks such as job candidate selection or for university admissions, diversity has also become a way to ensure the fairness of the selection process [9]. Interestingly, if integrated data are unfair and over-represent a certain category, diversity can lead to data exclusion of the same category. Aggregation. Many typical decision-support queries, such as GROUP BY queries, might yield biased result, e.g. trends appearing in different groups of data can disappear or even be reversedwhen these groups are combined, leading to incorrect insights. The work of [17] provides a framework for incorporating fairness in aggregated data based on independence tests, for specific aggregations. A future work is to detect bias in combined data with full-fledged query systems. Analytics. Data are typically analyzed by means of statistical, data mining and machine learning techniques, providing encouraging results in decision making, even in data management problems [14]. However, while we are able to understand statistics and data mining models, when using techniques such as deep learning we are still far from fully grasping how a model produces its output. Therefore, explaining systems has become an important new research area [16], related to the fairness and transparency of the training data as well as of the learning process. 4 RESEARCH DIRECTIONS In the spirit of the responsible data science initiatives towards a full-fledged data quality perspective on ethics (see, for instance, redasci.org and dataresponsibly.github.io), the key ingredient is shared responsibility. Like for any other engineering product, responsibility for data usage is shared by a contractor and a producer: only if the latter is able to provide a quality certification for the various ethical dimensions, the former can share the responsibility for improper usage. Similarly, producers should be aware of their responsibility when quality goes below the granted level. While such guarantees are available for many classical dimensions of quality, for instance timeliness, the same does not hold for most of the ethical dimensions. Privacy already has a well defined way for guaranteeing a privacy level by design: (i) in the knowledge extraction step, thanks to the notion of ε-differential privacy [10], and (ii) in the integration step (see [21] for a survey). The so-called nutritional labels [13] mark a major step towards the idea of a quality certificate for fairness and diversity in the source selection and knowledge extraction steps, but how to preserve these properties throughout the process remains instead an open problem. Transparency is perhaps the hardest dimension to guarantee, and we believe that the well-known notion of provenance [12] ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:4 Donatella Firmani, Letizia Tanca, and Riccardo Torlone can provide a promising starting point. However, the rise of machine learning and deep learning techniques also for some data integration tasks [8] poses new and exciting challenges in tracking the way integration is achieved [22]. Summing up, recent literature provides a variety of methods for verifying/enforcing ethical dimensions. However, they typically apply to the very early (such as, collection) or very late steps (such as, analytics) of the information extraction process, but very few works study how to preserve ethics by design throughout the process. 5 RELATEDWORKS AND CONCLUDING REMARKS An early attempt to con","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"37 1","pages":"1 - 5"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Ethical Dimensions for Data Quality\",\"authors\":\"D. Firmani, L. Tanca, Riccardo Torlone\",\"doi\":\"10.1145/3362121\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Association for Computing Machinery. 1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:2 Donatella Firmani, Letizia Tanca, and Riccardo Torlone Transparency is the ability to interpret the information extraction process in order to verify which aspects of the data determine its results. In this context, transparency metrics can use the notions of (i) data provenance [19, 18], by measuring the amount of meta-data describing where the original data come from; (ii) explanation [15], by describing how a result has been obtained. Diversity is the degree to which different kinds of objects are represented in a dataset. Several metrics are proposed in [9]. Ensuring diversity at the beginning of the information extraction process may be useful for enforcing fairness at the end. The diversity dimension may conflict with established dimensions in the Trust cluster of [5], that prioritizes few high-reputation sources. Data Protection concerns the ways to secure data, algorithms and models against unauthorized access. Defining measures can be an elusive goal since, on the one hand, anonymized datasets that are secure in isolation can reveal sensible information when combined [1], and on the other hand, robust techniques such as ε-differential privacy [10] can only describe the privacy impact of specific queries. Data protection is related to the well-established security dimension of [5]. 3 ETHICAL CHALLENGES IN THE INFORMATION EXTRACTION PROCESS We highlight some challenges of complying with the dimensions of the Ethics Cluster, throughout the three phases of the information extraction process mentioned in the Introduction. A. Source Selection. Data can typically come from multiple sources, and it is most desirable that each of these complies with the ethics dimensions described in the previous section. If sources do not comply with (some) dimension individually, we should consider that the really important requirement is that the data that are finally used for analysis or recommendations do. It is thus appropriate to consider ethics for multiple sources in combination, so that the bias towards a certain category in a single source can be eliminated by another source with opposite bias. While for the fairness, transparency and diversity dimensions this is clearly possible, for the privacy we can only act on the single data sources because adding more information can only lower the protection level, or, at most, leave it as it is. Ethics in source selection is tightly related to the transparency of the source, specifically for sources that are themselves aggregators. Information lineage is of paramount importance in this case and can be accomplished with the popular notion of provenance [18]; however, how to capture the most fine-grained type of provenance, namely data provenance, remains an open question [12]. A more general challenge is source meta-data extraction, especially for interpreting unstructured contents and thus their ethical implications. Finally, we note that also the data acquisition process plays a role, and developing inherently transparent and fair collection and extraction methods is an almost unstudied topic. B. Data Integration. Ensuring ethics in the selection step is not enough: even if the collected data satisfy the ethical requirements, not necessarily their integration does [1]. Data integration usually involves three main steps: (i) schema matching, i.e. the alignment of the schemata of the data sources (when present), (ii) identification of the items stored in different data sources that refer to the same entity (also called record linkage or entity resolution), and (iii) construction of an integrated database over the data sources, obtained by merging their contents (also called data fusion). Each step is prone to different ethical concerns, as discussed below. Schema Matching. Groups treated fairly in the sources can become overor under-represented as a consequence of the integration process, possibly causing, in the following steps, unfair decisions. Similar issues arise in connection with diversity. Entity Resolution. Integrating sources that, in isolation, protect identity (e.g. via anonymization) might generate a dataset that violates privacy: an instance of this is the so-called linkage attack [1]. We refer the reader to [21] for a survey of techniques and challenges of privacy-preserving entity resolution in the context of Big Data. ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. Ethical Dimensions for DataQuality 1:3 Data Fusion. Data disclosure, i.e., violation of data protection, can happen also in the fusion step if privacy-preserving noise is accidentally removed by merging the data. Fusion can also affect fairness, when combining data coming from different sources leads to the exclusion of some groups. In all the above steps transparency is fundamental: we can check the fulfilment of the ethical dimensions only if we can (i) provide explanations of the intermediate results (ii) describe the provenance of the final data. Unfortunately, this can conflict with data protection since removing identity information can cause lack of transparency, which ultimately may lead to unfair outcomes. As source selection, also the integration process – especially the last two steps, where schema information is not present – can benefit from the existence of meta-data, allowing to infer contextual meanings for individual terms and phrases. Fair extraction of meta-data is an exciting topic, as stereotypes and prejudices can be often found into automatically derived word semantics. C. Knowledge Extraction. An information extraction process presents the user with data organized as to satisfy their information needs. Here we highlight some ethical challenges for a sample of the many possible information extractions operations. Search and Query. These are typical data selection tasks. Diversifying the results of information retrieval and recommendation systems has traditionally been used to minimize dissatisfaction of the average user [4]. However, since these search algorithms are employed also in critical tasks such as job candidate selection or for university admissions, diversity has also become a way to ensure the fairness of the selection process [9]. Interestingly, if integrated data are unfair and over-represent a certain category, diversity can lead to data exclusion of the same category. Aggregation. Many typical decision-support queries, such as GROUP BY queries, might yield biased result, e.g. trends appearing in different groups of data can disappear or even be reversedwhen these groups are combined, leading to incorrect insights. The work of [17] provides a framework for incorporating fairness in aggregated data based on independence tests, for specific aggregations. A future work is to detect bias in combined data with full-fledged query systems. Analytics. Data are typically analyzed by means of statistical, data mining and machine learning techniques, providing encouraging results in decision making, even in data management problems [14]. However, while we are able to understand statistics and data mining models, when using techniques such as deep learning we are still far from fully grasping how a model produces its output. Therefore, explaining systems has become an important new research area [16], related to the fairness and transparency of the training data as well as of the learning process. 4 RESEARCH DIRECTIONS In the spirit of the responsible data science initiatives towards a full-fledged data quality perspective on ethics (see, for instance, redasci.org and dataresponsibly.github.io), the key ingredient is shared responsibility. Like for any other engineering product, responsibility for data usage is shared by a contractor and a producer: only if the latter is able to provide a quality certification for the various ethical dimensions, the former can share the responsibility for improper usage. Similarly, producers should be aware of their responsibility when quality goes below the granted level. While such guarantees are available for many classical dimensions of quality, for instance timeliness, the same does not hold for most of the ethical dimensions. Privacy already has a well defined way for guaranteeing a privacy level by design: (i) in the knowledge extraction step, thanks to the notion of ε-differential privacy [10], and (ii) in the integration step (see [21] for a survey). The so-called nutritional labels [13] mark a major step towards the idea of a quality certificate for fairness and diversity in the source selection and knowledge extraction steps, but how to preserve these properties throughout the process remains instead an open problem. Transparency is perhaps the hardest dimension to guarantee, and we believe that the well-known notion of provenance [12] ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:4 Donatella Firmani, Letizia Tanca, and Riccardo Torlone can provide a promising starting point. However, the rise of machine learning and deep learning techniques also for some data integration tasks [8] poses new and exciting challenges in tracking the way integration is achieved [22]. Summing up, recent literature provides a variety of methods for verifying/enforcing ethical dimensions. However, they typically apply to the very early (such as, collection) or very late steps (such as, analytics) of the information extraction process, but very few works study how to preserve ethics by design throughout the process. 5 RELATEDWORKS AND CONCLUDING REMARKS An early attempt to con\",\"PeriodicalId\":15582,\"journal\":{\"name\":\"Journal of Data and Information Quality (JDIQ)\",\"volume\":\"37 1\",\"pages\":\"1 - 5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality (JDIQ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3362121\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3362121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23
Ethical Dimensions for Data Quality
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Association for Computing Machinery. 1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:2 Donatella Firmani, Letizia Tanca, and Riccardo Torlone Transparency is the ability to interpret the information extraction process in order to verify which aspects of the data determine its results. In this context, transparency metrics can use the notions of (i) data provenance [19, 18], by measuring the amount of meta-data describing where the original data come from; (ii) explanation [15], by describing how a result has been obtained. Diversity is the degree to which different kinds of objects are represented in a dataset. Several metrics are proposed in [9]. Ensuring diversity at the beginning of the information extraction process may be useful for enforcing fairness at the end. The diversity dimension may conflict with established dimensions in the Trust cluster of [5], that prioritizes few high-reputation sources. Data Protection concerns the ways to secure data, algorithms and models against unauthorized access. Defining measures can be an elusive goal since, on the one hand, anonymized datasets that are secure in isolation can reveal sensible information when combined [1], and on the other hand, robust techniques such as ε-differential privacy [10] can only describe the privacy impact of specific queries. Data protection is related to the well-established security dimension of [5]. 3 ETHICAL CHALLENGES IN THE INFORMATION EXTRACTION PROCESS We highlight some challenges of complying with the dimensions of the Ethics Cluster, throughout the three phases of the information extraction process mentioned in the Introduction. A. Source Selection. Data can typically come from multiple sources, and it is most desirable that each of these complies with the ethics dimensions described in the previous section. If sources do not comply with (some) dimension individually, we should consider that the really important requirement is that the data that are finally used for analysis or recommendations do. It is thus appropriate to consider ethics for multiple sources in combination, so that the bias towards a certain category in a single source can be eliminated by another source with opposite bias. While for the fairness, transparency and diversity dimensions this is clearly possible, for the privacy we can only act on the single data sources because adding more information can only lower the protection level, or, at most, leave it as it is. Ethics in source selection is tightly related to the transparency of the source, specifically for sources that are themselves aggregators. Information lineage is of paramount importance in this case and can be accomplished with the popular notion of provenance [18]; however, how to capture the most fine-grained type of provenance, namely data provenance, remains an open question [12]. A more general challenge is source meta-data extraction, especially for interpreting unstructured contents and thus their ethical implications. Finally, we note that also the data acquisition process plays a role, and developing inherently transparent and fair collection and extraction methods is an almost unstudied topic. B. Data Integration. Ensuring ethics in the selection step is not enough: even if the collected data satisfy the ethical requirements, not necessarily their integration does [1]. Data integration usually involves three main steps: (i) schema matching, i.e. the alignment of the schemata of the data sources (when present), (ii) identification of the items stored in different data sources that refer to the same entity (also called record linkage or entity resolution), and (iii) construction of an integrated database over the data sources, obtained by merging their contents (also called data fusion). Each step is prone to different ethical concerns, as discussed below. Schema Matching. Groups treated fairly in the sources can become overor under-represented as a consequence of the integration process, possibly causing, in the following steps, unfair decisions. Similar issues arise in connection with diversity. Entity Resolution. Integrating sources that, in isolation, protect identity (e.g. via anonymization) might generate a dataset that violates privacy: an instance of this is the so-called linkage attack [1]. We refer the reader to [21] for a survey of techniques and challenges of privacy-preserving entity resolution in the context of Big Data. ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. Ethical Dimensions for DataQuality 1:3 Data Fusion. Data disclosure, i.e., violation of data protection, can happen also in the fusion step if privacy-preserving noise is accidentally removed by merging the data. Fusion can also affect fairness, when combining data coming from different sources leads to the exclusion of some groups. In all the above steps transparency is fundamental: we can check the fulfilment of the ethical dimensions only if we can (i) provide explanations of the intermediate results (ii) describe the provenance of the final data. Unfortunately, this can conflict with data protection since removing identity information can cause lack of transparency, which ultimately may lead to unfair outcomes. As source selection, also the integration process – especially the last two steps, where schema information is not present – can benefit from the existence of meta-data, allowing to infer contextual meanings for individual terms and phrases. Fair extraction of meta-data is an exciting topic, as stereotypes and prejudices can be often found into automatically derived word semantics. C. Knowledge Extraction. An information extraction process presents the user with data organized as to satisfy their information needs. Here we highlight some ethical challenges for a sample of the many possible information extractions operations. Search and Query. These are typical data selection tasks. Diversifying the results of information retrieval and recommendation systems has traditionally been used to minimize dissatisfaction of the average user [4]. However, since these search algorithms are employed also in critical tasks such as job candidate selection or for university admissions, diversity has also become a way to ensure the fairness of the selection process [9]. Interestingly, if integrated data are unfair and over-represent a certain category, diversity can lead to data exclusion of the same category. Aggregation. Many typical decision-support queries, such as GROUP BY queries, might yield biased result, e.g. trends appearing in different groups of data can disappear or even be reversedwhen these groups are combined, leading to incorrect insights. The work of [17] provides a framework for incorporating fairness in aggregated data based on independence tests, for specific aggregations. A future work is to detect bias in combined data with full-fledged query systems. Analytics. Data are typically analyzed by means of statistical, data mining and machine learning techniques, providing encouraging results in decision making, even in data management problems [14]. However, while we are able to understand statistics and data mining models, when using techniques such as deep learning we are still far from fully grasping how a model produces its output. Therefore, explaining systems has become an important new research area [16], related to the fairness and transparency of the training data as well as of the learning process. 4 RESEARCH DIRECTIONS In the spirit of the responsible data science initiatives towards a full-fledged data quality perspective on ethics (see, for instance, redasci.org and dataresponsibly.github.io), the key ingredient is shared responsibility. Like for any other engineering product, responsibility for data usage is shared by a contractor and a producer: only if the latter is able to provide a quality certification for the various ethical dimensions, the former can share the responsibility for improper usage. Similarly, producers should be aware of their responsibility when quality goes below the granted level. While such guarantees are available for many classical dimensions of quality, for instance timeliness, the same does not hold for most of the ethical dimensions. Privacy already has a well defined way for guaranteeing a privacy level by design: (i) in the knowledge extraction step, thanks to the notion of ε-differential privacy [10], and (ii) in the integration step (see [21] for a survey). The so-called nutritional labels [13] mark a major step towards the idea of a quality certificate for fairness and diversity in the source selection and knowledge extraction steps, but how to preserve these properties throughout the process remains instead an open problem. Transparency is perhaps the hardest dimension to guarantee, and we believe that the well-known notion of provenance [12] ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:4 Donatella Firmani, Letizia Tanca, and Riccardo Torlone can provide a promising starting point. However, the rise of machine learning and deep learning techniques also for some data integration tasks [8] poses new and exciting challenges in tracking the way integration is achieved [22]. Summing up, recent literature provides a variety of methods for verifying/enforcing ethical dimensions. However, they typically apply to the very early (such as, collection) or very late steps (such as, analytics) of the information extraction process, but very few works study how to preserve ethics by design throughout the process. 5 RELATEDWORKS AND CONCLUDING REMARKS An early attempt to con