Journal of Data and Information Quality (JDIQ)最新文献_第3页

Ethical Dimensions for Data Quality 数据质量的伦理维度

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-12-05 DOI: 10.1145/3362121

D. Firmani, L. Tanca, Riccardo Torlone

{"title":"Ethical Dimensions for Data Quality","authors":"D. Firmani, L. Tanca, Riccardo Torlone","doi":"10.1145/3362121","DOIUrl":"https://doi.org/10.1145/3362121","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Association for Computing Machinery. 1936-1955/2019/1-ART1 $15.00 https://doi.org/10.1145/3362121 ACM J. Data Inform. Quality, Vol. 1, No. 1, Article 1. Publication date: January 2019. 1:2 Donatella Firmani, Letizia Tanca, and Riccardo Torlone Transparency is the ability to interpret the information extraction process in order to verify which aspects of the data determine its results. In this context, transparency metrics can use the notions of (i) data provenance [19, 18], by measuring the amount of meta-data describing where the original data come from; (ii) explanation [15], by describing how a result has been obtained. Diversity is the degree to which different kinds of objects are represented in a dataset. Several metrics are proposed in [9]. Ensuring diversity at the beginning of the information extraction process may be useful for enforcing fairness at the end. The diversity dimension may conflict with established dimensions in the Trust cluster of [5], that prioritizes few high-reputation sources. Data Protection concerns the ways to secure data, algorithms and models against unauthorized access. Defining measures can be an elusive goal since, on the one hand, anonymized datasets that are secure in isolation can reveal sensible information when combined [1], and on the other hand, robust techniques such as ε-differential privacy [10] can only describe the privacy impact of specific queries. Data protection is related to the well-established security dimension of [5]. 3 ETHICAL CHALLENGES IN THE INFORMATION EXTRACTION PROCESS We highlight some challenges of complying with the dimensions of the Ethics Cluster, throughout the three phases of the information extraction process mentioned in the Introduction. A. Source Selection. Data can typically come from multiple sources, and it is most desirable that each of these complies with the ethics dimensions described in the previous section. If sources do not comply with (some) dimension individually, we should consider that the really important requirement is that the data that are finally used for analysis or recommendations do. It is thus appropriate to consider ethics for multiple sources in combination, so that the bias towards a certain category in a single source can be eliminated by another source with opposite bias. While for the fairness, transparency and diversity dimensions this is clearly possible, for the privacy we can only act on the single data sources because adding more information can only lower the protection level, or, at most, leave it as it is. Ethics in source selection is tightly related to the transparency of the source, specifically for sources that are themselves aggregators. Information lineage is of paramount importance in this case and can be accomplished w","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"37 1","pages":"1 - 5"},"PeriodicalIF":0.0,"publicationDate":"2019-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79060201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Experience 经验

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-11-16 DOI: 10.1145/3341107

R. Aswani, A. Kar, P. Ilavarasan

{"title":"Experience","authors":"R. Aswani, A. Kar, P. Ilavarasan","doi":"10.1145/3341107","DOIUrl":"https://doi.org/10.1145/3341107","url":null,"abstract":"Governance of misinformation is a serious concern in social media platforms. Based on experiences gathered from different case studies, we offer insights for the policymakers on managing misinformation in social media. These platforms are widely used for not just communication but also content consumption. Managing misinformation is thus a challenge for policymakers and the platforms. This article explores the factors of rapid propagation of misinformation based on our experiences in the domain. An average of about 1.5 million tweets were analysed in each of the three different cases surrounding misinformation. The findings indicate that the tweet emotion and polarity plays a significant role in determining whether the shared content is authentic or not. A deeper exploration highlights that a higher element of surprise combined with other emotions is present in such tweets. Further, the tweets that show case-neutral content often lack the possibilities of virality when it comes to misinformation. The second case explores whether the misinformation is being propagated intentionally by means of the identified fake profiles or it is done by authentic users, which can also be either intentional, for gaining attention, or unintentional, under the assumption that the information is correct. Last, network attributes, including topological analysis, community, and centrality analysis, also catalyze the propagation of misinformation. Policymakers can utilize these findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"214 1","pages":"1 - 18"},"PeriodicalIF":0.0,"publicationDate":"2019-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74761198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Automated Selection and Quality Assessment of Primary Studies 初级研究的自动选择和质量评估

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-11-16 DOI: 10.1145/3356901

Yusra Shakeel, J. Krüger, Ivonne von Nostitz-Wallwitz, G. Saake, Thomas Leich

{"title":"Automated Selection and Quality Assessment of Primary Studies","authors":"Yusra Shakeel, J. Krüger, Ivonne von Nostitz-Wallwitz, G. Saake, Thomas Leich","doi":"10.1145/3356901","DOIUrl":"https://doi.org/10.1145/3356901","url":null,"abstract":"Researchers use systematic literature reviews (SLRs) to synthesize existing evidence regarding a research topic. While being an important means to condense knowledge, conducting an SLR requires a large amount of time and effort. Consequently, researchers have proposed semi-automatic techniques to support different stages of the review process. Two of the most time-consuming tasks are (1) to select primary studies and (2) to assess their quality. In this article, we report an SLR in which we identify, discuss, and synthesize existing techniques of the software-engineering domain that aim to semi-automate these two tasks. Instead of solely providing statistics, we discuss these techniques in detail and compare them, aiming to improve our understanding of supported and unsupported activities. To this end, we identified eight primary studies that report unique techniques that have been published between 2007 and 2016. Most of these techniques rely on text mining and can be beneficial for researchers, but an independent validation using real SLRs is missing for most of them. Moreover, the results indicate the necessity of developing more reliable techniques, providing access to their implementations, and extending their scope to further activities to facilitate the selection and quality assessment of primary studies.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"27 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2019-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80177822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Getting Rid of Data 摆脱数据

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-11-11 DOI: 10.1145/3326920

T. Milo

{"title":"Getting Rid of Data","authors":"T. Milo","doi":"10.1145/3326920","DOIUrl":"https://doi.org/10.1145/3326920","url":null,"abstract":"We are experiencing an amazing data-centered revolution. Incredible amounts of data are collected, integrated, and analyzed, leading to key breakthroughs in science and society. This well of knowledge, however, is at a great risk if we do not dispense with some of the data flood. First, the amount of generated data grows exponentially and already at 2020 is expected to be more than twice the available storage. Second, even disregarding storage constraints, uncontrolled data retention risks privacy and security, as recognized, e.g., by the recent EU Data Protection reform. Data disposal policies must be developed to benefit and protect organizations and individuals. Retaining the knowledge hidden in the data while respecting storage, processing, and regulatory constraints is a great challenge. The difficulty stems from the distinct, intricate requirements entailed by each type of constraint, the scale and velocity of data, and the constantly evolving needs. While multiple data sketching, summarization, and deletion techniques were developed to address specific aspects of the problem, we are still very far from a comprehensive solution. Every organization has to battle the same tough challenges with ad hoc solutions that are application-specific and rarely sharable. In this article, we will discuss the logical, algorithmic, and methodological foundations required for the systematic disposal of large-scale data, for constraints enforcement and for the development of applications over the retained information. In particular, we will overview relevant related work, highlighting new research challenges and potential reuse of existing techniques.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 7"},"PeriodicalIF":0.0,"publicationDate":"2019-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86586104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Improving Adaptive Video Streaming through Session Classification 通过会话分类改进自适应视频流

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-09-07 DOI: 10.1145/3309682

Zahaib Akhtar, Anh Minh Le, Yun Seong Nam, Jessica Chen, R. Govindan, Ethan Katz-Bassett, Sanjay G. Rao, Jibin Zhan

{"title":"Improving Adaptive Video Streaming through Session Classification","authors":"Zahaib Akhtar, Anh Minh Le, Yun Seong Nam, Jessica Chen, R. Govindan, Ethan Katz-Bassett, Sanjay G. Rao, Jibin Zhan","doi":"10.1145/3309682","DOIUrl":"https://doi.org/10.1145/3309682","url":null,"abstract":"With internet video gaining increasing popularity and soaring to dominate network traffic, extensive studies are being carried out on how to achieve higher Quality of Experience (QoE) with the delivery of video content. Associated with the chunk-based streaming protocol, Adaptive Bitrate (ABR) algorithms have recently emerged to cope with the diverse and fluctuating network conditions by dynamically adjusting bitrates for future chunks. This inevitably involves predicting the future throughput of a video session. Some of the session features like Internet Service Provider (ISP), geographical location, and so on, could affect network conditions and contain helpful information for this throughput prediction. In this article, we consider how our knowledge about the session features can be utilized to improve ABR quality via customized parameter settings. We present our ABR-independent, QoE-driven, feature-based partition method to classify the logged video sessions so that different parameter settings could be adopted in different situations to reach better quality. A variation of Decision Tree is developed for the classification and has been applied to a sample ABR for evaluation. The experiment shows that our approach can improve the average bitrate of the sample ABR by 36.1% without causing the increase of the rebuffering ratio where 99% of the sessions can get improvement. It can also improve the rebuffering ratio by 87.7% without causing the decrease of the average bitrate, where, among those sessions involved in rebuffering, 82% receives improvement and 18% remains the same.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"524 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2019-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86888747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Data Transparency with Blockchain and AI Ethics 数据透明度与区块链和人工智能伦理

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-08-21 DOI: 10.1145/3312750

E. Bertino, A. Kundu, Zehra Sura

引用次数: 63

Assessing the Readiness of Academia in the Topic of False and Unverified Information 评估学术界对虚假和未经证实的信息的准备程度

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-08-21 DOI: 10.1145/3313788

A. E. Fard, S. Cunningham

{"title":"Assessing the Readiness of Academia in the Topic of False and Unverified Information","authors":"A. E. Fard, S. Cunningham","doi":"10.1145/3313788","DOIUrl":"https://doi.org/10.1145/3313788","url":null,"abstract":"The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking financial markets, and influencing crowd decisions in important events. This phenomenon needs to be properly curbed, otherwise it can contaminate other aspects of our social life. In this regard, academia as a key institution against false and unverified information is expected to play a pivotal role. Despite a great deal of research in this arena, the amount of progress by academia is not clear yet. This can lead to misjudgements about the performance of the topic of interest that can ultimately result in wrong science policies regarding academic efforts for quelling false and unverified information. In this research, we address this issue by assessing the readiness of academia in the topic of false and unverified information. To this end, we adopt the emergence framework and measure its dimensions (novelty, growth, coherence, and impact) over more than 21,000 articles published by academia about false and unverified information. Our results show the current body of research has had organic growth so far, which is not promising enough for confronting the problem of false and unverified information. To tackle this problem, we suggest an external push strategy that, compared to the early stages of the topic of interest, reinforces the emergence dimensions and leads to a higher level in every dimension.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"18 3","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2019-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91504309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Different Faces of False 虚假的不同面目

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-08-21 DOI: 10.1145/3339468

M. Babcock, David M. Beskow, Kathleen M. Carley

{"title":"Different Faces of False","authors":"M. Babcock, David M. Beskow, Kathleen M. Carley","doi":"10.1145/3339468","DOIUrl":"https://doi.org/10.1145/3339468","url":null,"abstract":"The task of combating false information online appears daunting, in part due to a public focus on how quickly it can spread and the clamor for automated platform-based interventions. While such concerns can be warranted, threat analysis and intervention design both benefit from a fuller understanding of different types of false information and of the community responses to them. Here, we present a study of the most tweeted about movie ever (Black Panther) in which the spread of false information of four different types is compared to the ad hoc Twitter community response. We find that (1) false information tweets played a small part in the overall conversation, (2) community-based debunking and shaming responses to false posts about attacks at theaters overwhelmed such posts by orders of magnitude, (3) as another form of community response, one type of false narrative (Satire) was used to attack another (Fake Attacks), and (4) the four types of false-information tweets differed in the use of hashtags and in the role played by originating users and responding users. Overall, this work helps to illustrate the importance of investigating “on-the-ground” community responses to fake news and other types of digital false information and to inform identification and intervention design and implementation.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 15"},"PeriodicalIF":0.0,"publicationDate":"2019-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88701510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning 深度学习训练数据增强与评价的案例研究

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-08-19 DOI: 10.1145/3317573

Junhua Ding, Xinchuan Li, Xiaojun Kang, V. Gudivada

{"title":"A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning","authors":"Junhua Ding, Xinchuan Li, Xiaojun Kang, V. Gudivada","doi":"10.1145/3317573","DOIUrl":"https://doi.org/10.1145/3317573","url":null,"abstract":"Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are “fit for purpose” of deep learning is still a question. A framework for comprehensively evaluating the effectiveness of the augmented data for deep learning is still not available. In this article, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. To evaluate the quality of the augmented data in fidelity, variety, and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91285858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Experience 经验

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-08-19 DOI: 10.1145/3328746

M. Bosu, Stephen G. MacDonell

{"title":"Experience","authors":"M. Bosu, Stephen G. MacDonell","doi":"10.1145/3328746","DOIUrl":"https://doi.org/10.1145/3328746","url":null,"abstract":"Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"260 1","pages":"1 - 38"},"PeriodicalIF":0.0,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74958642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2