{"title":"Experience","authors":"R. Aswani, A. Kar, P. Ilavarasan","doi":"10.1145/3341107","DOIUrl":"https://doi.org/10.1145/3341107","url":null,"abstract":"Governance of misinformation is a serious concern in social media platforms. Based on experiences gathered from different case studies, we offer insights for the policymakers on managing misinformation in social media. These platforms are widely used for not just communication but also content consumption. Managing misinformation is thus a challenge for policymakers and the platforms. This article explores the factors of rapid propagation of misinformation based on our experiences in the domain. An average of about 1.5 million tweets were analysed in each of the three different cases surrounding misinformation. The findings indicate that the tweet emotion and polarity plays a significant role in determining whether the shared content is authentic or not. A deeper exploration highlights that a higher element of surprise combined with other emotions is present in such tweets. Further, the tweets that show case-neutral content often lack the possibilities of virality when it comes to misinformation. The second case explores whether the misinformation is being propagated intentionally by means of the identified fake profiles or it is done by authentic users, which can also be either intentional, for gaining attention, or unintentional, under the assumption that the information is correct. Last, network attributes, including topological analysis, community, and centrality analysis, also catalyze the propagation of misinformation. Policymakers can utilize these findings in this experience study for the governance of misinformation. Tracking and disruption in any one of the identified drivers could act as a control mechanism to manage misinformation propagation.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"214 1","pages":"1 - 18"},"PeriodicalIF":0.0,"publicationDate":"2019-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74761198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusra Shakeel, J. Krüger, Ivonne von Nostitz-Wallwitz, G. Saake, Thomas Leich
{"title":"Automated Selection and Quality Assessment of Primary Studies","authors":"Yusra Shakeel, J. Krüger, Ivonne von Nostitz-Wallwitz, G. Saake, Thomas Leich","doi":"10.1145/3356901","DOIUrl":"https://doi.org/10.1145/3356901","url":null,"abstract":"Researchers use systematic literature reviews (SLRs) to synthesize existing evidence regarding a research topic. While being an important means to condense knowledge, conducting an SLR requires a large amount of time and effort. Consequently, researchers have proposed semi-automatic techniques to support different stages of the review process. Two of the most time-consuming tasks are (1) to select primary studies and (2) to assess their quality. In this article, we report an SLR in which we identify, discuss, and synthesize existing techniques of the software-engineering domain that aim to semi-automate these two tasks. Instead of solely providing statistics, we discuss these techniques in detail and compare them, aiming to improve our understanding of supported and unsupported activities. To this end, we identified eight primary studies that report unique techniques that have been published between 2007 and 2016. Most of these techniques rely on text mining and can be beneficial for researchers, but an independent validation using real SLRs is missing for most of them. Moreover, the results indicate the necessity of developing more reliable techniques, providing access to their implementations, and extending their scope to further activities to facilitate the selection and quality assessment of primary studies.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"27 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2019-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80177822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Getting Rid of Data","authors":"T. Milo","doi":"10.1145/3326920","DOIUrl":"https://doi.org/10.1145/3326920","url":null,"abstract":"We are experiencing an amazing data-centered revolution. Incredible amounts of data are collected, integrated, and analyzed, leading to key breakthroughs in science and society. This well of knowledge, however, is at a great risk if we do not dispense with some of the data flood. First, the amount of generated data grows exponentially and already at 2020 is expected to be more than twice the available storage. Second, even disregarding storage constraints, uncontrolled data retention risks privacy and security, as recognized, e.g., by the recent EU Data Protection reform. Data disposal policies must be developed to benefit and protect organizations and individuals. Retaining the knowledge hidden in the data while respecting storage, processing, and regulatory constraints is a great challenge. The difficulty stems from the distinct, intricate requirements entailed by each type of constraint, the scale and velocity of data, and the constantly evolving needs. While multiple data sketching, summarization, and deletion techniques were developed to address specific aspects of the problem, we are still very far from a comprehensive solution. Every organization has to battle the same tough challenges with ad hoc solutions that are application-specific and rarely sharable. In this article, we will discuss the logical, algorithmic, and methodological foundations required for the systematic disposal of large-scale data, for constraints enforcement and for the development of applications over the retained information. In particular, we will overview relevant related work, highlighting new research challenges and potential reuse of existing techniques.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 7"},"PeriodicalIF":0.0,"publicationDate":"2019-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86586104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zahaib Akhtar, Anh Minh Le, Yun Seong Nam, Jessica Chen, R. Govindan, Ethan Katz-Bassett, Sanjay G. Rao, Jibin Zhan
{"title":"Improving Adaptive Video Streaming through Session Classification","authors":"Zahaib Akhtar, Anh Minh Le, Yun Seong Nam, Jessica Chen, R. Govindan, Ethan Katz-Bassett, Sanjay G. Rao, Jibin Zhan","doi":"10.1145/3309682","DOIUrl":"https://doi.org/10.1145/3309682","url":null,"abstract":"With internet video gaining increasing popularity and soaring to dominate network traffic, extensive studies are being carried out on how to achieve higher Quality of Experience (QoE) with the delivery of video content. Associated with the chunk-based streaming protocol, Adaptive Bitrate (ABR) algorithms have recently emerged to cope with the diverse and fluctuating network conditions by dynamically adjusting bitrates for future chunks. This inevitably involves predicting the future throughput of a video session. Some of the session features like Internet Service Provider (ISP), geographical location, and so on, could affect network conditions and contain helpful information for this throughput prediction. In this article, we consider how our knowledge about the session features can be utilized to improve ABR quality via customized parameter settings. We present our ABR-independent, QoE-driven, feature-based partition method to classify the logged video sessions so that different parameter settings could be adopted in different situations to reach better quality. A variation of Decision Tree is developed for the classification and has been applied to a sample ABR for evaluation. The experiment shows that our approach can improve the average bitrate of the sample ABR by 36.1% without causing the increase of the rebuffering ratio where 99% of the sessions can get improvement. It can also improve the rebuffering ratio by 87.7% without causing the decrease of the average bitrate, where, among those sessions involved in rebuffering, 82% receives improvement and 18% remains the same.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"524 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2019-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86888747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Transparency with Blockchain and AI Ethics","authors":"E. Bertino, A. Kundu, Zehra Sura","doi":"10.1145/3312750","DOIUrl":"https://doi.org/10.1145/3312750","url":null,"abstract":"Providing a 360° view of a given data item especially for sensitive data is essential toward not only protecting the data and associated privacy but also assuring trust, compliance, and ethics of the systems that use or manage such data. With the advent of General Data Protection Regulation, California Data Privacy Law, and other such regulatory requirements, it is essential to support data transparency in all such dimensions. Moreover, data transparency should not violate privacy and security requirements. In this article, we put forward a vision for how data transparency would be achieved in a de-centralized fashion using blockchain technology.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"82 1","pages":"1 - 8"},"PeriodicalIF":0.0,"publicationDate":"2019-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88499876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assessing the Readiness of Academia in the Topic of False and Unverified Information","authors":"A. E. Fard, S. Cunningham","doi":"10.1145/3313788","DOIUrl":"https://doi.org/10.1145/3313788","url":null,"abstract":"The spread of false and unverified information has the potential to inflict damage by harming the reputation of individuals or organisations, shaking financial markets, and influencing crowd decisions in important events. This phenomenon needs to be properly curbed, otherwise it can contaminate other aspects of our social life. In this regard, academia as a key institution against false and unverified information is expected to play a pivotal role. Despite a great deal of research in this arena, the amount of progress by academia is not clear yet. This can lead to misjudgements about the performance of the topic of interest that can ultimately result in wrong science policies regarding academic efforts for quelling false and unverified information. In this research, we address this issue by assessing the readiness of academia in the topic of false and unverified information. To this end, we adopt the emergence framework and measure its dimensions (novelty, growth, coherence, and impact) over more than 21,000 articles published by academia about false and unverified information. Our results show the current body of research has had organic growth so far, which is not promising enough for confronting the problem of false and unverified information. To tackle this problem, we suggest an external push strategy that, compared to the early stages of the topic of interest, reinforces the emergence dimensions and leads to a higher level in every dimension.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"18 3","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2019-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91504309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Different Faces of False","authors":"M. Babcock, David M. Beskow, Kathleen M. Carley","doi":"10.1145/3339468","DOIUrl":"https://doi.org/10.1145/3339468","url":null,"abstract":"The task of combating false information online appears daunting, in part due to a public focus on how quickly it can spread and the clamor for automated platform-based interventions. While such concerns can be warranted, threat analysis and intervention design both benefit from a fuller understanding of different types of false information and of the community responses to them. Here, we present a study of the most tweeted about movie ever (Black Panther) in which the spread of false information of four different types is compared to the ad hoc Twitter community response. We find that (1) false information tweets played a small part in the overall conversation, (2) community-based debunking and shaming responses to false posts about attacks at theaters overwhelmed such posts by orders of magnitude, (3) as another form of community response, one type of false narrative (Satire) was used to attack another (Fake Attacks), and (4) the four types of false-information tweets differed in the use of hashtags and in the role played by originating users and responding users. Overall, this work helps to illustrate the importance of investigating “on-the-ground” community responses to fake news and other types of digital false information and to inform identification and intervention design and implementation.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 15"},"PeriodicalIF":0.0,"publicationDate":"2019-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88701510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junhua Ding, Xinchuan Li, Xiaojun Kang, V. Gudivada
{"title":"A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning","authors":"Junhua Ding, Xinchuan Li, Xiaojun Kang, V. Gudivada","doi":"10.1145/3317573","DOIUrl":"https://doi.org/10.1145/3317573","url":null,"abstract":"Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are “fit for purpose” of deep learning is still a question. A framework for comprehensively evaluating the effectiveness of the augmented data for deep learning is still not available. In this article, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. To evaluate the quality of the augmented data in fidelity, variety, and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91285858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experience","authors":"M. Bosu, Stephen G. MacDonell","doi":"10.1145/3328746","DOIUrl":"https://doi.org/10.1145/3328746","url":null,"abstract":"Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"260 1","pages":"1 - 38"},"PeriodicalIF":0.0,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74958642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}