Hao Xue, Qiaozhi Wang, Bo Luo, Hyunjin Seo, Fengjun Li
{"title":"Content-Aware Trust Propagation Toward Online Review Spam Detection","authors":"Hao Xue, Qiaozhi Wang, Bo Luo, Hyunjin Seo, Fengjun Li","doi":"10.1145/3305258","DOIUrl":"https://doi.org/10.1145/3305258","url":null,"abstract":"With the increasing popularity of online review systems, a large volume of user-generated content becomes available to help people make reasonable judgments about the quality of services and products from unknown providers. However, these platforms are frequently abused since fraudulent information can be freely inserted by potentially malicious users without validation. Consequently, online review systems become targets of individual and professional spammers, who insert deceptive reviews by manipulating the rating and/or the content of the reviews. In this work, we propose a review spamming detection scheme based on the deviation between the aspect-specific opinions extracted from individual reviews and the aggregated opinions on the corresponding aspects. In particular, we model the influence on the trustworthiness of the user due to his opinion deviations from the majority in the form of a deviation-based penalty, and integrate this penalty into a three-layer trust propagation framework to iteratively compute the trust scores for users, reviews, and review targets, respectively. The trust scores are effective indicators of spammers, since they reflect the overall deviation of a user from the aggregated aspect-specific opinions across all targets and all aspects. Experiments on the dataset collected from Yelp.com show that the proposed detection scheme based on aspect-specific content-aware trust propagation is able to measure users’ trustworthiness based on opinions expressed in reviews.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"3 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2019-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73968581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pepa Atanasova, Preslav Nakov, Lluís Màrquez i Villodre, Alberto Barrón-Cedeño, Georgi Karadzhov, Tsvetomila Mihaylova, Mitra Mohtarami, James R. Glass
{"title":"Automatic Fact-Checking Using Context and Discourse Information","authors":"Pepa Atanasova, Preslav Nakov, Lluís Màrquez i Villodre, Alberto Barrón-Cedeño, Georgi Karadzhov, Tsvetomila Mihaylova, Mitra Mohtarami, James R. Glass","doi":"10.1145/3297722","DOIUrl":"https://doi.org/10.1145/3297722","url":null,"abstract":"We study the problem of automatic fact-checking, paying special attention to the impact of contextual and discourse information. We address two related tasks: (i) detecting check-worthy claims and (ii) fact-checking claims. We develop supervised systems based on neural networks, kernel-based support vector machines, and combinations thereof, which make use of rich input representations in terms of discourse cues and contextual features. For the check-worthiness estimation task, we focus on political debates, and we model the target claim in the context of the full intervention of a participant and the previous and following turns in the debate, taking into account contextual meta information. For the fact-checking task, we focus on answer verification in a community forum, and we model the veracity of the answer with respect to the entire question–answer thread in which it occurs as well as with respect to other related posts from the entire forum. We develop annotated datasets for both tasks and we run extensive experimental evaluation, confirming that both types of information—but especially contextual features—play an important role.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"4 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2019-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90290747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue on Combating Digital Misinformation and Disinformation","authors":"Naeemul Hassan, Chengkai Li, Jun Yang, Cong Yu","doi":"10.1145/3321484","DOIUrl":"https://doi.org/10.1145/3321484","url":null,"abstract":"We are delighted to present this special issue of the Journal of Data and Information Quality (ACM JDIQ) on Combating Digital Misinformation and Disinformation. This issue presents an overview of innovative research primarily at the intersection of information credibility, machine learning, and data science, from theory to practice, with a focus on combating misinformation and disinformation. Spread of misinformation and disinformation is one of the most serious challenges facing the news industry, and a threat to democracy worldwide. The problem has reached an unprecedented level via social media, where contents can be created and disseminated to a large audience with little to zero cost and revenues are driven by clicks. Researchers from multiple disciplines have proposed various strategies, built automated and semiautomated systems [1, 3], and recommended policy changes across the media ecosystem [2, 4]. Recently, researchers also explored how artificial intelligence techniques, particularly machine learning and natural language processing, can be leveraged to combat falsehoods online. In this special issue of JDIQ, we provide a representative collection of insightful articles at the intersection of data quality and credibility, from theory to practice, with a focus on improvements in veracity and value. The articles went through a rigorous procedure of review involving at least three expert reviewers for each article. After two rounds of review, we selected five articles that made contributions to both research and practice. Zannettou et al., in “The Web of False Information: Rumors, Fake News, Hoaxes, Clickbait, and Various Other Shenanigans,” provide a typology of the false information content on the Web and surveys the latest research directions. It identifies several lines of works in the false information ecosystem. In particular, it surveys the research works from false information propagation, perception, and identification perspectives. Then, the authors specifically attend the false information spread in the political domain and investigate the velocity and consequence of the spread in communities. Finally, the authors delineate several future research directions that can help understand and mitigate this misinformation problem.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2019-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91269194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augmenting Data Quality through High-Precision Gender Categorization","authors":"Daniel Müller, Pratiksha Jain, Yieh-Funk Te","doi":"10.1145/3297720","DOIUrl":"https://doi.org/10.1145/3297720","url":null,"abstract":"Mappings of first name to gender have been widely recognized as a critical tool for the completion, study, and validation of data records in a range of areas. In this study, we investigate how organizations with large databases of existing entities can create their own mappings between first names and gender and how these mappings can be improved and utilized. Therefore, we first explore a dataset with demographic information on more than 4 million people, which was provided by a car insurance company. Then, we study how naming conventions have changed over time and how they differ by nationality. Next, we build a probabilistic first-name-to-gender mapping and augment the mapping by adding nationality and decade of birth to improve the mapping's performance. We test our mapping in two-label and three-label settings and further validate our mapping by categorizing patent filings by gender of the inventor. We compare the results with previous studies’ outcomes and find that our mapping produces high-precision results. We validate that the additional information of nationality and year of birth improve the precision scores of name-to-gender mappings. Therefore, the proposed approach constitutes an efficient process for improving the data quality of organizations’ records, if the gender attribute is missing or unreliable.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"188 12 1","pages":"1 - 18"},"PeriodicalIF":0.0,"publicationDate":"2019-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78875723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Patterns for Fact Checking in Knowledge Graphs","authors":"Peng Lin, Qi Song, Yinghui Wu, Jiaxing Pi","doi":"10.1145/3286488","DOIUrl":"https://doi.org/10.1145/3286488","url":null,"abstract":"This article presents a new framework that incorporates graph patterns to support fact checking in knowledge graphs. Our method discovers discriminant graph patterns to construct classifiers for fact prediction. First, we propose a class of graph fact checking rules (GFCs). A GFC incorporates graph patterns that best distinguish true and false facts of generalized fact statements. We provide statistical measures to characterize useful patterns that are both discriminant and diversified. Second, we show that it is feasible to discover GFCs in large graphs with optimality guarantees. We develop an algorithm that performs localized search to generate a stream of graph patterns, and dynamically assemble the best GFCs from multiple GFC sets, where each set ensures quality scores within certain ranges. The algorithm guarantees a (1/2−ϵ) approximation when it (early) terminates. We also develop a space-efficient alternative that dynamically spawns prioritized patterns with best marginal gains to the verified GFCs. It guarantees a (1−1/e) approximation. Both strategies guarantee a bounded time cost independent of the size of the underlying graph. Third, to support fact checking, we develop two classifiers, which make use of top-ranked GFCs as predictive rules or instance-level features of the pattern matches induced by GFCs, respectively. Using real-world data, we experimentally verify the efficiency and the effectiveness of GFC-based techniques for fact checking in knowledge graphs and verify its application in knowledge exploration and news prediction.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"104 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2019-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76020747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparency, Fairness, Data Protection, Neutrality","authors":"S. Abiteboul, Julia Stoyanovich","doi":"10.1145/3310231","DOIUrl":"https://doi.org/10.1145/3310231","url":null,"abstract":"The data revolution continues to transform every sector of science, industry, and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly—in accordance with laws and ethical norms. In this article, we discuss three recent regulatory frameworks: the European Union’s General Data Protection Regulation (GDPR), the New York City Automated Decisions Systems (ADS) Law, and the Net Neutrality principle, which aim to protect the rights of individuals who are impacted by data collection and analysis. These frameworks are prominent examples of a global trend: Governments are starting to recognize the need to regulate data-driven algorithmic technology. Our goal in this article is to bring these regulatory frameworks to the attention of the data management community and to underscore the technical challenges they raise and that we, as a community, are well-equipped to address. The main takeaway of this article is that legal and ethical norms cannot be incorporated into data-driven systems as an afterthought. Rather, we must think in terms of responsibility by design, viewing it as a systems requirement.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 9"},"PeriodicalIF":0.0,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83166713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experience","authors":"Christian Sillaber, Andrea Mussmann, R. Breu","doi":"10.1145/3297721","DOIUrl":"https://doi.org/10.1145/3297721","url":null,"abstract":"Governance, risk, and compliance (GRC) managers often struggle to document the current state of their organizations. This is due to the complexity of their IS landscape, the complex regulatory and organizational environment, and the frequent changes to both. GRC tools seek to support them by integrating existing information sources. However, a comprehensive analysis of how the data is managed in such tools, as well as the impact of data quality, is still missing. To build a basis of empirical data, we conducted a series of interviews with information security managers responsible for GRC management activities in their organizations. The results of a qualitative content analysis of these interviews suggest that decision makers largely depend on high-quality documentation but struggle to maintain their documentation at the required level for long periods of time. This work discusses factors affecting the quality of GRC data and information and provides insights into approaches implemented by organizations to analyze, improve, and maintain the quality of their GRC data and information.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"14 1","pages":"1 - 14"},"PeriodicalIF":0.0,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72674418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Lazar, Ling Jin, C. Spurlock, Kesheng Wu, A. Sim, A. Todd
{"title":"Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization","authors":"A. Lazar, Ling Jin, C. Spurlock, Kesheng Wu, A. Sim, A. Todd","doi":"10.1145/3301294","DOIUrl":"https://doi.org/10.1145/3301294","url":null,"abstract":"The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"17 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2019-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87224688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependencies for Graphs","authors":"W. Fan","doi":"10.1145/3310230","DOIUrl":"https://doi.org/10.1145/3310230","url":null,"abstract":"What are graph dependencies? What do we need them for? What new challenges do they introduce? This article tackles these questions. It aims to incite curiosity and interest in this emerging area of research.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"109 1","pages":"1 - 12"},"PeriodicalIF":0.0,"publicationDate":"2019-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77230194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Classification Quality in Uncertain Graphs","authors":"Michele Dallachiesa, C. Aggarwal, Themis Palpanas","doi":"10.1145/3242095","DOIUrl":"https://doi.org/10.1145/3242095","url":null,"abstract":"In many real applications that use and analyze networked data, the links in the network graph may be erroneous or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, then the classification accuracy in the underlying network may be affected adversely. In this article, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2019-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83126227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}