C. Bors, T. Gschwandtner, Simone Kriglstein, S. Miksch, M. Pohl
{"title":"Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics","authors":"C. Bors, T. Gschwandtner, Simone Kriglstein, S. Miksch, M. Pohl","doi":"10.1145/3190578","DOIUrl":"https://doi.org/10.1145/3190578","url":null,"abstract":"During data preprocessing, analysts spend a significant part of their time and effort profiling the quality of the data along with cleansing and transforming the data for further analysis. While quality metrics—ranging from general to domain-specific measures—support assessment of the quality of a dataset, there are hardly any approaches to visually support the analyst in customizing and applying such metrics. Yet, visual approaches could facilitate users’ involvement in data quality assessment. We present MetricDoc, an interactive environment for assessing data quality that provides customizable, reusable quality metrics in combination with immediate visual feedback. Moreover, we provide an overview visualization of these quality metrics along with error visualizations that facilitate interactive navigation of the data to determine the causes of quality issues present in the data. In this article, we describe the architecture, design, and evaluation of MetricDoc, which underwent several design cycles, including heuristic evaluation and expert reviews as well as a focus group with data quality, human-computer interaction, and visual analytics experts.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"12 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2018-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84487131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Reading of Biomedical Data Dictionaries","authors":"N. Ashish, Arihant Patawari","doi":"10.1145/3177874","DOIUrl":"https://doi.org/10.1145/3177874","url":null,"abstract":"This article describes an approach for the automated reading of biomedical data dictionaries. Automated reading is the process of extracting element details for each of the data elements from a data dictionary in a document format (such as PDF) to a completely structured representation. A structured representation is essential if the data dictionary metadata are to be used in applications such as data integration and also in evaluating the quality of the associated data. We present an approach and implemented solution for the problem, considering different formats of data dictionaries. We have a particular focus on the most challenging format with a machine-learning classification solution to the problem using conditional random field classifiers. We present an evaluation using several actual data dictionaries, demonstrating the effectiveness of our approach.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"180 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2018-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76045729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InfoClean","authors":"Fei Chiang, Dhruv Gairola","doi":"10.1145/3190577","DOIUrl":"https://doi.org/10.1145/3190577","url":null,"abstract":"Data quality has become a pervasive challenge for organizations as they wrangle with large, heterogeneous datasets to extract value. Given the proliferation of sensitive and confidential information, it is crucial to consider data privacy concerns during the data cleaning process. For example, in medical database applications, varying levels of privacy are enforced across the attribute values. Attributes such as a patient’s country or city of residence may be less sensitive than the patient’s prescribed medication. Traditional data cleaning techniques assume the data is openly accessible, without considering the differing levels of information sensitivity. In this work, we take the first steps toward a data cleaning model that integrates privacy as part of the data cleaning process. We present a privacy-aware data cleaning framework that differentiates the information content among the attribute values during the data cleaning process to resolve data inconsistencies while minimizing the amount of information disclosed. Our data repair algorithm includes a set of data disclosure operations that considers the information content of the underlying attribute values, while maximizing data utility. Our evaluation using real datasets shows that our algorithm scales well, and achieves improved performance and comparable repair accuracy against existing data cleaning solutions.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"29 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78796294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenge Paper","authors":"A. Gal, Arik Senderovich, M. Weidlich","doi":"10.1145/3165712","DOIUrl":"https://doi.org/10.1145/3165712","url":null,"abstract":"Queues represent a setting where agents compete over a scarce resource: People wait for public transportation, jobs wait to be processed, patients await treatment at a hospital, and so on. While data logs record many aspects of our lives, information about queues is rarely recorded. Queue mining (Senderovich et al. 2015) is the process of revealing queue information from data logs for the purpose of discovering queueing models, conformance checking, and optimization. As such, queue mining enables bottleneck detection and delay prediction (Gal et al. 2017). A queueing network is the most general form of a queueing model, represented as a directed graph with nodes being the queueing stations (corresponding to types of resources), edges corresponding to routing between stations, and node attributes corresponding to station dynamics (e.g., arrival patterns, service time distributions, station capacity, and service policy—for example, first-come first-served). Customers arrive into a queueing station, wait (enqueued) before being served by the station, and then leave to the next station (or exit the system). Queueing networks are often assumed to have a single customer type and an immediate Markovian routing (after completion at a station a customer appears in the next station with some probability). Also, simple queueing networks","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"219 1","pages":"1 - 5"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82731396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPMDL","authors":"M. Alshayeb, Yasser Shaaban, Jarallah AlGhamdi","doi":"10.1145/3185049","DOIUrl":"https://doi.org/10.1145/3185049","url":null,"abstract":"Software metrics are becoming more acceptable measures for software quality assessment. However, there is no standard form to represent metric definitions, which would be useful for metrics exchange and customization. In this article, we propose the Software Product Metrics Definition Language (SPMDL). We develop an XML-based description language to define software metrics in a precise and reusable form. Metric definitions in SPMDL are based on meta-models extracted from either source code or design artifacts, such as the Dagstuhl Middle Meta-model, with support for various abstraction levels. The language defines several flexible computation mechanisms, such as extended Object Constraint Language queries and predefined graph operations on the meta-model. SPMDL provides an unambiguous description of the metric definition; it is also easy to use and is extensible.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"140 1","pages":"1 - 30"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76387610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fathoni A. Musyaffa, C. Engels, Maria-Esther Vidal, F. Orlandi, Sören Auer
{"title":"Experience","authors":"Fathoni A. Musyaffa, C. Engels, Maria-Esther Vidal, F. Orlandi, Sören Auer","doi":"10.1145/3190576","DOIUrl":"https://doi.org/10.1145/3190576","url":null,"abstract":"Public administrations are continuously publishing open data, increasing the amount of government open data over time. The published data includes budgets and spending as part of fiscal data; publishing these data is an important part of transparent and accountable governance. However, open fiscal data should also meet open data publication guidelines. When requirements in data guidelines are not met, effective data analysis over published datasets cannot be performed effectively. In this article, we present Open Fiscal Data Publication (OFDP), a framework to assess the quality of open fiscal datasets. We also present an extensive open fiscal data assessment and common data quality issues found; additionally, open fiscal data publishing guidelines are presented. We studied and surveyed main quality factors for open fiscal datasets. Moreover, the collected quality factors have been scored according to the results of a questionnaire to score quality factors within the OFDP assessment framework. We gather and comprehensively analyze a representative set of 77 fiscal datasets from several public administrations across different regions at different levels (e.g., supranational, national, municipality). We characterize quality issues commonly arising in these datasets. Our assessment shows that there are many quality factors in fiscal data publication that still need to be taken care of so that the data can be analyzed effectively. Our proposed guidelines allow for publishing open fiscal data where these quality issues are avoided.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"12 1","pages":"1 - 10"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77147680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Savvas Zannettou, Michael Sirivianos, Jeremy Blackburn, N. Kourtellis
{"title":"The Web of False Information","authors":"Savvas Zannettou, Michael Sirivianos, Jeremy Blackburn, N. Kourtellis","doi":"10.1145/3309699","DOIUrl":"https://doi.org/10.1145/3309699","url":null,"abstract":"A new era of Information Warfare has arrived. Various actors, including state-sponsored ones, are weaponizing information on Online Social Networks to run false-information campaigns with targeted manipulation of public opinion on specific topics. These false-information campaigns can have dire consequences to the public: mutating their opinions and actions, especially with respect to critical world events like major elections. Evidently, the problem of false information on the Web is a crucial one and needs increased public awareness as well as immediate attention from law enforcement agencies, public institutions, and in particular, the research community. In this article, we make a step in this direction by providing a typology of the Web’s false-information ecosystem, composed of various types of false-information, actors, and their motives. We report a comprehensive overview of existing research on the false-information ecosystem by identifying several lines of work: (1) how the public perceives false information; (2) understanding the propagation of false information; (3) detecting and containing false information on the Web; and (4) false information on the political stage. In this work, we pay particular attention to political false information as: (1) it can have dire consequences to the community (e.g., when election results are mutated) and (2) previous work shows that this type of false information propagates faster and further when compared to other types of false information. Finally, for each of these lines of work, we report several future research directions that can help us better understand and mitigate the emerging problem of false-information dissemination on the Web.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"278 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2018-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82812774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego Esteves, A. Rula, Aniketh Janardhan Reddy, Jens Lehmann
{"title":"Toward Veracity Assessment in RDF Knowledge Bases","authors":"Diego Esteves, A. Rula, Aniketh Janardhan Reddy, Jens Lehmann","doi":"10.1145/3177873","DOIUrl":"https://doi.org/10.1145/3177873","url":null,"abstract":"Among different characteristics of knowledge bases, data quality is one of the most relevant to maximize the benefits of the provided information. Knowledge base quality assessment poses a number of big data challenges such as high volume, variety, velocity, and veracity. In this article, we focus on answering questions related to the assessment of the veracity of facts through Deep Fact Validation (DeFacto), a triple validation framework designed to assess facts in RDF knowledge bases. Despite current developments in the research area, the underlying framework faces many challenges. This article pinpoints and discusses these issues and conducts a thorough analysis of its pipeline, aiming at reducing the error propagation through its components. Furthermore, we discuss recent developments related to this fact validation as well as describing advantages and drawbacks of state-of-the-art models. As a result of this exploratory analysis, we give insights and directions toward a better architecture to tackle the complex task of fact-checking in knowledge bases.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"4 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2018-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78295673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, J. Zobel, Karin M. Verspoor
{"title":"Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases","authors":"Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, J. Zobel, Karin M. Verspoor","doi":"10.1145/3131611","DOIUrl":"https://doi.org/10.1145/3131611","url":null,"abstract":"The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"5 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2018-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85091079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}