{"title":"Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets","authors":"M. Mountantonakis, Yannis Tzitzikas","doi":"10.1145/3165713","DOIUrl":"https://doi.org/10.1145/3165713","url":null,"abstract":"Although the ultimate objective of Linked Data is linking and integration, it is not currently evident how connected the current Linked Open Data (LOD) cloud is. In this article, we focus on methods, supported by special indexes and algorithms, for performing measurements related to the connectivity of more than two datasets that are useful in various tasks including (a) Dataset Discovery and Selection; (b) Object Coreference, i.e., for obtaining complete information about a set of entities, including provenance information; (c) Data Quality Assessment and Improvement, i.e., for assessing the connectivity between any set of datasets and monitoring their evolution over time, as well as for estimating data veracity; (d) Dataset Visualizations; and various other tasks. Since it would be prohibitively expensive to perform all these measurements in a naïve way, in this article, we introduce indexes (and their construction algorithms) that can speed up such tasks. In brief, we introduce (i) a namespace-based prefix index, (ii) a sameAs catalog for computing the symmetric and transitive closure of the owl:sameAs relationships encountered in the datasets, (iii) a semantics-aware element index (that exploits the aforementioned indexes), and, finally, (iv) two lattice-based incremental algorithms for speeding up the computation of the intersection of URIs of any set of datasets. For enhancing scalability, we propose parallel index construction algorithms and parallel lattice-based incremental algorithms, we evaluate the achieved speedup using either a single machine or a cluster of machines, and we provide insights regarding the factors that affect efficiency. Finally, we report measurements about the connectivity of the (billion triples-sized) LOD cloud that have never been carried out so far.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"124 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2018-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78178169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bernd Heinrich, Diana Hristova, Mathias Klier, Alexander Schiller, Michael Szubartowicz
{"title":"Requirements for Data Quality Metrics","authors":"Bernd Heinrich, Diana Hristova, Mathias Klier, Alexander Schiller, Michael Szubartowicz","doi":"10.1145/3148238","DOIUrl":"https://doi.org/10.1145/3148238","url":null,"abstract":"Data quality and especially the assessment of data quality have been intensively discussed in research and practice alike. To support an economically oriented management of data quality and decision making under uncertainty, it is essential to assess the data quality level by means of well-founded metrics. However, if not adequately defined, these metrics can lead to wrong decisions and economic losses. Therefore, based on a decision-oriented framework, we present a set of five requirements for data quality metrics. These requirements are relevant for a metric that aims to support an economically oriented management of data quality and decision making under uncertainty. We further demonstrate the applicability and efficacy of these requirements by evaluating five data quality metrics for different data quality dimensions. Moreover, we discuss practical implications when applying the presented requirements.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"35 1","pages":"1 - 32"},"PeriodicalIF":0.0,"publicationDate":"2018-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90116160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyu Han Koh, Eric Fouh, Mohammed F. Farghally, Hossameldin Shahin, C. Shaffer
{"title":"Experience","authors":"Kyu Han Koh, Eric Fouh, Mohammed F. Farghally, Hossameldin Shahin, C. Shaffer","doi":"10.1145/3148240","DOIUrl":"https://doi.org/10.1145/3148240","url":null,"abstract":"We present lessons learned related to data collection and analysis from 5 years of experience with the eTextbook system OpenDSA. The use of such cyberlearning systems is expanding rapidly in both formal and informal educational settings. Although the precise issues related to any such project are idiosyncratic based on the data collection technology and goals of the project, certain types of data collection problems will be common. We begin by describing the nature of the data transmitted between the student’s client machine and the database server, and our initial database schema for storing interaction log data. We describe many problems that we encountered, with the nature of the problems categorized as syntactic-level data collection issues, issues with relating events to users, or issues with tracking users over time. Relating events to users and tracking the time spent on tasks are both prerequisites to converting syntactic-level interaction streams to semantic-level behavior needed for higher-order analysis of the data. Finally, we describe changes made to our database schema that helped to resolve many of the issues that we had encountered. These changes help advance our ultimate goal of encouraging a change from ineffective learning behavior by students to more productive behavior.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"25 1","pages":"1 - 10"},"PeriodicalIF":0.0,"publicationDate":"2018-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78228987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges in Enabling Quality of Analytics in the Cloud","authors":"Hong Linh Truong, A. Murguzur, Erica Y. Yang","doi":"10.1145/3138806","DOIUrl":"https://doi.org/10.1145/3138806","url":null,"abstract":"Currently, domain scientists (DSs) face challenges in managing quality across multiple data analytics contexts (DACs). We identify and define quality of analytics (QoA) in dynamic and diverse environments, e.g., based on cloud computing resources for big data sources, as a composition of quality of data (data quality), performance, and cost, to name just the main factors. QoA is a complex matter and not just about quality of data or performance, which are typically considered separately when evaluating existing data analytics frameworks/algorithms. Frequently, the DS needs to utilize multiple frameworks to run different (sub)analytics, and, at the same time, the sub-analytics executed in these frameworks exchange inputs and outputs each other. In these cases, we observe different DACs, where a DAC refers to a particular situation in which the DS works with a specific framework to run a sub-analytics carried out by pipeline(s) or tasks in a pipeline. Each DAC has a set of interactions in the following categories:","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"6 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2018-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82724178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Validating Data Quality Actions in Scoring Processes","authors":"C. Cappiello, C. Cerletti, C. Fratto, B. Pernici","doi":"10.1145/3141248","DOIUrl":"https://doi.org/10.1145/3141248","url":null,"abstract":"Data quality has gained momentum among organizations upon the realization that poor data quality might cause failures and/or inefficiencies, thus compromising business processes and application results. However, enterprises often adopt data quality assessment and improvement methods based on practical and empirical approaches without conducting a rigorous analysis of the data quality issues and outcome of the enacted data quality improvement practices. In particular, data quality management, especially the identification of the data quality dimensions to be monitored and improved, is performed by knowledge workers on the basis of their skills and experience. Control methods are therefore designed on the basis of expected and evident quality problems; thus, these methods may not be effective in dealing with unknown and/or unexpected problems. This article aims to provide a methodology, based on fault injection, for validating the data quality actions used by organizations. We show how it is possible to check whether the adopted techniques properly monitor the real issues that may damage business processes. At this stage, we focus on scoring processes, i.e., those in which the output represents the evaluation or ranking of a specific object. We show the effectiveness of our proposal by means of a case study in the financial risk management area.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"48 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2018-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88882387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editor-in-Chief (January 2014-May 2017) Farewell Report","authors":"L. Raschid","doi":"10.1145/3143313","DOIUrl":"https://doi.org/10.1145/3143313","url":null,"abstract":"","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"140 1","pages":"1 - 2"},"PeriodicalIF":0.0,"publicationDate":"2017-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80381236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Foreword from the New JDIQ Editor-in-Chief","authors":"T. Catarci","doi":"10.1145/3143316","DOIUrl":"https://doi.org/10.1145/3143316","url":null,"abstract":"","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"21 1","pages":"1 - 2"},"PeriodicalIF":0.0,"publicationDate":"2017-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87262206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Quality Challenges in Social Spam Research","authors":"Nour El-Mawass, Saad S. Alaboodi","doi":"10.1145/3090057","DOIUrl":"https://doi.org/10.1145/3090057","url":null,"abstract":"","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"45 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2017-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90582679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cluster-Based Quality-Aware Adaptive Data Compression for Streaming Data","authors":"Aseel Basheer, Kewei Sha","doi":"10.1145/3122863","DOIUrl":"https://doi.org/10.1145/3122863","url":null,"abstract":"Wireless sensor networks (WSNs) are widely applied in data collection applications. Energy efficiency is one of the most important design goals of WSNs. In this article, we examine the tradeoffs between the energy efficiency and the data quality. First, four attributes used to evaluate data quality are formally defined. Then, we propose a novel data compression algorithm, Quality-Aware Adaptive data Compression (QAAC), to reduce the amount of data communication to save energy. QAAC utilizes an adaptive clustering algorithm to build clusters from dataset; then a code for each cluster is generated and stored in a Huffman encoding tree. The encoding algorithm encodes the original dataset based on the Haffman encoding tree. An improvement algorithm is also designed to reduce the information loss when data are compressed. After the encoded data, the Huffman encoding tree and parameters used in the improvement algorithm have been received at the sink, a decompression algorithm is used to retrieve the approximation of the original dataset. The performance evaluation shows that QAAC is efficient and achieves a much higher compression ratio than lossy and lossless compression algorithms, while it has much smaller information loss than lossy compression algorithms.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"70 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2017-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80940529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}