Journal of Data and Information Quality (JDIQ)最新文献

筛选
英文 中文
Social-minded Measures of Data Quality 数据质量的社会意识度量
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-07-16 DOI: 10.1145/3404193
E. Pitoura
{"title":"Social-minded Measures of Data Quality","authors":"E. Pitoura","doi":"10.1145/3404193","DOIUrl":"https://doi.org/10.1145/3404193","url":null,"abstract":"For decades, research in data-driven algorithmic systems has focused on improving efficiency (making data access faster and lighter) and effectiveness (providing relevant results to users). As data-driven decision making becomes prevalent, there is an increasing need for new measures for evaluating the quality of data systems. In this article, we make the case for social-minded measures, that is, measures that evaluate the effect of a system in society. We focus on three such measures, namely diversity (ensuring that all relevant aspects are represented), lack of bias (processing data without unjustifiable concentration on a particular side), and fairness (non discriminating treatment of data and people).","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 8"},"PeriodicalIF":0.0,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83101881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Data Preparation for Duplicate Detection 重复检测数据准备
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-06-13 DOI: 10.1145/3377878
Ioannis K. Koumarelas, Lan Jiang, Felix Naumann
{"title":"Data Preparation for Duplicate Detection","authors":"Ioannis K. Koumarelas, Lan Jiang, Felix Naumann","doi":"10.1145/3377878","DOIUrl":"https://doi.org/10.1145/3377878","url":null,"abstract":"Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87771105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Anatomy of Metadata for Data Curation 数据管理的元数据剖析
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-06-13 DOI: 10.1145/3371925
L. Visengeriyeva, Ziawasch Abedjan
{"title":"Anatomy of Metadata for Data Curation","authors":"L. Visengeriyeva, Ziawasch Abedjan","doi":"10.1145/3371925","DOIUrl":"https://doi.org/10.1145/3371925","url":null,"abstract":"Real-world datasets often suffer from various data quality problems. Several data cleaning solutions have been proposed so far. However, data cleaning remains a manual and iterative task that requires domain and technical expertise. Exploiting metadata promises to improve the tedious process of data preparation, because data errors are detectable through metadata. This article investigates the intrinsic connection between metadata and data errors. In this work, we establish a mapping that reflects the connection between data quality issues and extractable metadata using qualitative and quantitative techniques. Additionally, we present a taxonomy based on a closed grammar that covers all existing metadata and allows the composition of novel types of metadata. We provide a case-study to show the practical application of the grammar for generating new metadata for data quality assessment.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"12 1","pages":"1 - 30"},"PeriodicalIF":0.0,"publicationDate":"2020-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81911432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Characterizing Disinformation Risk to Open Data in the Post-Truth Era 后真相时代开放数据的虚假信息风险特征
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-06-02 DOI: 10.1145/3328747
Adrienne Colborne, M. Smit
{"title":"Characterizing Disinformation Risk to Open Data in the Post-Truth Era","authors":"Adrienne Colborne, M. Smit","doi":"10.1145/3328747","DOIUrl":"https://doi.org/10.1145/3328747","url":null,"abstract":"Curated, labeled, high-quality data is a valuable commodity for tasks such as business analytics and machine learning. Open data is a common source of such data—for example, retail analytics draws on open demographic data, and weather forecast systems draw on open atmospheric and ocean data. Open data is released openly by governments to achieve various objectives, such as transparency, informing citizen engagement, or supporting private enterprise. Critical examination of ongoing social changes, including the post-truth phenomenon, suggests the quality, integrity, and authenticity of open data may be at risk. We introduce this risk through various lenses, describe some of the types of risk we expect using a threat model approach, identify approaches to mitigate each risk, and present real-world examples of cases where the risk has already caused harm. As an initial assessment of awareness of this disinformation risk, we compare our analysis to perspectives captured during open data stakeholder consultations in Canada.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"34 1","pages":"1 - 13"},"PeriodicalIF":0.0,"publicationDate":"2020-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81959492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Mining Expressive Rules in Knowledge Graphs 挖掘知识图中的表达规则
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-05-06 DOI: 10.1145/3371315
N. Ahmadi, Viet-Phi Huynh, Venkata Vamsikrishna Meduri, Stefano Ortona, Paolo Papotti
{"title":"Mining Expressive Rules in Knowledge Graphs","authors":"N. Ahmadi, Viet-Phi Huynh, Venkata Vamsikrishna Meduri, Stefano Ortona, Paolo Papotti","doi":"10.1145/3371315","DOIUrl":"https://doi.org/10.1145/3371315","url":null,"abstract":"We describe RuDiK, an algorithm and a system for mining declarative rules over RDF knowledge graphs (KGs). RuDiK can discover rules expressing both positive relationships between KG elements, e.g., “if two persons share at least one parent, they are likely to be siblings,” and negative patterns identifying data contradictions, e.g., “if two persons are married, one cannot be the child of the other” or “the birth year for a person cannot be bigger than her graduation year.” While the first kind of rules identify new facts in the KG, the second kind enables the detection of incorrect triples and the generation of (training) negative examples for learning algorithms. High-quality rules are also critical for any reasoning task involving the KGs. Our approach increases the expressive power of the supported rule language w.r.t. the existing systems. RuDiK discovers rules containing (i) comparisons among literal values and (ii) selection conditions with constants. Richer rules increase the accuracy and the coverage over the facts in the KG for the task at hand. This is achieved with aggressive pruning of the search space and with disk-based algorithms, which enable the execution of the system in commodity machines. Also, RuDiK is robust to errors and missing data in the input graph. It discovers approximate rules with a measure of support that is aware of the quality issues. Our experimental evaluation with real-world KGs shows that RuDiK does better than existing solutions in terms of scalability and that it can identify effective rules for different target applications.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"48 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2020-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88678487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web 什么是链接开放数据中的链接?网络上知识图谱之间链接的表征与评价
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-05-06 DOI: 10.1145/3369875
A. Haller, Javier D. Fernández, Maulik R. Kamdar, A. Polleres
{"title":"What Are Links in Linked Open Data? A Characterization and Evaluation of Links between Knowledge Graphs on the Web","authors":"A. Haller, Javier D. Fernández, Maulik R. Kamdar, A. Polleres","doi":"10.1145/3369875","DOIUrl":"https://doi.org/10.1145/3369875","url":null,"abstract":"Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. We argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. First, to define boundaries of single coherent knowledge graphs within Linked Data, a principled notion of what a dataset is, or, respectively, what links within and between datasets are, has been missing. Second, we argue that to enable FAIR knowledge graphs, Linked Data misses standardised findability and accessability mechanism via a single entry link. To address the first issue, we (i) propose a rigorous definition of a naming authority for a Linked Data dataset, (ii) define different link types for data in Linked datasets, (iii) provide an empirical analysis of linkage among the datasets of the Linked Open Data cloud, and (iv) analyse the dereferenceability of those links. We base our analyses and link computations on a scalable mechanism implemented on top of the HDT format, which allows us to analyse quantity and quality of different link types at scale.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 34"},"PeriodicalIF":0.0,"publicationDate":"2020-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78796528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Data Quality and Explainable AI 数据质量和可解释的人工智能
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-04-30 DOI: 10.1145/3386687
L. Bertossi, Floris Geerts
{"title":"Data Quality and Explainable AI","authors":"L. Bertossi, Floris Geerts","doi":"10.1145/3386687","DOIUrl":"https://doi.org/10.1145/3386687","url":null,"abstract":"In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"2 1","pages":"1 - 9"},"PeriodicalIF":0.0,"publicationDate":"2020-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85232405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
The Information Resilience Framework 信息弹性框架
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-04-24 DOI: 10.1145/3388786
K. Blay, S. Yeomans, P. Demian, Danny Murguia
{"title":"The Information Resilience Framework","authors":"K. Blay, S. Yeomans, P. Demian, Danny Murguia","doi":"10.1145/3388786","DOIUrl":"https://doi.org/10.1145/3388786","url":null,"abstract":"The quality of information is crucial to the success of asset delivery, management, and performance in the Digitised Architecture, Engineering, Construction, and Operations (DAECO) sector. The exposure and sensitivity of information to threats during its lifecycle leaves it vulnerable, affecting the intrinsic, relational, and security dimensions of information quality. A resilient information lifecycle perspective that identifies capabilities and requirements is therefore needed to assure information quality amid threats. This research develops and presents an information resilience (IR) framework by drawing on the theories of resilience, information quality, and vulnerability. In developing the framework, the critical incident technique was employed in interviewing 30 professionals (average of 40 minutes) in addition to reviewing seven project-documents across three digitally-driven infrastructure projects (making up 324 pages of data). The validated capabilities and requirements identified from this study have been collated into the framework and this highlights the need for cognitive-driven capabilities and process-driven requirements in DAECO.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"17 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2020-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86366419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge Graphs RDF知识图上数据集搜索的基于内容的联合和补充度量
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-04-24 DOI: 10.1145/3372750
M. Mountantonakis, Yannis Tzitzikas
{"title":"Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge Graphs","authors":"M. Mountantonakis, Yannis Tzitzikas","doi":"10.1145/3372750","DOIUrl":"https://doi.org/10.1145/3372750","url":null,"abstract":"RDF Knowledge Graphs (or Datasets) contain valuable information that can be exploited for a variety of real-world tasks. However, due to the enormous size of the available RDF datasets, it is difficult to discover the most valuable datasets for a given task. For improving dataset Discoverability, Interlinking, and Reusability, there is a trend for Dataset Search systems. Such systems are mainly based on metadata and ignore the contents; however, in tasks related to data integration and enrichment, the contents of datasets have to be considered. This is important for data integration but also for data enrichment, for instance, quite often datasets’ owners want to enrich the content of their dataset, by selecting datasets that provide complementary information for their dataset. The above tasks require content-based union and complement metrics between any subset of datasets; however, there is a lack of such approaches. For making feasible the computation of such metrics at very large scale, we propose an approach relying on (a) a set of pre-constructed (and periodically refreshed) semantics-aware indexes, and (b) “lattice-based” incremental algorithms that exploit the posting lists of such indexes, as well as set theory properties, for enabling efficient responses at query time. Finally, we discuss the efficiency of the proposed methods by presenting comparative results, and we report measurements for 400 real RDF datasets (containing over 2 billion triples), by exploiting the proposed metrics.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"15 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2020-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73838322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Robustness of Word and Character N-gram Combinations in Detecting Deceptive and Truthful Opinions 单词和字符n图组合在检测欺骗性和真实意见中的鲁棒性
Journal of Data and Information Quality (JDIQ) Pub Date : 2020-01-15 DOI: 10.1145/3349536
A. Siagian, M. Aritsugi
{"title":"Robustness of Word and Character N-gram Combinations in Detecting Deceptive and Truthful Opinions","authors":"A. Siagian, M. Aritsugi","doi":"10.1145/3349536","DOIUrl":"https://doi.org/10.1145/3349536","url":null,"abstract":"Opinions in reviews about the quality of products or services can be important information for readers. Unfortunately, such opinions may include deceptive ones posted for some business reasons. To keep the opinions as a valuable and trusted source of information, we propose an approach to detecting deceptive and truthful opinions. Specifically, we explore the use of word and character n-gram combinations, function words, and word syntactic n-grams (word sn-grams) as features for classifiers to deal with this task. We also consider applying word correction to our utilized dataset. Our experiments show that classification results of using the word and character n-gram combination features could perform better than those of employing other features. Although the experiments indicate that applying the word correction might be insignificant, we note that the deceptive opinions tend to have a smaller number of error words than the truthful ones. To examine robustness of our features, we then perform cross-classification tests. Our latter experiments results suggest that using the word and character n-gram combination features could work well in detecting deceptive and truthful opinions. Interestingly, the latter experimental results also indicate that using the word sn-grams as combination features could give good performance.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"23 1","pages":"1 - 24"},"PeriodicalIF":0.0,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78150732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信