Journal of Data and Information Quality (JDIQ)最新文献

筛选
英文 中文
An Introduction to Dynamic Data Quality Challenges 介绍动态数据质量挑战
Journal of Data and Information Quality (JDIQ) Pub Date : 2017-01-04 DOI: 10.1145/2998575
Alan G. Labouseur, C. Matheus
{"title":"An Introduction to Dynamic Data Quality Challenges","authors":"Alan G. Labouseur, C. Matheus","doi":"10.1145/2998575","DOIUrl":"https://doi.org/10.1145/2998575","url":null,"abstract":"We live in an evolving world. As time passes, data changes in content and structure, and thus becomes dynamic. Data quality, therefore, also becomes dynamic because it is an aggregate characteristic of data itself. Thus, our evolving world and Internet of Things (IoT) presents renewed challenges in data quality. IoT data is teeming with multivendor and multiprovider applications, devices, microservices, and automated processes built on social media, public and private datasets, digitized records, sensor logs, web logs, and much more. From intelligent traffic systems to smart healthcare devices, modern enterprises are inundated with a daily deluge of dynamic big data. The primary characteristics of big data are volume, velocity, and variety [Abadi et al. 2014]. Techniques for managing volume and velocity have been under development for decades. While some work has been done on variety, integrating and analyzing data from diverse sources and formats still presents challenges. For example, much of the big data deluge is structured and much of it is not. This single dimension of variety inherent in today’s IoT clearly illustrates there is no “silver bullet” and one size does not fit all [Abadi et al. 2014; Stonebraker and Cetintemel 2005, 2015]. It is important to note there are many other dimensions of variety beyond structure. We must consider possibilities arising from analyzing data in a dizzying range of data types found in varying time frames of differing granularity from diverse sources in our evolving and streaming world. Structure is but one example illustrative of many more general challenges that we use in this article to introduce dynamic data quality.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"35 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2017-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73794146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
The Challenge of Test Data Quality in Data Processing 数据处理中测试数据质量的挑战
Journal of Data and Information Quality (JDIQ) Pub Date : 2017-01-04 DOI: 10.1145/3012004
Christoph Becker, Kresimir Duretec, A. Rauber
{"title":"The Challenge of Test Data Quality in Data Processing","authors":"Christoph Becker, Kresimir Duretec, A. Rauber","doi":"10.1145/3012004","DOIUrl":"https://doi.org/10.1145/3012004","url":null,"abstract":"© ACM, 2016. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is forthcoming and will be published in JDIQ in 2016.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"110 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2017-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77428672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
From Content to Context 从内容到语境
Journal of Data and Information Quality (JDIQ) Pub Date : 2017-01-04 DOI: 10.1145/2996198
Ganesan Shankaranarayanan, R. Blake
{"title":"From Content to Context","authors":"Ganesan Shankaranarayanan, R. Blake","doi":"10.1145/2996198","DOIUrl":"https://doi.org/10.1145/2996198","url":null,"abstract":"Research in data and information quality has made significant strides over the last 20 years. It has become a unified body of knowledge incorporating techniques, methods, and applications from a variety of disciplines including information systems, computer science, operations management, organizational behavior, psychology, and statistics. With organizations viewing “Big Data”, social media data, data-driven decision-making, and analytics as critical, data quality has never been more important. We believe that data quality research is reaching the threshold of significant growth and a metamorphosis from focusing on measuring and assessing data quality—content—toward a focus on usage and context. At this stage, it is vital to understand the identity of this research area in order to recognize its current state and to effectively identify an increasing number of research opportunities within. Using Latent Semantic Analysis (LSA) to analyze the abstracts of 972 peer-reviewed journal and conference articles published over the past 20 years, this article contributes by identifying the core topics and themes that define the identity of data quality research. It further explores their trends over time, pointing to the data quality dimensions that have—and have not—been well-studied, and offering insights into topics that may provide significant opportunities in this area.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"22 1","pages":"1 - 28"},"PeriodicalIF":0.0,"publicationDate":"2017-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74787485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Reproducibility Challenges in Information Retrieval Evaluation 信息检索评价中的再现性挑战
Journal of Data and Information Quality (JDIQ) Pub Date : 2017-01-04 DOI: 10.1145/3020206
N. Ferro
{"title":"Reproducibility Challenges in Information Retrieval Evaluation","authors":"N. Ferro","doi":"10.1145/3020206","DOIUrl":"https://doi.org/10.1145/3020206","url":null,"abstract":"Information Retrieval (IR) is concerned with ranking information resources with respect to user information needs, delivering a wide range of key applications for industry and society, such as Web search engines [Croft et al. 2009], intellectual property, and patent search [Lupu and Hanbury 2013], and many others. The performance of IR systems is determined not only by their efficiency but also and most importantly by their effectiveness, that is, their ability to retrieve and better rank relevant information resources while at the same time suppressing the retrieval of not relevant ones. Due to the many sources of uncertainty, as for example vague user information needs, unstructured information sources, or subjective notion of relevance, experimental evaluation is the only mean to assess the performances of IR systems from the effectiveness point of view. Experimental evaluation relies on the Cranfield paradigm, which makes use of experimental collections, consisting of documents, sampled from a real domain of interest; topics, representing real user information needs in that domain; and relevance judgements, determining which documents are relevant to which topics [Harman 2011]. To share the effort and optimize the use of resources, experimental evaluation is usually carried out in publicly open and large-scale evaluation campaigns at the international level, like the Text REtrieval Conference (TREC)1 in the United States [Harman and Voorhees 2005], the Conference and Labs of the Evaluation Forum (CLEF)2 in Europe [Ferro 2014], the NII Testbeds and Community for Information access Research (NTCIR)3 in Japan and Asia, and the Forum for Information Retrieval Evaluation (FIRE)4 in India. These initiatives produce, every year, huge amounts of scientific data","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"52 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2017-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74084125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Replacing Mechanical Turkers? Challenges in the Evaluation of Models with Semantic Properties 取代机械土耳其人?语义属性模型评价中的挑战
Journal of Data and Information Quality (JDIQ) Pub Date : 2016-10-12 DOI: 10.1145/2935752
Fred Morstatter, Huan Liu
{"title":"Replacing Mechanical Turkers? Challenges in the Evaluation of Models with Semantic Properties","authors":"Fred Morstatter, Huan Liu","doi":"10.1145/2935752","DOIUrl":"https://doi.org/10.1145/2935752","url":null,"abstract":"Some machine-learning algorithms offer more than just predictive power. For example, algorithms provide additional insight into the underlying data. Examples of these algorithms are topic modeling algorithms such as Latent Dirichlet Allocation (LDA) [Blei et al. 2003], whose topics are often inspected as part of the analysis that many researchers perform on their data. Recently, deep learning algorithms such as word embedding algorithms like Word2Vec [Mikolov et al. 2013] have produced models with semantic properties. These algorithms are immensely useful; they tell us something about the environment from which they generate their predictions. One pressing challenge is how to evaluate the quality of the semantic information produced by these algorithms. When we employ algorithms for their semantic properties, it is important that these properties can be understood by a human. Currently, there are no established approaches to carry out this evaluation automatically. This evaluation (if done at all) is usually carried out via user studies. While this type of evaluation is sound, it is expensive from the perspective of both time and cost. It takes a great deal of time to recruit crowdsourced workers to complete tasks on crowdsourcing sites. This adds a huge amount of time to the research process. Furthermore, crowdsourced workers do not work for free. Each individual task costs real currency that could be spent on other parts of the research endeavor. Both the time and financial cost associated with these crowdsourced experiments mean that these types of experiments are difficult to perform. They greatly reduce the probability that future researchers will be able to reproduce these experiments.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"80 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2016-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80928211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Ontology-Based Data Quality Management for Data Streams 基于本体的数据流数据质量管理
Journal of Data and Information Quality (JDIQ) Pub Date : 2016-10-06 DOI: 10.1145/2968332
Sandra Geisler, C. Quix, Sven Weber, M. Jarke
{"title":"Ontology-Based Data Quality Management for Data Streams","authors":"Sandra Geisler, C. Quix, Sven Weber, M. Jarke","doi":"10.1145/2968332","DOIUrl":"https://doi.org/10.1145/2968332","url":null,"abstract":"Data Stream Management Systems (DSMS) provide real-time data processing in an effective way, but there is always a tradeoff between data quality (DQ) and performance. We propose an ontology-based data quality framework for relational DSMS that includes DQ measurement and monitoring in a transparent, modular, and flexible way. We follow a threefold approach that takes the characteristics of relational data stream management for DQ metrics into account. While (1) Query Metrics respect changes in data quality due to query operations, (2) Content Metrics allow the semantic evaluation of data in the streams. Finally, (3) Application Metrics allow easy user-defined computation of data quality values to account for application specifics. Additionally, a quality monitor allows us to observe data quality values and take counteractions to balance data quality and performance. The framework has been designed along a DQ management methodology suited for data streams. It has been evaluated in the domains of transportation systems and health monitoring.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"109 1","pages":"1 - 34"},"PeriodicalIF":0.0,"publicationDate":"2016-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78101459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
EXPERIENCE 经验
Journal of Data and Information Quality (JDIQ) Pub Date : 2016-05-26 DOI: 10.1145/2893482
Peter H. Aiken
{"title":"EXPERIENCE","authors":"Peter H. Aiken","doi":"10.1145/2893482","DOIUrl":"https://doi.org/10.1145/2893482","url":null,"abstract":"In a manner similar to most organizations, BigCompany (BigCo) was determined to benefit strategically from its widely recognized and vast quantities of data. (U.S. government agencies make regular visits to BigCo to learn from its experiences in this area.) When faced with an explosion in data volume, increases in complexity, and a need to respond to changing conditions, BigCo struggled to respond using a traditional, information technology (IT) project-based approach to address these challenges. As BigCo was not data knowledgeable, it did not realize that traditional approaches could not work. Two full years into the initiative, BigCo was far from achieving its initial goals. How much more time, money, and effort would be required before results were achieved? Moreover, could the results be achieved in time to support a larger, critical, technology-driven challenge that also depended on solving the data challenges? While these questions remain unaddressed, these considerations increase our collective understanding of data assets as separate from IT projects. Only by reconceiving data as a strategic asset can organizations begin to address these new challenges. Transformation to a data-driven culture requires far more than technology, which remains just one of three required “stool legs” (people and process being the other two). Seven prerequisites to effectively leveraging data are necessary, but insufficient awareness exists in most organizations—hence, the widespread misfires in these areas, especially when attempting to implement the so-called big data initiatives. Refocusing on foundational data management practices is required for all organizations, regardless of their organizational or data strategies.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"29 1","pages":"1 - 35"},"PeriodicalIF":0.0,"publicationDate":"2016-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83074414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Data Standards Challenges for Interoperable and Quality Data 互操作和高质量数据的数据标准挑战
Journal of Data and Information Quality (JDIQ) Pub Date : 2016-05-16 DOI: 10.1145/2903723
Hongwei Zhu, Yang W. Lee, A. Rosenthal
{"title":"Data Standards Challenges for Interoperable and Quality Data","authors":"Hongwei Zhu, Yang W. Lee, A. Rosenthal","doi":"10.1145/2903723","DOIUrl":"https://doi.org/10.1145/2903723","url":null,"abstract":"Data standards are agreed-on specifications about data objects and their relationships used to enable semantic interoperability of data originated from multiple sources and to help improve data quality. Despite their importance and the large number of data standards being created by standards development organizations [Cargill and Bolin 2007], little is understood regarding the quality of data standards and the costly and complex process of developing, maintaining, and using them [Lyytinen and King 2006]. It is common to see failures of standards efforts in practice [Bernstein and Haas 2008; Rosenthal et al. 2004]. A twodecade-old call for re-theorizing data standards still applies today [Wybo and Goodhue 1995]. Recent studies present important findings and opportunities for future research. We identify four representative works that (1) confirmed empirically the value of data standards for interoperability and business performance [Zhao and Xia 2014], (2) presented rules to identify and exclude suboptimal standards approaches under certain circumstances [Rosenthal et al. 2014], (3) explained the difficulties and the gap between standards development and implementation in the U.S. mortgage industry [Markus et al. 2006], and (4) proposed a set of characteristics of quality of data standards [Folmer 2012]. While we benefit from the above and other work on data standards, many questions remain unanswered. What is a “good” data standard? How do we measure its quality? What are the best processes and mechanisms for developing and maintaining standards that optimally address multiple objectives? How do we best manage the evolution of data standards? What kinds of data standards are most effective, or, more generally, what are the effects of data standards? Addressing these questions will reduce failures and improve the ability of data standards to produce interoperable and quality data.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88600617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automatic Discovery of Abnormal Values in Large Textual Databases 大型文本数据库异常值的自动发现
Journal of Data and Information Quality (JDIQ) Pub Date : 2016-04-19 DOI: 10.1145/2889311
P. Christen, Ross W. Gayler, Khoi-Nguyen Tran, Jeffrey Fisher, Dinusha Vatsalan
{"title":"Automatic Discovery of Abnormal Values in Large Textual Databases","authors":"P. Christen, Ross W. Gayler, Khoi-Nguyen Tran, Jeffrey Fisher, Dinusha Vatsalan","doi":"10.1145/2889311","DOIUrl":"https://doi.org/10.1145/2889311","url":null,"abstract":"Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records. With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services, while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that “normal” values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"21 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2016-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79120427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Challenges for Context-Driven Time Series Forecasting 上下文驱动时间序列预测的挑战
Journal of Data and Information Quality (JDIQ) Pub Date : 2016-04-19 DOI: 10.1145/2896822
R. Ulbricht, H. Donker, Claudio Hartmann, M. Hahmann, Wolfgang Lehner
{"title":"Challenges for Context-Driven Time Series Forecasting","authors":"R. Ulbricht, H. Donker, Claudio Hartmann, M. Hahmann, Wolfgang Lehner","doi":"10.1145/2896822","DOIUrl":"https://doi.org/10.1145/2896822","url":null,"abstract":"Predicting time series is a crucial task for organizations, since decisions are often based on uncertain information. Many forecasting models are designed from a generic statistical point of view. However, each real-world application requires domain-specific adaptations to obtain high-quality results. All such specifics are summarized by the term of context. In contrast to current approaches, we want to integrate context as the primary driver in the forecasting process. We introduce context-driven time series forecasting focusing on two exemplary domains: renewable energy and sparse sales data. In view of this, we discuss the challenge of context integration in the individual process steps.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"41 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2016-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79566297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信