Journal of Data and Information Quality (JDIQ)最新文献_第7页

Adaptive and Cost-Effective Collection of High-Quality Data for Critical Infrastructure and Emergency Management in Smart Cities—Framework and Challenges 智慧城市关键基础设施和应急管理高质量数据的自适应和成本效益收集——框架和挑战

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-05-29 DOI: 10.1145/3190579

E. Bertino, M. Jahanshahi

{"title":"Adaptive and Cost-Effective Collection of High-Quality Data for Critical Infrastructure and Emergency Management in Smart Cities—Framework and Challenges","authors":"E. Bertino, M. Jahanshahi","doi":"10.1145/3190579","DOIUrl":"https://doi.org/10.1145/3190579","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 ACM 1936-1955/2018/05-ART1 $15.00 https://doi.org/10.1145/3190579 ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. 1:2 E. Bertino and M. R. Jahanshahi Fig. 1. A spatially incomplete object. agents (mobile phones, small drones, robots, sensors); 5G networks and edge computing processing [2]; crowdsourcing. In this article, we first briefly discuss relevant data quality requirements related to applications in the area of critical infrastructure and emergency management, although this framework can be extended to other applications. We then present a comprehensive framework for a real-time, adaptive, and cost-effective collection of high-quality data for such applications that leverage many of the above technologies, and elaborate on a few research challenges. 2 DATA QUALITY REQUIREMENTS Data quality is usually characterized by many different dimensions [3]. In our context, e.g., objects extracted from image data, key requirements include: —Spatial Completeness: The objects of interest should be “fully covered” by the image data. For example, an image reporting only half of a building crack would not have satisfactory spatial completeness (see Figure 1 for an example of a spatially incomplete object). —Temporal Completeness: The temporal evolution of the objects of interest should be covered as it is critical for accurate prediction. —Precision: The object images should be sharp and have high resolution. —Traceability: Information about the entire process, according to which data of interest was collected, processed, and transmitted, should be recorded; this is critical for identifying errors that lead to poor quality data about the objects of interest. —Minimality: The presence of non-relevant objects should be minimized. It is, however, important to remark that other quality requirements, such as currentness and consistency, are also relevant in our context. 3 DATA COLLECTION FRAMEWORK Our framework (see Figure 2) is based on two conceptual parties: data collection coordinator (referred to as base station (BS)); and data collectors (e.g., agents in charge of data gathering). The data collection coordinator is the interface system that coordinates the data acquisition tasks and data quality assessment. It interfaces on one side with the data users (e.g., end-users and applications) and on the other with data collectors. Given a data acquisition task and geographical area of interest, it allocates a number of data collectors, based on the capabilities of collectors, for the execution of the task, by also trying to optimize the cost of data acquisition and minimize ACM Journal of Data and Information Quality, Vol. 10, No. 1, Article 1. Publication date: May 2018. Adaptive and Cost-Effective ","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"95 1","pages":"1 - 6"},"PeriodicalIF":0.0,"publicationDate":"2018-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87018908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics 数据质量度量的可视化交互创建、定制和分析

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-05-29 DOI: 10.1145/3190578

C. Bors, T. Gschwandtner, Simone Kriglstein, S. Miksch, M. Pohl

{"title":"Visual Interactive Creation, Customization, and Analysis of Data Quality Metrics","authors":"C. Bors, T. Gschwandtner, Simone Kriglstein, S. Miksch, M. Pohl","doi":"10.1145/3190578","DOIUrl":"https://doi.org/10.1145/3190578","url":null,"abstract":"During data preprocessing, analysts spend a significant part of their time and effort profiling the quality of the data along with cleansing and transforming the data for further analysis. While quality metrics—ranging from general to domain-specific measures—support assessment of the quality of a dataset, there are hardly any approaches to visually support the analyst in customizing and applying such metrics. Yet, visual approaches could facilitate users’ involvement in data quality assessment. We present MetricDoc, an interactive environment for assessing data quality that provides customizable, reusable quality metrics in combination with immediate visual feedback. Moreover, we provide an overview visualization of these quality metrics along with error visualizations that facilitate interactive navigation of the data to determine the causes of quality issues present in the data. In this article, we describe the architecture, design, and evaluation of MetricDoc, which underwent several design cycles, including heuristic evaluation and expert reviews as well as a focus group with data quality, human-computer interaction, and visual analytics experts.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"12 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2018-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84487131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Machine Reading of Biomedical Data Dictionaries 生物医学数据词典的机器阅读

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-05-11 DOI: 10.1145/3177874

N. Ashish, Arihant Patawari

引用次数: 1

InfoClean InfoClean

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-04-12 DOI: 10.1145/3190577

Fei Chiang, Dhruv Gairola

{"title":"InfoClean","authors":"Fei Chiang, Dhruv Gairola","doi":"10.1145/3190577","DOIUrl":"https://doi.org/10.1145/3190577","url":null,"abstract":"Data quality has become a pervasive challenge for organizations as they wrangle with large, heterogeneous datasets to extract value. Given the proliferation of sensitive and confidential information, it is crucial to consider data privacy concerns during the data cleaning process. For example, in medical database applications, varying levels of privacy are enforced across the attribute values. Attributes such as a patient’s country or city of residence may be less sensitive than the patient’s prescribed medication. Traditional data cleaning techniques assume the data is openly accessible, without considering the differing levels of information sensitivity. In this work, we take the first steps toward a data cleaning model that integrates privacy as part of the data cleaning process. We present a privacy-aware data cleaning framework that differentiates the information content among the attribute values during the data cleaning process to resolve data inconsistencies while minimizing the amount of information disclosed. Our data repair algorithm includes a set of data disclosure operations that considers the information content of the underlying attribute values, while maximizing data utility. Our evaluation using real datasets shows that our algorithm scales well, and achieves improved performance and comparable repair accuracy against existing data cleaning solutions.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"29 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78796294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

SPMDL

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-04-12 DOI: 10.1145/3185049

M. Alshayeb, Yasser Shaaban, Jarallah AlGhamdi

引用次数: 4

Challenge Paper 挑战的论文

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-04-12 DOI: 10.1145/3165712

A. Gal, Arik Senderovich, M. Weidlich

{"title":"Challenge Paper","authors":"A. Gal, Arik Senderovich, M. Weidlich","doi":"10.1145/3165712","DOIUrl":"https://doi.org/10.1145/3165712","url":null,"abstract":"Queues represent a setting where agents compete over a scarce resource: People wait for public transportation, jobs wait to be processed, patients await treatment at a hospital, and so on. While data logs record many aspects of our lives, information about queues is rarely recorded. Queue mining (Senderovich et al. 2015) is the process of revealing queue information from data logs for the purpose of discovering queueing models, conformance checking, and optimization. As such, queue mining enables bottleneck detection and delay prediction (Gal et al. 2017). A queueing network is the most general form of a queueing model, represented as a directed graph with nodes being the queueing stations (corresponding to types of resources), edges corresponding to routing between stations, and node attributes corresponding to station dynamics (e.g., arrival patterns, service time distributions, station capacity, and service policy—for example, first-come first-served). Customers arrive into a queueing station, wait (enqueued) before being served by the station, and then leave to the next station (or exit the system). Queueing networks are often assumed to have a single customer type and an immediate Markovian routing (after completion at a station a customer appears in the next station with some probability). Also, simple queueing networks","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"219 1","pages":"1 - 5"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82731396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Experience 经验

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-04-12 DOI: 10.1145/3190576

Fathoni A. Musyaffa, C. Engels, Maria-Esther Vidal, F. Orlandi, Sören Auer

{"title":"Experience","authors":"Fathoni A. Musyaffa, C. Engels, Maria-Esther Vidal, F. Orlandi, Sören Auer","doi":"10.1145/3190576","DOIUrl":"https://doi.org/10.1145/3190576","url":null,"abstract":"Public administrations are continuously publishing open data, increasing the amount of government open data over time. The published data includes budgets and spending as part of fiscal data; publishing these data is an important part of transparent and accountable governance. However, open fiscal data should also meet open data publication guidelines. When requirements in data guidelines are not met, effective data analysis over published datasets cannot be performed effectively. In this article, we present Open Fiscal Data Publication (OFDP), a framework to assess the quality of open fiscal datasets. We also present an extensive open fiscal data assessment and common data quality issues found; additionally, open fiscal data publishing guidelines are presented. We studied and surveyed main quality factors for open fiscal datasets. Moreover, the collected quality factors have been scored according to the results of a questionnaire to score quality factors within the OFDP assessment framework. We gather and comprehensively analyze a representative set of 77 fiscal datasets from several public administrations across different regions at different levels (e.g., supranational, national, municipality). We characterize quality issues commonly arising in these datasets. Our assessment shows that there are many quality factors in fiscal data publication that still need to be taken care of so that the data can be analyzed effectively. Our proposed guidelines allow for publishing open fiscal data where these quality issues are avoided.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"12 1","pages":"1 - 10"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77147680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The Web of False Information 虚假信息网

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-04-10 DOI: 10.1145/3309699

Savvas Zannettou, Michael Sirivianos, Jeremy Blackburn, N. Kourtellis

{"title":"The Web of False Information","authors":"Savvas Zannettou, Michael Sirivianos, Jeremy Blackburn, N. Kourtellis","doi":"10.1145/3309699","DOIUrl":"https://doi.org/10.1145/3309699","url":null,"abstract":"A new era of Information Warfare has arrived. Various actors, including state-sponsored ones, are weaponizing information on Online Social Networks to run false-information campaigns with targeted manipulation of public opinion on specific topics. These false-information campaigns can have dire consequences to the public: mutating their opinions and actions, especially with respect to critical world events like major elections. Evidently, the problem of false information on the Web is a crucial one and needs increased public awareness as well as immediate attention from law enforcement agencies, public institutions, and in particular, the research community. In this article, we make a step in this direction by providing a typology of the Web’s false-information ecosystem, composed of various types of false-information, actors, and their motives. We report a comprehensive overview of existing research on the false-information ecosystem by identifying several lines of work: (1) how the public perceives false information; (2) understanding the propagation of false information; (3) detecting and containing false information on the Web; and (4) false information on the political stage. In this work, we pay particular attention to political false information as: (1) it can have dire consequences to the community (e.g., when election results are mutated) and (2) previous work shows that this type of false information propagates faster and further when compared to other types of false information. Finally, for each of these lines of work, we report several future research directions that can help us better understand and mitigate the emerging problem of false-information dissemination on the Web.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"278 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2018-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82812774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 137

Toward Veracity Assessment in RDF Knowledge Bases RDF知识库的准确性评估研究

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-02-20 DOI: 10.1145/3177873

Diego Esteves, A. Rula, Aniketh Janardhan Reddy, Jens Lehmann

引用次数: 10

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases 生物数据库重复数据删除序列聚类方法比较分析

Journal of Data and Information Quality (JDIQ) Pub Date : 2018-01-27 DOI: 10.1145/3131611

Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, J. Zobel, Karin M. Verspoor

{"title":"Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases","authors":"Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, J. Zobel, Karin M. Verspoor","doi":"10.1145/3131611","DOIUrl":"https://doi.org/10.1145/3131611","url":null,"abstract":"The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"5 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2018-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85091079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9