Large-Scale Data Quality Challenges, Framework and Evaluation in Metro Systems

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2024-10-03 DOI:10.1109/TBDATA.2024.3474215

Tailan Yuan;Wen Xiong;Siyuan Liu

{"title":"Large-Scale Data Quality Challenges, Framework and Evaluation in Metro Systems","authors":"Tailan Yuan;Wen Xiong;Siyuan Liu","doi":"10.1109/TBDATA.2024.3474215","DOIUrl":null,"url":null,"abstract":"Data quality is a fundamental challenge for downstream data mining tasks. While numerous studies have addressed data quality issues in various contexts, there is a notable lack of systematic research on data quality in metro systems. Metro systems generate a vast volume of multisource heterogeneous datasets daily, and many data mining tasks have been developed for operational and management purposes. Therefore, investigating data quality problems in metro systems is crucial. In this paper, we systematically explore data quality issues in metro systems. First, we present a comprehensive analysis method to examine data quality problems such as missing data, noise, and weak semantics. Second, we design five metrics to measure data quality and propose a set of quality improvement approaches. These approaches include a travel pattern-based missing value imputation method, a heuristic trajectory noise filtering method, and a data semantics enhancement method. Additionally, we develop an automated pipeline solution where the data quality enhancement algorithms are seamlessly integrated with the data processing pipeline. Finally, we provide a case study to illustrate the significant benefits of our data quality improvement methods. We conducted extensive experiments to validate our methods on a set of large-scale datasets collected from a metro system, which include, Wi-Fi signal data, and electronic fence data. The results indicate that 1) the proposed imputation method surpasses other baselines by 26.47% to 44.82%; 2) the proposed noise filtering method outperforms other baselines by an average of 12.22%; and 3) the proposed data semantics enrichment method exceeds the baseline method by 37.34% in terms of maximum accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1447-1463"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10705079/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Data quality is a fundamental challenge for downstream data mining tasks. While numerous studies have addressed data quality issues in various contexts, there is a notable lack of systematic research on data quality in metro systems. Metro systems generate a vast volume of multisource heterogeneous datasets daily, and many data mining tasks have been developed for operational and management purposes. Therefore, investigating data quality problems in metro systems is crucial. In this paper, we systematically explore data quality issues in metro systems. First, we present a comprehensive analysis method to examine data quality problems such as missing data, noise, and weak semantics. Second, we design five metrics to measure data quality and propose a set of quality improvement approaches. These approaches include a travel pattern-based missing value imputation method, a heuristic trajectory noise filtering method, and a data semantics enhancement method. Additionally, we develop an automated pipeline solution where the data quality enhancement algorithms are seamlessly integrated with the data processing pipeline. Finally, we provide a case study to illustrate the significant benefits of our data quality improvement methods. We conducted extensive experiments to validate our methods on a set of large-scale datasets collected from a metro system, which include, Wi-Fi signal data, and electronic fence data. The results indicate that 1) the proposed imputation method surpasses other baselines by 26.47% to 44.82%; 2) the proposed noise filtering method outperforms other baselines by an average of 12.22%; and 3) the proposed data semantics enrichment method exceeds the baseline method by 37.34% in terms of maximum accuracy.

查看原文本刊更多论文

地铁系统中大规模数据质量的挑战、框架和评估

数据质量是下游数据挖掘任务的基本挑战。虽然许多研究已经解决了各种背景下的数据质量问题，但明显缺乏对地铁系统数据质量的系统研究。地铁系统每天都会产生大量的多源异构数据集，并且已经开发了许多用于操作和管理目的的数据挖掘任务。因此，研究地铁系统的数据质量问题至关重要。本文系统地探讨了地铁系统中的数据质量问题。首先，我们提出了一种综合分析方法来检查数据质量问题，如缺失数据、噪声和弱语义。其次，我们设计了五个衡量数据质量的指标，并提出了一套质量改进方法。这些方法包括一种基于旅行模式的缺失值输入方法、一种启发式轨迹噪声滤波方法和一种数据语义增强方法。此外，我们还开发了一种自动化的管道解决方案，其中数据质量增强算法与数据处理管道无缝集成。最后，我们提供了一个案例研究来说明我们的数据质量改进方法的显著好处。我们进行了广泛的实验，以验证我们的方法在一组从地铁系统收集的大规模数据集上，其中包括Wi-Fi信号数据和电子围栏数据。结果表明：1)该方法比其他基准高出26.47% ~ 44.82%；2)噪声滤波方法比其他基准平均高出12.22%；3)数据语义充实方法的最大准确率比基线方法高出37.34%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.