{"title":"Large-Scale Data Quality Challenges, Framework and Evaluation in Metro Systems","authors":"Tailan Yuan;Wen Xiong;Siyuan Liu","doi":"10.1109/TBDATA.2024.3474215","DOIUrl":null,"url":null,"abstract":"Data quality is a fundamental challenge for downstream data mining tasks. While numerous studies have addressed data quality issues in various contexts, there is a notable lack of systematic research on data quality in metro systems. Metro systems generate a vast volume of multisource heterogeneous datasets daily, and many data mining tasks have been developed for operational and management purposes. Therefore, investigating data quality problems in metro systems is crucial. In this paper, we systematically explore data quality issues in metro systems. First, we present a comprehensive analysis method to examine data quality problems such as missing data, noise, and weak semantics. Second, we design five metrics to measure data quality and propose a set of quality improvement approaches. These approaches include a travel pattern-based missing value imputation method, a heuristic trajectory noise filtering method, and a data semantics enhancement method. Additionally, we develop an automated pipeline solution where the data quality enhancement algorithms are seamlessly integrated with the data processing pipeline. Finally, we provide a case study to illustrate the significant benefits of our data quality improvement methods. We conducted extensive experiments to validate our methods on a set of large-scale datasets collected from a metro system, which include, Wi-Fi signal data, and electronic fence data. The results indicate that 1) the proposed imputation method surpasses other baselines by 26.47% to 44.82%; 2) the proposed noise filtering method outperforms other baselines by an average of 12.22%; and 3) the proposed data semantics enrichment method exceeds the baseline method by 37.34% in terms of maximum accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1447-1463"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10705079/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Data quality is a fundamental challenge for downstream data mining tasks. While numerous studies have addressed data quality issues in various contexts, there is a notable lack of systematic research on data quality in metro systems. Metro systems generate a vast volume of multisource heterogeneous datasets daily, and many data mining tasks have been developed for operational and management purposes. Therefore, investigating data quality problems in metro systems is crucial. In this paper, we systematically explore data quality issues in metro systems. First, we present a comprehensive analysis method to examine data quality problems such as missing data, noise, and weak semantics. Second, we design five metrics to measure data quality and propose a set of quality improvement approaches. These approaches include a travel pattern-based missing value imputation method, a heuristic trajectory noise filtering method, and a data semantics enhancement method. Additionally, we develop an automated pipeline solution where the data quality enhancement algorithms are seamlessly integrated with the data processing pipeline. Finally, we provide a case study to illustrate the significant benefits of our data quality improvement methods. We conducted extensive experiments to validate our methods on a set of large-scale datasets collected from a metro system, which include, Wi-Fi signal data, and electronic fence data. The results indicate that 1) the proposed imputation method surpasses other baselines by 26.47% to 44.82%; 2) the proposed noise filtering method outperforms other baselines by an average of 12.22%; and 3) the proposed data semantics enrichment method exceeds the baseline method by 37.34% in terms of maximum accuracy.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.