Ming Sheng, Shuliang Wang, Yong Zhang, Rui Hao, Ye Liang, Yi Luo, Wenhan Yang, Jincheng Wang, Yinan Li, Wenkui Zheng, Wenyao Li
{"title":"基于 Lakehouse 的多源异构医疗数据增强框架。","authors":"Ming Sheng, Shuliang Wang, Yong Zhang, Rui Hao, Ye Liang, Yi Luo, Wenhan Yang, Jincheng Wang, Yinan Li, Wenkui Zheng, Wenyao Li","doi":"10.1007/s13755-024-00295-6","DOIUrl":null,"url":null,"abstract":"<p><p>Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks: clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.</p>","PeriodicalId":46312,"journal":{"name":"Health Information Science and Systems","volume":"12 1","pages":"37"},"PeriodicalIF":4.7000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11226589/pdf/","citationCount":"0","resultStr":"{\"title\":\"A multi-source heterogeneous medical data enhancement framework based on lakehouse.\",\"authors\":\"Ming Sheng, Shuliang Wang, Yong Zhang, Rui Hao, Ye Liang, Yi Luo, Wenhan Yang, Jincheng Wang, Yinan Li, Wenkui Zheng, Wenyao Li\",\"doi\":\"10.1007/s13755-024-00295-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks: clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.</p>\",\"PeriodicalId\":46312,\"journal\":{\"name\":\"Health Information Science and Systems\",\"volume\":\"12 1\",\"pages\":\"37\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11226589/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health Information Science and Systems\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s13755-024-00295-6\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/12/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Information Science and Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s13755-024-00295-6","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
摘要
从原始数据中获取高质量的数据集是数据探索和分析前的关键一步。如今,在医疗领域,大量数据在用于分析患者健康状况之前都需要提高质量。在数据提取、数据清洗和数据估算方面分别有许多研究。然而,很少有将这三种技术整合在一起的框架,使得数据集的准确性、一致性和完整性受到影响。本文提出了一种基于湖库 MHDP 的多源异构数据增强框架,包括数据提取、数据清洗和数据估算三个步骤。在数据提取步骤中,提供了一种数据融合技术来处理多模态和多源异构数据。在数据清洗步骤中,我们提出了 HoloCleanX,它提供了一种便捷的交互式程序。在数据估算步骤中,我们针对不同情况应用了多重估算(MI)和 SOTA 算法 SAITS。我们通过三个任务来评估我们的框架:聚类、分类和策略预测。实验结果证明了我们的数据增强框架的有效性。
A multi-source heterogeneous medical data enhancement framework based on lakehouse.
Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks: clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.
期刊介绍:
Health Information Science and Systems is a multidisciplinary journal that integrates artificial intelligence/computer science/information technology with health science and services, embracing information science research coupled with topics related to the modeling, design, development, integration and management of health information systems, smart health, artificial intelligence in medicine, and computer aided diagnosis, medical expert systems. The scope includes: i.) smart health, artificial Intelligence in medicine, computer aided diagnosis, medical image processing, medical expert systems ii.) medical big data, medical/health/biomedicine information resources such as patient medical records, devices and equipments, software and tools to capture, store, retrieve, process, analyze, optimize the use of information in the health domain, iii.) data management, data mining, and knowledge discovery, all of which play a key role in decision making, management of public health, examination of standards, privacy and security issues, iv.) development of new architectures and applications for health information systems.