Zhiyun Xue, Tochi Oguguo, Kelly J Yu, Tseng-Cheng Chen, Chun-Hung Hua, Chung Jan Kang, Chih-Yen Chien, Ming-Hsui Tsai, Cheng-Ping Wang, Anil K Chaturvedi, Sameer Antani
{"title":"Cleaning and Harmonizing Medical Image Data for Reliable AI: Lessons Learned from Longitudinal Oral Cancer Natural History Study Data.","authors":"Zhiyun Xue, Tochi Oguguo, Kelly J Yu, Tseng-Cheng Chen, Chun-Hung Hua, Chung Jan Kang, Chih-Yen Chien, Ming-Hsui Tsai, Cheng-Ping Wang, Anil K Chaturvedi, Sameer Antani","doi":"10.1117/12.3005875","DOIUrl":null,"url":null,"abstract":"<p><p>For deep learning-based machine learning, not only are large and sufficiently diverse data crucial but their good qualities are equally important. However, in real-world applications, it is very common that raw source data may contain incorrect, noisy, inconsistent, improperly formatted and sometimes missing elements, particularly, when the datasets are large and sourced from many sites. In this paper, we present our work towards preparing and making image data ready for the development of AI-driven approaches for studying various aspects of the natural history of oral cancer. Specifically, we focus on two aspects: 1) cleaning the image data; and 2) extracting the annotation information. Data cleaning includes removing duplicates, identifying missing data, correcting errors, standardizing data sets, and removing personal sensitive information, toward combining data sourced from different study sites. These steps are often collectively referred to as data harmonization. Annotation information extraction includes identifying crucial or valuable texts that are manually entered by clinical providers related to the image paths/names and standardizing of the texts of labels. Both are important for the successful deep learning algorithm development and data analyses. Specifically, we provide details on the data under consideration, describe the challenges and issues we observed that motivated our work, present specific approaches and methods that we used to clean and standardize the image data and extract labelling information. Further, we discuss the ways to increase efficiency of the process and the lessons learned. Research ideas on automating the process with ML-driven techniques are also presented and discussed. Our intent in reporting and discussing such work in detail is to help provide insights in automating or, minimally, increasing the efficiency of these critical yet often under-reported processes.</p>","PeriodicalId":74505,"journal":{"name":"Proceedings of SPIE--the International Society for Optical Engineering","volume":"12931 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11107840/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of SPIE--the International Society for Optical Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.3005875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/4/2 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
For deep learning-based machine learning, not only are large and sufficiently diverse data crucial but their good qualities are equally important. However, in real-world applications, it is very common that raw source data may contain incorrect, noisy, inconsistent, improperly formatted and sometimes missing elements, particularly, when the datasets are large and sourced from many sites. In this paper, we present our work towards preparing and making image data ready for the development of AI-driven approaches for studying various aspects of the natural history of oral cancer. Specifically, we focus on two aspects: 1) cleaning the image data; and 2) extracting the annotation information. Data cleaning includes removing duplicates, identifying missing data, correcting errors, standardizing data sets, and removing personal sensitive information, toward combining data sourced from different study sites. These steps are often collectively referred to as data harmonization. Annotation information extraction includes identifying crucial or valuable texts that are manually entered by clinical providers related to the image paths/names and standardizing of the texts of labels. Both are important for the successful deep learning algorithm development and data analyses. Specifically, we provide details on the data under consideration, describe the challenges and issues we observed that motivated our work, present specific approaches and methods that we used to clean and standardize the image data and extract labelling information. Further, we discuss the ways to increase efficiency of the process and the lessons learned. Research ideas on automating the process with ML-driven techniques are also presented and discussed. Our intent in reporting and discussing such work in detail is to help provide insights in automating or, minimally, increasing the efficiency of these critical yet often under-reported processes.
对于基于深度学习的机器学习而言,大量且足够多样化的数据不仅至关重要,而且它们的良好品质也同样重要。然而,在现实世界的应用中,原始源数据可能包含不正确、有噪声、不一致、格式不当、有时甚至缺失的元素,这在数据集较大且来源于许多网站时尤为常见。在本文中,我们介绍了为开发人工智能驱动的口腔癌自然史研究方法而准备和制作图像数据的工作。具体来说,我们主要关注两个方面:1) 清理图像数据;2) 提取注释信息。数据清理包括去除重复数据、识别缺失数据、纠正错误、标准化数据集、去除个人敏感信息,以及将来自不同研究地点的数据进行合并。这些步骤通常统称为数据协调。注释信息提取包括识别临床提供者手动输入的与图像路径/名称相关的关键或有价值文本,以及标准化标签文本。这两点对于成功的深度学习算法开发和数据分析都非常重要。具体来说,我们将详细介绍所考虑的数据,描述我们所观察到的促使我们开展工作的挑战和问题,介绍我们用于清理和标准化图像数据以及提取标签信息的具体方法和手段。此外,我们还讨论了提高流程效率的方法和汲取的经验教训。此外,我们还介绍并讨论了利用 ML 驱动技术实现流程自动化的研究思路。我们详细报告和讨论这些工作的目的,是希望能为这些关键但往往未得到充分报告的流程的自动化或最低限度地提高效率提供帮助。