从不平衡和有限数据集学习及其在医学成像中的应用

2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) Pub Date : 2019-08-01 DOI:10.1109/PACRIM47961.2019.8985057

Xiaoli Qin, F. Bui, Ha H. Nguyen

{"title":"从不平衡和有限数据集学习及其在医学成像中的应用","authors":"Xiaoli Qin, F. Bui, Ha H. Nguyen","doi":"10.1109/PACRIM47961.2019.8985057","DOIUrl":null,"url":null,"abstract":"Chest X-rays (CXRs) are routinely acquired in medical imaging for the purpose of diagnosing lung diseases. But for many patients, accurate and timely radiologic interpretation of the acquired CXRs is not always feasible, due to limited medical personnel and resources. A computer aided diagnosis (CAD) system based on machine learning would be an effective solution to enhance the efficiency of disease diagnosis. However, obtaining a sufficiently large-scale, balanced, and annotated dataset of CXRs for effectively training a CAD system is challenging in practice. In this paper, we present a comprehensive comparative study on learning from imbalanced and limited CXRs to detect pneumonia, tackling two main questions: (1) Is data sampling an effective method for improving the performance of learning models? (2) Are there quantifiable differences between learning models with different sampling techniques? With respect to data sampling, we investigate two general categories of techniques that modify of an imbalanced data set to deliver a balanced data distribution: (i) undersampling the majority class; and (ii) oversampling/augmentation of the minority class. With respect to learning models, we focus on Support Vector Machine (SVM) and deep convolutional neural network (CNN). Using a publicly available CXR dataset, we demonstrate that SVM and CNN learning models both exhibit improved performance, with the proper selection of the data sampling strategies.","PeriodicalId":152556,"journal":{"name":"2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Learning from an Imbalanced and Limited Dataset and an Application to Medical Imaging\",\"authors\":\"Xiaoli Qin, F. Bui, Ha H. Nguyen\",\"doi\":\"10.1109/PACRIM47961.2019.8985057\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Chest X-rays (CXRs) are routinely acquired in medical imaging for the purpose of diagnosing lung diseases. But for many patients, accurate and timely radiologic interpretation of the acquired CXRs is not always feasible, due to limited medical personnel and resources. A computer aided diagnosis (CAD) system based on machine learning would be an effective solution to enhance the efficiency of disease diagnosis. However, obtaining a sufficiently large-scale, balanced, and annotated dataset of CXRs for effectively training a CAD system is challenging in practice. In this paper, we present a comprehensive comparative study on learning from imbalanced and limited CXRs to detect pneumonia, tackling two main questions: (1) Is data sampling an effective method for improving the performance of learning models? (2) Are there quantifiable differences between learning models with different sampling techniques? With respect to data sampling, we investigate two general categories of techniques that modify of an imbalanced data set to deliver a balanced data distribution: (i) undersampling the majority class; and (ii) oversampling/augmentation of the minority class. With respect to learning models, we focus on Support Vector Machine (SVM) and deep convolutional neural network (CNN). Using a publicly available CXR dataset, we demonstrate that SVM and CNN learning models both exhibit improved performance, with the proper selection of the data sampling strategies.\",\"PeriodicalId\":152556,\"journal\":{\"name\":\"2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACRIM47961.2019.8985057\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACRIM47961.2019.8985057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

胸部x光片(cxr)是医学成像中诊断肺部疾病的常规手段。但对于许多患者来说，由于医疗人员和资源的限制，获得性cxr的准确、及时的放射学解释并不总是可行的。基于机器学习的计算机辅助诊断(CAD)系统是提高疾病诊断效率的有效解决方案。然而，在实践中，获得一个足够大规模、平衡和注释的cxr数据集来有效地训练CAD系统是具有挑战性的。在本文中，我们对从不平衡和有限的cxr中学习以检测肺炎进行了全面的比较研究，解决了两个主要问题:(1)数据采样是否是提高学习模型性能的有效方法?(2)采用不同采样技术的学习模型之间是否存在可量化的差异?关于数据采样，我们研究了两类修改不平衡数据集以提供平衡数据分布的技术:(i)对多数类进行欠采样;(ii)少数族裔班级的过度抽样/扩大。在学习模型方面，我们主要关注支持向量机(SVM)和深度卷积神经网络(CNN)。使用公开可用的CXR数据集，我们证明了SVM和CNN学习模型在正确选择数据采样策略的情况下都表现出更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning from an Imbalanced and Limited Dataset and an Application to Medical Imaging

Chest X-rays (CXRs) are routinely acquired in medical imaging for the purpose of diagnosing lung diseases. But for many patients, accurate and timely radiologic interpretation of the acquired CXRs is not always feasible, due to limited medical personnel and resources. A computer aided diagnosis (CAD) system based on machine learning would be an effective solution to enhance the efficiency of disease diagnosis. However, obtaining a sufficiently large-scale, balanced, and annotated dataset of CXRs for effectively training a CAD system is challenging in practice. In this paper, we present a comprehensive comparative study on learning from imbalanced and limited CXRs to detect pneumonia, tackling two main questions: (1) Is data sampling an effective method for improving the performance of learning models? (2) Are there quantifiable differences between learning models with different sampling techniques? With respect to data sampling, we investigate two general categories of techniques that modify of an imbalanced data set to deliver a balanced data distribution: (i) undersampling the majority class; and (ii) oversampling/augmentation of the minority class. With respect to learning models, we focus on Support Vector Machine (SVM) and deep convolutional neural network (CNN). Using a publicly available CXR dataset, we demonstrate that SVM and CNN learning models both exhibit improved performance, with the proper selection of the data sampling strategies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)

自引率

0.00%

发文量