Compound Figure Separation of Biomedical Images: Mining Large Datasets for Self-supervised Learning.

The journal of machine learning for biomedical imaging Pub Date : 2022-08-01 Epub Date: 2022-09-04

Tianyuan Yao, Chang Qu, Jun Long, Quan Liu, Ruining Deng, Yuanhan Tian, Jiachen Xu, Aadarsh Jha, Zuhayr Asad, Shunxing Bao, Mengyang Zhao, Agnes B Fogo, Bennett A Landman, Haichun Yang, Catie Chang, Yuankai Huo

{"title":"Compound Figure Separation of Biomedical Images: Mining Large Datasets for Self-supervised Learning.","authors":"Tianyuan Yao, Chang Qu, Jun Long, Quan Liu, Ruining Deng, Yuanhan Tian, Jiachen Xu, Aadarsh Jha, Zuhayr Asad, Shunxing Bao, Mengyang Zhao, Agnes B Fogo, Bennett A Landman, Haichun Yang, Catie Chang, Yuankai Huo","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>With the rapid development of self-supervised learning (e.g., contrastive learning), the importance of having large-scale images (even without annotations) for training a more generalizable AI model has been widely recognized in medical image analysis. However, collecting large-scale task-specific unannotated data at scale can be challenging for individual labs. Existing online resources, such as digital books, publications, and search engines, provide a new resource for obtaining large-scale images. However, published images in healthcare (e.g., radiology and pathology) consist of a considerable amount of compound figures with subplots. In order to extract and separate compound figures into usable individual images for downstream learning, we propose a simple compound figure separation (SimCFS) framework without using the traditionally required detection bounding box annotations, with a new loss function and a hard case simulation. Our technical contribution is four-fold: (1) we introduce a simulation-based training framework that minimizes the need for resource extensive bounding box annotations; (2) we propose a new side loss that is optimized for compound figure separation; (3) we propose an intra-class image augmentation method to simulate hard cases; and (4) to the best of our knowledge, this is the first study that evaluates the efficacy of leveraging self-supervised learning with compound image separation. From the results, the proposed SimCFS achieved state-of-the-art performance on the ImageCLEF 2016 Compound Figure Separation Database. The pretrained self-supervised learning model using large-scale mined figures improved the accuracy of downstream image classification tasks with a contrastive learning algorithm. The source code of SimCFS is made publicly available at https://github.com/hrlblab/ImageSeperation.</p>","PeriodicalId":75083,"journal":{"name":"The journal of machine learning for biomedical imaging","volume":"1 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10112832/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The journal of machine learning for biomedical imaging","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/9/4 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With the rapid development of self-supervised learning (e.g., contrastive learning), the importance of having large-scale images (even without annotations) for training a more generalizable AI model has been widely recognized in medical image analysis. However, collecting large-scale task-specific unannotated data at scale can be challenging for individual labs. Existing online resources, such as digital books, publications, and search engines, provide a new resource for obtaining large-scale images. However, published images in healthcare (e.g., radiology and pathology) consist of a considerable amount of compound figures with subplots. In order to extract and separate compound figures into usable individual images for downstream learning, we propose a simple compound figure separation (SimCFS) framework without using the traditionally required detection bounding box annotations, with a new loss function and a hard case simulation. Our technical contribution is four-fold: (1) we introduce a simulation-based training framework that minimizes the need for resource extensive bounding box annotations; (2) we propose a new side loss that is optimized for compound figure separation; (3) we propose an intra-class image augmentation method to simulate hard cases; and (4) to the best of our knowledge, this is the first study that evaluates the efficacy of leveraging self-supervised learning with compound image separation. From the results, the proposed SimCFS achieved state-of-the-art performance on the ImageCLEF 2016 Compound Figure Separation Database. The pretrained self-supervised learning model using large-scale mined figures improved the accuracy of downstream image classification tasks with a contrastive learning algorithm. The source code of SimCFS is made publicly available at https://github.com/hrlblab/ImageSeperation.

Abstract Image

本刊更多论文

生物医学图像的复合图分离：为自我监督学习挖掘大型数据集。

随着自我监督学习（如对比学习）的快速发展，在医学图像分析领域，人们普遍认识到拥有大规模图像（即使没有注释）对于训练更具通用性的人工智能模型的重要性。然而，大规模收集针对特定任务的无注释数据对于单个实验室来说具有挑战性。现有的在线资源，如数字图书、出版物和搜索引擎，为获取大规模图像提供了新的资源。然而，医疗保健领域（如放射学和病理学）出版的图像由大量带子图的复合图组成。为了提取复合图并将其分离成可供下游学习使用的单个图像，我们提出了一个简单的复合图分离（SimCFS）框架，无需使用传统上所需的检测边界框注释，并采用了新的损失函数和困难情况模拟。我们的技术贡献有四个方面：(1) 我们引入了一个基于模拟的训练框架，最大限度地减少了对资源丰富的边界框注释的需求；(2) 我们提出了一个新的边损失函数，该函数针对复合图像分离进行了优化；(3) 我们提出了一种类内图像增强方法来模拟困难情况；(4) 据我们所知，这是第一项评估利用自监督学习进行复合图像分离的有效性的研究。从结果来看，所提出的 SimCFS 在 ImageCLEF 2016 复合图像分离数据库上取得了最先进的性能。使用大规模挖掘的图形预训练的自监督学习模型，通过对比学习算法提高了下游图像分类任务的准确性。SimCFS 的源代码已在 https://github.com/hrlblab/ImageSeperation 上公开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The journal of machine learning for biomedical imaging

自引率

0.00%

发文量