CPIA dataset: a large-scale comprehensive pathological image analysis dataset for self-supervised learning pre-training

IF 4.9 2区 医学 Q1 ENGINEERING, BIOMEDICAL
Nan Ying , Yanli Lei , Tianyi Zhang , Shangqing Lyu , Sicheng Chen , Zeyu Liu , Yunlu Feng , Yu Zhao , Guanglei Zhang
{"title":"CPIA dataset: a large-scale comprehensive pathological image analysis dataset for self-supervised learning pre-training","authors":"Nan Ying ,&nbsp;Yanli Lei ,&nbsp;Tianyi Zhang ,&nbsp;Shangqing Lyu ,&nbsp;Sicheng Chen ,&nbsp;Zeyu Liu ,&nbsp;Yunlu Feng ,&nbsp;Yu Zhao ,&nbsp;Guanglei Zhang","doi":"10.1016/j.bspc.2025.108148","DOIUrl":null,"url":null,"abstract":"<div><div>Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This work introduces a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,586 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and regions of interest (ROIs) images. Furthermore, we establish a standard multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. The CPIA dataset information and code are available at <span><span>https://github.com/zhanglab2021/CPIA_Dataset</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"110 ","pages":"Article 108148"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425006597","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This work introduces a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,586 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and regions of interest (ROIs) images. Furthermore, we establish a standard multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. The CPIA dataset information and code are available at https://github.com/zhanglab2021/CPIA_Dataset.
CPIA数据集:用于自监督学习预训练的大规模综合病理图像分析数据集
病理图像分析是计算机辅助诊断的一个重要领域。使用自然图像初始化模型的迁移学习改善了下游病理性能。然而,缺乏复杂的领域特异性病理初始化阻碍了它们的潜力。自监督学习(Self-supervised learning, SSL)支持无需样本级标签的预训练,克服了昂贵的注释带来的挑战。因此,这个领域需要一个全面的数据集,类似于计算机视觉中的ImageNet。本文介绍了用于SSL预训练的大规模综合病理图像分析(CPIA)数据集。CPIA数据集包含148,962,586张图像,涵盖超过48个器官/组织和大约100种疾病,其中包括两种主要数据类型:全幻灯片图像(WSIs)和感兴趣区域(roi)图像。并结合资深病理医师的诊断习惯,建立标准的多尺度病理数据处理工作流程。CPIA数据集促进了全面的病理理解,并使模式发现探索成为可能。此外,为了启动CPIA数据集,特别进行了SSL预训练和下游评估的几个最先进(SOTA)基线。CPIA数据集信息和代码可在https://github.com/zhanglab2021/CPIA_Dataset上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Biomedical Signal Processing and Control
Biomedical Signal Processing and Control 工程技术-工程:生物医学
CiteScore
9.80
自引率
13.70%
发文量
822
审稿时长
4 months
期刊介绍: Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信