CPIA dataset: a large-scale comprehensive pathological image analysis dataset for self-supervised learning pre-training

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biomedical Signal Processing and Control Pub Date : 2025-05-29 DOI:10.1016/j.bspc.2025.108148

Nan Ying , Yanli Lei , Tianyi Zhang , Shangqing Lyu , Sicheng Chen , Zeyu Liu , Yunlu Feng , Yu Zhao , Guanglei Zhang

{"title":"CPIA dataset: a large-scale comprehensive pathological image analysis dataset for self-supervised learning pre-training","authors":"Nan Ying , Yanli Lei , Tianyi Zhang , Shangqing Lyu , Sicheng Chen , Zeyu Liu , Yunlu Feng , Yu Zhao , Guanglei Zhang","doi":"10.1016/j.bspc.2025.108148","DOIUrl":null,"url":null,"abstract":"<div><div>Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This work introduces a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,586 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and regions of interest (ROIs) images. Furthermore, we establish a standard multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. The CPIA dataset information and code are available at <span><span>https://github.com/zhanglab2021/CPIA_Dataset</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"110 ","pages":"Article 108148"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425006597","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This work introduces a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,586 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and regions of interest (ROIs) images. Furthermore, we establish a standard multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. The CPIA dataset information and code are available at https://github.com/zhanglab2021/CPIA_Dataset.

查看原文本刊更多论文

CPIA数据集：用于自监督学习预训练的大规模综合病理图像分析数据集

病理图像分析是计算机辅助诊断的一个重要领域。使用自然图像初始化模型的迁移学习改善了下游病理性能。然而，缺乏复杂的领域特异性病理初始化阻碍了它们的潜力。自监督学习（Self-supervised learning， SSL）支持无需样本级标签的预训练，克服了昂贵的注释带来的挑战。因此，这个领域需要一个全面的数据集，类似于计算机视觉中的ImageNet。本文介绍了用于SSL预训练的大规模综合病理图像分析（CPIA）数据集。CPIA数据集包含148,962,586张图像，涵盖超过48个器官/组织和大约100种疾病，其中包括两种主要数据类型：全幻灯片图像（WSIs）和感兴趣区域（roi）图像。并结合资深病理医师的诊断习惯，建立标准的多尺度病理数据处理工作流程。CPIA数据集促进了全面的病理理解，并使模式发现探索成为可能。此外，为了启动CPIA数据集，特别进行了SSL预训练和下游评估的几个最先进（SOTA）基线。CPIA数据集信息和代码可在https://github.com/zhanglab2021/CPIA_Dataset上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biomedical Signal Processing and Control 工程技术-工程：生物医学

CiteScore

9.80

自引率

13.70%

发文量

822

审稿时长

4 months

期刊介绍： Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.