哈萨克斯坦HER2乳腺癌数字图像数据集：ADEL数据集。

IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-09-11 eCollection Date: 2025-10-01 DOI:10.1016/j.dib.2025.112052

Gauhar Dunenova, Aidos Sarsembayev, Alexandr Ivankov, Dilyara Kaidarova, Zhanna Kalmatayeva, Elvira Satbayeva, Natalya Glushkova

{"title":"哈萨克斯坦HER2乳腺癌数字图像数据集：ADEL数据集。","authors":"Gauhar Dunenova, Aidos Sarsembayev, Alexandr Ivankov, Dilyara Kaidarova, Zhanna Kalmatayeva, Elvira Satbayeva, Natalya Glushkova","doi":"10.1016/j.dib.2025.112052","DOIUrl":null,"url":null,"abstract":"Breast cancer remains a leading cause of cancer-related mortality among women worldwide, with HER2-positive subtypes requiring precise diagnostic approaches to guide targeted therapy. Digital pathology and AI-based tools offer promising solutions, but their development relies heavily on high-quality digital datasets, labelled or annotated. In this study, we present a dataset of digital images of breast cancer tissue samples with immunohistochemical expression of human epidermal growth factor receptor 2 (HER2) classes 0, 1+, 2+, and 3+. Breast cancer tissue samples were formalin-fixed and paraffin-embedded (FFPE), followed by the preparation of paraffin blocks and 5-µm sections. Immunohistochemical staining was performed using a Ventana Benchmark Ultra automated immunostainer with PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibodies and ULTRA VIEW detection system. Digital images were acquired via a fully automated digital system (KFB PRO 120 scanner) at INVIVO LLP with 40x magnification and one focusing layer, ranging in size from 50 MB to 2 GB, depending on the size of the tissue sample fixed on the original slide. The dataset consists of 418 subfolders with images, each corresponding to a source image and containing a different number of tiles depending on the size of the source image. The original images were preprocessed using a conversion script that transformed SVS files into sub-images with a 1:1 aspect ratio in JPEG format. A non-overlapping sliding window approach was applied to generate these sub-images, optimized for machine learning applications. A square window of 1000 × 1000 pixels was used to crop sub-images with a 1:1 aspect ratio. The stride of the sliding window was set to a value that was a multiple of the image resolution (as determined during preprocessing). As a result, a variable number of sub-images were generated from each original SVS image, depending on its size. The output file format was JPEG. Clinical labeling of the data was provided by reference laboratory pathologists with expertise in advanced oncological morphology evaluations. This dataset allows training and validation of machine learning models for the diagnosis, recognition, and classification of breast cancer using the available labeling, as well as for educational purposes for residents and pathologists.","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"112052"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478050/pdf/","citationCount":"0","resultStr":"{\"title\":\"Kazakhstani HER2 breast cancer digital image dataset: The ADEL dataset.\",\"authors\":\"Gauhar Dunenova, Aidos Sarsembayev, Alexandr Ivankov, Dilyara Kaidarova, Zhanna Kalmatayeva, Elvira Satbayeva, Natalya Glushkova\",\"doi\":\"10.1016/j.dib.2025.112052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Breast cancer remains a leading cause of cancer-related mortality among women worldwide, with HER2-positive subtypes requiring precise diagnostic approaches to guide targeted therapy. Digital pathology and AI-based tools offer promising solutions, but their development relies heavily on high-quality digital datasets, labelled or annotated. In this study, we present a dataset of digital images of breast cancer tissue samples with immunohistochemical expression of human epidermal growth factor receptor 2 (HER2) classes 0, 1+, 2+, and 3+. Breast cancer tissue samples were formalin-fixed and paraffin-embedded (FFPE), followed by the preparation of paraffin blocks and 5-µm sections. Immunohistochemical staining was performed using a Ventana Benchmark Ultra automated immunostainer with PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibodies and ULTRA VIEW detection system. Digital images were acquired via a fully automated digital system (KFB PRO 120 scanner) at INVIVO LLP with 40x magnification and one focusing layer, ranging in size from 50 MB to 2 GB, depending on the size of the tissue sample fixed on the original slide. The dataset consists of 418 subfolders with images, each corresponding to a source image and containing a different number of tiles depending on the size of the source image. The original images were preprocessed using a conversion script that transformed SVS files into sub-images with a 1:1 aspect ratio in JPEG format. A non-overlapping sliding window approach was applied to generate these sub-images, optimized for machine learning applications. A square window of 1000 × 1000 pixels was used to crop sub-images with a 1:1 aspect ratio. The stride of the sliding window was set to a value that was a multiple of the image resolution (as determined during preprocessing). As a result, a variable number of sub-images were generated from each original SVS image, depending on its size. The output file format was JPEG. Clinical labeling of the data was provided by reference laboratory pathologists with expertise in advanced oncological morphology evaluations. This dataset allows training and validation of machine learning models for the diagnosis, recognition, and classification of breast cancer using the available labeling, as well as for educational purposes for residents and pathologists.\",\"PeriodicalId\":10973,\"journal\":{\"name\":\"Data in Brief\",\"volume\":\"62 \",\"pages\":\"112052\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478050/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data in Brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.dib.2025.112052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/10/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.dib.2025.112052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

乳腺癌仍然是全球女性癌症相关死亡的主要原因，her2阳性亚型需要精确的诊断方法来指导靶向治疗。数字病理学和基于人工智能的工具提供了有希望的解决方案，但它们的发展严重依赖于高质量的数字数据集，标记或注释。在这项研究中，我们展示了人类表皮生长因子受体2 （HER2）类型0、1+、2+和3+免疫组织化学表达的乳腺癌组织样本的数字图像数据集。乳腺癌组织标本经福尔马林固定，石蜡包埋（FFPE），石蜡块和5µm切片制备。免疫组化染色采用Ventana Benchmark Ultra自动免疫染色仪，采用PATHWAY抗her2 /neu （4B5）兔单克隆抗体和Ultra VIEW检测系统。数字图像通过INVIVO LLP的全自动数字系统（KFB PRO 120扫描仪）获得，具有40倍放大和一个聚焦层，大小从50mb到2gb不等，具体取决于固定在原始载玻片上的组织样本的大小。该数据集由418个子文件夹和图像组成，每个子文件夹对应一个源图像，并根据源图像的大小包含不同数量的磁贴。使用转换脚本对原始图像进行预处理，该脚本将SVS文件转换为JPEG格式的1:1长宽比的子图像。采用非重叠滑动窗口方法生成这些子图像，并针对机器学习应用进行了优化。使用1000 × 1000像素的正方形窗口以1:1的纵横比裁剪子图像。将滑动窗口的步幅设置为图像分辨率的倍数（在预处理期间确定）。因此，根据原始SVS图像的大小，从每个原始SVS图像生成可变数量的子图像。输出文件格式为JPEG。数据的临床标记由具有高级肿瘤形态学评估专业知识的参考实验室病理学家提供。该数据集允许训练和验证机器学习模型，以使用可用的标签对乳腺癌进行诊断、识别和分类，并用于住院医生和病理学家的教育目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Kazakhstani HER2 breast cancer digital image dataset: The ADEL dataset.

查看原文本刊更多论文

Kazakhstani HER2 breast cancer digital image dataset: The ADEL dataset.

Breast cancer remains a leading cause of cancer-related mortality among women worldwide, with HER2-positive subtypes requiring precise diagnostic approaches to guide targeted therapy. Digital pathology and AI-based tools offer promising solutions, but their development relies heavily on high-quality digital datasets, labelled or annotated. In this study, we present a dataset of digital images of breast cancer tissue samples with immunohistochemical expression of human epidermal growth factor receptor 2 (HER2) classes 0, 1+, 2+, and 3+. Breast cancer tissue samples were formalin-fixed and paraffin-embedded (FFPE), followed by the preparation of paraffin blocks and 5-µm sections. Immunohistochemical staining was performed using a Ventana Benchmark Ultra automated immunostainer with PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibodies and ULTRA VIEW detection system. Digital images were acquired via a fully automated digital system (KFB PRO 120 scanner) at INVIVO LLP with 40x magnification and one focusing layer, ranging in size from 50 MB to 2 GB, depending on the size of the tissue sample fixed on the original slide. The dataset consists of 418 subfolders with images, each corresponding to a source image and containing a different number of tiles depending on the size of the source image. The original images were preprocessed using a conversion script that transformed SVS files into sub-images with a 1:1 aspect ratio in JPEG format. A non-overlapping sliding window approach was applied to generate these sub-images, optimized for machine learning applications. A square window of 1000 × 1000 pixels was used to crop sub-images with a 1:1 aspect ratio. The stride of the sliding window was set to a value that was a multiple of the image resolution (as determined during preprocessing). As a result, a variable number of sub-images were generated from each original SVS image, depending on its size. The output file format was JPEG. Clinical labeling of the data was provided by reference laboratory pathologists with expertise in advanced oncological morphology evaluations. This dataset allows training and validation of machine learning models for the diagnosis, recognition, and classification of breast cancer using the available labeling, as well as for educational purposes for residents and pathologists.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.