Huibert-Jan Joosse, Chontira Chumsaeng-Reijers, Albert Huisman, Imo E Hoefer, Wouter W van Solinge, Saskia Haitjema, Bram van Es
{"title":"Haematology dimension reduction, a large scale application to regular care haematology data.","authors":"Huibert-Jan Joosse, Chontira Chumsaeng-Reijers, Albert Huisman, Imo E Hoefer, Wouter W van Solinge, Saskia Haitjema, Bram van Es","doi":"10.1186/s12911-025-02899-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data that cannot be directly visualised. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.</p><p><strong>Results: </strong>We found that, for clinical hematology data, PCA performs systematically better than UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.</p><p><strong>Conclusions: </strong>We conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"75"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-02899-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data that cannot be directly visualised. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.
Results: We found that, for clinical hematology data, PCA performs systematically better than UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.
Conclusions: We conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.