{"title":"The Identification of Breast Cancer Subtypes by Raman Spectroscopy Integrated With Machine Learning Algorithms: Analyzing the Influence of Baseline","authors":"Chao Yang, Kaisaier Aizezi, Juan Li, Xiaoting Wang, Fengling Li, Wen Lei, Jingjing Xia, Ayitila Maimaitijiang","doi":"10.1002/jrs.6799","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>The question of how the baseline of Raman spectroscopy impacts data models has remained unexplored. In this research, we utilized three spectral datasets—raw, preprocessed, and baseline data—to construct identification models for breast cancer molecular subtypes using four machine learning algorithms and examined and analyzed the influence of baseline data on the performance of these models. In the identification models for cancer cell molecular subtypes, regardless of whether they pertained to normal or breast cancer cells, preprocessed data consistently yielded the most optimal model performance, trailed by raw data, and ultimately followed by baseline data. Despite the baseline data giving the worst classification performance, when coupled with the artificial neural network, it consistently attained a recognition accuracy of approximately 92.50 ± 5.30% in the binary classification and 90.60 ± 1.52% in the five-class classification. The results suggested that baseline data held a notable contribution to the performance of data models. Looking ahead, it could potentially harness the concept of food by-product processing to maximize the utilization of baseline data. Furthermore, when integrated with feature visualization strategies, the UVE-SPA and ICO approaches, employing merely 30 or 258 variables, respectively, were able to yield model results comparable to those of preprocessed data (with 858 variables), attaining an accuracy of 96.00 ± 1.87%. This underscored the pivotal role of the selected Raman spectral regions in distinguishing breast cancer molecular subtypes. Beyond the standard protein, lipid, and nucleic acid regions, the selected features encompassed cysteine, phenylalanine, and carotenoid, all of which, according to established research, had held crucial significance in the development and progression of cancer. This project delved into the impact of Raman baseline on model outcomes, furnishing valuable data to enhance future Raman spectroscopy modeling techniques and igniting discussions on the untapped potential of baseline data in forthcoming endeavors.</p>\n </div>","PeriodicalId":16926,"journal":{"name":"Journal of Raman Spectroscopy","volume":"56 7","pages":"556-566"},"PeriodicalIF":2.4000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Raman Spectroscopy","FirstCategoryId":"92","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jrs.6799","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SPECTROSCOPY","Score":null,"Total":0}
引用次数: 0
Abstract
The question of how the baseline of Raman spectroscopy impacts data models has remained unexplored. In this research, we utilized three spectral datasets—raw, preprocessed, and baseline data—to construct identification models for breast cancer molecular subtypes using four machine learning algorithms and examined and analyzed the influence of baseline data on the performance of these models. In the identification models for cancer cell molecular subtypes, regardless of whether they pertained to normal or breast cancer cells, preprocessed data consistently yielded the most optimal model performance, trailed by raw data, and ultimately followed by baseline data. Despite the baseline data giving the worst classification performance, when coupled with the artificial neural network, it consistently attained a recognition accuracy of approximately 92.50 ± 5.30% in the binary classification and 90.60 ± 1.52% in the five-class classification. The results suggested that baseline data held a notable contribution to the performance of data models. Looking ahead, it could potentially harness the concept of food by-product processing to maximize the utilization of baseline data. Furthermore, when integrated with feature visualization strategies, the UVE-SPA and ICO approaches, employing merely 30 or 258 variables, respectively, were able to yield model results comparable to those of preprocessed data (with 858 variables), attaining an accuracy of 96.00 ± 1.87%. This underscored the pivotal role of the selected Raman spectral regions in distinguishing breast cancer molecular subtypes. Beyond the standard protein, lipid, and nucleic acid regions, the selected features encompassed cysteine, phenylalanine, and carotenoid, all of which, according to established research, had held crucial significance in the development and progression of cancer. This project delved into the impact of Raman baseline on model outcomes, furnishing valuable data to enhance future Raman spectroscopy modeling techniques and igniting discussions on the untapped potential of baseline data in forthcoming endeavors.
期刊介绍:
The Journal of Raman Spectroscopy is an international journal dedicated to the publication of original research at the cutting edge of all areas of science and technology related to Raman spectroscopy. The journal seeks to be the central forum for documenting the evolution of the broadly-defined field of Raman spectroscopy that includes an increasing number of rapidly developing techniques and an ever-widening array of interdisciplinary applications.
Such topics include time-resolved, coherent and non-linear Raman spectroscopies, nanostructure-based surface-enhanced and tip-enhanced Raman spectroscopies of molecules, resonance Raman to investigate the structure-function relationships and dynamics of biological molecules, linear and nonlinear Raman imaging and microscopy, biomedical applications of Raman, theoretical formalism and advances in quantum computational methodology of all forms of Raman scattering, Raman spectroscopy in archaeology and art, advances in remote Raman sensing and industrial applications, and Raman optical activity of all classes of chiral molecules.