Marcelo Terán , José Javier Ruiz , Pablo Loza-Alvarez , David Masip , David Merino
{"title":"Open Raman spectral library for biomolecule identification","authors":"Marcelo Terán , José Javier Ruiz , Pablo Loza-Alvarez , David Masip , David Merino","doi":"10.1016/j.chemolab.2025.105476","DOIUrl":null,"url":null,"abstract":"<div><div>Raman spectroscopy combined with Multivariate Curve Resolution (MCR) analysis is widely used in biomedical applications. However, assignation of biomolecules to the components extracted by MCR can be challenging due to the absence of an open Raman spectral library for biomolecules. Raman experts typically identify unmixed component spectra as biomolecules by comparing them with reference spectra from the literature. This process can be time-consuming and subject to human bias. In this work, we created an open Raman spectral database with 140 biomolecules by implementing an algorithm to digitalize the spectra plots and most relevant peaks from articles available in the literature. Additionally, we implemented two search algorithms. The first one uses the spectral linear kernel or cosine similarity on the full spectra. The second algorithm is based on peak matching, and relies on the intersection over the union of the matched peaks with a defined tolerance for peak matching. Our experimental validation showed 100 % top 10 accuracy in molecule identification (e.g. collagen) and 100 % accuracy in molecule type identification (e.g. protein) in both pure biomolecule measurements and also when replicating results from prior studies. Objectively narrowing the identification to the top 10 ranked candidates and providing type identification can significantly reduce both the time required for visual identification and the need to purchase reference component samples. We publish our spectral library as an open-source tool so it can be expanded collaboratively by the research community. It is available at: <span><span>https://github.com/mteranm/ramanbiolib</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105476"},"PeriodicalIF":3.7000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925001613","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Raman spectroscopy combined with Multivariate Curve Resolution (MCR) analysis is widely used in biomedical applications. However, assignation of biomolecules to the components extracted by MCR can be challenging due to the absence of an open Raman spectral library for biomolecules. Raman experts typically identify unmixed component spectra as biomolecules by comparing them with reference spectra from the literature. This process can be time-consuming and subject to human bias. In this work, we created an open Raman spectral database with 140 biomolecules by implementing an algorithm to digitalize the spectra plots and most relevant peaks from articles available in the literature. Additionally, we implemented two search algorithms. The first one uses the spectral linear kernel or cosine similarity on the full spectra. The second algorithm is based on peak matching, and relies on the intersection over the union of the matched peaks with a defined tolerance for peak matching. Our experimental validation showed 100 % top 10 accuracy in molecule identification (e.g. collagen) and 100 % accuracy in molecule type identification (e.g. protein) in both pure biomolecule measurements and also when replicating results from prior studies. Objectively narrowing the identification to the top 10 ranked candidates and providing type identification can significantly reduce both the time required for visual identification and the need to purchase reference component samples. We publish our spectral library as an open-source tool so it can be expanded collaboratively by the research community. It is available at: https://github.com/mteranm/ramanbiolib.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.