Towards a more accurate and reliable evaluation of machine learning protein-protein interaction prediction model performance in the presence of unavoidable dataset biases.
Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández
{"title":"Towards a more accurate and reliable evaluation of machine learning protein-protein interaction prediction model performance in the presence of unavoidable dataset biases.","authors":"Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández","doi":"10.1515/jib-2024-0054","DOIUrl":null,"url":null,"abstract":"<p><p>The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrative Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/jib-2024-0054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.