Nicolas K Shinada,Naoki Koyama,Megumi Ikemori,Tomoki Nishioka,Seiji Hitaoka,Atsushi Hakura,Shoji Asakura,Yukiko Matsuoka,Sucheendra K Palaniappan
{"title":"Optimizing machine-learning models for mutagenicity prediction through better feature selection.","authors":"Nicolas K Shinada,Naoki Koyama,Megumi Ikemori,Tomoki Nishioka,Seiji Hitaoka,Atsushi Hakura,Shoji Asakura,Yukiko Matsuoka,Sucheendra K Palaniappan","doi":"10.1093/mutage/geac010","DOIUrl":null,"url":null,"abstract":"Assessing a compound's mutagenicity using machine learning is an important activity in the drug discovery and development process. Traditional methods of mutagenicity detection, such as Ames test, are expensive and time and labor intensive. In this context, in silico methods that predict a compound mutagenicity with high accuracy are important. Recently, machine-learning (ML) models are increasingly being proposed to improve the accuracy of mutagenicity prediction. While these models are used in practice, there is further scope to improve the accuracy of these models. We hypothesize that choosing the right features to train the model can further lead to better accuracy. We systematically consider and evaluate a combination of novel structural and molecular features which have the maximal impact on the accuracy of models. We rigorously evaluate these features against multiple classification models (from classical ML models to deep neural network models). The performance of the models was assessed using 5- and 10-fold cross-validation and we show that our approach using the molecule structure, molecular properties, and structural alerts as feature sets successfully outperform the state-of-the-art methods for mutagenicity prediction for the Hansen et al. benchmark dataset with an area under the receiver operating characteristic curve of 0.93. More importantly, our framework shows how combining features could benefit model accuracy improvements.","PeriodicalId":18889,"journal":{"name":"Mutagenesis","volume":"25 1","pages":"191-202"},"PeriodicalIF":2.5000,"publicationDate":"2022-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mutagenesis","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/mutage/geac010","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 1
Abstract
Assessing a compound's mutagenicity using machine learning is an important activity in the drug discovery and development process. Traditional methods of mutagenicity detection, such as Ames test, are expensive and time and labor intensive. In this context, in silico methods that predict a compound mutagenicity with high accuracy are important. Recently, machine-learning (ML) models are increasingly being proposed to improve the accuracy of mutagenicity prediction. While these models are used in practice, there is further scope to improve the accuracy of these models. We hypothesize that choosing the right features to train the model can further lead to better accuracy. We systematically consider and evaluate a combination of novel structural and molecular features which have the maximal impact on the accuracy of models. We rigorously evaluate these features against multiple classification models (from classical ML models to deep neural network models). The performance of the models was assessed using 5- and 10-fold cross-validation and we show that our approach using the molecule structure, molecular properties, and structural alerts as feature sets successfully outperform the state-of-the-art methods for mutagenicity prediction for the Hansen et al. benchmark dataset with an area under the receiver operating characteristic curve of 0.93. More importantly, our framework shows how combining features could benefit model accuracy improvements.
期刊介绍:
Mutagenesis is an international multi-disciplinary journal designed to bring together research aimed at the identification, characterization and elucidation of the mechanisms of action of physical, chemical and biological agents capable of producing genetic change in living organisms and the study of the consequences of such changes.