{"title":"A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction","authors":"Rıza Özçelik, Francesca Grisoni","doi":"arxiv-2407.12152","DOIUrl":null,"url":null,"abstract":"Deep learning has significantly accelerated drug discovery, with 'chemical\nlanguage' processing (CLP) emerging as a prominent approach. CLP learns from\nmolecular string representations (e.g., Simplified Molecular Input Line Entry\nSystems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods\nakin to natural language processing. Despite their growing importance, training\npredictive CLP models is far from trivial, as it involves many 'bells and\nwhistles'. Here, we analyze the key elements of CLP training, to provide\nguidelines for newcomers and experts alike. Our study spans three neural\nnetwork architectures, two string representations, three embedding strategies,\nacross ten bioactivity datasets, for both classification and regression\npurposes. This 'hitchhiker's guide' not only underscores the importance of\ncertain methodological choices, but it also equips researchers with practical\nrecommendations on ideal choices, e.g., in terms of neural network\narchitectures, molecular representations, and hyperparameter optimization.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deep learning has significantly accelerated drug discovery, with 'chemical
language' processing (CLP) emerging as a prominent approach. CLP learns from
molecular string representations (e.g., Simplified Molecular Input Line Entry
Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods
akin to natural language processing. Despite their growing importance, training
predictive CLP models is far from trivial, as it involves many 'bells and
whistles'. Here, we analyze the key elements of CLP training, to provide
guidelines for newcomers and experts alike. Our study spans three neural
network architectures, two string representations, three embedding strategies,
across ten bioactivity datasets, for both classification and regression
purposes. This 'hitchhiker's guide' not only underscores the importance of
certain methodological choices, but it also equips researchers with practical
recommendations on ideal choices, e.g., in terms of neural network
architectures, molecular representations, and hyperparameter optimization.