{"title":"S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search","authors":"Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao","doi":"arxiv-2409.07462","DOIUrl":null,"url":null,"abstract":"Virtual Screening is an essential technique in the early phases of drug\ndiscovery, aimed at identifying promising drug candidates from vast molecular\nlibraries. Recently, ligand-based virtual screening has garnered significant\nattention due to its efficacy in conducting extensive database screenings\nwithout relying on specific protein-binding site information. Obtaining binding\naffinity data for complexes is highly expensive, resulting in a limited amount\nof available data that covers a relatively small chemical space. Moreover,\nthese datasets contain a significant amount of inconsistent noise. It is\nchallenging to identify an inductive bias that consistently maintains the\nintegrity of molecular activity during data augmentation. To tackle these\nchallenges, we propose S-MolSearch, the first framework to our knowledge, that\nleverages molecular 3D information and affinity information in semi-supervised\ncontrastive learning for ligand-based virtual screening. Drawing on the\nprinciples of inverse optimal transport, S-MolSearch efficiently processes both\nlabeled and unlabeled data, training molecular structural encoders while\ngenerating soft labels for the unlabeled data. This design allows S-MolSearch\nto adaptively utilize unlabeled data within the learning process. Empirically,\nS-MolSearch demonstrates superior performance on widely-used benchmarks\nLIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual\nscreening methods for enrichment factors across 0.5%, 1% and 5%.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Virtual Screening is an essential technique in the early phases of drug
discovery, aimed at identifying promising drug candidates from vast molecular
libraries. Recently, ligand-based virtual screening has garnered significant
attention due to its efficacy in conducting extensive database screenings
without relying on specific protein-binding site information. Obtaining binding
affinity data for complexes is highly expensive, resulting in a limited amount
of available data that covers a relatively small chemical space. Moreover,
these datasets contain a significant amount of inconsistent noise. It is
challenging to identify an inductive bias that consistently maintains the
integrity of molecular activity during data augmentation. To tackle these
challenges, we propose S-MolSearch, the first framework to our knowledge, that
leverages molecular 3D information and affinity information in semi-supervised
contrastive learning for ligand-based virtual screening. Drawing on the
principles of inverse optimal transport, S-MolSearch efficiently processes both
labeled and unlabeled data, training molecular structural encoders while
generating soft labels for the unlabeled data. This design allows S-MolSearch
to adaptively utilize unlabeled data within the learning process. Empirically,
S-MolSearch demonstrates superior performance on widely-used benchmarks
LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual
screening methods for enrichment factors across 0.5%, 1% and 5%.