{"title":"S-MolSearch:用于生物活性分子搜索的 3D 半监督对比学习","authors":"Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao","doi":"arxiv-2409.07462","DOIUrl":null,"url":null,"abstract":"Virtual Screening is an essential technique in the early phases of drug\ndiscovery, aimed at identifying promising drug candidates from vast molecular\nlibraries. Recently, ligand-based virtual screening has garnered significant\nattention due to its efficacy in conducting extensive database screenings\nwithout relying on specific protein-binding site information. Obtaining binding\naffinity data for complexes is highly expensive, resulting in a limited amount\nof available data that covers a relatively small chemical space. Moreover,\nthese datasets contain a significant amount of inconsistent noise. It is\nchallenging to identify an inductive bias that consistently maintains the\nintegrity of molecular activity during data augmentation. To tackle these\nchallenges, we propose S-MolSearch, the first framework to our knowledge, that\nleverages molecular 3D information and affinity information in semi-supervised\ncontrastive learning for ligand-based virtual screening. Drawing on the\nprinciples of inverse optimal transport, S-MolSearch efficiently processes both\nlabeled and unlabeled data, training molecular structural encoders while\ngenerating soft labels for the unlabeled data. This design allows S-MolSearch\nto adaptively utilize unlabeled data within the learning process. Empirically,\nS-MolSearch demonstrates superior performance on widely-used benchmarks\nLIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual\nscreening methods for enrichment factors across 0.5%, 1% and 5%.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search\",\"authors\":\"Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao\",\"doi\":\"arxiv-2409.07462\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Virtual Screening is an essential technique in the early phases of drug\\ndiscovery, aimed at identifying promising drug candidates from vast molecular\\nlibraries. Recently, ligand-based virtual screening has garnered significant\\nattention due to its efficacy in conducting extensive database screenings\\nwithout relying on specific protein-binding site information. Obtaining binding\\naffinity data for complexes is highly expensive, resulting in a limited amount\\nof available data that covers a relatively small chemical space. Moreover,\\nthese datasets contain a significant amount of inconsistent noise. It is\\nchallenging to identify an inductive bias that consistently maintains the\\nintegrity of molecular activity during data augmentation. To tackle these\\nchallenges, we propose S-MolSearch, the first framework to our knowledge, that\\nleverages molecular 3D information and affinity information in semi-supervised\\ncontrastive learning for ligand-based virtual screening. Drawing on the\\nprinciples of inverse optimal transport, S-MolSearch efficiently processes both\\nlabeled and unlabeled data, training molecular structural encoders while\\ngenerating soft labels for the unlabeled data. This design allows S-MolSearch\\nto adaptively utilize unlabeled data within the learning process. Empirically,\\nS-MolSearch demonstrates superior performance on widely-used benchmarks\\nLIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual\\nscreening methods for enrichment factors across 0.5%, 1% and 5%.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"6 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07462\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search
Virtual Screening is an essential technique in the early phases of drug
discovery, aimed at identifying promising drug candidates from vast molecular
libraries. Recently, ligand-based virtual screening has garnered significant
attention due to its efficacy in conducting extensive database screenings
without relying on specific protein-binding site information. Obtaining binding
affinity data for complexes is highly expensive, resulting in a limited amount
of available data that covers a relatively small chemical space. Moreover,
these datasets contain a significant amount of inconsistent noise. It is
challenging to identify an inductive bias that consistently maintains the
integrity of molecular activity during data augmentation. To tackle these
challenges, we propose S-MolSearch, the first framework to our knowledge, that
leverages molecular 3D information and affinity information in semi-supervised
contrastive learning for ligand-based virtual screening. Drawing on the
principles of inverse optimal transport, S-MolSearch efficiently processes both
labeled and unlabeled data, training molecular structural encoders while
generating soft labels for the unlabeled data. This design allows S-MolSearch
to adaptively utilize unlabeled data within the learning process. Empirically,
S-MolSearch demonstrates superior performance on widely-used benchmarks
LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual
screening methods for enrichment factors across 0.5%, 1% and 5%.