Shreyas V, Jose Siguenza, Karan Bania, Bharath Ramsundar
{"title":"Open-Source Molecular Processing Pipeline for Generating Molecules","authors":"Shreyas V, Jose Siguenza, Karan Bania, Bharath Ramsundar","doi":"arxiv-2408.06261","DOIUrl":"https://doi.org/arxiv-2408.06261","url":null,"abstract":"Generative models for molecules have shown considerable promise for use in\u0000computational chemistry, but remain difficult to use for non-experts. For this\u0000reason, we introduce open-source infrastructure for easily building generative\u0000molecular models into the widely used DeepChem [Ramsundar et al., 2019] library\u0000with the aim of creating a robust and reusable molecular generation pipeline.\u0000In particular, we add high quality PyTorch [Paszke et al., 2019]\u0000implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao\u0000and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our\u0000implementations show strong performance comparable with past work [Kuznetsov\u0000and Polykovskiy, 2021, Cao and Kipf, 2022].","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang
{"title":"LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library","authors":"Tianhao Yu, Cai Yao, Zhuorui Sun, Feng Shi, Lin Zhang, Kangjie Lyu, Xuan Bai, Andong Liu, Xicheng Zhang, Jiali Zou, Wenshou Wang, Chris Lai, Kai Wang","doi":"arxiv-2408.06150","DOIUrl":"https://doi.org/arxiv-2408.06150","url":null,"abstract":"In this study, we generate and maintain a database of 10 million virtual\u0000lipids through METiS's in-house de novo lipid generation algorithms and lipid\u0000virtual screening techniques. These virtual lipids serve as a corpus for\u0000pre-training, lipid representation learning, and downstream task knowledge\u0000transfer, culminating in state-of-the-art LNP property prediction performance.\u0000We propose LipidBERT, a BERT-like model pre-trained with the Masked Language\u0000Model (MLM) and various secondary tasks. Additionally, we compare the\u0000performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like\u0000lipid generation model, on downstream tasks. The proposed bilingual LipidBERT\u0000model operates in two languages: the language of ionizable lipid pre-training,\u0000using in-house dry-lab lipid structures, and the language of LNP fine-tuning,\u0000utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT\u0000as a key AI-based filter for future screening tasks, including new versions of\u0000METiS de novo lipid libraries and, more importantly, candidates for in vivo\u0000testing for orgran-targeting LNPs. To the best of our knowledge, this is the\u0000first successful demonstration of the capability of a pre-trained language\u0000model on virtual lipids and its effectiveness in downstream tasks using web-lab\u0000data. This work showcases the clever utilization of METiS's in-house de novo\u0000lipid library as well as the power of dry-wet lab integration.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaoyu Liu, Jingxun Chen, Mingkun Xu, David H. Gracias, Ken-Tye Yong, Yuanyuan Wei, Ho-Pui Ho
{"title":"Programmable Lipid Nanoparticles for Precision Drug Delivery: A Four-Domain Model Perspective","authors":"Zhaoyu Liu, Jingxun Chen, Mingkun Xu, David H. Gracias, Ken-Tye Yong, Yuanyuan Wei, Ho-Pui Ho","doi":"arxiv-2408.05695","DOIUrl":"https://doi.org/arxiv-2408.05695","url":null,"abstract":"Programmable lipid nanoparticles (LNPs) offer precise spatiotemporal control\u0000over drug distribution and release, a critical advancement for treating complex\u0000diseases like cancer and genetic disorders. While existing reviews offer\u0000extensive insights into LNP development, this work introduces a novel model\u0000that dissects key components of LNP design, providing a framework to enhance\u0000the rational design of programmable LNPs. This review introduces a novel\u0000Four-Domain Model - Architecture, Interface, Payload, and Dispersal - providing\u0000a modular perspective that emphasizes the programmability of LNPs. We explore\u0000the dynamics between LNPs components and their environment throughout their\u0000lifecycle, focusing on thermodynamic stability during synthesis, storage,\u0000delivery, and drug release. Through these four distinct but interconnected\u0000domains, we introduce the concept of input stimuli, functional components, and\u0000output responses. This modular approach offers new perspectives for the\u0000rational design of programmable nanocarriers for exquisite control over payload\u0000release while minimizing off-target effects. Advances in bioinspired design\u0000principles could lead to LNPs that mimic natural biological systems, enhancing\u0000their biocompatibility and functionality. This review summarizes recent\u0000advancements, identifies challenges, and offers outlooks for programmable LNPs,\u0000emphasizing their potential to evolve into more intelligent, naturally\u0000integrated systems that enhance scalability and reduce side effects.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"419 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minkyu Jeon, Rishwanth Raghu, Miro Astore, Geoffrey Woollard, Ryan Feathers, Alkin Kaz, Sonya M. Hanson, Pilar Cossio, Ellen D. Zhong
{"title":"CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM","authors":"Minkyu Jeon, Rishwanth Raghu, Miro Astore, Geoffrey Woollard, Ryan Feathers, Alkin Kaz, Sonya M. Hanson, Pilar Cossio, Ellen D. Zhong","doi":"arxiv-2408.05526","DOIUrl":"https://doi.org/arxiv-2408.05526","url":null,"abstract":"Cryo-electron microscopy (cryo-EM) is a powerful technique for determining\u0000high-resolution 3D biomolecular structures from imaging data. As this technique\u0000can capture dynamic biomolecular complexes, 3D reconstruction methods are\u0000increasingly being developed to resolve this intrinsic structural\u0000heterogeneity. However, the absence of standardized benchmarks with ground\u0000truth structures and validation metrics limits the advancement of the field.\u0000Here, we propose CryoBench, a suite of datasets, metrics, and performance\u0000benchmarks for heterogeneous reconstruction in cryo-EM. We propose five\u0000datasets representing different sources of heterogeneity and degrees of\u0000difficulty. These include conformational heterogeneity generated from simple\u0000motions and random configurations of antibody complexes and from tens of\u0000thousands of structures sampled from a molecular dynamics simulation. We also\u0000design datasets containing compositional heterogeneity from mixtures of\u0000ribosome assembly states and 100 common complexes present in cells. We then\u0000perform a comprehensive analysis of state-of-the-art heterogeneous\u0000reconstruction tools including neural and non-neural methods and their\u0000sensitivity to noise, and propose new metrics for quantitative comparison of\u0000methods. We hope that this benchmark will be a foundational resource for\u0000analyzing existing methods and new algorithmic development in both the cryo-EM\u0000and machine learning communities.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen Zhewen Lu, Ziqing Lu, Ehsan Hajiramezanali, Tommaso Biancalani, Yoshua Bengio, Gabriele Scalia, Michał Koziarski
{"title":"Cell Morphology-Guided Small Molecule Generation with GFlowNets","authors":"Stephen Zhewen Lu, Ziqing Lu, Ehsan Hajiramezanali, Tommaso Biancalani, Yoshua Bengio, Gabriele Scalia, Michał Koziarski","doi":"arxiv-2408.05196","DOIUrl":"https://doi.org/arxiv-2408.05196","url":null,"abstract":"High-content phenotypic screening, including high-content imaging (HCI), has\u0000gained popularity in the last few years for its ability to characterize novel\u0000therapeutics without prior knowledge of the protein target. When combined with\u0000deep learning techniques to predict and represent molecular-phenotype\u0000interactions, these advancements hold the potential to significantly accelerate\u0000and enhance drug discovery applications. This work focuses on the novel task of\u0000HCI-guided molecular design. Generative models for molecule design could be\u0000guided by HCI data, for example with a supervised model that links molecules to\u0000phenotypes of interest as a reward function. However, limited labeled data,\u0000combined with the high-dimensional readouts, can make training these methods\u0000challenging and impractical. We consider an alternative approach in which we\u0000leverage an unsupervised multimodal joint embedding to define a latent\u0000similarity as a reward for GFlowNets. The proposed model learns to generate new\u0000molecules that could produce phenotypic effects similar to those of the given\u0000image target, without relying on pre-annotated phenotypic labels. We\u0000demonstrate that the proposed method generates molecules with high\u0000morphological and structural similarity to the target, increasing the\u0000likelihood of similar biological activity, as confirmed by an independent\u0000oracle model.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProS2Vi: a Python Tool for Visualizing Proteins Secondary Structure","authors":"Luckman Qasim, Laleh Alisaraie","doi":"arxiv-2408.03436","DOIUrl":"https://doi.org/arxiv-2408.03436","url":null,"abstract":"The Protein Secondary Structure Visualizer ProS2Vi is a novel Python-based\u0000visualization tool designed to enhance the analysis and accessibility of\u0000protein secondary structures calculated and identified by the Dictionary of\u0000Secondary Structure of Proteins algorithm. Leveraging robust Python libraries\u0000such as Biopython for data handling, Flask, for Graphical User Interface,\u0000Jinja2, and wkhtmltopdf for visualization, ProS2Vi offers a modern and\u0000intuitive representation for visualization of the DSSP assigned secondary\u0000structures to each residue of any proteins amino acid sequence. Significant\u0000features of ProS2Vi include customizable icon colors, the number of residues\u0000per line, and the ability to export visualizations as scalable PDFs, enhancing\u0000both visual appeal and functional versatility through a user-friendly GUI. We\u0000have designed ProS2Vi specifically for secure and local operation, which\u0000significantly increases security when dealing with novel protein data.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiangbin Zheng, Han Zhang, Qianqing Xu, An-Ping Zeng, Stan Z. Li
{"title":"MetaEnzyme: Meta Pan-Enzyme Learning for Task-Adaptive Redesign","authors":"Jiangbin Zheng, Han Zhang, Qianqing Xu, An-Ping Zeng, Stan Z. Li","doi":"arxiv-2408.10247","DOIUrl":"https://doi.org/arxiv-2408.10247","url":null,"abstract":"Enzyme design plays a crucial role in both industrial production and biology.\u0000However, this field faces challenges due to the lack of comprehensive\u0000benchmarks and the complexity of enzyme design tasks, leading to a dearth of\u0000systematic research. Consequently, computational enzyme design is relatively\u0000overlooked within the broader protein domain and remains in its early stages.\u0000In this work, we address these challenges by introducing MetaEnzyme, a staged\u0000and unified enzyme design framework. We begin by employing a cross-modal\u0000structure-to-sequence transformation architecture, as the feature-driven\u0000starting point to obtain initial robust protein representation. Subsequently,\u0000we leverage domain adaptive techniques to generalize specific enzyme design\u0000tasks under low-resource conditions. MetaEnzyme focuses on three fundamental\u0000low-resource enzyme redesign tasks: functional design (FuncDesign), mutation\u0000design (MutDesign), and sequence generation design (SeqDesign). Through novel\u0000unified paradigm and enhanced representation capabilities, MetaEnzyme\u0000demonstrates adaptability to diverse enzyme design tasks, yielding outstanding\u0000results. Wet lab experiments further validate these findings, reinforcing the\u0000efficacy of the redesign process.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142216199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Free energy, rates, and mechanism of transmembrane dimerization in lipid bilayers from dynamically unbiased molecular dynamics simulations","authors":"Emil Jackel, Gianmarco Lazzeri, Roberto Covino","doi":"arxiv-2408.01407","DOIUrl":"https://doi.org/arxiv-2408.01407","url":null,"abstract":"The assembly of proteins in membranes plays a key role in many crucial\u0000cellular pathways. Despite their importance, characterizing transmembrane\u0000assembly remains challenging for experiments and simulations. Equilibrium\u0000molecular dynamics simulations do not cover the time scales required to sample\u0000the typical transmembrane assembly. Hence, most studies rely on enhanced\u0000sampling schemes that steer the dynamics of transmembrane proteins along a\u0000collective variable that should encode all slow degrees of freedom. However,\u0000given the complexity of the condensed-phase lipid environment, this is far from\u0000trivial, with the consequence that free energy profiles of dimerization can be\u0000poorly converged. Here, we introduce an alternative approach, which relies only\u0000on simulating short, dynamically unbiased trajectory segments, avoiding using\u0000collective variables or biasing forces. By merging all trajectories, we obtain\u0000free energy profiles, rates, and mechanisms of transmembrane dimerization with\u0000the same set of simulations. We showcase our algorithm by sampling the\u0000spontaneous association and dissociation of a transmembrane protein in a lipid\u0000bilayer, the popular coarse-grained Martini force field. Our algorithm\u0000represents a promising way to investigate assembly processes in biologically\u0000relevant membranes, overcoming some of the challenges of conventional methods.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber
{"title":"Peptide Sequencing Via Protein Language Models","authors":"Thuong Le Hoai Pham, Jillur Rahman Saurav, Aisosa A. Omere, Calvin J. Heyl, Mohammad Sadegh Nasr, Cody Tyler Reynolds, Jai Prakash Yadav Veerla, Helen H Shang, Justyn Jaworski, Alison Ravenscraft, Joseph Anthony Buonomo, Jacob M. Luber","doi":"arxiv-2408.00892","DOIUrl":"https://doi.org/arxiv-2408.00892","url":null,"abstract":"We introduce a protein language model for determining the complete sequence\u0000of a peptide based on measurement of a limited set of amino acids. To date,\u0000protein sequencing relies on mass spectrometry, with some novel edman\u0000degregation based platforms able to sequence non-native peptides. Current\u0000protein sequencing techniques face limitations in accurately identifying all\u0000amino acids, hindering comprehensive proteome analysis. Our method simulates\u0000partial sequencing data by selectively masking amino acids that are\u0000experimentally difficult to identify in protein sequences from the UniRef\u0000database. This targeted masking mimics real-world sequencing limitations. We\u0000then modify and finetune a ProtBert derived transformer-based model, for a new\u0000downstream task predicting these masked residues, providing an approximation of\u0000the complete sequence. Evaluating on three bacterial Escherichia species, we\u0000achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM])\u0000are known. Structural assessment using AlphaFold and TM-score validates the\u0000biological relevance of our predictions. The model also demonstrates potential\u0000for evolutionary analysis through cross-species performance. This integration\u0000of simulated experimental constraints with computational predictions offers a\u0000promising avenue for enhancing protein sequence analysis, potentially\u0000accelerating advancements in proteomics and structural biology by providing a\u0000probabilistic reconstruction of the complete protein sequence from limited\u0000experimental data.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An Wu, Yu Pan, Fuqi Zhou, Jinghui Yan, Chuanlu Liu
{"title":"A Vectorization Method Induced By Maximal Margin Classification For Persistent Diagrams","authors":"An Wu, Yu Pan, Fuqi Zhou, Jinghui Yan, Chuanlu Liu","doi":"arxiv-2407.21298","DOIUrl":"https://doi.org/arxiv-2407.21298","url":null,"abstract":"Persistent homology is an effective method for extracting topological\u0000information, represented as persistent diagrams, of spatial structure data.\u0000Hence it is well-suited for the study of protein structures. Attempts to\u0000incorporate Persistent homology in machine learning methods of protein function\u0000prediction have resulted in several techniques for vectorizing persistent\u0000diagrams. However, current vectorization methods are excessively artificial and\u0000cannot ensure the effective utilization of information or the rationality of\u0000the methods. To address this problem, we propose a more geometrical\u0000vectorization method of persistent diagrams based on maximal margin\u0000classification for Banach space, and additionaly propose a framework that\u0000utilizes topological data analysis to identify proteins with specific\u0000functions. We evaluated our vectorization method using a binary classification\u0000task on proteins and compared it with the statistical methods that exhibit the\u0000best performance among thirteen commonly used vectorization methods. The\u0000experimental results indicate that our approach surpasses the statistical\u0000methods in both robustness and precision.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}