{"title":"piCRISPR: Physically informed deep learning models for CRISPR/Cas9 off-target cleavage prediction","authors":"Florian Störtz, Jeffrey K. Mak, Peter Minary","doi":"10.1016/j.ailsci.2023.100075","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100075","url":null,"abstract":"<div><p>CRISPR/Cas programmable nuclease systems have become ubiquitous in the field of gene editing. With progressing development, applications in <em>in vivo</em> therapeutic gene editing are increasingly within reach, yet limited by possible adverse side effects from unwanted edits. Recent years have thus seen continuous development of off-target prediction algorithms trained on <em>in vitro</em> cleavage assay data gained from immortalised cell lines. It has been shown that in contrast to experimental epigenetic features, computed physically informed features are so far underutilised despite bearing considerably larger correlation with cleavage activity. Here, we implement state-of-the-art deep learning algorithms and feature encodings for off-target prediction with emphasis on <em>physically informed</em> features that capture the biological environment of the cleavage site, hence terming our approach piCRISPR. Features were gained from the large, diverse crisprSQL off-target cleavage dataset. We find that our best-performing models highlight the importance of sequence context and chromatin accessibility for cleavage prediction and compare favourably with literature standard prediction performance. We further show that our novel, environmentally sensitive features are crucial to accurate prediction on sequence-identical locus pairs, making them highly relevant for clinical guide design. The source code and trained models can be found ready to use at <span>github.com/florianst/picrispr</span><svg><path></path></svg>.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49774977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jazmín Miranda-Salas , Carlos Peña-Varas , Ignacio Valenzuela Martínez , Dionisio A. Olmedo , William J. Zamora , Miguel Angel Chávez-Fumagalli , Daniela Q. Azevedo , Rachel Oliveira Castilho , Vinicius G. Maltarollo , David Ramírez , José L. Medina-Franco
{"title":"Trends and challenges in chemoinformatics research in Latin America","authors":"Jazmín Miranda-Salas , Carlos Peña-Varas , Ignacio Valenzuela Martínez , Dionisio A. Olmedo , William J. Zamora , Miguel Angel Chávez-Fumagalli , Daniela Q. Azevedo , Rachel Oliveira Castilho , Vinicius G. Maltarollo , David Ramírez , José L. Medina-Franco","doi":"10.1016/j.ailsci.2023.100077","DOIUrl":"10.1016/j.ailsci.2023.100077","url":null,"abstract":"<div><p>Chemoinformatics is an independent inter-discipline with a broad impact in drug design and discovery, medicinal chemistry, biochemistry, analytical and organic chemistry, natural products, and several other areas in chemistry. Through collaborations, scientific exchanges, and participation in international research networks, Latin American scientists have contributed to the development of this subject. The aim of this perspective is to discuss the status and progress of the chemoinformatic discipline in Latin America. We team up to provide an author´s perspective on the topics that have been investigated and published over the past twelve years, collaborations between Latin America researchers and others worldwide, contributions to open-access chemoinformatic tools such as web servers, and educational-related resources and events, such as scientific conferences. We conclude that linking and fostering collaboration within each nation as well as among other Latin American nations and globally is made possible by open science and the democratization of science. We also outline strategic actions that can boost the development and practice of chemoinformatic in the region and enhance the interaction between Latin American countries and the rest of the world.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42646088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arjun Rao , Tin M. Tunjic , Michael Brunsteiner , Michael Müller, Hosein Fooladi, Chiara Gasbarri, Noah Weber
{"title":"Bayesian optimization for ternary complex prediction (BOTCP)","authors":"Arjun Rao , Tin M. Tunjic , Michael Brunsteiner , Michael Müller, Hosein Fooladi, Chiara Gasbarri, Noah Weber","doi":"10.1016/j.ailsci.2023.100072","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100072","url":null,"abstract":"<div><p>Proximity-inducing compounds (PICs) are an emergent drug technology through which a protein of interest (POI), often a drug target, is brought into the vicinity of a second protein which modifies the POI’s function, abundance or localisation, giving rise to a therapeutic effect. One of the best-known examples for such compounds are heterobifunctional molecules known as proteolysis targeting chimeras (PROTACs). PROTACs reduce the abundance of the target protein by establishing proximity to an E3 ligase which labels the protein for degradation via the ubiquitin-proteasomal pathway. Design of PROTACs in silico requires the computational prediction of the ternary complex consisting of POI, PROTAC molecule, and the E3 ligase.</p><p>We present a novel machine learning-based method for predicting PROTAC-mediated ternary complex structures using Bayesian optimization. We show how a fitness score combining an estimation of protein-protein interactions with PROTAC conformation energy calculations enables the sample-efficient exploration of candidate structures. Furthermore, our method presents two novel scores for filtering and reranking which take PROTAC stability (Autodock-Vina based PROTAC stability score) and protein interaction restraints (the TCP-AIR score) into account. We evaluate our method using DockQ scores on a number of available ternary complex structures (including previously unevaluated cases) and demonstrate that even with a clustering that requires members to have a high similarity, i.e., with smaller clusters, we can assign high ranks to those clusters that contain poses close to the experimentally determined native structure of the ternary complexes. We also demonstrate the resultant improved yield of near-native poses<span><sup>3</sup></span> in these clusters.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49775003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
María Andreína Francisco Rodríguez, Jordi Carreras Puigvert, Ola Spjuth
{"title":"Designing microplate layouts using artificial intelligence","authors":"María Andreína Francisco Rodríguez, Jordi Carreras Puigvert, Ola Spjuth","doi":"10.1016/j.ailsci.2023.100073","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100073","url":null,"abstract":"<div><p>Microplates are indispensable in large-scale biomedical experiments but the physical location of samples and controls on the microplate can significantly affect the resulting data and quality metric values. We introduce a new method based on constraint programming for designing microplate layouts that reduces unwanted bias and limits the impact of batch effects after error correction and normalisation. We demonstrate that our method applied to dose-response experiments leads to more accurate regression curves and lower errors when estimating <span><math><msub><mtext>IC</mtext><mn>50</mn></msub></math></span>/<span><math><msub><mtext>EC</mtext><mn>50</mn></msub></math></span>, and for drug screening leads to increased precision, when compared to random layouts. It also reduces the risk of inflated scores from common microplate quality assessment metrics such as <span><math><msup><mi>Z</mi><mo>′</mo></msup></math></span> factor and SSMD. We make our method available via a suite of tools (PLAID) including a reference constraint model, a web application, and Python notebooks to evaluate and compare designs when planning microplate experiments.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49774976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Merchan , Kenji Contreras , Rolando A. Gittens , Jose R. Loaiza , Javier E. Sanchez-Galan
{"title":"Deep metric learning for the classification of MALDI-TOF spectral signatures from multiple species of neotropical disease vectors","authors":"Fernando Merchan , Kenji Contreras , Rolando A. Gittens , Jose R. Loaiza , Javier E. Sanchez-Galan","doi":"10.1016/j.ailsci.2023.100071","DOIUrl":"10.1016/j.ailsci.2023.100071","url":null,"abstract":"<div><p>Deep Learning techniques have significant advantages for mass spectral classification, such as parallelized signal correction and feature extraction. Deep Metric Learning models combine Metric Learning to determine the degree of similarity or difference between a set of mass spectra with the generalization power of Deep Learning to improve feature extraction even further. The two most popular of these models combine multiple neural networks with identical architectures and are commonly called Siamese (SNN) and Triplet Neural Networks (TNN). Herein, using both SNNs and TNNs, we intended to taxonomically categorize two sets of previously-validated mass spectra that corresponded to 30 species of Neotropical arthropods in the Culicidae and Ixodidae families, some of which are disease vectors. The effectiveness of SNNs and TNNs to correctly classify 826 spectra from 12 mosquito species and 310 spectra from 18 species of hard ticks was highly effective, with both algorithms performing with minimal average loss during cross-validation. SNNs produced accuracy rates for ticks and mosquitoes of 91.22% and 94.46%, respectively, while accuracy rates of 93% and 99% were obtained with TNNs. Our results indicate that Deep Metric Learning is a practical machine learning tool for quickly and precisely classifying MALDI-TOF-generated mass spectra of Neotropical and public-health-relevant arthropod species.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41748999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wouter Heyndrickx , Adam Arany , Jaak Simm , Anastasia Pentina , Noé Sturm , Lina Humbeck , Lewis Mervin , Adam Zalewski , Martijn Oldenhof , Peter Schmidtke , Lukas Friedrich , Regis Loeb , Arina Afanasyeva , Ansgar Schuffenhauer , Yves Moreau , Hugo Ceulemans
{"title":"Conformal efficiency as a metric for comparative model assessment befitting federated learning","authors":"Wouter Heyndrickx , Adam Arany , Jaak Simm , Anastasia Pentina , Noé Sturm , Lina Humbeck , Lewis Mervin , Adam Zalewski , Martijn Oldenhof , Peter Schmidtke , Lukas Friedrich , Regis Loeb , Arina Afanasyeva , Ansgar Schuffenhauer , Yves Moreau , Hugo Ceulemans","doi":"10.1016/j.ailsci.2023.100070","DOIUrl":"10.1016/j.ailsci.2023.100070","url":null,"abstract":"<div><p>In a drug discovery setting, pharmaceutical companies own substantial but confidential datasets. The MELLODDY project developed a privacy-preserving federated machine learning solution and deployed it at an unprecedented scale. Each partner built models for their own private assays that benefitted from a shared representation. Established predictive performance metrics such as AUC ROC or AUC PR are constrained to unseen labeled chemical space and cannot gage performance gains in unlabeled chemical space. Federated learning indirectly extends labeled space, but in a privacy-preserving context, a partner cannot use this label extension for performance assessment. Metrics that estimate uncertainty on a prediction can be calculated even where no label is known. Practically, the chemical space covered with predictions above an uncertainty threshold, reflects the applicability domain of a model. After establishing a link to established performance metrics, we propose the efficiency from the conformal prediction framework (‘conformal efficiency’) as a proxy to the applicability domain size. A documented extension of the applicability domain would qualify as a tangible benefit from federated learning. In interim assessments, MELLODDY partners reported a median increase in conformal efficiency of the federated over the single-partner model of 5.5% (with increases up to 9.7%). Subject to distributional conditions, that efficiency increase can be directly interpreted as the expected increase in conformal i.e. low uncertainty predictions. In conclusion, we present the first indication that privacy-preserving federated machine learning across massive drug-discovery datasets from ten pharma partners indeed extends the applicability domain of property prediction models.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42954871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yojana Gadiya , Philip Gribbon , Martin Hofmann-Apitius , Andrea Zaliani
{"title":"Pharmaceutical patent landscaping: A novel approach to understand patents from the drug discovery perspective","authors":"Yojana Gadiya , Philip Gribbon , Martin Hofmann-Apitius , Andrea Zaliani","doi":"10.1016/j.ailsci.2023.100069","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100069","url":null,"abstract":"<div><p>Patents play a crucial role in the drug discovery process by providing legal protection for discoveries and incentivising investments in research and development. By identifying patterns within patent data resources, researchers can gain insight into the market trends and priorities of the pharmaceutical and biotechnology industries, as well as provide additional perspectives on more fundamental aspects such as the emergence of potential new drug targets. In this paper, we used the patent enrichment tool, PEMT, to extract, integrate, and analyse patent literature for rare diseases (RD) and Alzheimer's disease (AD). This is followed by a systematic review of the underlying patent landscape to decipher trends and applications in patents for these diseases. To do so, we discuss prominent organisations involved in drug discovery research in AD and RD. This allows us to gain an understanding of the importance of AD and RD from specific organisational (pharmaceutical or university) perspectives. Next, we analyse the historical focus of patents in relation to individual therapeutic targets and correlate them with market scenarios allowing the identification of prominent targets for a disease. Lastly, we identified drug repurposing activities within the two diseases with the help of patents. This resulted in identifying existing repurposed drugs and novel potential therapeutic approaches applicable to the indication areas. The study demonstrates the expanded applicability of patent documents from legal to drug discovery, design, and research, thus, providing a valuable resource for future drug discovery efforts. Moreover, this study is an attempt towards understanding the importance of data underlying patent documents and raising the need for preparing the data for machine learning-based applications.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49774974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Elucidating dynamic cell lineages and gene networks in time-course single cell differentiation","authors":"Mengrui Zhang , Yongkai Chen , Dingyi Yu , Wenxuan Zhong , Jingyi Zhang , Ping Ma","doi":"10.1016/j.ailsci.2023.100068","DOIUrl":"10.1016/j.ailsci.2023.100068","url":null,"abstract":"<div><p>Single cell RNA sequencing (scRNA-seq) technologies provide researchers with an unprecedented opportunity to exploit cell heterogeneity. For example, the sequenced cells belong to various cell lineages, which may have different cell fates in stem and progenitor cells. Those cells may differentiate into various mature cell types in a cell differentiation process. To trace the behavior of cell differentiation, researchers reconstruct cell lineages and predict cell fates by ordering cells chronologically into a trajectory with a pseudo-time. However, in scRNA-seq experiments, there are no cell-to-cell correspondences along with the time to reconstruct the cell lineages, which creates a significant challenge for cell lineage tracing and cell fate prediction. Therefore, methods that can accurately reconstruct the dynamic cell lineages and predict cell fates are highly desirable.</p><p>In this article, we develop an innovative machine-learning framework called Cell Smoothing Transformation (CellST) to elucidate the dynamic cell fate paths and construct gene networks in cell differentiation processes. Unlike the existing methods that construct one single bulk cell trajectory, CellST builds cell trajectories and tracks behaviors for each individual cell. Additionally, CellST can predict cell fates even for less frequent cell types. Based on the individual cell fate trajectories, CellST can further construct dynamic gene networks to model gene-gene relationships along the cell differentiation process and discover critical genes that potentially regulate cells into various mature cell types.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10328540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9800573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data science and data analytics in life science research","authors":"Jürgen Bajorath","doi":"10.1016/j.ailsci.2023.100067","DOIUrl":"10.1016/j.ailsci.2023.100067","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43783253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Natural products subsets: Generation and characterization","authors":"Ana L. Chávez-Hernández, José L. Medina-Franco","doi":"10.1016/j.ailsci.2023.100066","DOIUrl":"10.1016/j.ailsci.2023.100066","url":null,"abstract":"<div><p>Natural products are attractive for drug discovery applications because of their distinctive chemical structures, such as an overall large fraction of sp<sup>3</sup> carbon atoms, chiral centers (both features associated with structural complexity), large chemical scaffolds, and diversity of functional groups. Furthermore, natural products are used in <em>de novo</em> design and have inspired the development of pseudo-natural products using generative models. Public databases such as the Collection of Open NatUral ProdUcTs and the Universal Natural Product database (UNPD) are rich sources of structures to be used in generative models and other applications. In this work, we report the selection and characterization of the most diverse compounds of natural products from the UNPD using the MaxMin algorithm. The subsets generated with 14,994, 7,497, and 4,998 compounds are publicly available at <span>https://github.com/DIFACQUIM/Natural-products-subsets-generation</span><svg><path></path></svg>. We anticipate that the subsets will be particularly useful in building generative models based on natural products by research groups, particularly those with limited access to extensive supercomputer resources.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43292936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}