BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad446
Matthew D Smith, Marshall A Case, Emily K Makowski, Peter M Tessier
{"title":"Position-Specific Enrichment Ratio Matrix scores predict antibody variant properties from deep sequencing data.","authors":"Matthew D Smith, Marshall A Case, Emily K Makowski, Peter M Tessier","doi":"10.1093/bioinformatics/btad446","DOIUrl":"10.1093/bioinformatics/btad446","url":null,"abstract":"<p><strong>Motivation: </strong>Deep sequencing of antibody and related protein libraries after phage or yeast-surface display sorting is widely used to identify variants with increased affinity, specificity, and/or improvements in key biophysical properties. Conventional approaches for identifying optimal variants typically use the frequencies of observation in enriched libraries or the corresponding enrichment ratios. However, these approaches disregard the vast majority of deep sequencing data and often fail to identify the best variants in the libraries.</p><p><strong>Results: </strong>Here, we present a method, Position-Specific Enrichment Ratio Matrix (PSERM) scoring, that uses entire deep sequencing datasets from pre- and post-selections to score each observed protein variant. The PSERM scores are the sum of the site-specific enrichment ratios observed at each mutated position. We find that PSERM scores are much more reproducible and correlate more strongly with experimentally measured properties than frequencies or enrichment ratios, including for multiple antibody properties (affinity and non-specific binding) for a clinical-stage antibody (emibetuzumab). We expect that this method will be broadly applicable to diverse protein engineering campaigns.</p><p><strong>Availability and implementation: </strong>All deep sequencing datasets and code to perform the analyses presented within are available via https://github.com/Tessier-Lab-UMich/PSERM_paper.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10477941/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10628969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad548
Seong-Joon Park, Sunghwan Kim, Jaeho Jeong, Albert No, Jong-Seon No, Hosung Park
{"title":"Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads.","authors":"Seong-Joon Park, Sunghwan Kim, Jaeho Jeong, Albert No, Jong-Seon No, Hosung Park","doi":"10.1093/bioinformatics/btad548","DOIUrl":"10.1093/bioinformatics/btad548","url":null,"abstract":"<p><strong>Motivation: </strong>DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed.</p><p><strong>Results: </strong>We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works.</p><p><strong>Availability and implementation: </strong>Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500082/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10631513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FunTaxIS-lite: a simple and light solution to investigate protein functions in all living organisms.","authors":"Federico Bianca, Emilio Ispano, Ermanno Gazzola, Enrico Lavezzo, Paolo Fontana, Stefano Toppo","doi":"10.1093/bioinformatics/btad549","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad549","url":null,"abstract":"<p><strong>Motivation: </strong>Defining the full domain of protein functions belonging to an organism is a complex challenge that is due to the huge heterogeneity of the taxonomy, where single or small groups of species can bear unique functional characteristics. FunTaxIS-lite provides a solution to this challenge by determining taxon-based constraints on Gene Ontology (GO) terms, which specify the functions that an organism can or cannot perform. The tool employs a set of rules to generate and spread the constraints across both the taxon hierarchy and the GO graph.</p><p><strong>Results: </strong>The taxon-based constraints produced by FunTaxIS-lite extend those provided by the Gene Ontology Consortium by an average of 300%. The implementation of these rules significantly reduces errors in function predictions made by automatic algorithms and can assist in correcting inconsistent protein annotations in databases.</p><p><strong>Availability and implementation: </strong>FunTaxIS-lite is available on https://www.medcomp.medicina.unipd.it/funtaxis-lite and from https://github.com/MedCompUnipd/FunTaxIS-lite.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10631519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An extensive benchmark study on biomedical text generation and mining with ChatGPT.","authors":"Qijie Chen, Haotong Sun, Haoyang Liu, Yinghui Jiang, Ting Ran, Xurui Jin, Xianglu Xiao, Zhimin Lin, Hongming Chen, Zhangmin Niu","doi":"10.1093/bioinformatics/btad557","DOIUrl":"10.1093/bioinformatics/btad557","url":null,"abstract":"<p><strong>Motivation: </strong>In recent years, the development of natural language process (NLP) technologies and deep learning hardware has led to significant improvement in large language models (LLMs). The ChatGPT, the state-of-the-art LLM built on GPT-3.5 and GPT-4, shows excellent capabilities in general language understanding and reasoning. Researchers also tested the GPTs on a variety of NLP-related tasks and benchmarks and got excellent results. With exciting performance on daily chat, researchers began to explore the capacity of ChatGPT on expertise that requires professional education for human and we are interested in the biomedical domain.</p><p><strong>Results: </strong>To evaluate the performance of ChatGPT on biomedical-related tasks, this article presents a comprehensive benchmark study on the use of ChatGPT for biomedical corpus, including article abstracts, clinical trials description, biomedical questions, and so on. Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30. Through a series of experiments, we demonstrated the effectiveness and versatility of ChatGPT in biomedical text understanding, reasoning and generation, and the limitation of ChatGPT build on GPT-3.5.</p><p><strong>Availability and implementation: </strong>All the datasets are available from BLURB benchmark https://microsoft.github.io/BLURB/index.html. The prompts are described in the article.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10562950/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepMHCI: an anchor position-aware deep interaction model for accurate MHC-I peptide binding affinity prediction.","authors":"Wei Qu, Ronghui You, Hiroshi Mamitsuka, Shanfeng Zhu","doi":"10.1093/bioinformatics/btad551","DOIUrl":"10.1093/bioinformatics/btad551","url":null,"abstract":"<p><strong>Motivation: </strong>Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels.</p><p><strong>Results: </strong>The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction.</p><p><strong>Availability and implementation: </strong>DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516514/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10217795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad570
Jackson Callaghan, Colleen H Xu, Jiwen Xin, Marco Alvarado Cano, Anders Riutta, Eric Zhou, Rohan Juneja, Yao Yao, Madhumita Narayan, Kristina Hanspers, Ayushi Agrawal, Alexander R Pico, Chunlei Wu, Andrew I Su
{"title":"BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs.","authors":"Jackson Callaghan, Colleen H Xu, Jiwen Xin, Marco Alvarado Cano, Anders Riutta, Eric Zhou, Rohan Juneja, Yao Yao, Madhumita Narayan, Kristina Hanspers, Ayushi Agrawal, Alexander R Pico, Chunlei Wu, Andrew I Su","doi":"10.1093/bioinformatics/btad570","DOIUrl":"10.1093/bioinformatics/btad570","url":null,"abstract":"<p><strong>Summary: </strong>Knowledge graphs are an increasingly common data structure for representing biomedical information. These knowledge graphs can easily represent heterogeneous types of information, and many algorithms and tools exist for querying and analyzing graphs. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of drug side effects, and clinical decision support. Typically, knowledge graphs are constructed by centralization and integration of data from multiple disparate sources. Here, we describe BioThings Explorer, an application that can query a virtual, federated knowledge graph derived from the aggregated information in a network of biomedical web services. BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries. Because there is no large, centralized knowledge graph to maintain, BioThings Explorer is distributed as a lightweight application that dynamically retrieves information at query time.</p><p><strong>Availability and implementation: </strong>More information can be found at https://explorer.biothings.io and code is available at https://github.com/biothings/biothings_explorer.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":4.4,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11015316/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10287315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad507
Sarah G Ayton, Víctor Treviño
{"title":"MuTATE-an R package for comprehensive multi-objective molecular modeling.","authors":"Sarah G Ayton, Víctor Treviño","doi":"10.1093/bioinformatics/btad507","DOIUrl":"https://doi.org/10.1093/bioinformatics/btad507","url":null,"abstract":"<p><strong>Motivation: </strong>Comprehensive multi-omics studies have driven advances in disease modeling for effective precision medicine but pose a challenge for existing machine-learning approaches, which have limited interpretability across clinical endpoints. Automated, comprehensive disease modeling requires a machine-learning approach that can simultaneously identify disease subgroups and their defining molecular biomarkers by explaining multiple clinical endpoints. Current tools are restricted to individual endpoints or limited variable types, necessitate advanced computation skills, and require resource-intensive manual expert interpretation.</p><p><strong>Results: </strong>We developed Multi-Target Automated Tree Engine (MuTATE) for automated and comprehensive molecular modeling, which enables user-friendly multi-objective decision tree construction and visualization of relationships between molecular biomarkers and patient subgroups characterized by multiple clinical endpoints. MuTATE incorporates multiple targets throughout model construction and allows for target weights, enabling construction of interpretable decision trees that provide insights into disease heterogeneity and molecular signatures. MuTATE eliminates the need for manual synthesis of multiple non-explainable models, making it highly efficient and accessible for bioinformaticians and clinicians. The flexibility and versatility of MuTATE make it applicable to a wide range of complex diseases, including cancer, where it can improve therapeutic decisions by providing comprehensive molecular insights for precision medicine. MuTATE has the potential to transform biomarker discovery and subtype identification, leading to more effective and personalized treatment strategies in precision medicine, and advancing our understanding of disease mechanisms at the molecular level.</p><p><strong>Availability and implementation: </strong>MuTATE is freely available at GitHub (https://github.com/SarahAyton/MuTATE) under the GPLv3 license.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500092/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10287680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad564
Jeffrey M Dick, Xun Kang
{"title":"chem16S: community-level chemical metrics for exploring genomic adaptation to environments.","authors":"Jeffrey M Dick, Xun Kang","doi":"10.1093/bioinformatics/btad564","DOIUrl":"10.1093/bioinformatics/btad564","url":null,"abstract":"<p><strong>Summary: </strong>The chem16S package combines taxonomic classifications of 16S rRNA gene sequences with amino acid compositions of prokaryotic reference proteomes to generate community reference proteomes. Taxonomic classifications from the RDP Classifier or data objects created by the phyloseq R package are supported. Users can calculate and visualize a variety of chemical metrics in order to explore the effects of redox, salinity, and other physicochemical variables on the genomic adaptation of protein sequences at the community level.</p><p><strong>Availability and implementation: </strong>Development of chem16S is hosted at https://github.com/jedick/chem16S. Version 1.0.0 is freely available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/package=chem16S.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10304977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad565
Mercedeh Movassagh, Steven J Schiff, Joseph N Paulson
{"title":"mbQTL: an R/Bioconductor package for microbial quantitative trait loci (QTL) estimation.","authors":"Mercedeh Movassagh, Steven J Schiff, Joseph N Paulson","doi":"10.1093/bioinformatics/btad565","DOIUrl":"10.1093/bioinformatics/btad565","url":null,"abstract":"<p><strong>Motivation: </strong>In recent years, significant strides have been made in the field of genomics, with the commencement of large-scale studies aimed at collecting host mutational profiles and microbiome data. The amalgamation of host gene mutational profiles in both healthy and diseased subjects with microbial abundance data holds immense promise in providing insights into several crucial research questions, including the development and progression of diseases, as well as individual responses to therapeutic interventions. With the advent of sequencing methods such as 16s ribosomal RNA (rRNA) sequencing and whole genome sequencing, there is increasing evidence of interplay of human genetics and microbial communities. Quantitative trait loci associated with microbial abundance (mbQTLs), are genetic variants that influence the abundance of microbial populations within the host.</p><p><strong>Results: </strong>Here, we introduce mbQTL, the first R package integrating 16S ribosomal RNA (rRNA) sequencing and single-nucleotide variation (SNV) and single-nucleotide polymorphism (SNP) data. We describe various statistical methods implemented for the identification of microbe-SNV pairs, relevant statistical measures, and plot functionality for interpretation.</p><p><strong>Availability and implementation: </strong>mbQTL is available on bioconductor at https://bioconductor.org/packages/mbQTL/.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516520/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10231044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BioinformaticsPub Date : 2023-09-02DOI: 10.1093/bioinformatics/btad573
Björn Wallner
{"title":"AFsample: improving multimer prediction with AlphaFold using massive sampling.","authors":"Björn Wallner","doi":"10.1093/bioinformatics/btad573","DOIUrl":"10.1093/bioinformatics/btad573","url":null,"abstract":"<p><strong>Summary: </strong>The AlphaFold2 neural network model has revolutionized structural biology with unprecedented performance. We demonstrate that by stochastically perturbing the neural network by enabling dropout at inference combined with massive sampling, it is possible to improve the quality of the generated models. We generated ∼6000 models per target compared with 25 default for AlphaFold-Multimer, with v1 and v2 multimer network models, with and without templates, and increased the number of recycles within the network. The method was benchmarked in CASP15, and compared with AlphaFold-Multimer v2 it improved the average DockQ from 0.41 to 0.55 using identical input and was ranked at the very top in the protein assembly category when compared with all other groups participating in CASP15. The simplicity of the method should facilitate the adaptation by the field, and the method should be useful for anyone interested in modeling multimeric structures, alternate conformations, or flexible structures.</p><p><strong>Availability and implementation: </strong>AFsample is available online at http://wallnerlab.org/AFsample.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.8,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10534052/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10253205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}