Thomas C Smits, Sehi L'Yi, Andrew P Mar, Nils Gehlenborg
{"title":"AltGosling: Automatic Generation of Text Descriptions for Accessible Genomics Data Visualization.","authors":"Thomas C Smits, Sehi L'Yi, Andrew P Mar, Nils Gehlenborg","doi":"10.1093/bioinformatics/btae670","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae670","url":null,"abstract":"<p><strong>Motivation: </strong>Biomedical visualizations are key to accessing biomedical knowledge and detecting new patterns in large datasets. Interactive visualizations are essential for biomedical data scientists and are omnipresent in data analysis software and data portals. Without appropriate descriptions, these visualizations are not accessible to all people with blindness and low vision, who often rely on screen reader accessibility technologies to access visual information on digital devices. Screen readers require descriptions to convey image content. However, many images lack informative descriptions due to unawareness and difficulty writing such descriptions. Describing complex and interactive visualizations, like genomics data visualizations, is even more challenging. Automatic generation of descriptions could be beneficial, yet current alt text generating models are limited to basic visualizations and cannot be used for genomics.</p><p><strong>Results: </strong>We present AltGosling, an automated description generation tool focused on interactive data visualizations of genome-mapped data, created with the grammar-based genomics toolkit Gosling. The logic-based algorithm of AltGosling creates various descriptions including a tree-structured navigable panel. We co-designed AltGosling with a blind screen reader user (co-author). We show that AltGosling outperforms state-of-the-art large language models and common image-based neural networks for alt text generation of genomics data visualizations. As a first of its kind in genomic research, we lay the groundwork to increase accessibility in the field.</p><p><strong>Availability and implementation: </strong>The source code, examples, and interactive demo are accessible under the MIT License at https://github.com/gosling-lang/altgosling. The package is available at https://www.npmjs.com/package/altgosling.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling.","authors":"Wenkai Xiang, Zhaoping Xiong, Huan Chen, Jiacheng Xiong, Wei Zhang, Zunyun Fu, Mingyue Zheng, Bing Liu, Qian Shi","doi":"10.1093/bioinformatics/btae680","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae680","url":null,"abstract":"<p><strong>Motivation: </strong>Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and \"tail labels\" with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels.</p><p><strong>Results: </strong>We introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM's flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation. The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaime Moreno, Henrik Nielsen, Ole Winther, Felix Teufel
{"title":"Predicting the subcellular location of prokaryotic proteins with DeepLocPro.","authors":"Jaime Moreno, Henrik Nielsen, Ole Winther, Felix Teufel","doi":"10.1093/bioinformatics/btae677","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae677","url":null,"abstract":"<p><strong>Motivation: </strong>Protein subcellular location prediction is a widely explored task in bioinformatics because of its importance in proteomics research. We propose DeepLocPro, an extension to the popular method DeepLoc, tailored specifically to archaeal and bacterial organisms.</p><p><strong>Results: </strong>DeepLocPro is a multiclass subcellular location prediction tool for prokaryotic proteins, trained on experimentally verified data curated from UniProt and PSORTdb. DeepLocPro compares favorably to the PSORTb 3.0 ensemble method, surpassing its performance across multiple metrics in our benchmark experiment.</p><p><strong>Availability: </strong>The DeepLocPro prediction tool is available online at https://ku.biolib.com/deeplocpro and https://services.healthtech.dtu.dk/services/DeepLocPro-1.0/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhijian Huang, Yucheng Wang, Song Chen, Yaw Sing Tan, Lei Deng, Min Wu
{"title":"DeepRSMA: a cross-fusion based deep learning method for RNA-small molecule binding affinity prediction.","authors":"Zhijian Huang, Yucheng Wang, Song Chen, Yaw Sing Tan, Lei Deng, Min Wu","doi":"10.1093/bioinformatics/btae678","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae678","url":null,"abstract":"<p><strong>Motivation: </strong>RNA is implicated in numerous aberrant cellular functions and disease progressions, highlighting the crucial importance of RNA-targeted drugs. To accelerate the discovery of such drugs, it is essential to develop an effective computational method for predicting RNA-small molecule affinity (RSMA). Recently, deep learning based computational methods have been promising due to their powerful nonlinear modeling ability. However, the leveraging of advanced deep learning methods to mine the diverse information of RNAs, small molecules and their interaction still remains a great challenge.</p><p><strong>Results: </strong>In this study, we present DeepRSMA, an innovative cross-attention-based deep learning method for RSMA prediction. To effectively capture fine-grained features from RNA and small molecules, we developed nucleotide-level and atomic-level feature extraction modules for RNA and small molecules, respectively. Additionally, we incorporated both sequence and graph views into these modules to capture features from multiple perspectives. Moreover, a Transformer-based cross-fusion module is introduced to learn the general patterns of interactions between RNAs and small molecules. To achieve effective RSMA prediction, we integrated the RNA and small molecule representations from the feature extraction and cross-fusion modules. Our results show that DeepRSMA outperforms baseline methods in multiple test settings. The interpretability analysis and the case study on spinal muscular atrophy (SMA) demonstrate that DeepRSMA has the potential to guide RNA-targeted drug design.</p><p><strong>Availability: </strong>The codes and data are publicly available at https://github.com/Hhhzj-7/DeepRSMA.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcio Soares Ferreira, Sebastian Stricker, Tomas Fitzgerald, Jack Monahan, Fanny Defranoux, Philip Watson, Bettina Welz, Omar Hammouda, Joachim Wittbrodt, Ewan Birney
{"title":"FEHAT: Efficient, Large scale and Automated Heartbeat Detection in Medaka Fish Embryos.","authors":"Marcio Soares Ferreira, Sebastian Stricker, Tomas Fitzgerald, Jack Monahan, Fanny Defranoux, Philip Watson, Bettina Welz, Omar Hammouda, Joachim Wittbrodt, Ewan Birney","doi":"10.1093/bioinformatics/btae664","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae664","url":null,"abstract":"<p><p>High resolution imaging of model organisms allows the quantification of important physiological measurements. In the case of fish with transparent embryos, these videos can visualise key physiological processes, such as heartbeat. High throughput systems can provide enough measurements for the robust investigation of developmental processes as well as the impact of system perturbations on physiological state. However, few analytical schemes have been designed to handle thousands of high-resolution videos without the need for some level of human intervention. We developed a software package, named FEHAT, to provide a fully automated solution for the analytics of large numbers of heart rate imaging datasets obtained from developing Medaka fish embryos in 96 well plate format imaged on an Acquifer machine. FEHAT uses image segmentation to define regions of the embryo showing changes in pixel intensity over time, followed by the classification of the most likely position of the heart and Fourier Transformations to estimate the heart rate. Here we describe some important features of the FEHAT software, showcasing its performance across a large set of medaka fish embryos and compare its performance to established, less automated solutions. FEHAT provides reliable heart rate estimates across a range of temperature-based perturbations and can be applied to tens of thousands of embryos without the need for any human intervention.</p><p><strong>Availability: </strong>Data used in this manuscript will be made available on request.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142607234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean J McIlwain, Anna Hoefges, Amy K Erbe, Paul M Sondel, Irene M Ong
{"title":"Ranking Antibody Binding Epitopes and Proteins Across Samples from Whole Proteome Tiled Linear Peptides.","authors":"Sean J McIlwain, Anna Hoefges, Amy K Erbe, Paul M Sondel, Irene M Ong","doi":"10.1093/bioinformatics/btae637","DOIUrl":"10.1093/bioinformatics/btae637","url":null,"abstract":"<p><strong>Introduction: </strong>Ultradense peptide binding arrays that can probe millions of linear peptides comprising the entire proteomes of human or mouse, or hundreds of thousands of microbes, are powerful tools for studying the antibody repertoire in serum samples to understand adaptive immune responses.</p><p><strong>Motivation: </strong>There are few tools for exploring high-dimensional, significant and reproducible antibody targets for ultradense peptide binding arrays at the linear peptide, epitope (grouping of adjacent peptides), and protein level across multiple samples/subjects (i.e. epitope spread or immunogenic regions of proteins) for understanding the heterogeneity of immune responses.</p><p><strong>Results: </strong>We developed HERON (Hierarchical antibody binding Epitopes and pROteins from liNear peptides), an R package, which identifies immunogenic epitopes, using meta-analyses and spatial clustering techniques to explore antibody targets at various resolution and confidence levels, that can be found consistently across a specified number of samples through the entire proteome to study antibody responses for diagnostics or treatment. Our approach estimates significance values at the linear peptide (probe), epitope, and protein level to identify top candidates for validation. We test the performance of predictions on all three levels using correlation between technical replicates and comparison of epitope calls on two datasets, which shows HERON's competitiveness in estimating false discovery rates and finding general and sample-level regions of interest for antibody binding.</p><p><strong>Availability: </strong>The HERON R package is available at Bioconductor https://bioconductor.org/packages/release/bioc/html/HERON.html.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingyao Zhou, Jiayi Cox, Bin Zhou, Steven Zhu, Yang Zhong, Glen Spraggon
{"title":"Afpdb - an efficient structure manipulation package for AI protein design.","authors":"Yingyao Zhou, Jiayi Cox, Bin Zhou, Steven Zhu, Yang Zhong, Glen Spraggon","doi":"10.1093/bioinformatics/btae654","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae654","url":null,"abstract":"<p><strong>Motivation: </strong>The advent of AlphaFold and other protein Artificial Intelligence (AI) models has transformed protein design, necessitating efficient handling of large-scale data and complex workflows. Using existing programming packages that predate recent AI advancements often leads to inefficiencies in human coding and slow code execution. To address this gap, we developed the Afpdb package.</p><p><strong>Results: </strong>Afpdb, built on AlphaFold's NumPy architecture, offers a high-performance core. It uses RFDiffusion's contig syntax to streamline residue and atom selection, making coding simpler and more readable. Integrating PyMOL's visualization capabilities, Afpdb allows automatic visual quality control. With over 180 methods commonly used in protein AI design, which are otherwise hard to find, Afpdb enhances productivity in structural biology by supporting the development of concise, high-performance code.</p><p><strong>Availability: </strong>Code and documentation are available on GitHub (https://github.com/data2code/afpdb) and PyPI (https://pypi.org/project/afpdb). An interactive tutorial is accessible through Google Colab.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laiyi Fu, Yanxin Xie, Shunkang Ling, Ying Wang, Binzhong Wang, Hejun Du, Qinke Peng, Hequan Sun
{"title":"findGSEP: estimating genome size of polyploid species using k-mer frequencies.","authors":"Laiyi Fu, Yanxin Xie, Shunkang Ling, Ying Wang, Binzhong Wang, Hejun Du, Qinke Peng, Hequan Sun","doi":"10.1093/bioinformatics/btae647","DOIUrl":"10.1093/bioinformatics/btae647","url":null,"abstract":"<p><strong>Summary: </strong>Estimating genome size using k-mer frequencies, which plays a fundamental role in designing genome sequencing and analysis projects, has remained challenging for polyploid species, i.e., ploidy p > 2. To address this, we introduce \"findGSEP,\" which is designed based on iterative curve fitting of k-mer frequencies. Precisely, it first disentangles up to p normal distributions by analyzing k-mer frequencies in whole genome sequencing of the focal species. Second, it computes the sizes of genomic regions related to 1∼p (homologous) chromosome(s) using each respective curve fitting, from which it infers the full polyploid and average haploid genome size. \"findGSEP\" can handle any level of ploidy p, and infer more accurate genome size than other well-known tools, as shown by tests using simulated and real genomic sequencing data of various species including octoploids.</p><p><strong>Availability and implementation: </strong>\"findGSEP\" was implemented as a web server, which is freely available at http://146.56.237.198:3838/findGSEP/. Also, \"findGSEP\" was implemented as an R package for parallel processing of multiple samples. Source code and tutorial on its installation and usage is available at https://github.com/sperfu/findGSEP.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11552620/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ADAPT: Analysis of Microbiome Differential Abundance by Pooling Tobit Models.","authors":"Mukai Wang, Simon Fontaine, Hui Jiang, Gen Li","doi":"10.1093/bioinformatics/btae661","DOIUrl":"10.1093/bioinformatics/btae661","url":null,"abstract":"<p><strong>Motivation: </strong>Microbiome differential abundance analysis (DAA) remains a challenging problem despite multiple methods proposed in the literature. The excessive zeros and compositionality of metagenomics data are two main challenges for DAA.</p><p><strong>Results: </strong>We propose a novel method called \"Analysis of Microbiome Differential Abundance by Pooling Tobit Models\" (ADAPT) to overcome these two challenges. ADAPT interprets zero counts as left-censored observations to avoid unfounded assumptions and complex models. ADAPT also encompasses a theoretically justified way of selecting non-differentially abundant microbiome taxa as a reference to reveal differentially abundant taxa while avoiding false discoveries. We generate synthetic data using independent simulation frameworks to show that ADAPT has more consistent false discovery rate control and higher statistical power than competitors. We use ADAPT to analyze 16S rRNA sequencing of saliva samples and shotgun metagenomics sequencing of plaque samples collected from infants in the COHRA2 study. The results provide novel insights into the association between the oral microbiome and early childhood dental caries.</p><p><strong>Availability and implementation: </strong>The R package ADAPT can be installed from Bioconductor at https://bioconductor.org/packages/release/bioc/html/ADAPT.html or from Github at https://github.com/mkbwang/ADAPT. The source codes for simulation studies and real data analysis are available at https://github.com/mkbwang/ADAPT_example.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142607231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shubo Tian, Qingyu Chen, Donald C Comeau, W John Wilbur, Zhiyong Lu
{"title":"PubMed Computed Authors in 2024: an open resource of disambiguated author names in biomedical literature.","authors":"Shubo Tian, Qingyu Chen, Donald C Comeau, W John Wilbur, Zhiyong Lu","doi":"10.1093/bioinformatics/btae672","DOIUrl":"10.1093/bioinformatics/btae672","url":null,"abstract":"<p><strong>Summary: </strong>Over 55% of author names in PubMed are ambiguous: the same name is shared by different individual researchers. This poses significant challenges on precise literature retrieval for author name queries, a common behavior in biomedical literature search. In response, we present a comprehensive dataset of disambiguated authors. Specifically, we complement the automatic PubMed Computed Authors algorithm with the latest ORCID data for improved accuracy. As a result, the enhanced algorithm achieves high performance in author name disambiguation, and subsequently our dataset contains more than 21 million disambiguated authors for over 35 million PubMed articles and is incrementally updated on a weekly basis. More importantly, we make the dataset publicly available for the community such that it can be utilized in a wide variety of potential applications beyond assisting PubMed's author name queries. Finally, we propose a set of guidelines for best practices of authors pertaining to use of their names.</p><p><strong>Availability and implementation: </strong>The PubMed Computed Authors dataset is publicly available for bulk download at: https://ftp.ncbi.nlm.nih.gov/pub/lu/ComputedAuthors/. Additionally, it is available for query through web API at: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/authors/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11588201/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142634380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}