Bioinformatics advancesPub Date : 2024-07-29eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae108
Kaitlyn E Wade, Lianghong Chen, Chutong Deng, Gen Zhou, Pingzhao Hu
{"title":"Investigating alignment-free machine learning methods for HIV-1 subtype classification.","authors":"Kaitlyn E Wade, Lianghong Chen, Chutong Deng, Gen Zhou, Pingzhao Hu","doi":"10.1093/bioadv/vbae108","DOIUrl":"10.1093/bioadv/vbae108","url":null,"abstract":"<p><strong>Motivation: </strong>Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification.</p><p><strong>Results: </strong>We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a <i>k</i>-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments.</p><p><strong>Availability and implementation: </strong>Source code is available at https://www.github.com/kwade4/HIV_Subtypes.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371153/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-25eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae105
Yushu Shi, Liangliang Zhang, Kim-Anh Do, Robert R Jenq, Christine B Peterson
{"title":"survivalContour: visualizing predicted survival via colored contour plots.","authors":"Yushu Shi, Liangliang Zhang, Kim-Anh Do, Robert R Jenq, Christine B Peterson","doi":"10.1093/bioadv/vbae105","DOIUrl":"10.1093/bioadv/vbae105","url":null,"abstract":"<p><strong>Summary: </strong>Advances in survival analysis have facilitated unprecedented flexibility in data modeling, yet there remains a lack of tools for illustrating the influence of continuous covariates on predicted survival outcomes. We propose the utilization of a colored contour plot to depict the predicted survival probabilities over time. Our approach is capable of supporting conventional models, including the Cox and Fine-Gray models. However, its capability shines when coupled with cutting-edge machine learning models such as random survival forests and deep neural networks.</p><p><strong>Availability and implementation: </strong>We provide a Shiny app at https://biostatistics.mdanderson.org/shinyapps/survivalContour/ and an R package available at https://github.com/YushuShi/survivalContour as implementations of this tool.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11290613/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141861796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-23eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae107
Saul Pierotti, Bettina Welz, Mireia Osuna-López, Tomas Fitzgerald, Joachim Wittbrodt, Ewan Birney
{"title":"Genotype imputation in F2 crosses of inbred lines.","authors":"Saul Pierotti, Bettina Welz, Mireia Osuna-López, Tomas Fitzgerald, Joachim Wittbrodt, Ewan Birney","doi":"10.1093/bioadv/vbae107","DOIUrl":"10.1093/bioadv/vbae107","url":null,"abstract":"<p><strong>Motivation: </strong>Crosses among inbred lines are a fundamental tool for the discovery of genetic loci associated with phenotypes of interest. In organisms for which large reference panels or SNP chips are not available, imputation from low-pass whole-genome sequencing is an effective method for obtaining genotype data from a large number of individuals. To date, a structured analysis of the conditions required for optimal genotype imputation has not been performed.</p><p><strong>Results: </strong>We report a systematic exploration of the effect of several design variables on imputation performance in F2 crosses of inbred medaka lines using the imputation software STITCH. We determined that, depending on the number of samples, imputation performance reaches a plateau when increasing the per-sample sequencing coverage. We also systematically explored the trade-offs between cost, imputation accuracy, and sample numbers. We developed a computational pipeline to streamline the process, enabling other researchers to perform a similar cost-benefit analysis on their population of interest.</p><p><strong>Availability and implementation: </strong>The source code for the pipeline is available at https://github.com/birneylab/stitchimpute. While our pipeline has been developed and tested for an F2 population, the software can also be used to analyse populations with a different structure.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11286293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141794100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae106
Jehad Aldahdooh, Ziaurrehman Tanoli, Jing Tang
{"title":"Mining drug-target interactions from biomedical literature using chemical and gene descriptions-based ensemble transformer model.","authors":"Jehad Aldahdooh, Ziaurrehman Tanoli, Jing Tang","doi":"10.1093/bioadv/vbae106","DOIUrl":"10.1093/bioadv/vbae106","url":null,"abstract":"<p><strong>Motivation: </strong>Drug-target interactions (DTIs) play a pivotal role in drug discovery, as it aims to identify potential drug targets and elucidate their mechanism of action. In recent years, the application of natural language processing (NLP), particularly when combined with pre-trained language models, has gained considerable momentum in the biomedical domain, with the potential to mine vast amounts of texts to facilitate the efficient extraction of DTIs from the literature.</p><p><strong>Results: </strong>In this article, we approach the task of DTIs as an entity-relationship extraction problem, utilizing different pre-trained transformer language models, such as BERT, to extract DTIs. Our results indicate that an ensemble approach, by combining gene descriptions from the Entrez Gene database with chemical descriptions from the Comparative Toxicogenomics Database (CTD), is critical for achieving optimal performance. The proposed model achieves an <i>F</i>1 score of 80.6 on the hidden DrugProt test set, which is the top-ranked performance among all the submitted models in the official evaluation. Furthermore, we conduct a comparative analysis to evaluate the effectiveness of various gene textual descriptions sourced from Entrez Gene and UniProt databases to gain insights into their impact on the performance. Our findings highlight the potential of NLP-based text mining using gene and chemical descriptions to improve drug-target extraction tasks.</p><p><strong>Availability and implementation: </strong>Datasets utilized in this study are accessible at https://dtis.drugtargetcommons.org/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11293871/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141876854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-18eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae091
{"title":"Correction to: Quantitative transcriptomic and epigenomic data analysis: a primer.","authors":"","doi":"10.1093/bioadv/vbae091","DOIUrl":"https://doi.org/10.1093/bioadv/vbae091","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/bioadv/vbae019.].</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11257713/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae104
Michael Predl, Kilian Gandolf, Michael Hofer, Thomas Rattei
{"title":"ScyNet: Visualizing interactions in community metabolic models.","authors":"Michael Predl, Kilian Gandolf, Michael Hofer, Thomas Rattei","doi":"10.1093/bioadv/vbae104","DOIUrl":"10.1093/bioadv/vbae104","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-scale community metabolic models are used to gain mechanistic insights into interactions between community members. However, existing tools for visualizing metabolic models only cater to the needs of single organism models.</p><p><strong>Results: </strong>ScyNet is a Cytoscape app for visualizing community metabolic models, generating networks with reduced complexity by focusing on interactions between community members. ScyNet can incorporate the state of a metabolic model via fluxes or flux ranges, which is shown in a previously published simplified cystic fibrosis airway community model.</p><p><strong>Availability and implementation: </strong>ScyNet is freely available under an MIT licence and can be retrieved via the Cytoscape App Store (apps.cytoscape.org/apps/scynet). The source code is available at Github (github.com/univieCUBE/ScyNet).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11315608/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141918224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-13eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae103
Chiara Rodella, Symela Lazaridi, Thomas Lemmin
{"title":"TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms.","authors":"Chiara Rodella, Symela Lazaridi, Thomas Lemmin","doi":"10.1093/bioadv/vbae103","DOIUrl":"10.1093/bioadv/vbae103","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding protein thermostability is essential for numerous biotechnological applications, but traditional experimental methods are time-consuming, expensive, and error-prone. Recently, deep learning (DL) techniques from natural language processing (NLP) was extended to the field of biology, since the primary sequence of proteins can be viewed as a string of amino acids that follow a physicochemical grammar.</p><p><strong>Results: </strong>In this study, we developed TemBERTure, a DL framework that predicts thermostability class and melting temperature from protein sequences. Our findings emphasize the importance of data diversity for training robust models, especially by including sequences from a wider range of organisms. Additionally, we suggest using attention scores from Deep Learning models to gain deeper insights into protein thermostability. Analyzing these scores in conjunction with the 3D protein structure can enhance understanding of the complex interactions among amino acid properties, their positioning, and the surrounding microenvironment. By addressing the limitations of current prediction methods and introducing new exploration avenues, this research paves the way for more accurate and informative protein thermostability predictions, ultimately accelerating advancements in protein engineering.</p><p><strong>Availability and implementation: </strong>TemBERTure model and the data are available at: https://github.com/ibmm-unibe-ch/TemBERTure.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11262459/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141749771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-11eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae098
Zehua T Zhou, Gregory L Owens, Wesley A Larson, Runyang Nicolas Lou, Peter H Sudmant
{"title":"loco-pipe: an automated pipeline for population genomics with low-coverage whole-genome sequencing.","authors":"Zehua T Zhou, Gregory L Owens, Wesley A Larson, Runyang Nicolas Lou, Peter H Sudmant","doi":"10.1093/bioadv/vbae098","DOIUrl":"10.1093/bioadv/vbae098","url":null,"abstract":"<p><strong>Summary: </strong>We developed loco-pipe, a Snakemake pipeline that seamlessly streamlines a set of essential population genomic analyses for low-coverage whole genome sequencing (lcWGS) data. loco-pipe is highly automated, easily customizable, massively parallelized, and thus is a valuable tool for both new and experienced users of lcWGS.</p><p><strong>Availability and implementation: </strong>loco-pipe is published under the GPLv3. It is freely available on GitHub (github.com/sudmantlab/loco-pipe) and archived on Zenodo (doi.org/10.5281/zenodo.10425920).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11246161/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141617759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-07-04eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae100
Matthias Mattanovich, Viktor Hesselberg-Thomsen, Annette Lien, Dovydas Vaitkus, Victoria Sara Saad, Douglas McCloskey
{"title":"INCAWrapper: a Python wrapper for INCA for seamless data import, -export, and -processing.","authors":"Matthias Mattanovich, Viktor Hesselberg-Thomsen, Annette Lien, Dovydas Vaitkus, Victoria Sara Saad, Douglas McCloskey","doi":"10.1093/bioadv/vbae100","DOIUrl":"10.1093/bioadv/vbae100","url":null,"abstract":"<p><strong>Motivation: </strong>INCA is a powerful tool for metabolic flux analysis, however, import and export of data and results can be tedious and limit the use of INCA in automated workflows.</p><p><strong>Results: </strong>The INCAWrapper enables the use of INCA purely through Python, which allows the use of INCA in common data science workflows.</p><p><strong>Availability and implementation: </strong>The INCAWrapper is implemented in Python and can be found at https://github.com/biosustain/incawrapper. It is freely available under an MIT License. To run INCA, the user needs their own MATLAB and INCA licenses. INCA is freely available for noncommercial use at mfa.vueinnovations.com.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11245311/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141617758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-06-28eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae087
Jaewoo Lee, Mehita Achuthan, Lucas Chen, Paulina Carmona-Mora
{"title":"A customizable secure DIY web application for accessing, sharing, and browsing aggregate experimental results and metadata.","authors":"Jaewoo Lee, Mehita Achuthan, Lucas Chen, Paulina Carmona-Mora","doi":"10.1093/bioadv/vbae087","DOIUrl":"10.1093/bioadv/vbae087","url":null,"abstract":"<p><strong>Summary: </strong>A problem spanning across many research fields is that processed data and research results are often scattered, which makes data access, analysis, extraction, and team sharing more challenging. We have developed a platform for researchers to easily manage tabular data with features like browsing, bookmarking, and linking to external open knowledge bases. The source code, originally designed for genomics research, is customizable for use by other fields or data, providing a no- to low-cost DIY system for research teams.</p><p><strong>Availability and implementation: </strong>The source code of our DIY app is available on https://github.com/Carmona-MoraUCD/Human-Genomics-Browser. It can be downloaded and run by anyone with a web browser, Python3, and Node.js on their machine. The web application is licensed under the MIT license.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11257709/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}