Konstantinos A. Kyritsis, Nikolaos Pechlivanis, Fotis Psomopoulos
{"title":"Software pipelines for RNA-Seq, ChIP-Seq and germline variant calling analyses in common workflow language (CWL)","authors":"Konstantinos A. Kyritsis, Nikolaos Pechlivanis, Fotis Psomopoulos","doi":"10.3389/fbinf.2023.1275593","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1275593","url":null,"abstract":"Background: Automating data analysis pipelines is a key requirement to ensure reproducibility of results, especially when dealing with large volumes of data. Here we assembled automated pipelines for the analysis of High-throughput Sequencing (HTS) data originating from RNA-Seq, ChIP-Seq and Germline variant calling experiments. We implemented these workflows in Common workflow language (CWL) and evaluated their performance by: i) reproducing the results of two previously published studies on Chronic Lymphocytic Leukemia (CLL), and ii) analyzing whole genome sequencing data from four Genome in a Bottle Consortium (GIAB) samples, comparing the detected variants against their respective golden standard truth sets. Findings: We demonstrated that CWL-implemented workflows clearly achieved high accuracy in reproducing previously published results, discovering significant biomarkers and detecting germline SNP and small INDEL variants. Conclusion: CWL pipelines are characterized by reproducibility and reusability; combined with containerization, they provide the ability to overcome issues of software incompatibility and laborious configuration requirements. In addition, they are flexible and can be used immediately or adapted to the specific needs of an experiment or study. The CWL-based workflows developed in this study, along with version information for all software tools, are publicly available on GitHub ( https://github.com/BiodataAnalysisGroup/CWL_HTS_pipelines ) under the MIT License. They are suitable for the analysis of short-read (such as Illumina-based) data and constitute an open resource that can facilitate automation, reproducibility and cross-platform compatibility for standard bioinformatic analyses.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135475666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Insights on poster preparation practices in life sciences","authors":"Helena Klara Jambor","doi":"10.3389/fbinf.2023.1216139","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1216139","url":null,"abstract":"Posters are intended to spark scientific dialogue and are omnipresent at biological conferences. Guides and how-to articles help life scientists in preparing informative visualizations in poster format. However, posters shown at conferences are at present often overloaded with data and text and lack visual structure. Here, I surveyed life scientists themselves to understand how they are currently preparing posters and which parts they struggle with. Biologist spend on average two entire days preparing one poster, with half of the time devoted to visual design aspects. Most receive no design or software training and also receive little to no feedback when preparing their visualizations. In conclusion, training in visualization principles and tools for poster preparation would likely improve the quality of conference posters. This would also benefit other common visuals such as figures and slides, and improve the science communication of researchers overall.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"126 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135270825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Famke Bäuerle, Gwendolyn O Döbel, Laura Camus, Simon Heilbronner, Andreas Dräger
{"title":"Genome-scale metabolic models consistently predict <i>in vitro</i> characteristics of <i>Corynebacterium striatum</i>.","authors":"Famke Bäuerle, Gwendolyn O Döbel, Laura Camus, Simon Heilbronner, Andreas Dräger","doi":"10.3389/fbinf.2023.1214074","DOIUrl":"10.3389/fbinf.2023.1214074","url":null,"abstract":"<p><p><b>Introduction:</b> Genome-scale metabolic models (GEMs) are organism-specific knowledge bases which can be used to unravel pathogenicity or improve production of specific metabolites in biotechnology applications. However, the validity of predictions for bacterial proliferation in <i>in vitro</i> settings is hardly investigated. <b>Methods:</b> The present work combines <i>in silico</i> and <i>in vitro</i> approaches to create and curate strain-specific genome-scale metabolic models of <i>Corynebacterium striatum</i>. <b>Results:</b> We introduce five newly created strain-specific genome-scale metabolic models (GEMs) of high quality, satisfying all contemporary standards and requirements. All these models have been benchmarked using the community standard test suite Metabolic Model Testing (MEMOTE) and were validated by laboratory experiments. For the curation of those models, the software infrastructure <i>refineGEMs</i> was developed to work on these models in parallel and to comply with the quality standards for GEMs. The model predictions were confirmed by experimental data and a new comparison metric based on the doubling time was developed to quantify bacterial growth. <b>Discussion:</b> Future modeling projects can rely on the proposed software, which is independent of specific environmental conditions. The validation approach based on the growth rate calculation is now accessible and closely aligned with biological questions. The curated models are freely available via BioModels and a GitHub repository and can be used. The open-source software refineGEMs is available from https://github.com/draeger-lab/refinegems.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1214074"},"PeriodicalIF":0.0,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10626998/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71489591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erik Burlingame, Luke Ternes, Jia-Ren Lin, Yu-An Chen, Eun Na Kim, Joe W Gray, Young Hwan Chang
{"title":"3D multiplexed tissue imaging reconstruction and optimized region of interest (ROI) selection through deep learning model of channels embedding.","authors":"Erik Burlingame, Luke Ternes, Jia-Ren Lin, Yu-An Chen, Eun Na Kim, Joe W Gray, Young Hwan Chang","doi":"10.3389/fbinf.2023.1275402","DOIUrl":"10.3389/fbinf.2023.1275402","url":null,"abstract":"<p><p><b>Introduction:</b> Tissue-based sampling and diagnosis are defined as the extraction of information from certain limited spaces and its diagnostic significance of a certain object. Pathologists deal with issues related to tumor heterogeneity since analyzing a single sample does not necessarily capture a representative depiction of cancer, and a tissue biopsy usually only presents a small fraction of the tumor. Many multiplex tissue imaging platforms (MTIs) make the assumption that tissue microarrays (TMAs) containing small core samples of 2-dimensional (2D) tissue sections are a good approximation of bulk tumors although tumors are not 2D. However, emerging whole slide imaging (WSI) or 3D tumor atlases that use MTIs like cyclic immunofluorescence (CyCIF) strongly challenge this assumption. In spite of the additional insight gathered by measuring the tumor microenvironment in WSI or 3D, it can be prohibitively expensive and time-consuming to process tens or hundreds of tissue sections with CyCIF. Even when resources are not limited, the criteria for region of interest (ROI) selection in tissues for downstream analysis remain largely qualitative and subjective as stratified sampling requires the knowledge of objects and evaluates their features. Despite the fact TMAs fail to adequately approximate whole tissue features, a theoretical subsampling of tissue exists that can best represent the tumor in the whole slide image. <b>Methods:</b> To address these challenges, we propose deep learning approaches to learn multi-modal image translation tasks from two aspects: 1) generative modeling approach to reconstruct 3D CyCIF representation and 2) co-embedding CyCIF image and Hematoxylin and Eosin (H&E) section to learn multi-modal mappings by a cross-domain translation for minimum representative ROI selection. <b>Results and discussion:</b> We demonstrate that generative modeling enables a 3D virtual CyCIF reconstruction of a colorectal cancer specimen given a small subset of the imaging data at training time. By co-embedding histology and MTI features, we propose a simple convex optimization for objective ROI selection. We demonstrate the potential application of ROI selection and the efficiency of its performance with respect to cellular heterogeneity.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1275402"},"PeriodicalIF":2.8,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71489590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Protein quality assessment with a loss function designed for high-quality decoys.","authors":"Soumyadip Roy, Asa Ben-Hur","doi":"10.3389/fbinf.2023.1198218","DOIUrl":"https://doi.org/10.3389/fbinf.2023.1198218","url":null,"abstract":"<p><p><b>Motivation:</b> The prediction of a protein 3D structure is essential for understanding protein function, drug discovery, and disease mechanisms; with the advent of methods like AlphaFold that are capable of producing very high-quality decoys, ensuring the quality of those decoys can provide further confidence in the accuracy of their predictions. <b>Results:</b> In this work, we describe Q<sub><i>ϵ</i></sub>, a graph convolutional network (GCN) that utilizes a minimal set of atom and residue features as inputs to predict the global distance test total score (GDTTS) and local distance difference test (lDDT) score of a decoy. To improve the model's performance, we introduce a novel loss function based on the <i>ϵ</i>-insensitive loss function used for SVM regression. This loss function is specifically designed for evaluating the characteristics of the quality assessment problem and provides predictions with improved accuracy over standard loss functions used for this task. Despite using only a minimal set of features, it matches the performance of recent state-of-the-art methods like DeepUMQA. <b>Availability:</b> The code for Q<sub><i>ϵ</i></sub> is available at https://github.com/soumyadip1997/qepsilon.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1198218"},"PeriodicalIF":0.0,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10616882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71429770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Chinese text and DNA shift encoding scheme based on biomass plasmid storage.","authors":"Xu Yang, Langwen Lai, Xiaoli Qiang, Ming Deng, Yuhao Xie, Xiaolong Shi, Zheng Kou","doi":"10.3389/fbinf.2023.1276934","DOIUrl":"10.3389/fbinf.2023.1276934","url":null,"abstract":"<p><p>DNA, as the storage medium in organisms, can address the shortcomings of existing electromagnetic storage media, such as low information density, high maintenance power consumption, and short storage time. Current research on DNA storage mainly focuses on designing corresponding encoders to convert binary data into DNA base data that meets biological constraints. We have created a new Chinese character code table that enables exceptionally high information storage density for storing Chinese characters (compared to traditional UTF-8 encoding). To meet biological constraints, we have devised a DNA shift coding scheme with low algorithmic complexity, which can encode any strand of DNA even has excessively long homopolymer. The designed DNA sequence will be stored in a double-stranded plasmid of 744bp, ensuring high reliability during storage. Additionally, the plasmid's resistance to environmental interference ensuring long-term stable information storage. Moreover, it can be replicated at a lower cost.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1276934"},"PeriodicalIF":2.8,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10602677/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71415731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions.","authors":"Kejue Jia, Mesih Kilinc, Robert L Jernigan","doi":"10.3389/fbinf.2023.1227193","DOIUrl":"10.3389/fbinf.2023.1227193","url":null,"abstract":"<p><p>Understanding protein sequences and how they relate to the functions of proteins is extremely important. One of the most basic operations in bioinformatics is sequence alignment and usually the first things learned from these are which positions are the most conserved and often these are critical parts of the structure, such as enzyme active site residues. In addition, the contact pairs in a protein usually correspond closely to the correlations between residue positions in the multiple sequence alignment, and these usually change in a systematic and coordinated way, if one position changes then the other member of the pair also changes to compensate. In the present work, these correlated pairs are taken as anchor points for a new type of sequence alignment. The main advantage of the method here is its combining the remote homolog detection from our method PROST with pairwise sequence substitutions in the rigorous method from Kleinjung et al. We show a few examples of some resulting sequence alignments, and how they can lead to improvements in alignments for function, even for a disordered protein.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1227193"},"PeriodicalIF":0.0,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10602800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71415730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Krannich, Marina Herrera Sarrias, Hiba Ben Aribi, Moustafa Shokrof, Alfredo Iacoangeli, Ammar Al-Chalabi, Fritz J Sedlazeck, Ben Busby, Ahmad Al Khleifat
{"title":"VariantSurvival: a tool to identify genotype-treatment response.","authors":"Thomas Krannich, Marina Herrera Sarrias, Hiba Ben Aribi, Moustafa Shokrof, Alfredo Iacoangeli, Ammar Al-Chalabi, Fritz J Sedlazeck, Ben Busby, Ahmad Al Khleifat","doi":"10.3389/fbinf.2023.1277923","DOIUrl":"10.3389/fbinf.2023.1277923","url":null,"abstract":"<p><p><b>Motivation:</b> For a number of neurological diseases, such as Alzheimer's disease, amyotrophic lateral sclerosis, and many others, certain genes are known to be involved in the disease mechanism. A common question is whether a structural variant in any such gene may be related to drug response in clinical trials and how this relationship can contribute to the lifecycle of drug development. <b>Results:</b> To this end, we introduce VariantSurvival, a tool that identifies changes in survival relative to structural variants within target genes. VariantSurvival matches annotated structural variants with genes that are clinically relevant to neurological diseases. A Cox regression model determines the change in survival between the placebo and clinical trial groups with respect to the number of structural variants in the drug target genes. We demonstrate the functionality of our approach with the exemplary case of the <i>SETX</i> gene. VariantSurvival has a user-friendly and lightweight graphical user interface built on the shiny web application package.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1277923"},"PeriodicalIF":2.8,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10598652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"54232718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepRaccess: high-speed RNA accessibility prediction using deep learning.","authors":"Kaisei Hara, Natsuki Iwano, Tsukasa Fukunaga, Michiaki Hamada","doi":"10.3389/fbinf.2023.1275787","DOIUrl":"10.3389/fbinf.2023.1275787","url":null,"abstract":"<p><p>RNA accessibility is a useful RNA secondary structural feature for predicting RNA-RNA interactions and translation efficiency in prokaryotes. However, conventional accessibility calculation tools, such as Raccess, are computationally expensive and require considerable computational time to perform transcriptome-scale analysis. In this study, we developed DeepRaccess, which predicts RNA accessibility based on deep learning methods. DeepRaccess was trained to take artificial RNA sequences as input and to predict the accessibility of these sequences as calculated by Raccess. Simulation and empirical dataset analyses showed that the accessibility predicted by DeepRaccess was highly correlated with the accessibility calculated by Raccess. In addition, we confirmed that DeepRaccess could predict protein abundance in <i>E.coli</i> with moderate accuracy from the sequences around the start codon. We also demonstrated that DeepRaccess achieved tens to hundreds of times software speed-up in a GPU environment. The source codes and the trained models of DeepRaccess are freely available at https://github.com/hmdlab/DeepRaccess.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1275787"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10597636/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50163995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}