Selina Glaser, Helene Kretzmer, Iris Tatjana Kolassa, Matthias Schlesner, Anja Fischer, Isabell Fenske, Reiner Siebert, Ole Ammerpohl
{"title":"Navigating Illumina DNA methylation data: biology versus technical artefacts.","authors":"Selina Glaser, Helene Kretzmer, Iris Tatjana Kolassa, Matthias Schlesner, Anja Fischer, Isabell Fenske, Reiner Siebert, Ole Ammerpohl","doi":"10.1093/nargab/lqae181","DOIUrl":"10.1093/nargab/lqae181","url":null,"abstract":"<p><p>Illumina-based BeadChip arrays have revolutionized genome-wide DNA methylation profiling, pushing it into diagnostics. However, comprehensive quality assessment remains challenging within a wide range of available tissue materials and sample preparation methods. This study tackles two critical issues: differentiating between biological effects and technical artefacts in suboptimal quality samples and the impact of the first sample on the Illumina-like normalization algorithm. We introduce three quality control scores based on global DNA methylation distribution (DB-Score), bin distance from copy number variation analysis (BIN-Score) and consistently methylated CpGs (CM-Score) that rely on biological features rather than internal array controls. These scores, designed to be adjustable for different analysis tools and sample cohort characteristics, were explored and benchmarked across independent cohorts. Additionally, we reveal deviations in beta values caused by different sample rankings with the Illumina-like normalization algorithm, verified these with whole-genome methylation sequencing data and showed effects on differential DNA methylation analysis. Our findings underscore the necessity of consistently utilizing a pre-defined normalization sample within the ranking process to boost reproducibility of the Illumina-like normalization algorithm. Overall, our study delivers valuable insights, practical recommendations and R functions designed to enhance reproducibility and quality assurance of DNA methylation analysis, particularly for challenging sample types.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae181"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John T Chamberlin, Austin E Gillen, Aaron R Quinlan
{"title":"Improved characterization of 3' single-cell RNA-seq libraries with paired-end avidity sequencing.","authors":"John T Chamberlin, Austin E Gillen, Aaron R Quinlan","doi":"10.1093/nargab/lqae175","DOIUrl":"10.1093/nargab/lqae175","url":null,"abstract":"<p><p>Prevailing poly(dT)-primed 3' single-cell RNA-seq protocols generate barcoded cDNA fragments containing the reverse transcriptase priming site or in principle the polyadenylation site. Direct sequencing across this site was historically difficult because of DNA sequencing errors induced by the homopolymeric primer at the 'barcode' end. Here, we evaluate the capability of 'avidity base chemistry' DNA sequencing from Element Biosciences to sequence through the primer and enable accurate paired-end read alignment and precise quantification of polyadenylation sites. We find that the Element Aviti instrument sequences through the thymine homopolymer into the subsequent cDNA sequence without detectable loss of accuracy. The additional sequence enables direct and independent assignment of reads to polyadenylation sites, which bypasses the complexities and limitations of conventional approaches but does not consistently improve read mapping rates compared to single-end alignment. We also characterize low-level artifacts and demonstrate necessary adjustments to adapter trimming and sequence alignment regardless of platform, particularly in the context of extended read lengths. Our analyses confirm that Element avidity sequencing is an effective alternative to Illumina sequencing for standard single-cell RNA-seq, particularly for polyadenylation site measurement but do not rule out the potential for similar performance from other emerging platforms.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae175"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Malesys, Rachel Torchet, Bertrand Saunier, Nicolas Maillet
{"title":"AntiBody Sequence Database.","authors":"Simon Malesys, Rachel Torchet, Bertrand Saunier, Nicolas Maillet","doi":"10.1093/nargab/lqae171","DOIUrl":"10.1093/nargab/lqae171","url":null,"abstract":"<p><p>Antibodies play a crucial role in the humoral immune response against health threats, such as viral infections. Although the theoretical number of human immunoglobulins is well over a trillion, the total number of unique antibody protein sequences accessible in databases is much lower than the number found in a single individual. Training AI (Artificial Intelligence) models, for example to assist in developing serodiagnoses or antibody-based therapies, requires building datasets according to strict criteria to include as many standardized antibody sequences as possible. However, the available sequences are scattered across partially redundant databases, making it difficult to compile them into single non-redundant datasets. Here, we introduce ABSD (AntiBody Sequence Database, https://absd.pasteur.cloud), which contains data from major publicly available resources, creating the largest standardized, automatically updated and non-redundant source of public antibody sequences. This user-friendly and open website enables users to generate lists of antibodies based on selected criteria and download the unique sequence pairs of their variable regions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae171"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655285/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IDclust: Iterative clustering for unsupervised identification of cell types with single cell transcriptomics and epigenomics.","authors":"Pacôme Prompsy, Mélissa Saichi, Félix Raimundo, Céline Vallot","doi":"10.1093/nargab/lqae174","DOIUrl":"10.1093/nargab/lqae174","url":null,"abstract":"<p><p>The increasing diversity of single-cell datasets require systematic cell type characterization. Clustering is a critical step in single-cell analysis, heavily influencing downstream analyses. However, current unsupervised clustering algorithms rely on biologically irrelevant parameters that require manual optimization and fail to capture hierarchical relationships between clusters. We developed IDclust, a framework that identifies clusters with significant biological features at multiple resolutions using biologically meaningful thresholds like fold change, adjusted <i>P</i>-value and fraction of expressing cells. By iteratively processing and clustering subsets of the dataset, IDclust guarantees that all clusters found have significantly different features and stops only when no more interpretable cluster is found. It also creates a hierarchy of clusters, enabling visualization of the hierarchical relationships between different clusters. Analyzing multiple single-cell transcriptomic reference datasets, IDclust achieves superior clustering accuracy compared to state of the art algorithms. We showcase its utility by identifying previously unannotated clusters and identifying branching patterns in scATAC datasets. Using it's unsupervised nature and ability to analyze different -omics, we compare the resolution of different histone marks in multi-omic paired-tag dataset. Overall, IDclust automates single-cell exploration, facilitates cell type annotation and provides a biologically interpretable basis for clustering.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae174"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655290/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chiara Tarracchini, Federico Fontana, Silvia Petraro, Gabriele Andrea Lugli, Leonardo Mancabelli, Francesca Turroni, Marco Ventura, Christian Milani
{"title":"Optimal Representative Strain selector-a comprehensive pipeline for selecting next-generation reference strains of bacterial species.","authors":"Chiara Tarracchini, Federico Fontana, Silvia Petraro, Gabriele Andrea Lugli, Leonardo Mancabelli, Francesca Turroni, Marco Ventura, Christian Milani","doi":"10.1093/nargab/lqae173","DOIUrl":"10.1093/nargab/lqae173","url":null,"abstract":"<p><p>Although it is common practice to use historically established 'reference strains' or 'type strains' for laboratory experiments, this approach often overlooks how effectively these strains represent the full ecological, genetic and functional diversity of the species within a specific ecological niche. In this context, this study proposes the Optimal Representative Strain (ORS) selector tool (https://zenodo.org/doi/10.5281/zenodo.13772191), an innovative bioinformatic pipeline capable of evaluating how a strain represents its whole species from a genetic and functional perspective, in addition to considering its ecological distribution in a particular ecological niche. Based on publicly available genomes, the strain that best fits all these three microbiological aspects is designated as an optimal representative strain. Moreover, a user-friendly software called Local Alternative Optimal Representative Strain selector was developed to allow researchers to screen their local library of bacterial strains for an optimal available alternative based on the reference optimal representative strain. Five different bacterial species, i.e. <i>Lacticaseibacillus paracasei</i>, <i>Lactobacillus delbrueckii</i>, <i>Streptococcus thermophilus</i>, <i>Bacteroides thetaiotaomicron</i> and <i>Lactococcus lactis</i>, were tested in three different environments to evaluate the performance of the bioinformatic pipeline in selecting optimal representative strains.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae173"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655286/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cell- and tissue-specific glycosylation pathways informed by single-cell transcriptomics.","authors":"Panagiotis Chrysinas, Shriramprasad Venkatesan, Isaac Ang, Vishnu Ghosh, Changyou Chen, Sriram Neelamegham, Rudiyanto Gunawan","doi":"10.1093/nargab/lqae169","DOIUrl":"10.1093/nargab/lqae169","url":null,"abstract":"<p><p>While single-cell studies have made significant impacts in various subfields of biology, they lag in the Glycosciences. To address this gap, we analyzed single-cell glycogene expressions in the Tabula Sapiens dataset of human tissues and cell types using a recent glycosylation-specific gene ontology (GlycoEnzOnto). At the median sequencing (count) depth, ∼40-50 out of 400 glycogenes were detected in individual cells. Upon increasing the sequencing depth, the number of detectable glycogenes saturates at ∼200 glycogenes, suggesting that the average human cell expresses about half of the glycogene repertoire. Hierarchies in glycogene and glycopathway expressions emerged from our analysis: nucleotide-sugar synthesis and transport exhibited the highest gene expressions, followed by genes for core enzymes, glycan modification and extensions, and finally terminal modifications. Interestingly, the same cell types showed variable glycopathway expressions based on their organ or tissue origin, suggesting nuanced cell- and tissue-specific glycosylation patterns. Probing deeper into the transcription factors (TFs) of glycogenes, we identified distinct groupings of TFs controlling different aspects of glycosylation: core biosynthesis, terminal modifications, etc. We present webtools to explore the interconnections across glycogenes, glycopathways and TFs regulating glycosylation in human cell/tissue types. Overall, the study presents an overview of glycosylation across multiple human organ systems.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae169"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655298/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TSS-Captur: a user-friendly pipeline for characterizing unclassified RNA transcripts.","authors":"Mathias Witte Paz, Thomas Vogel, Kay Nieselt","doi":"10.1093/nargab/lqae168","DOIUrl":"10.1093/nargab/lqae168","url":null,"abstract":"<p><p>RNA-seq and its 5'-enrichment methods for prokaryotes have enabled the precise identification of transcription start sites (TSSs), improving gene expression analysis. Computational methods are applied to these data to identify TSSs and classify them based on proximal annotated genes. While some TSSs cannot be classified at all (orphan TSSs), other TSSs are found on the reverse strand of known genes (antisense TSSs) but are not associated with the direct transcription of any known gene. Here, we introduce TSS-Captur, a novel pipeline, which uses computational approaches to characterize genomic regions starting from experimentally confirmed but unclassified TSSs. By analyzing TSS data, TSS-Captur characterizes unclassified signals, complementing prokaryotic genome annotation tools. TSS-Captur categorizes extracted transcripts as either messenger RNA for genes with coding potential or non-coding RNA (ncRNA) for non-translated genes. Additionally, it predicts the transcription termination site for each putative transcript. For ncRNA genes, the secondary structure is computed. Moreover, all putative promoter regions are analyzed to identify enriched motifs. An interactive report allows seamless data exploration. We validated TSS-Captur with a <i>Campylobacter jejuni</i> dataset and characterized unlabeled ncRNAs in <i>Streptomyces coelicolor</i>. TSS-Captur is available both as a web-application and as a command-line tool.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae168"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655288/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yixiao Zhai, Tong Zhou, Yanming Wei, Quan Zou, Yansu Wang
{"title":"ReAlign-N: an integrated realignment approach for multiple nucleic acid sequence alignment, combining global and local realignments.","authors":"Yixiao Zhai, Tong Zhou, Yanming Wei, Quan Zou, Yansu Wang","doi":"10.1093/nargab/lqae170","DOIUrl":"10.1093/nargab/lqae170","url":null,"abstract":"<p><p>Ensuring accurate multiple sequence alignment (MSA) is essential for comprehensive biological sequence analysis. However, the complexity of evolutionary relationships often results in variations that generic alignment tools may not adequately address. Realignment is crucial to remedy this issue. Currently, there is a lack of realignment methods tailored for nucleic acid sequences, particularly for lengthy sequences. Thus, there's an urgent need for the development of realignment methods better suited to address these challenges. This study presents ReAlign-N, a realignment method explicitly designed for multiple nucleic acid sequence alignment. ReAlign-N integrates both global and local realignment strategies for improved accuracy. In the global realignment phase, ReAlign-N incorporates K-Band and innovative memory-saving technology into the dynamic programming approach, ensuring high efficiency and minimal memory requirements for large-scale realignment tasks. The local realignment stage employs full matching and entropy scoring methods to identify low-quality regions and conducts realignment through MAFFT. Experimental results demonstrate that ReAlign-N consistently outperforms initial alignments on simulated and real datasets. Furthermore, compared to ReformAlign, the only existing multiple nucleic acid sequence realignment tool, ReAlign-N, exhibits shorter running times and occupies less memory space. The source code and test data for ReAlign-N are available on GitHub (https://github.com/malabz/ReAlign-N).</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae170"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655299/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard Mayne, Pakorn Aiewsakun, Dann Turner, Evelien M Adriaenssens, Peter Simmonds
{"title":"GRAViTy-V2: a grounded viral taxonomy application.","authors":"Richard Mayne, Pakorn Aiewsakun, Dann Turner, Evelien M Adriaenssens, Peter Simmonds","doi":"10.1093/nargab/lqae183","DOIUrl":"10.1093/nargab/lqae183","url":null,"abstract":"<p><p>Taxonomic classification of viruses is essential for understanding their evolution. Genomic classification of viruses at higher taxonomic ranks, such as order or phylum, is typically based on alignment and comparison of amino acid sequence motifs in conserved genes. Classification at lower taxonomic ranks, such as genus or species, is usually based on nucleotide sequence identities between genomic sequences. Building on our whole-genome analytical classification framework, we here describe Genome Relationships Applied to Viral Taxonomy Version 2 (GRAViTy-V2), which encompasses a greatly expanded range of features and numerous optimisations, packaged as an application that may be used as a general-purpose virus classification tool. Using 28 datasets derived from the ICTV 2022 taxonomy proposals, GRAViTy-V2 output was compared against human expert-curated classifications used for assignments in the 2023 round of ICTV taxonomy changes. GRAViTy-V2 produced taxonomies equivalent to manually-curated versions down to the family level and in almost all cases, to genus and species levels. The majority of discrepant results arose from errors in coding sequence annotations in INDSC records, or from inclusion of incomplete genome sequences in the analysis. Analysis times ranged from 1-506 min (median 3.59) on datasets with 17-1004 genomes and mean genome length of 3000-1 000 000 bases.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae183"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655284/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianshu Zhao, Jean Pierre Both, Konstantinos T Konstantinidis
{"title":"Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.","authors":"Jianshu Zhao, Jean Pierre Both, Konstantinos T Konstantinidis","doi":"10.1093/nargab/lqae172","DOIUrl":"10.1093/nargab/lqae172","url":null,"abstract":"<p><p>Dimension reduction (DR or embedding) algorithms such as t-SNE and UMAP have many applications in big data visualization but remain slow for large datasets. Here, we further improve the UMAP-like algorithms by (i) combining several aspects of t-SNE and UMAP to create a new DR algorithm; (ii) replacing its rate-limiting step, the K-nearest neighbor graph (K-NNG), with a Hierarchical Navigable Small World (HNSW) graph; and (iii) extending the functionality to DNA/RNA sequence data by combining HNSW with locality sensitive hashing algorithms (e.g. MinHash) for distance estimations among sequences. We also provide additional features including computation of local intrinsic dimension and hubness, which can reflect structures and properties of the underlying data that strongly affect the K-NNG accuracy, and thus the quality of the resulting embeddings. Our library, called annembed, is implemented, and fully parallelized in Rust and shows competitive accuracy compared to the popular UMAP-like algorithms. Additionally, we showcase the usefulness and scalability of our library with three real-world examples: visualizing a large-scale microbial genomic database, visualizing single-cell RNA sequencing data and metagenomic contig (or population) binning. Therefore, annembed can facilitate DR for several tasks for biological data analysis where distance computation is expensive or when there are millions to billions of data points to process.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae172"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655291/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}