Thibaut Goldsborough, Alan O'Callaghan, Fiona Inglis, Leo Leplat, Andrew Filby, Hakan Bilen, Peter Bankhead
{"title":"A novel channel invariant architecture for the segmentation of cells and nuclei in multiplexed images using InstanSeg","authors":"Thibaut Goldsborough, Alan O'Callaghan, Fiona Inglis, Leo Leplat, Andrew Filby, Hakan Bilen, Peter Bankhead","doi":"10.1101/2024.09.04.611150","DOIUrl":"https://doi.org/10.1101/2024.09.04.611150","url":null,"abstract":"The quantitative analysis of bioimaging data increasingly depends on the accurate segmentation of cells and nuclei, a significant challenge for the analysis of high-plex imaging data. Current deep learning-based approaches to segment cells in multiplexed images require reducing the input to a small and fixed number of input channels, discarding imaging information in the process. We present ChannelNet, a novel deep learning architecture for generating three-channel representations of multiplexed images irrespective of the number or ordering of imaged biomarkers. When combined with InstanSeg, ChannelNet sets a new benchmark for the segmentation of cells and nuclei on public multiplexed imaging datasets. We provide an open implementation of our method and integrate it in open source software. Our code and models are available at https://github.com/instanseg/instanseg.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danny Salem, Anuradha Surendra, Graeme SV McDowell, Miroslava Cuperlovic-Culf
{"title":"Projection Statistics ProST Online statistical assessment of group separation in data projection analysis","authors":"Danny Salem, Anuradha Surendra, Graeme SV McDowell, Miroslava Cuperlovic-Culf","doi":"10.1101/2024.09.04.611273","DOIUrl":"https://doi.org/10.1101/2024.09.04.611273","url":null,"abstract":"Motivation: Unsupervised data projection for the determination of trends in the data, visualization of multidimensional data in a reduced dimension space or feature space reduction through combination of data is a major step in data mining. Methods such as Principal Component Analysis or t-Distribution Stochastic Neighbor Embedding are regularly used as one of the first steps in computational biology or omics investigation. However, the significance of the separation of sample groups by these methods generally relies on visual assessment. User-friendly application for different projection methods, each focusing on distinct data properties, are needed as well as a rigorous method for statistical determination of the significance of separation of groups of interest in each dataset.\u0000Results: We present Projection STatistics (ProST), a user-friendly solution for data projection analysis providing three unsupervised (PCA, t-SNE and UMAP) and one supervised (LDA) approach. For each method we are including a novel statistical investigation of the significance of group separation with Mann-Whitney U-rank or t-test analysis as well as necessary preprocessing steps. ProST provides an unbiased, objective application of the determination of the significance of the separation of measurement groups through either linear or manifold projection analysis with methods ranging from a focus on the separation of points based on major variances or on point proximity based on distance.\u0000Availability: The ProST software application is freely available at https://complimet.ca/shiny/ProST/ with source code provided on https://github.com/complimet/prost.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dinghai Zheng, Jun Wang, Logan Persyn, Yue Liu, Fernando Ulloa Montoya, Can Cenik, Vikram Agarwal
{"title":"Predicting the translation efficiency of messenger RNA in mammalian cells","authors":"Dinghai Zheng, Jun Wang, Logan Persyn, Yue Liu, Fernando Ulloa Montoya, Can Cenik, Vikram Agarwal","doi":"10.1101/2024.08.11.607362","DOIUrl":"https://doi.org/10.1101/2024.08.11.607362","url":null,"abstract":"The degree to which translational control is specified by mRNA sequence is poorly understood in mammalian cells. Here, we constructed and leveraged a compendium of 3,819 ribosomal profiling datasets, distilling them into a transcriptome-wide atlas of translation efficiency (TE) measurements encompassing >140 human and mouse cell types. We subsequently developed RiboNN, a multitask deep convolutional neural network, and classic machine learning models to predict TEs in hundreds of cell types from sequence-encoded mRNA features, achieving state-of-the-art performance (r=0.79 in human and r=0.78 in mouse for mean TE across cell types). While the majority of earlier models solely considered 5′ UTR sequence, RiboNN integrates contributions from the full-length mRNA sequence, learning that the 5′ UTR, CDS, and 3′ UTR respectively possess ~67%, 31%, and 2% per-nucleotide information density in the specification of mammalian TEs. Interpretation of RiboNN revealed that the spatial positioning of low-level di- and tri-nucleotide features (i.e., including codons) largely explain model performance, capturing mechanistic principles such as how ribosomal processivity and tRNA abundance control translational output. RiboNN is predictive of the translational behavior of base-modified therapeutic RNA, and can explain evolutionary selection pressures in human 5′ UTRs. Finally, it detects a common language governing mRNA regulatory control and highlights the interconnectedness of mRNA translation, stability, and localization in mammalian organisms.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Liu, Ian Hoskins, Michael Geng, Qiuxia Zhao, Jonathan Chacko, Kangsheng Qi, Logan Persyn, Jun Wang, Dinghai Zheng, Yochen Zhong, Shilpa Rao, Dayea Park, Elif Sarinay Cenik, Vikram Agarwal, Hakan Ozadam, Can Cenik
{"title":"Translation efficiency covariation across cell types is a conserved organizing principle of mammalian transcriptomes","authors":"Yue Liu, Ian Hoskins, Michael Geng, Qiuxia Zhao, Jonathan Chacko, Kangsheng Qi, Logan Persyn, Jun Wang, Dinghai Zheng, Yochen Zhong, Shilpa Rao, Dayea Park, Elif Sarinay Cenik, Vikram Agarwal, Hakan Ozadam, Can Cenik","doi":"10.1101/2024.08.11.607360","DOIUrl":"https://doi.org/10.1101/2024.08.11.607360","url":null,"abstract":"Characterization of shared patterns of RNA expression between genes across conditions has led to the discovery of regulatory networks and novel biological functions. However, it is unclear if such coordination extends to translation, a critical step in gene expression. Here, we uniformly analyzed 3,819 ribosome profiling datasets from 117 human and 94 mouse tissues and cell lines. We introduce the concept of Translation Efficiency Covariation (TEC), identifying coordinated translation patterns across cell types. We nominate potential mechanisms driving shared patterns of translation regulation. TEC is conserved across human and mouse cells and helps uncover gene functions. Moreover, our observations indicate that proteins that physically interact are highly enriched for positive covariation at both translational and transcriptional levels. Our findings establish translational covariation as a conserved organizing principle of mammalian transcriptomes.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"119 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Cesare Mcleod, Changhyun Lim, Tanner Stokes, Jalil-Ahmad Sharif, Vagif Zeynalli, Lucas Wiens, Alysha C D'Souza, Lauren Colenso-Semple, James McKendry, Robert W Morton, Cameron J Mitchell, Sara Y Oikawa, Claes Wahlestedt, Paul Chapple, Chris McGlory, James A Timmons, Stuart M Phillips
{"title":"Network-based modelling reveals cell-type enriched patterns of non-coding RNA regulation during human skeletal muscle remodelling","authors":"Jonathan Cesare Mcleod, Changhyun Lim, Tanner Stokes, Jalil-Ahmad Sharif, Vagif Zeynalli, Lucas Wiens, Alysha C D'Souza, Lauren Colenso-Semple, James McKendry, Robert W Morton, Cameron J Mitchell, Sara Y Oikawa, Claes Wahlestedt, Paul Chapple, Chris McGlory, James A Timmons, Stuart M Phillips","doi":"10.1101/2024.08.11.606848","DOIUrl":"https://doi.org/10.1101/2024.08.11.606848","url":null,"abstract":"Most human genes are non-protein-coding RNA (ncRNA). A handful of ncRNAs have characterised functions, including important epigenetic roles in development and disease. Neither ncRNA nor multinucleated muscle is ideally suited to sequencing technologies. We therefore used customised RNA profiling methods and quantitative network modelling to study cell-type specific ncRNA transcriptome responses during load-induced skeletal muscle hypertrophy. We completed five independent supervised exercise-training studies (n=144) and 61% of individuals accrued muscle mass beyond normal technical variation (lean mass responders, LMR). The remainder were defined as having no measurable lean mass response (NMLMR). Fifty ncRNA genes (FDR <1%) were differentially regulated in LMR, and in total we identified 110 ncRNAs for further study. A network model of the human muscle transcriptome was built (n=437 samples), assigning ncRNAs to protein coding modules representing functional pathways or single-cell types. We identified that the known hypertrophy-related ncRNA, CYTOR, was leukocyte-associated in vivo in humans (FDR = 4.9 x10-7; Fold Enrichment [FE] = 6.6). Other ncRNA modules included PPP1CB-DT, which was segregated with myofibril assembly genes (FDR = 8.15 x 10-8; FE = 47.5), while EEF1A1P24 and TMSB4XP8 were associated with vascular remodelling and angiogenesis genes (FDR = 2.77 x 10-5; FE = 3.6). MYREM was positively associated with hypertrophy, and we established its myonuclear expression pattern in vivo in humans using spatial transcriptomics probes. We show that single-cell type associations of ncRNA are identifiable from bulk transcriptomic data and that hypertrophy-linked ncRNA genes appear to mediate their association with muscle growth via multiple cell types.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vicky Jung, Kenneth Lopez Perez, Lexin Chen, Kate Huddleston, Ramon Alain Miranda Quintana
{"title":"Efficient clustering of large molecular libraries","authors":"Vicky Jung, Kenneth Lopez Perez, Lexin Chen, Kate Huddleston, Ramon Alain Miranda Quintana","doi":"10.1101/2024.08.10.607459","DOIUrl":"https://doi.org/10.1101/2024.08.10.607459","url":null,"abstract":"The widespread use of Machine Learning (ML) techniques in chemical applications has come with the pressing need to analyze extremely large molecular libraries. In particular, clustering remains one of the most common tools to dissect the chemical space. Unfortunately, most current approaches present unfavorable time and memory scaling, which makes them unsuitable to handle million- and billion-sized sets. Here, we propose to bypass these problems with a time- and memory-efficient clustering algorithm, BitBIRCH. This method uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure O(N) time scaling. BitBIRCH leverages the instant similarity (iSIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity, and reducing memory requirements. Our tests show that BitBIRCH is already > 1,000 times faster than standard implementations of the Taylor-Butina clustering for libraries with 1,500,000 molecules. BitBIRCH increases efficiency without compromising the quality of the resulting clusters. We explore strategies to handle large sets, which we applied in the clustering of one billion molecules under 5 hours using a parallel/iterative BitBIRCH approximation.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven M Foltz, Yize Li, Lijun Yao, Nadezhda V Terekhanova, Amila Weerasinghe, Qingsong Gao, Guanlan Dong, Moses Schindler, Song Cao, Hua Sun, Reyka G Jayasinghe, Robert S Fulton, Catrina C Fronick, Justin King, Daniel R Kohnen, Mark A Fiala, Ken Chen, John F DiPersio, Ravi Vij, Li Ding
{"title":"Somatic mutation phasing and haplotype extension using linked-reads in multiple myeloma","authors":"Steven M Foltz, Yize Li, Lijun Yao, Nadezhda V Terekhanova, Amila Weerasinghe, Qingsong Gao, Guanlan Dong, Moses Schindler, Song Cao, Hua Sun, Reyka G Jayasinghe, Robert S Fulton, Catrina C Fronick, Justin King, Daniel R Kohnen, Mark A Fiala, Ken Chen, John F DiPersio, Ravi Vij, Li Ding","doi":"10.1101/2024.08.09.607342","DOIUrl":"https://doi.org/10.1101/2024.08.09.607342","url":null,"abstract":"Somatic mutation phasing informs our understanding of cancer-related events, like driver mutations. We generated linked-read whole genome sequencing data for 23 samples across disease stages from 14 multiple myeloma (MM) patients and systematically assigned somatic mutations to haplotypes using linked-reads. Here, we report the reconstructed cancer haplotypes and phase blocks from several MM samples and show how phase block length can be extended by integrating samples from the same individual. We also uncover phasing information in genes frequently mutated in MM, including DIS3, HIST1H1E, KRAS, NRAS, and TP53, phasing 79.4% of 20,705 high-confidence somatic mutations. In some cases, this enabled us to interpret clonal evolution models at higher resolution using pairs of phased somatic mutations. For example, our analysis of one patient suggested that two NRAS hotspot mutations occurred on the same haplotype but were independent events in different subclones. Given sufficient tumor purity and data quality, our framework illustrates how haplotype-aware analysis of somatic mutations in cancer can be beneficial for some cancer cases.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling Fine-scale Spatial Structures and Amplifying Gene Expression Signals in Ultra-Large ST slices with HERGAST","authors":"Yuqiao Gong, Xin Yuan, Qiong Jiao, Zhangsheng Yu","doi":"10.1101/2024.08.09.607422","DOIUrl":"https://doi.org/10.1101/2024.08.09.607422","url":null,"abstract":"We propose HERGAST, a system for spatial structure identification and signal amplification in ultra-large-scale and ultra-high-resolution spatial transcriptomics data. To handle ultra-large ST data, we consider the divide and conquer strategy and devise a Divide-Iterate-Conque framework specially for spatial transcriptomics data analysis, which can also be adopted by other computational methods for extending to ultra-large-scale ST data analysis. To tackle the potential oversmoothing problem arising from data splitting, we construct a heterogeneous graph network to incorporate both local and global spatial relationships. In simulation, HERGAST consistently outperformed other methods across all settings with more than 10% average gaining. In real-world data, HERGAST's high-precision spatial clustering enabled finding SPP1+ macrophages intermingled in tumors in colorectal cancer, while the enhanced gene expression signal enabled discovering unique spatial expression pattern of key genes in breast cancer.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AllerTrans: An Improved Protein Allergenicity Prediction Model Using Deep Learning","authors":"Faezeh Sarlakifar, Hamed Malek, Najaf Allahyari Fard, Zahra Khotanlou","doi":"10.1101/2024.08.09.607419","DOIUrl":"https://doi.org/10.1101/2024.08.09.607419","url":null,"abstract":"Recognizing the potential allergenicity of proteins is essential for ensuring their safety. Allergens are a major concern in determining protein safety, especially with the increasing use of recombinant proteins in new medical products. These proteins need careful allergenicity assessment to guarantee their safety. However, traditional laboratory testing for allergenicity is expensive and time-consuming. To address this challenge, bioinformatics offers efficient and cost-effective alternatives for predicting protein allergenicity. In this study, we developed an enhanced deep-learning model to predict the potential allergenicity of proteins based on their primary structure represented as protein sequences. Our approach utilizes two protein language models, to extract distinct feature vectors for each sequence, which are then input into a deep neural network model for classification. Each feature vector represents a specific aspect of the protein sequence, and combining them enhances the final result and balances the model's sensitivity and specificity. The model classifies proteins into allergenic or non-allergenic classes. Our proposed model demonstrates admissible improvement across all evaluation metrics compared to the AlgPred 2.0 model, achieving a sensitivity of 97.91%, specificity of 97.69%, accuracy of 97.80%, and an impressive area under the ROC curve of 99% on the AlgPred 2.0 dataset using standard five-fold cross-validation.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenbin Jiang, Yueting Xiong, Jin Xiao, Jingyi Wang, Zhenjian Jiang, Ling Luo, Quan Yuan, Ningshao Xia, Rongshan Yu
{"title":"Comprehensive assembly of monoclonal and mixed antibody sequences","authors":"Wenbin Jiang, Yueting Xiong, Jin Xiao, Jingyi Wang, Zhenjian Jiang, Ling Luo, Quan Yuan, Ningshao Xia, Rongshan Yu","doi":"10.1101/2024.08.09.607415","DOIUrl":"https://doi.org/10.1101/2024.08.09.607415","url":null,"abstract":"The elucidation of antibody sequence information is crucial for understanding antigen binding and advancing therapeutic and research applications. However, complete de novo assembly of monoclonal antibody sequences remains challenging due to accuracy and robustness limitations. To address this issue, we introduce Fusion, an innovative de novo assembler that integrates overlapping peptides and template information into complete sequences using a beam search strategy. We demonstrate Fusion's performance by reconstructing multiple human and murine antibodies with highest accuracy (100% and over 99%, respectively). Biological validation of the recombinantly expressed AFS98 antibody with unknown sequences further supports its effectiveness. Furthermore, current methods are applicable only to traditional monoclonal antibody sequencing assembly, presenting a significant bottleneck in achieving higher throughput. In contrast, Fusion can assemble peptide sequences from mixtures of two or three monoclonal antibodies into complete individual sequences with the same accuracy as traditional sequencing, significantly enhancing throughput. To our knowledge, this is the first study enabling high-throughput sequencing of multiple antibodies using only bottom-up mass spectrometry. The duration, expense, and reagent consumption of mass spectrometry detection are comparable to those required for sequencing a single monoclonal antibody. In summary, Fusion's superior performance in handling the complex antibody sequencing represents a significant advancement in antibody research.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"130 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141934759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}