{"title":"Perfect collinearity not created equal: measuring and visualizing the severity of multi-collinearity of modern omics data.","authors":"Wei Q Deng, Radu V Craiu, Lei Sun","doi":"10.1515/sagmb-2025-0043","DOIUrl":"10.1515/sagmb-2025-0043","url":null,"abstract":"<p><p>Multi-collinearity frequently occurs in modern statistical applications and when ignored, can negatively impact model selection and statistical inference. Though perfect collinearity is always present in \"<i>n</i> < <i>p</i>\" data, we demonstrate that perfect collinearity arises differently, from diverse data redundancy patterns and/or data dimensions. Classic tools and measures that were developed for \"<i>n</i> > <i>p</i>\" data cannot be used to distinguish or visualize these patterns in the high-dimensional regime. Here we propose 1) new individualized measures that can be used to visualize patterns of perfect collinearity, and subsequently 2) global measures to assess the overall burden of multi-collinearity irrespective of data dimensions. We applied these measures to the human X chromosome data to understand similarity and differences in linkage disequilibrium structure due to sex and genetic features. The measures can highlight gene regions of excessive multi-collinearity and contrast the severity of perfect collinearity between different sexes. Utility of these measures to high-dimensional statistical application were also discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"24 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12909097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146208306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI-driven risk prediction and categorization in cystic fibrosis leveraging AttentiveLSTM and Fox Wolf Optimizer.","authors":"Ashwini A Pandagale, Lalit V Patil","doi":"10.1515/sagmb-2025-0042","DOIUrl":"https://doi.org/10.1515/sagmb-2025-0042","url":null,"abstract":"<p><p>Cystic fibrosis (CF), a genetic disorder stemming from CFTR gene mutations, requires accurate risk prediction to improve management. Modulator therapies have advanced treatment but remain limited, as they don't cover all gene variants and face accessibility issues. To address these challenges, a novel Cystic Fibrosis Risk Prediction Framework (CGRPF) is proposed. CGRPF utilizes mean imputation for missing data, the Fox Wolf Optimizer (FWO) for effective feature selection, and an AttentiveLSTM to capture temporal patterns in time-series data, aiding chronic disease prediction. Fully connected layers and a softmax layer enhance model performance and ensure calibrated classification into high, medium, and low-risk categories. Tested on the CFTR-2 dataset, CGRPF achieved strong performance metrics - 97 % accuracy, 91 % precision, 97 % recall, 93 % F1-score, outperforming state-of-the-art models in CF risk prediction.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"25 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146208249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi Sara George, Roshni Sivasevan, Aleyamma Mathew
{"title":"Corrigendum to: Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data.","authors":"Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi Sara George, Roshni Sivasevan, Aleyamma Mathew","doi":"10.1515/sagmb-2025-0073","DOIUrl":"10.1515/sagmb-2025-0073","url":null,"abstract":"","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"24 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145514911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast (CNN + MCWS-transformer) based architecture for protein function prediction.","authors":"Abhipsa Mahala, Ashish Ranjan, Rojalina Priyadarshini, Raj Vikram, Prabhat Dansena","doi":"10.1515/sagmb-2024-0027","DOIUrl":"https://doi.org/10.1515/sagmb-2024-0027","url":null,"abstract":"<p><p>The transformer model for sequence mining has brought a paradigmatic shift to many domains, including biological sequence mining. However, transformers suffer from quadratic complexity, i.e., O(<i>l</i> <sup>2</sup>), where <i>l</i> is the sequence length, which affects the training and prediction time. Therefore, the work herein introduces a simple, generalized, and fast transformer architecture for improved protein function prediction. The proposed architecture uses a combination of CNN and global-average pooling to effectively shorten the protein sequences. The shortening process helps reduce the quadratic complexity of the transformer, resulting in the complexity of O((<i>l</i>/2)<sup>2</sup>). This architecture is utilized to develop PFP solution at the sub-sequence level. Furthermore, focal loss is employed to ensure balanced training for the hard-classified examples. The multi sub-sequence-based proposed solution utilizing an average-pooling layer (with stride = 2) produced improvements of +2.50 % (BP) and +3.00 % (MF) when compared to Global-ProtEnc Plus. The corresponding improvements when compared to the Lite-SeqCNN are: +4.50 % (BP) and +2.30 % (MF).</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"24 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144530716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirically adjusted fixed-effects meta-analysis methods in genomic studies.","authors":"Wimarsha T Jayanetti, Sinjini Sikdar","doi":"10.1515/sagmb-2023-0041","DOIUrl":"10.1515/sagmb-2023-0041","url":null,"abstract":"<p><p>In recent years, meta-analyzing summary results from multiple studies has become a common practice in genomic research, leading to a significant improvement in the power of statistical detection compared to an individual genomic study. Meta analysis methods that combine statistical estimates across studies are known to be statistically more powerful than those combining statistical significance measures. An approach combining effect size estimates based on a fixed-effects model, called METAL, has gained extreme popularity to perform the former type of meta-analysis. In this article, we discuss the limitations of METAL due to its dependence on the theoretical null distribution, leading to incorrect significance testing results. Through various simulation studies and real genomic data application, we show how modifying the <i>z</i>-scores in METAL, using an empirical null distribution, can significantly improve the results, especially in presence of hidden confounders. For the estimation of the null distribution, we consider two different approaches, and we highlight the scenarios when one null estimation approach outperforms the other. This article will allow researchers to gain an insight into the importance of using an empirical null distribution in the fixed-effects meta-analysis as well as in choosing the appropriate empirical null distribution estimation approach.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A CNN-CBAM-BIGRU model for protein function prediction.","authors":"Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2024-0004","DOIUrl":"10.1515/sagmb-2024-0004","url":null,"abstract":"<p><p>Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141471963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A heavy-tailed model for analyzing miRNA-seq raw read counts.","authors":"Annika Krutto, Therese Haugdahl Nøst, Magne Thoresen","doi":"10.1515/sagmb-2023-0016","DOIUrl":"10.1515/sagmb-2023-0016","url":null,"abstract":"<p><p>This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141176757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible model-based non-negative matrix factorization with application to mutational signatures.","authors":"Ragnhild Laursen, Lasse Maretty, Asger Hobolth","doi":"10.1515/sagmb-2023-0034","DOIUrl":"10.1515/sagmb-2023-0034","url":null,"abstract":"<p><p>Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. Our novel estimation procedure is based on the expectation-maximization (EM) algorithm and regression in the log-linear quasi-Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework in a large simulation study where we compare to state of the art methods, and show results for three data sets of somatic mutation counts from patients with cancer in the breast, Liver and urinary tract.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140945949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi S George, Sivasevan Roshni, Aleyamma Mathew
{"title":"Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data.","authors":"Anand Hari, Edakkalathoor George Jinto, Divya Dennis, Kumarapillai Mohanan Nair Jagathnath Krishna, Preethi S George, Sivasevan Roshni, Aleyamma Mathew","doi":"10.1515/sagmb-2023-0038","DOIUrl":"10.1515/sagmb-2023-0038","url":null,"abstract":"<p><p>Longitudinal time-to-event analysis is a statistical method to analyze data where covariates are measured repeatedly. In survival studies, the risk for an event is estimated using Cox-proportional hazard model or extended Cox-model for exogenous time-dependent covariates. However, these models are inappropriate for endogenous time-dependent covariates like longitudinally measured biomarkers, Carcinoembryonic Antigen (CEA). Joint models that can simultaneously model the longitudinal covariates and time-to-event data have been proposed as an alternative. The present study highlights the importance of choosing the baseline hazards to get more accurate risk estimation. The study used colon cancer patient data to illustrate and compare four different joint models which differs based on the choice of baseline hazards [piecewise-constant Gauss-Hermite (GH), piecewise-constant pseudo-adaptive GH, Weibull Accelerated Failure time model with GH & B-spline GH]. We conducted simulation study to assess the model consistency with varying sample size (<i>N</i> = 100, 250, 500) and censoring (20 %, 50 %, 70 %) proportions. In colon cancer patient data, based on Akaike information criteria (AIC) and Bayesian information criteria (BIC), piecewise-constant pseudo-adaptive GH was found to be the best fitted model. Despite differences in model fit, the hazards obtained from the four models were similar. The study identified composite stage as a prognostic factor for time-to-event and the longitudinal outcome, CEA as a dynamic predictor for overall survival in colon cancer patients. Based on the simulation study Piecewise-PH-aGH was found to be the best model with least AIC and BIC values, and highest coverage probability(CP). While the Bias, and RMSE for all the models showed a competitive performance. However, Piecewise-PH-aGH has shown least bias and RMSE in most of the combinations and has taken the shortest computation time, which shows its computational efficiency. This study is the first of its kind to discuss on the choice of baseline hazards.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140913025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Minotto, Philippe A Robert, Ingrid Hobæk Haff, Geir K Sandve
{"title":"Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets.","authors":"Thomas Minotto, Philippe A Robert, Ingrid Hobæk Haff, Geir K Sandve","doi":"10.1515/sagmb-2023-0027","DOIUrl":"10.1515/sagmb-2023-0027","url":null,"abstract":"<p><p>Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"23 1","pages":""},"PeriodicalIF":0.4,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140337377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}