Qianqian Song, Taobo Hu, Baosheng Liang, Shihai Li, Yang Li, Jinbo Wu, Shu Wang, Xiaohua Zhou
{"title":"cascAGS: Comparative Analysis of SNP Calling Methods for Human Genome Data in the Absence of Gold Standard.","authors":"Qianqian Song, Taobo Hu, Baosheng Liang, Shihai Li, Yang Li, Jinbo Wu, Shu Wang, Xiaohua Zhou","doi":"10.1007/s12539-024-00653-8","DOIUrl":"https://doi.org/10.1007/s12539-024-00653-8","url":null,"abstract":"<p><p>The development of third-generation sequencing has accelerated the boom of single nucleotide polymorphism (SNP) calling methods, but evaluating accuracy remains challenging owing to the absence of the SNP gold standard. The definitions for without-gold-standard and performance metrics and their estimation are urgently needed. Additionally, the possible correlations between different SNP loci should also be further explored. To address these challenges, we first introduced the concept of a gold standard and imperfect gold standard under the consistency framework and gave the corresponding definitions of sensitivity and specificity. A latent class model (LCM) was established to estimate the sensitivity and specificity of callers. Furthermore, we incorporated different dependency structures into LCM to investigate their impact on sensitivity and specificity. The performance of LCM was illustrated by comparing the accuracy of BCFtools, DeepVariant, FreeBayes, and GATK on various datasets. Through estimations across multiple datasets, the results indicate that LCM is well-suitable for evaluating callers without the SNP gold standard, and accurate inclusion of the dependency between variations is crucial for better performance ranking. DeepVariant has a higher sum of sensitivity and specificity than other callers, followed by GATK and BCFtools. FreeBayes has low sensitivity but high specificity. Notably, appropriate sequencing coverage is another important factor for precise callers' evaluation. Most importantly, a web interface for assessing and comparing different callers was developed to simplify the evaluation process.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142499766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hatice Busra Luleci, Selcen Ari Yuka, Alper Yilmaz
{"title":"Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.","authors":"Hatice Busra Luleci, Selcen Ari Yuka, Alper Yilmaz","doi":"10.1007/s12539-024-00659-2","DOIUrl":"https://doi.org/10.1007/s12539-024-00659-2","url":null,"abstract":"<p><p>k-mer frequencies are crucial for understanding DNA sequence patterns and structure, with applications in motif discovery, genome classification, and short read assembly. However, the exponential increase in the dimension of frequency tables with increasing k-mer length poses storage challenges. In this study, we present a novel method for compressing k-mer data without information loss, aiming to optimize storage and analysis processes. We employed Chaos Game Representation (CGR) to map k-mers to coordinates and used these components to generate raster images of k-mers. The CGR maps were partitioned and labeled based on substrings, with each substring mapped to a subframe, creating a fractal-like structure. The entire k-mer frequency set of each genomic sequence was represented as a single image, with each pixel corresponding to a specific k-mer and its occurrence. This approach reduced file size by up to 16-fold compared to plain text and 3-fold compared to binary format. Furthermore, we demonstrated the feasibility of performing alignment-free similarity analyses on images derived from k-mer frequencies of whole genome sequences from 14 plant species. Our results highlight the potential of this method as a fast and efficient tool for accessing, processing, and analyzing large biological sequence datasets, enabling the retrieval of k-mer frequencies and image reconstruction.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cell Fate Dynamics Reconstruction Identifies TPT1 and PTPRZ1 Feedback Loops as Master Regulators of Differentiation in Pediatric Glioblastoma-Immune Cell Networks.","authors":"Abicumaran Uthamacumaran","doi":"10.1007/s12539-024-00657-4","DOIUrl":"https://doi.org/10.1007/s12539-024-00657-4","url":null,"abstract":"<p><p>Pediatric glioblastoma is a complex dynamical disease that is difficult to treat due to its multiple adaptive behaviors driven largely by phenotypic plasticity. Integrated data science and network theory pipelines offer novel approaches to studying glioblastoma cell fate dynamics, particularly phenotypic transitions over time. Here we used various single-cell trajectory inference algorithms to infer signaling dynamics regulating pediatric glioblastoma-immune cell networks. We identified GATA2, PTPRZ1, TPT1, MTRNR2L1/2, OLIG1/2, SOX11, FXYD6, SEZ6L, PDGFRA, EGFR, S100B, WNT, TNF <math><mi>α</mi></math> , and NF-kB as critical transition genes or signals regulating glioblastoma-immune network dynamics, revealing potential clinically relevant targets. Further, we reconstructed glioblastoma cell fate attractors and found complex bifurcation dynamics within glioblastoma phenotypic transitions, suggesting that a causal pattern may be driving glioblastoma evolution and cell fate decision-making. Together, our findings have implications for developing targeted therapies against glioblastoma, and the continued integration of quantitative approaches and artificial intelligence (AI) to understand pediatric glioblastoma tumor-immune interactions.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"misORFPred: A Novel Method to Mine Translatable sORFs in Plant Pri-miRNAs Using Enhanced Scalable k-mer and Dynamic Ensemble Voting Strategy.","authors":"Haibin Li, Jun Meng, Zhaowei Wang, Yushi Luan","doi":"10.1007/s12539-024-00661-8","DOIUrl":"https://doi.org/10.1007/s12539-024-00661-8","url":null,"abstract":"<p><p>The primary microRNAs (pri-miRNAs) have been observed to contain translatable small open reading frames (sORFs) that can encode peptides as an independent element. Relevant studies have proven that those of sORFs are of significance in regulating the expression of biological traits. The existing methods for predicting the coding potential of sORFs frequently overlook this data or categorize them as negative samples, impeding the identification of additional translatable sORFs in pri-miRNAs. In light of this, a novel method named misORFPred has been proposed. Specifically, an enhanced scalable k-mer (ESKmer) that simultaneously integrates the composition information within a sequence and distance information between sequences is designed to extract the nucleotide sequence features. After feature selection, the optimal features and several machine learning classifiers are combined to construct the ensemble model, where a newly devised dynamic ensemble voting strategy (DEVS) is proposed to dynamically adjust the weights of base classifiers and adaptively select the optimal base classifiers for each unlabeled sample. Cross-validation results suggest that ESKmer and DEVS are essential for this classification task and could boost model performance. Independent testing results indicate that misORFPred outperforms the state-of-the-art methods. Furthermore, we execute misORFPerd on the genomes of various plant species and perform a thorough analysis of the predicted outcomes. Taken together, misORFPred is a powerful tool for identifying the translatable sORFs in plant pri-miRNAs and can provide highly trusted candidates for subsequent biological experiments.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Plant lncRNA-miRNA Interaction Prediction Based on Counterfactual Heterogeneous Graph Attention Network.","authors":"Yu He, ZiLan Ning, XingHui Zhu, YinQiong Zhang, ChunHai Liu, SiWei Jiang, ZheMing Yuan, HongYan Zhang","doi":"10.1007/s12539-024-00652-9","DOIUrl":"https://doi.org/10.1007/s12539-024-00652-9","url":null,"abstract":"<p><p>Identifying interactions between long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) provides a new perspective for understanding regulatory relationships in plant life processes. Recently, computational methods based on graph neural networks (GNNs) have been widely employed to predict lncRNA-miRNA interactions (LMIs), which compensate for the inadequacy of biological experiments. However, the low-semantic and noise of graph limit the performance of existing GNN-based methods. In this paper, we develop a novel Counterfactual Heterogeneous Graph Attention Network (CFHAN) to improve the robustness to against the noise and the prediction of plant LMIs. Firstly, we construct a real-world based lncRNA-miRNA (L-M) heterogeneous network. Secondly, CFHAN utilizes the node-level attention, the semantic-level attention, and the counterfactual links to enhance the node embeddings learning. Finally, these embeddings are used as inputs for Multilayer Perceptron (MLP) to predict the interactions between lncRNAs and miRNAs. Evaluating our method on a benchmark dataset of plant LMIs, CFHAN outperforms five state-of-the-art methods, and achieves an average AUC and average ACC of 0.9953 and 0.9733, respectively. This demonstrates CFHAN's ability to predict plant LMIs and exhibits promising cross-species prediction ability, offering valuable insights for experimental LMI researches.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Molecular Fragment Representation Learning Framework for Drug-Drug Interaction Prediction.","authors":"Jiaxi He, Yuping Sun, Jie Ling","doi":"10.1007/s12539-024-00658-3","DOIUrl":"https://doi.org/10.1007/s12539-024-00658-3","url":null,"abstract":"<p><p>The concurrent use of multiple drugs may result in drug-drug interactions, increasing the risk of adverse reactions. Hence, it is particularly crucial to propose computational methods for precisely identifying unknown drug-drug interactions, which is of great significance for drug development and health. However, most recent studies have limited the drug-drug interaction prediction task to identifying interactions between substructures, overlooking molecular hierarchical information. Moreover, the extracted substructures in these methods are always restricted to have the same number of atoms as contained in the molecular graph, which does not align with real-world facts. In this study, a molecular fragment representation learning framework for drug-drug interaction prediction is introduced. Initially, a fragment extraction module is designed to acquire a series of molecular fragments. Subsequently, to capture more comprehensive features, molecular hierarchical information is effectively integrated, enabling drug-drug interaction prediction by identifying pairwise interactions between molecular fragments of each drug. Comprehensive evaluations demonstrate that the proposed method achieved state-of-the-art performance in both DrugBank and Twosides datasets, particularly achieving an improved accuracy of over 20% for unseen drugs in both two datasets. Furthermore, case studies and visual analysis confirm that the proposed method can accurately identify crucial substructures influencing the interactions, which are basically consistent with functional group structures in reality. In conclusion, this method not only enhances the performance of drug-drug interaction prediction but also offers high interpretability. Source code is freely available at https://github.com/kennysyp/MFR-DDI .</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI Prediction of Structural Stability of Nanoproteins Based on Structures and Residue Properties by Mean Pooled Dual Graph Convolutional Network.","authors":"Daixi Li, Yuqi Zhu, Wujie Zhang, Jing Liu, Xiaochen Yang, Zhihong Liu, Dongqing Wei","doi":"10.1007/s12539-024-00662-7","DOIUrl":"https://doi.org/10.1007/s12539-024-00662-7","url":null,"abstract":"<p><p>The structural stability of proteins is an important topic in various fields such as biotechnology, pharmaceuticals, and enzymology. Specifically, understanding the structural stability of protein is crucial for protein design. Artificial design, while pursuing high thermodynamic stability and rigidity of proteins, inevitably sacrifices biological functions closely related to protein flexibility. The thermodynamic stability of proteins is not always optimal when they are highest to perfectly perform their biological functions. Extensive theoretical and experimental screening is often required to obtain stable protein structures. Thus, it becomes critically important to develop a stability prediction model based on the balance between protein stability and bioactivity. To design protein drugs with better functionality in a broader structural space, a novel protein structural stability predictor called PSSP has been developed in this study. PSSP is a mean pooled dual graph convolutional network (GCN) model based on sequence characteristics and secondary structure, distance matrix, graph, and residue properties of a nanoprotein to provide rapid prediction and judgment. This model exhibits excellent robustness in predicting the structural stability of nanoproteins. Comparing with previous artificial intelligence algorithms, the results indicate this model can provide a rapid and accurate assessment of the structural stability of artificially designed proteins, which shows the great promises for promoting the robust development of protein design.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minghui Du, Yuxiang Ren, Yang Zhang, Wenwen Li, Hongtao Yang, Huiying Chu, Yongshan Zhao
{"title":"CSEL-BGC: A Bioinformatics Framework Integrating Machine Learning for Defining the Biosynthetic Evolutionary Landscape of Uncharacterized Antibacterial Natural Products.","authors":"Minghui Du, Yuxiang Ren, Yang Zhang, Wenwen Li, Hongtao Yang, Huiying Chu, Yongshan Zhao","doi":"10.1007/s12539-024-00656-5","DOIUrl":"https://doi.org/10.1007/s12539-024-00656-5","url":null,"abstract":"<p><p>The sluggish pace of new antibacterial drug development reflects a vulnerability in the face of the current severe threat posed by bacterial resistance. Microbial natural products (NPs), as a reservoir of immense chemical potential, have emerged as the most promising avenue for the discovery of next generation antibacterial agent. Directly accessing the antibacterial activity of potential products derived from biosynthetic gene clusters (BGCs) would significantly expedite the process. To tackle this issue, we propose a CSEL-BGC framework that integrates machine learning (ML) techniques. This framework involves the development of a novel cascade-stacking ensemble learning (CSEL) model and the establishment of a groundbreaking model evaluation system. Based on this framework, we predict 6,666 BGCs with antibacterial activity from 3,468 complete bacterial genomes and elucidate a biosynthetic evolutionary landscape to reveal their antibacterial potential. This provides crucial insights for interpretating the synthesis and secretion mechanisms of unknown NPs.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142346017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"scCrab: A Reference-Guided Cancer Cell Identification Method based on Bayesian Neural Networks.","authors":"Heyang Hua, Wenxin Long, Yan Pan, Siyu Li, Jianyu Zhou, Haixin Wang, Shengquan Chen","doi":"10.1007/s12539-024-00655-6","DOIUrl":"https://doi.org/10.1007/s12539-024-00655-6","url":null,"abstract":"<p><p>Cancer is a significant global public health concern, where early detection can greatly enhance curative outcomes. Therefore, the identification of cancer cells holds significant importance as the primary method for cancer diagnosis. The advancement of single-cell RNA sequencing (scRNA-seq) technology has made it possible to address the problem of cancer cell identification at the single-cell level more efficiently with computational methods, as opposed to the time-consuming and less reproducible manual identification methods. However, existing computational methods have shown suboptimal identification performance and a lack of capability to incorporate external reference data as prior information. Here, we propose scCrab, a reference-guided automatic cancer cell identification method, which performs ensemble learning based on a Bayesian neural network (BNN) with multi-head self-attention mechanisms and a linear regression model. Through a series of experiments on various datasets, we systematically validated the superior performance of scCrab in both intra- and inter-dataset predictions. Besides, we demonstrated the robustness of scCrab to dropout rate and sample size, and conducted ablation experiments to investigate the contributions of each component in scCrab. Furthermore, as a dedicated model for cancer cell identification, scCrab effectively captures cancer-related biological significance during the identification process.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142346020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empowering Graph Neural Network-Based Computational Drug Repositioning with Large Language Model-Inferred Knowledge Representation.","authors":"Yaowen Gu, Zidu Xu, Carl Yang","doi":"10.1007/s12539-024-00654-7","DOIUrl":"https://doi.org/10.1007/s12539-024-00654-7","url":null,"abstract":"<p><p>Computational drug repositioning, through predicting drug-disease associations (DDA), offers significant potential for discovering new drug indications. Current methods incorporate graph neural networks (GNN) on drug-disease heterogeneous networks to predict DDAs, achieving notable performances compared to traditional machine learning and matrix factorization approaches. However, these methods depend heavily on network topology, hampered by incomplete and noisy network data, and overlook the wealth of biomedical knowledge available. Correspondingly, large language models (LLMs) excel in graph search and relational reasoning, which can possibly enhance the integration of comprehensive biomedical knowledge into drug and disease profiles. In this study, we first investigate the contribution of LLM-inferred knowledge representation in drug repositioning and DDA prediction. A zero-shot prompting template was designed for LLM to extract high-quality knowledge descriptions for drug and disease entities, followed by embedding generation from language models to transform the discrete text to continual numerical representation. Then, we proposed LLM-DDA with three different model architectures (LLM-DDA<sub>Node Feat</sub>, LLM-DDA<sub>Dual GNN</sub>, LLM-DDA<sub>GNN-AE</sub>) to investigate the best fusion mode for LLM-based embeddings. Extensive experiments on four DDA benchmarks show that, LLM-DDA<sub>GNN-AE</sub> achieved the optimal performance compared to 11 baselines with the overall relative improvement in AUPR of 23.22%, F1-Score of 17.20%, and precision of 25.35%. Meanwhile, selected case studies of involving Prednisone and Allergic Rhinitis highlighted the model's capability to identify reliable DDAs and knowledge descriptions, supported by existing literature. This study showcases the utility of LLMs in drug repositioning with its generality and applicability in other biomedical relation prediction tasks.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142346018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}