{"title":"Bi-SeqCNN: A Novel Light-Weight Bi-Directional CNN Architecture for Protein Function Prediction","authors":"Vikash Kumar;Akshay Deepak;Ashish Ranjan;Aravind Prakash","doi":"10.1109/TCBB.2024.3426491","DOIUrl":"10.1109/TCBB.2024.3426491","url":null,"abstract":"Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both \u0000<i>short-and-long</i>\u0000 range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on \u0000<i>short-term</i>\u0000 information from both the past and the future, although they offer parallelism. Therefore, a novel \u0000<i>bi-directional CNN</i>\u0000 that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is an ensemble approach to better the prediction results. To our knowledge, this is the first time \u0000<i>bi-directional CNNs</i>\u0000 are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50–0.70 times) fewer parameters than the SOTA methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1922-1933"},"PeriodicalIF":3.6,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141590210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao Zhu;Zhiqiang Du;Ziang Xu;Defu Yang;Minghan Chen;Qianqian Song
{"title":"SCRN: Single-Cell Gene Regulatory Network Identification in Alzheimer's Disease","authors":"Wentao Zhu;Zhiqiang Du;Ziang Xu;Defu Yang;Minghan Chen;Qianqian Song","doi":"10.1109/TCBB.2024.3424400","DOIUrl":"10.1109/TCBB.2024.3424400","url":null,"abstract":"Alzheimer's disease (AD) is the most common neurodegenerative disease, and it consumes considerable medical resources with increasing number of patients every year. Mounting evidence show that the regulatory disruptions altering the intrinsic activity of genes in brain cells contribute to AD pathogenesis. To gain insights into the underlying gene regulation in AD, we proposed a graph learning method, Single-Cell based Regulatory Network (SCRN), to identify the regulatory mechanisms based on single-cell data. SCRN implements the γ-decaying heuristic link prediction based on graph neural networks and can identify reliable gene regulatory networks using locally closed subgraphs. In this work, we first performed UMAP dimension reduction analysis on single-cell RNA sequencing (scRNA-seq) data of AD and normal samples. Then we used SCRN to construct the gene regulatory network based on three well-recognized AD genes (APOE, CX3CR1, and P2RY12). Enrichment analysis of the regulatory network revealed significant pathways including NGF signaling, ERBB2 signaling, and hemostasis. These findings demonstrate the feasibility of using SCRN to uncover potential biomarkers and therapeutic targets related to AD.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1886-1896"},"PeriodicalIF":3.6,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141558630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marzieh Emadi;Farsad Zamani Boroujeni;Jamshid Pirgazi
{"title":"Improved Fuzzy Cognitive Maps for Gene Regulatory Networks Inference Based on Time Series Data","authors":"Marzieh Emadi;Farsad Zamani Boroujeni;Jamshid Pirgazi","doi":"10.1109/TCBB.2024.3423383","DOIUrl":"10.1109/TCBB.2024.3423383","url":null,"abstract":"Microarray data provide lots of information regarding gene expression levels. Due to the large amount of such data, their analysis requires sufficient computational methods for identifying and analyzing gene regulation networks; however, researchers in this field are faced with numerous challenges such as consideration for too many genes and at the same time, the limited number of samples and their noisy nature of the data. In this paper, a hybrid method base on fuzzy cognitive map and compressed sensing is used to identify interactions between genes. For this purpose, in inference of the gene regulation network, the Ensemble Kalman filtered compressed sensing is used to learn the fuzzy cognitive map. Using the Ensemble Kalman filter and compressed sensing, the fuzzy cognitive map will be robust against noise. The proposed algorithm is evaluated using several metrics and compared with several well know methods such as LASSOFCM, KFRegular, CMI2NI. The experimental results show that the proposed method outperforms methods proposed in recent years in terms of SSmean, Data Error and accuracy.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1816-1829"},"PeriodicalIF":3.6,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141534365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhang, Junyong Zhu, Sheng Wang, Jie Hou, Dong Si, Renzhi Cao
{"title":"AnglesRefine: Refinement of 3D Protein Structures Using Transformer Based on Torsion Angles.","authors":"Lei Zhang, Junyong Zhu, Sheng Wang, Jie Hou, Dong Si, Renzhi Cao","doi":"10.1109/TCBB.2024.3422288","DOIUrl":"10.1109/TCBB.2024.3422288","url":null,"abstract":"<p><p>The goal of protein structure refinement is to enhance the precision of predicted protein models, particularly at the residue level of the local structure. Existing refinement approaches primarily rely on physics, whereas molecular simulation methods are resource-intensive and time-consuming. In this study, we employ deep learning methods to extract structural constraints from protein structure residues to assist in protein structure refinement. We introduce a novel method, AnglesRefine, which focuses on a protein's secondary structure and employs transformer to refine various protein structure angles (psi, phi, omega, CA_C_N_angle, C_N_CA_angle, N_CA_C_angle), ultimately generating a superior protein model based on the refined angles. We evaluate our approach against other cutting-edge methods using the CASP11-14 and CASP15 datasets. Experimental outcomes indicate that our method generally surpasses other techniques on the CASP11-14 test dataset, while performing comparably or marginally better on the CASP15 test dataset. Our method consistently demonstrates the least likelihood of model quality degradation, e.g., the degradation percentage of our method is less than 10%, while other methods are about 50%. Furthermore, as our approach eliminates the need for conformational search and sampling, it significantly reduces computational time compared to existing refinement methods.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141497925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hailin Feng;Chenchen Ke;Quan Zou;Zhechen Zhu;Tongcun Liu
{"title":"Prediction of Potential miRNA-Disease Associations Based on a Masked Graph Autoencoder","authors":"Hailin Feng;Chenchen Ke;Quan Zou;Zhechen Zhu;Tongcun Liu","doi":"10.1109/TCBB.2024.3421924","DOIUrl":"10.1109/TCBB.2024.3421924","url":null,"abstract":"Biomedical evidence has demonstrated the relevance of microRNA (miRNA) dysregulation in complex human diseases, and determining the relationship between miRNAs and diseases can aid in the early detection and prevention of diseases. Traditional biological experimental methods have the disadvantages of high cost and low efficiency, which are well compensated by computational methods. However, many computational methods have the challenge of excessively focusing on the neighbor relationship, ignoring the structural information of the graph, and belittling the redundant information of the graph structure. This study proposed a computational model based on a graph-masking autoencoder named MGAEMDA. MGAEMDA is an asymmetric framework in which the encoder maps partially observed graphs into latent representations. The decoder reconstructs the masked structural information based on the edge and node levels and combines it with linear matrices to obtain the result. The empirical results on the two datasets reveal that the MGAEMDA model performs better than its counterparts. We also demonstrated the predictive performance of MGAEMDA using a case study of four diseases, and all the top 30 predicted miRNAs were validated in the database, providing further evidence of the excellent performance of the model.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1874-1885"},"PeriodicalIF":3.6,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141491810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyu Wang;Ying Chu;Qianqian Wang;Limei Zhang;Lishan Qiao;Mingxia Liu
{"title":"Graph Convolutional Network With Self-Supervised Learning for Brain Disease Classification","authors":"Guangyu Wang;Ying Chu;Qianqian Wang;Limei Zhang;Lishan Qiao;Mingxia Liu","doi":"10.1109/TCBB.2024.3422152","DOIUrl":"10.1109/TCBB.2024.3422152","url":null,"abstract":"Brain functional network (BFN) analysis has become a popular method for identifying neurological diseases at their early stages and revealing sensitive biomarkers related to these diseases. Due to the fact that BFN is a graph with complex structure, graph convolutional networks (GCNs) can be naturally used in the identification of BFN, and can generally achieve an encouraging performance if given large amounts of training data. In practice, however, it is very difficult to obtain sufficient brain functional data, especially from subjects with brain disorders. As a result, GCNs usually fail to learn a reliable feature representation from limited BFNs, leading to overfitting issues. In this paper, we propose an improved GCN method to classify brain diseases by introducing a self-supervised learning (SSL) module for assisting the graph feature representation. We conduct experiments to classify subjects with mild cognitive impairment (MCI) and autism spectrum disorder (ASD) respectively from normal controls (NCs). Experimental results on two benchmark databases demonstrate that our proposed scheme tends to obtain higher classification accuracy than the baseline methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1830-1841"},"PeriodicalIF":3.6,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141491809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Structured Matrix Approximation for Robustness to Incomplete Biosequence Data","authors":"Chris Salahub;Jeffrey Uhlmann","doi":"10.1109/TCBB.2024.3420903","DOIUrl":"10.1109/TCBB.2024.3420903","url":null,"abstract":"We propose a general method for optimally approximating an arbitrary matrix \u0000<inline-formula><tex-math>$mathbf {M}$</tex-math></inline-formula>\u0000 by a structured matrix \u0000<inline-formula><tex-math>$mathbf {T}$</tex-math></inline-formula>\u0000 (circulant, Toeplitz/Hankel, etc.) and examine its use for estimating the spectra of genomic linkage disequilibrium matrices. This application is prototypical of a variety of genomic and proteomic problems that demand robustness to incomplete biosequence information. We perform a simulation study and corroborative test of our method using real genomic data from the Mouse Genome Database (Bult et al., 2019). The results confirm the predicted utility of the method and provide strong evidence of its potential value to a wide range of bioinformatics applications. Our optimal general matrix approximation method is expected to be of independent interest to an even broader range of applications in applied mathematics and engineering.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2592-2597"},"PeriodicalIF":3.6,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141476497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ense-i6mA: Identification of DNA N6-Methyladenine Sites Using XGB-RFE Feature Selection and Ensemble Machine Learning","authors":"Xueqiang Fan;Bing Lin;Jun Hu;Zhongyi Guo","doi":"10.1109/TCBB.2024.3421228","DOIUrl":"10.1109/TCBB.2024.3421228","url":null,"abstract":"DNA N\u0000<sup>6</sup>\u0000-methyladenine (6mA) is an important epigenetic modification that plays a vital role in various cellular processes. Accurate identification of the 6mA sites is fundamental to elucidate the biological functions and mechanisms of modification. However, experimental methods for detecting 6mA sites are high-priced and time-consuming. In this study, we propose a novel computational method, called Ense-i6mA, to predict 6mA sites. Firstly, five encoding schemes, i.e., one-hot encoding, gcContent, Z-Curve, \u0000<italic>K</i>\u0000-mer nucleotide frequency, and \u0000<italic>K</i>\u0000-mer nucleotide frequency with gap, are employed to extract DNA sequence features. Secondly, eXtreme gradient boosting coupled with recursive feature elimination is applied to remove noisy features for avoiding over-fitting, reducing computing time and complexity. Then, the best subset of features is fed into base-classifiers composed of Extra Trees, eXtreme Gradient Boosting, Light Gradient Boosting Machine, and Support Vector Machine. Finally, to minimize generalization errors, the prediction probabilities of the base-classifiers are aggregated by averaging for inferring the final 6mA sites results. We conduct experiments on two species, i.e., Arabidopsis thaliana and Drosophila melanogaster, to compare the performance of Ense-i6mA against the recent 6mA sites prediction methods. The experimental results demonstrate that the proposed Ense-i6mA achieves area under the receiver operating characteristic curve values of 0.967 and 0.968, accuracies of 91.4% and 92.0%, and Mathew's correlation coefficient values of 0.829 and 0.842 on two benchmark datasets, respectively, and outperforms several existing state-of-the-art methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1842-1854"},"PeriodicalIF":3.6,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141476496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Haplotype Frequency Inference From Pooled Genetic Data With a Latent Multinomial Model","authors":"Yong See Foo;Jennifer Flegg","doi":"10.1109/TCBB.2024.3420430","DOIUrl":"10.1109/TCBB.2024.3420430","url":null,"abstract":"In genetic association studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation fails, due to the normal covariance matrix being near-singular. As an alternative to approximate methods, in this paper we propose two exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the pooled results are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for haplotype data from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1864-1873"},"PeriodicalIF":3.6,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141467650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tropical Density Estimation of Phylogenetic Trees","authors":"Ruriko Yoshida;David Barnhill;Keiji Miura;Daniel Howe","doi":"10.1109/TCBB.2024.3420815","DOIUrl":"10.1109/TCBB.2024.3420815","url":null,"abstract":"Much evidence from biological theory and empirical data indicates that, gene trees, phylogenetic trees reconstructed from different genes (loci), do not have to have exactly the same tree topologies. Such incongruence between gene trees might be caused by some “unusual” evolutionary events, such as meiotic sexual recombination in eukaryotes or horizontal transfers of genetic material in prokaryotes. However, most of the gene trees are constrained by the tree topology of the underlying species tree, that is, the phylogenetic tree depicting the evolutionary history of the set of species under consideration. In order to discover “outlying” gene trees which do not follow the “main distribution(s)” of trees, we propose to apply the “tropical metric” with the max-plus algebra from tropical geometry to a non-parametric estimation of gene trees over the space of phylogenetic trees. In this research we apply the “tropical metric,” a well-defined metric over the space of phylogenetic trees under the max-plus algebra, to non-parametric estimation of gene trees distribution over the tree space. Kernel density estimator (KDE) is one of the most popular non-parametric estimation of a distribution from a given sample, and we propose an analogue of the classical KDE in the setting of tropical geometry with the tropical metric which measures the length of an intrinsic geodesic between trees over the tree space. We estimate the probability of an observed tree by empirical frequencies of nearby trees, with the level of influence determined by the tropical metric. Then, with simulated data generated from the multispecies coalescent model, we show that the non-parametric estimation of the gene tree distribution using the tropical metric performs better than one using the Billera-Holmes-Vogtmann (BHV) metric developed by Weyenberg et al. in terms of computational times and accuracy. We then apply it to Apicomplexa data.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1855-1863"},"PeriodicalIF":3.6,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10577088","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141467651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}