GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae009
Mohammad Torabi, Georgios D Mitsis, Jean-Baptiste Poline
{"title":"On the variability of dynamic functional connectivity assessment methods.","authors":"Mohammad Torabi, Georgios D Mitsis, Jean-Baptiste Poline","doi":"10.1093/gigascience/giae009","DOIUrl":"10.1093/gigascience/giae009","url":null,"abstract":"<p><strong>Background: </strong>Dynamic functional connectivity (dFC) has become an important measure for understanding brain function and as a potential biomarker. However, various methodologies have been developed for assessing dFC, and it is unclear how the choice of method affects the results. In this work, we aimed to study the results variability of commonly used dFC methods.</p><p><strong>Methods: </strong>We implemented 7 dFC assessment methods in Python and used them to analyze the functional magnetic resonance imaging data of 395 subjects from the Human Connectome Project. We measured the similarity of dFC results yielded by different methods using several metrics to quantify overall, temporal, spatial, and intersubject similarity.</p><p><strong>Results: </strong>Our results showed a range of weak to strong similarity between the results of different methods, indicating considerable overall variability. Somewhat surprisingly, the observed variability in dFC estimates was found to be comparable to the expected functional connectivity variation over time, emphasizing the impact of methodological choices on the final results. Our findings revealed 3 distinct groups of methods with significant intergroup variability, each exhibiting distinct assumptions and advantages.</p><p><strong>Conclusions: </strong>Overall, our findings shed light on the impact of dFC assessment analytical flexibility and highlight the need for multianalysis approaches and careful method selection to capture the full range of dFC variation. They also emphasize the importance of distinguishing neural-driven dFC variations from physiological confounds and developing validation frameworks under a known ground truth. To facilitate such investigations, we provide an open-source Python toolbox, PydFC, which facilitates multianalysis dFC assessment, with the goal of enhancing the reliability and interpretability of dFC studies.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11000510/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140863530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae070
Yuanting Shen, Lidan Tao, Rengang Zhang, Gang Yao, Minjie Zhou, Weibang Sun, Yongpeng Ma
{"title":"Genomic insights into endangerment and conservation of the garlic-fruit tree (Malania oleifera), a plant species with extremely small populations.","authors":"Yuanting Shen, Lidan Tao, Rengang Zhang, Gang Yao, Minjie Zhou, Weibang Sun, Yongpeng Ma","doi":"10.1093/gigascience/giae070","DOIUrl":"10.1093/gigascience/giae070","url":null,"abstract":"<p><strong>Background: </strong>Advanced whole-genome sequencing techniques enable covering nearly all genome nucleotide variations and thus can provide deep insights into protecting endangered species. However, the use of genomic data to make conservation strategies is still rare, particularly for endangered plants. Here we performed comprehensive conservation genomic analysis for Malania oleifera, an endangered tree species with a high amount of nervonic acid. We used whole-genome resequencing data of 165 samples, covering 16 populations across the entire distribution range, to investigate the formation reasons of its extremely small population sizes and to evaluate the possible genomic offsets and changes of ecology niche suitability under future climate change.</p><p><strong>Results: </strong>Although M. oleifera maintains relatively high genetic diversity among endangered woody plants (θπ = 3.87 × 10-3), high levels of inbreeding have been observed, which have reduced genetic diversity in 3 populations (JM, NP, and BM2) and caused the accumulation of deleterious mutations. Repeated bottleneck events, recent inbreeding (∼490 years ago), and anthropogenic disturbance to wild habitats have aggravated the fragmentation of M. oleifera and made it endangered. Due to the significant effect of higher average annual temperature, populations distributed in low altitude exhibit a greater genomic offset. Furthermore, ecological niche modeling shows the suitable habitats for M. oleifera will decrease by 71.15% and 98.79% in 2100 under scenarios SSP126 and SSP585, respectively.</p><p><strong>Conclusions: </strong>The basic realizations concerning the threats to M. oleifera provide scientific foundation for defining management and adaptive units, as well as prioritizing populations for genetic rescue. Meanwhile, we highlight the importance of integrating genomic offset and ecological niche modeling to make targeted conservation actions under future climate change. Overall, our study provides a paradigm for genomics-directed conservation.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142283910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae064
Yan Lu, Fang Luo, An Zhou, Cun Yi, Hao Chen, Jian Li, Yunhai Guo, Yuxiang Xie, Wei Zhang, Datao Lin, Yaming Yang, Zhongdao Wu, Yi Zhang, Shuhua Xu, Wei Hu
{"title":"Whole-genome sequencing of the invasive golden apple snail Pomacea canaliculata from Asia reveals rapid expansion and adaptive evolution.","authors":"Yan Lu, Fang Luo, An Zhou, Cun Yi, Hao Chen, Jian Li, Yunhai Guo, Yuxiang Xie, Wei Zhang, Datao Lin, Yaming Yang, Zhongdao Wu, Yi Zhang, Shuhua Xu, Wei Hu","doi":"10.1093/gigascience/giae064","DOIUrl":"10.1093/gigascience/giae064","url":null,"abstract":"<p><p>Pomacea canaliculata, an invasive species native to South America, is recognized for its broad geographic distribution and adaptability to a variety of ecological conditions. The details concerning the evolution and adaptation of P. canaliculate remain unclear due to a lack of whole-genome resequencing data. We examined 173 P. canaliculata genomes representing 17 geographic populations in East and Southeast Asia. Interestingly, P. canaliculata showed a higher level of genetic diversity than other mollusks, and our analysis suggested that the dispersal of P. canaliculata could have been driven by climate changes and human activities. Notably, we identified a set of genes associated with low temperature adaptation, including Csde1, a cold shock protein coding gene. Further RNA sequencing analysis and reverse transcription quantitative polymerase chain reaction experiments demonstrated the gene's dynamic pattern and biological functions during cold exposure. Moreover, both positive selection and balancing selection are likely to have contributed to the rapid environmental adaptation of P. canaliculata populations. In particular, genes associated with energy metabolism and stress response were undergoing positive selection, while a large number of immune-related genes showed strong signatures of balancing selection. Our study has advanced our understanding of the evolution of P. canaliculata and has provided a valuable resource concerning an invasive species.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142283912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae068
Danilo Bzdok, Guy Wolf, Jakub Kopal
{"title":"Harnessing population diversity: in search of tools of the trade.","authors":"Danilo Bzdok, Guy Wolf, Jakub Kopal","doi":"10.1093/gigascience/giae068","DOIUrl":"https://doi.org/10.1093/gigascience/giae068","url":null,"abstract":"<p><p>Big neuroscience datasets are not big small datasets when it comes to quantitative data analysis. Neuroscience has now witnessed the advent of many population cohort studies that deep-profile participants, yielding hundreds of measures, capturing dimensions of each individual's position in the broader society. Indeed, there is a rebalancing from small, strictly selected, and thus homogenized cohorts toward always larger, more representative, and thus diverse cohorts. This shift in cohort composition is prompting the revision of incumbent modeling practices. Major sources of population stratification increasingly overshadow the subtle effects that neuroscientists are typically studying. In our opinion, as we sample individuals from always wider diversity backgrounds, we will require a new stack of quantitative tools to realize diversity-aware modeling. We here take inventory of candidate analytical frameworks. Better incorporating driving factors behind population structure will allow refining our understanding of how brain-behavior relationships depend on human subgroups.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142344886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae080
Shizhuo Zhang, Jiyun Han, Juntao Liu
{"title":"Protein-protein and protein-nucleic acid binding site prediction via interpretable hierarchical geometric deep learning.","authors":"Shizhuo Zhang, Jiyun Han, Juntao Liu","doi":"10.1093/gigascience/giae080","DOIUrl":"10.1093/gigascience/giae080","url":null,"abstract":"<p><p>Identification of protein-protein and protein-nucleic acid binding sites provides insights into biological processes related to protein functions and technical guidance for disease diagnosis and drug design. However, accurate predictions by computational approaches remain highly challenging due to the limited knowledge of residue binding patterns. The binding pattern of a residue should be characterized by the spatial distribution of its neighboring residues combined with their physicochemical information interaction, which yet cannot be achieved by previous methods. Here, we design GraphRBF, a hierarchical geometric deep learning model to learn residue binding patterns from big data. To achieve it, GraphRBF describes physicochemical information interactions by designing an enhanced graph neural network and characterizes residue spatial distributions by introducing a prioritized radial basis function neural network. After training and testing, GraphRBF shows great improvements over existing state-of-the-art methods and strong interpretability of its learned representations. Applying GraphRBF to the SARS-CoV-2 omicron spike protein, it successfully identifies known epitopes of the protein. Moreover, it predicts multiple potential binding regions for new nanobodies or even new drugs with strong evidence. A user-friendly online server for GraphRBF is freely available at http://liulab.top/GraphRBF/server.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11528319/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142557605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giad113
Sheeba Samuel, Daniel Mietchen
{"title":"Computational reproducibility of Jupyter notebooks from biomedical publications.","authors":"Sheeba Samuel, Daniel Mietchen","doi":"10.1093/gigascience/giad113","DOIUrl":"10.1093/gigascience/giad113","url":null,"abstract":"<p><strong>Background: </strong>Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications.</p><p><strong>Approach: </strong>We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article's full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion.</p><p><strong>Results: </strong>Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions.</p><p><strong>Conclusions: </strong>We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10783158/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139416803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giad111
Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani
{"title":"Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis.","authors":"Akshay Akshay, Mitali Katoch, Navid Shekarchizadeh, Masoud Abedi, Ankush Sharma, Fiona C Burkhard, Rosalyn M Adam, Katia Monastyrskaya, Ali Hashemi Gheinani","doi":"10.1093/gigascience/giad111","DOIUrl":"10.1093/gigascience/giad111","url":null,"abstract":"<p><strong>Background: </strong>Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.</p><p><strong>Results: </strong>To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating 4 essential functionalities-namely, Data Exploration, AutoML, CustomML, and Visualization-MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on 6 distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme's feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.</p><p><strong>Conclusion: </strong>MLme serves as a valuable resource for leveraging ML to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10783149/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139416804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae042
Chao Zhang, Lin Liu, Ying Zhang, Mei Li, Shuangsang Fang, Qiang Kang, Ao Chen, Xun Xu, Yong Zhang, Yuxiang Li
{"title":"spatiAlign: an unsupervised contrastive learning model for data integration of spatially resolved transcriptomics.","authors":"Chao Zhang, Lin Liu, Ying Zhang, Mei Li, Shuangsang Fang, Qiang Kang, Ao Chen, Xun Xu, Yong Zhang, Yuxiang Li","doi":"10.1093/gigascience/giae042","DOIUrl":"10.1093/gigascience/giae042","url":null,"abstract":"<p><strong>Background: </strong>Integrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times.</p><p><strong>Findings: </strong>We propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space.</p><p><strong>Conclusions: </strong>In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":3.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11258913/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141727100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GigaSciencePub Date : 2024-01-02DOI: 10.1093/gigascience/giae078
Xing Liu, Chi Qu, Chuandong Liu, Na Zhu, Huaqiang Huang, Fei Teng, Caili Huang, Bingying Luo, Xuanzhu Liu, Min Xie, Feng Xi, Mei Li, Liang Wu, Yuxiang Li, Ao Chen, Xun Xu, Sha Liao, Jiajun Zhang
{"title":"StereoSiTE: a framework to spatially and quantitatively profile the cellular neighborhood organized iTME.","authors":"Xing Liu, Chi Qu, Chuandong Liu, Na Zhu, Huaqiang Huang, Fei Teng, Caili Huang, Bingying Luo, Xuanzhu Liu, Min Xie, Feng Xi, Mei Li, Liang Wu, Yuxiang Li, Ao Chen, Xun Xu, Sha Liao, Jiajun Zhang","doi":"10.1093/gigascience/giae078","DOIUrl":"https://doi.org/10.1093/gigascience/giae078","url":null,"abstract":"<p><strong>Background: </strong>Spatial transcriptome (ST) technologies are emerging as powerful tools for studying tumor biology. However, existing tools for analyzing ST data are limited, as they mainly rely on algorithms developed for single-cell RNA sequencing data and do not fully utilize the spatial information. While some algorithms have been developed for ST data, they are often designed for specific tasks, lacking a comprehensive analytical framework for leveraging spatial information.</p><p><strong>Results: </strong>In this study, we present StereoSiTE, an analytical framework that combines open-source bioinformatics tools with custom algorithms to accurately infer the functional spatial cell interaction intensity (SCII) within the cellular neighborhood (CN) of interest. We applied StereoSiTE to decode ST datasets from xenograft models and found that the CN efficiently distinguished different cellular contexts, while the SCII analysis provided more precise insights into intercellular interactions by incorporating spatial information. By applying StereoSiTE to multiple samples, we successfully identified a CN region dominated by neutrophils, suggesting their potential role in remodeling the immune tumor microenvironment (iTME) after treatment. Moreover, the SCII analysis within the CN region revealed neutrophil-mediated communication, supported by pathway enrichment, transcription factor regulon activities, and protein-protein interactions.</p><p><strong>Conclusions: </strong>StereoSiTE represents a promising framework for unraveling the mechanisms underlying treatment response within the iTME by leveraging CN-based tissue domain identification and SCII-inferred spatial intercellular interactions. The software is designed to be scalable, modular, and user-friendly, making it accessible to a wide range of researchers.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142498592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning a generalized graph transformer for protein function prediction in dissimilar sequences.","authors":"Yiwei Fu, Zhonghui Gu, Xiao Luo, Qirui Guo, Luhua Lai, Minghua Deng","doi":"10.1093/gigascience/giae093","DOIUrl":"https://doi.org/10.1093/gigascience/giae093","url":null,"abstract":"<p><strong>Background: </strong>In the face of a growing disparity between high-throughput sequence data and low-throughput experimental studies, the emerging field of deep learning stands as a promising alternative. Generally, many data-driven approaches are capable of facilitating fast and accurate predictions of protein functions. Nevertheless, the inherent statistical nature of deep learning techniques may limit their generalization capabilities when applied to novel nonhomologous proteins that diverge significantly from existing ones.</p><p><strong>Results: </strong>In this work, we herein propose a novel, generalized approach named Graph Adversarial Learning with Alignment (GALA) for protein function prediction. Our GALA method integrates a graph transformer architecture with an attention pooling module to extract embeddings from both protein sequences and structures, facilitating unified learning of protein representations. Particularly noteworthy, GALA incorporates a domain discriminator conditioned on both learnable representations and predicted probabilities, which undergoes adversarial learning to ensure representation invariance across diverse environments. To optimize the model with abundant label information, we generate label embeddings in the hidden space, explicitly aligning them with protein representations. Benchmarked on datasets derived from the PDB database and Swiss-Prot database, our GALA achieves considerable performance comparable to several state-of-the-art methods. Even more, GALA demonstrates wonderful biological interpretability by identifying significant functional residues associated with Gene Ontology terms through class activation mapping.</p><p><strong>Conclusions: </strong>GALA, which leverages adversarial learning and label embedding alignment to acquire domain-invariant protein representations, exhibits outstanding generalizability in function prediction for proteins from previously unseen sequence space. By incorporating the structures predicted by AlphaFold2, GALA demonstrates significant potential for function annotation in newly discovered sequences. A detailed implementation of our GALA is available at https://github.com/fuyw-aisw/GALA.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142828050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}