Bioinformatics advances最新文献

筛选
英文 中文
Improving protein function prediction by learning and integrating representations of protein sequences and function labels. 通过学习和整合蛋白质序列与功能标签的表征,改进蛋白质功能预测。
IF 2.4
Bioinformatics advances Pub Date : 2024-08-17 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae120
Frimpong Boadu, Jianlin Cheng
{"title":"Improving protein function prediction by learning and integrating representations of protein sequences and function labels.","authors":"Frimpong Boadu, Jianlin Cheng","doi":"10.1093/bioadv/vbae120","DOIUrl":"10.1093/bioadv/vbae120","url":null,"abstract":"<p><strong>Motivation: </strong>As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt.</p><p><strong>Results: </strong>We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms.</p><p><strong>Availability and implementation: </strong>https://github.com/BioinfoMachineLearning/TransFew.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11374024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142135095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In the twilight zone of protein sequence homology: do protein language models learn protein structure? 蛋白质序列同源性的黄昏地带:蛋白质语言模型能学习蛋白质结构吗?
IF 2.4
Bioinformatics advances Pub Date : 2024-08-17 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae119
Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu
{"title":"In the twilight zone of protein sequence homology: do protein language models learn protein structure?","authors":"Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu","doi":"10.1093/bioadv/vbae119","DOIUrl":"10.1093/bioadv/vbae119","url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.</p><p><strong>Results: </strong>We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the \"twilight zone\" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.</p><p><strong>Availability and implementation: </strong>We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11344590/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142057444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
splicekit: an integrative toolkit for splicing analysis from short-read RNA-seq. splicekit:从短线程 RNA-seq 进行剪接分析的综合工具包。
IF 2.4
Bioinformatics advances Pub Date : 2024-08-17 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae121
Gregor Rot, Arne Wehling, Roland Schmucki, Nikolaos Berntenis, Jitao David Zhang, Martin Ebeling
{"title":"<i>splicekit</i>: an integrative toolkit for splicing analysis from short-read RNA-seq.","authors":"Gregor Rot, Arne Wehling, Roland Schmucki, Nikolaos Berntenis, Jitao David Zhang, Martin Ebeling","doi":"10.1093/bioadv/vbae121","DOIUrl":"10.1093/bioadv/vbae121","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis of alternative splicing using short-read RNA-seq data is a complex process that involves several steps: alignment of reads to the reference genome, identification of alternatively spliced features, motif discovery, analysis of RNA-protein binding near donor and acceptor splice sites, and exploratory data visualization. To the best of our knowledge, there is currently no integrative open-source software dedicated to this task.</p><p><strong>Results: </strong>Here, we introduce <i>splicekit</i>, a Python package that provides and integrates a set of existing and novel splicing analysis tools for conducting splicing analysis.</p><p><strong>Availability and implementation: </strong>The software <i>splicekit</i> is open-source and available at Github (https://github.com/bedapub/splicekit) and <i>via</i> the Python Package Index.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11364168/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142115498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
C2CDB: an advanced platform integrating comprehensive information and analysis tools of cancer-related circRNAs. C2CDB:一个集成了癌症相关 circRNAs 综合信息和分析工具的先进平台。
IF 2.4
Bioinformatics advances Pub Date : 2024-08-16 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae112
Yuanli Zuo, Wenrong Liu, Yang Jin, Yitong Pan, Ting Fan, Xin Fu, Jiawei Guo, Shuangyan Tan, Juan He, Yang Yang, Zhang Li, Chenyu Yang, Yong Peng
{"title":"C2CDB: an advanced platform integrating comprehensive information and analysis tools of cancer-related circRNAs.","authors":"Yuanli Zuo, Wenrong Liu, Yang Jin, Yitong Pan, Ting Fan, Xin Fu, Jiawei Guo, Shuangyan Tan, Juan He, Yang Yang, Zhang Li, Chenyu Yang, Yong Peng","doi":"10.1093/bioadv/vbae112","DOIUrl":"10.1093/bioadv/vbae112","url":null,"abstract":"<p><strong>Motivation: </strong>Circular RNAs (circRNAs) play important roles in gene expression and their involvement in tumorigenesis is emerging. circRNA-related database is a powerful tool for researchers to investigate circRNAs. However, existing databases lack advanced platform integrating comprehensive information and analysis tools of cancer-related circRNAs.</p><p><strong>Results: </strong>We developed a comprehensive platform called CircRNA to Cancer Database (C2CDB), encompassing 318 158 cancer-related circRNAs expressed in tumors and adjacent tissues across 30 types of cancers. C2CDB provides basic details such as sequence and expression levels of circRNAs, as well as crucial insights into biological mechanisms, including miRNA binding, RNA-binding protein interaction, coding potential, base modification, mutation, and secondary structure. Moreover, C2CDB collects an extensive compilation of published literature on cancer circRNAs, extracting and presenting pivotal content encompassing biological functions, underlying mechanisms, and molecular tools in these studies. Additionally, C2CDB offers integrated tools to analyse three potential mechanisms: circRNA-miRNA ceRNA interaction, circRNA encoding, and circRNA biogenesis, facilitating investigators with convenient access to highly reliable information. To enhance clarity and organization, C2CDB has meticulously curated and integrated the previously chaotic nomenclature of circRNAs, addressing the prevailing confusion and ambiguity surrounding their designations.</p><p><strong>Availability and implementation: </strong>C2CDB is freely available at http://pengyonglab.com/c2cdb.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11379471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142156806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Current and future directions in network biology. 网络生物学的当前和未来发展方向。
IF 2.4
Bioinformatics advances Pub Date : 2024-08-14 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae099
Marinka Zitnik, Michelle M Li, Aydin Wells, Kimberly Glass, Deisy Morselli Gysi, Arjun Krishnan, T M Murali, Predrag Radivojac, Sushmita Roy, Anaïs Baudot, Serdar Bozdag, Danny Z Chen, Lenore Cowen, Kapil Devkota, Anthony Gitter, Sara J C Gosline, Pengfei Gu, Pietro H Guzzi, Heng Huang, Meng Jiang, Ziynet Nesibe Kesimoglu, Mehmet Koyuturk, Jian Ma, Alexander R Pico, Nataša Pržulj, Teresa M Przytycka, Benjamin J Raphael, Anna Ritz, Roded Sharan, Yang Shen, Mona Singh, Donna K Slonim, Hanghang Tong, Xinan Holly Yang, Byung-Jun Yoon, Haiyuan Yu, Tijana Milenković
{"title":"Current and future directions in network biology.","authors":"Marinka Zitnik, Michelle M Li, Aydin Wells, Kimberly Glass, Deisy Morselli Gysi, Arjun Krishnan, T M Murali, Predrag Radivojac, Sushmita Roy, Anaïs Baudot, Serdar Bozdag, Danny Z Chen, Lenore Cowen, Kapil Devkota, Anthony Gitter, Sara J C Gosline, Pengfei Gu, Pietro H Guzzi, Heng Huang, Meng Jiang, Ziynet Nesibe Kesimoglu, Mehmet Koyuturk, Jian Ma, Alexander R Pico, Nataša Pržulj, Teresa M Przytycka, Benjamin J Raphael, Anna Ritz, Roded Sharan, Yang Shen, Mona Singh, Donna K Slonim, Hanghang Tong, Xinan Holly Yang, Byung-Jun Yoon, Haiyuan Yu, Tijana Milenković","doi":"10.1093/bioadv/vbae099","DOIUrl":"10.1093/bioadv/vbae099","url":null,"abstract":"<p><strong>Summary: </strong>Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11321866/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision. 利用具有字节级精度的编码器-解码器基础模型理解 DNA 的自然语言。
IF 2.4
Bioinformatics advances Pub Date : 2024-08-12 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae117
Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal
{"title":"Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.","authors":"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal","doi":"10.1093/bioadv/vbae117","DOIUrl":"10.1093/bioadv/vbae117","url":null,"abstract":"<p><strong>Summary: </strong>This article presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a subquadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pretrain the foundation model using reference genome sequences and apply it in the following downstream tasks: (i) identification of enhancers, promotors, and splice sites, (ii) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (iii) identification of biological function annotations of genomic sequences, and (iv) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p><p><strong>Availability and implementation: </strong>The source code used to develop and fine-tune the foundation model has been released on Github (https://github.itap.purdue.edu/Clan-labs/ENBED).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11341122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142037895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
gwid: an R package and Shiny application for Genome-Wide analysis of IBD data. gwid:用于对 IBD 数据进行全基因组分析的 R 软件包和 Shiny 应用程序。
IF 2.4
Bioinformatics advances Pub Date : 2024-07-31 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae115
Soroush Mahmoudiandehkordi, Mehdi Maadooliat, Steven J Schrodi
{"title":"gwid: an R package and Shiny application for Genome-Wide analysis of IBD data.","authors":"Soroush Mahmoudiandehkordi, Mehdi Maadooliat, Steven J Schrodi","doi":"10.1093/bioadv/vbae115","DOIUrl":"10.1093/bioadv/vbae115","url":null,"abstract":"<p><strong>Summary: </strong>Genome-wide identity by descent (gwid) is an R package developed for the analysis of identity-by-descent (IBD) data pertaining to dichotomous traits. This package offers a set of tools to assess differential IBD levels for the two states of a binary trait, yielding informative and meaningful results. Furthermore, it provides convenient functions to visualize the outcomes of these analyses, enhancing the interpretability and accessibility of the results. To assess the performance of the package, we conducted an evaluation using real genotype data derived from the SNPs to investigate rheumatoid arthritis susceptibility from the Marshfield Clinic Personalized Medicine Research Project.</p><p><strong>Availability and implementation: </strong>gwid is available as an open-source R package. Release versions can be accessed on CRAN (https://cran.r-project.org/package=gwid) for all major operating systems. The development version is maintained on GitHub (https://github.com/soroushmdg/gwid) and full documentation with examples and workflow templates is provided <i>via</i> the package website (http://tinyurl.com/gwid-tutorial). An interactive R Shiny dashboard is also developed (https://tinyurl.com/gwid-shiny).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11379470/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142157201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evolution and subfamilies of HERVL human endogenous retrovirus. HERVL 人类内源性逆转录病毒的进化和亚家族。
IF 2.4
Bioinformatics advances Pub Date : 2024-07-30 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae110
Huan Zhang, Martin C Frith
{"title":"Evolution and subfamilies of HERVL human endogenous retrovirus.","authors":"Huan Zhang, Martin C Frith","doi":"10.1093/bioadv/vbae110","DOIUrl":"10.1093/bioadv/vbae110","url":null,"abstract":"<p><strong>Background: </strong>Endogenous retroviruses (ERVs), which blur the boundary between virus and transposable element, are genetic material derived from retroviruses and have important implications for evolution. This study examines the diversity and evolution of human endogenous retroviruses (HERVs) of the HERVL family, which has long terminal repeats (LTRs) named MLT2.</p><p><strong>Results: </strong>By probability-based sequence comparison, we uncover systematic annotation errors that conceal the true complexity and diversity of transposable elements (TEs) in the human genome. Our analysis identifies new subfamilies within the MLT2 group, proposes a refined classification scheme, and constructs new consensus sequences. We present an evolutionary analysis including phylogenetic trees that elucidate the relationships between these subfamilies and their contributions to human evolution. The results underscore the significance of accurate TE annotation in understanding genome evolution, highlighting the potential for misclassified TEs to impact interpretations of genomic studies.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11319637/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141977643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing field-programmable gate arrays in genotype phasing and imputation. 将现场可编程门阵列引入基因型分期和估算。
IF 2.4
Bioinformatics advances Pub Date : 2024-07-30 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae114
Lars Wienbrandt, David Ellinghaus
{"title":"Introducing field-programmable gate arrays in genotype phasing and imputation.","authors":"Lars Wienbrandt, David Ellinghaus","doi":"10.1093/bioadv/vbae114","DOIUrl":"10.1093/bioadv/vbae114","url":null,"abstract":"<p><strong>Summary: </strong>We recently developed <i>EagleImp</i>, a free software that combines genotype phasing and imputation in a single tool. By introducing algorithmic and technical improvements we accelerated the classical two-step approach using <i>Eagle2</i> and <i>PBWT</i>. Here, we demonstrate how to use field-programmable gate arrays (FPGAs) to accelerate <i>EagleImp</i> even further by a factor of up to 93% without loss of phasing and imputation quality. Due to the speed advantage over a not accelerated processor-based implementation, the FPGA extension of <i>EagleImp</i> allows the user to choose a more resource-intensive parameter setting in exchange for computation time to further improve phasing and imputation quality.</p><p><strong>Availability and implementation: </strong><i>EagleImp</i> and its FPGA extension are freely available at https://github.com/ikmb/eagleimp and https://github.com/ikmb/eagleimp-fpga.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11333566/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142010039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Systematic analysis on the horse-shoe-like effect in PCA plots of scRNA-seq data. 对 scRNA-seq 数据 PCA 图中的马蹄铁效应进行系统分析。
IF 2.4
Bioinformatics advances Pub Date : 2024-07-29 eCollection Date: 2024-01-01 DOI: 10.1093/bioadv/vbae109
Najeebullah Shah, Qiuchen Meng, Ziheng Zou, Xuegong Zhang
{"title":"Systematic analysis on the horse-shoe-like effect in PCA plots of scRNA-seq data.","authors":"Najeebullah Shah, Qiuchen Meng, Ziheng Zou, Xuegong Zhang","doi":"10.1093/bioadv/vbae109","DOIUrl":"10.1093/bioadv/vbae109","url":null,"abstract":"<p><strong>Motivation: </strong>In single-cell studies, principal component analysis (PCA) is widely used to reduce the dimensionality of dataset and visualize in 2D or 3D PC plots. Scientists often focus on different clusters within PC plot, overlooking the specific phenomenon, such as horse-shoe-like effect, that may reveal hidden knowledge about underlying biological dataset. This phenomenon remains largely unexplored in single-cell studies.</p><p><strong>Results: </strong>In this study, we investigated into the horse-shoe-like effect in PC plots using simulated and real scRNA-seq datasets. We systematically explain horse-shoe-like phenomenon from various inter-related perspectives. Initially, we establish an intuitive understanding with the help of simulated datasets. Then, we generalized the acquired knowledge on real biological scRNA-seq data. Experimental results provide logical explanations and understanding for the appearance of horse-shoe-like effect in PC plots. Furthermore, we identify a potential problem with a well-known theory of 'distance saturation property' attributed to induce horse-shoe phenomenon. Finally, we analyse a mathematical model for horse-shoe effect that suggests trigonometric solutions to estimated eigenvectors. We observe significant resemblance after comparing the results of mathematical model with simulated and real scRNA-seq datasets.</p><p><strong>Availability and implementation: </strong>The code for reproducing the results of this study is available at: https://github.com/najeebullahshah/PCA-Horse-Shoe.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141918225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信