基于集成学习的超增强器识别。

IF 2.5 3区生物学 Q3 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Briefings in Functional Genomics Pub Date : 2025-01-15 DOI:10.1093/bfgp/elaf003

Wenying He, Jialu Xu, Yun Zuo, Yude Bai, Fei Guo

{"title":"基于集成学习的超增强器识别。","authors":"Wenying He, Jialu Xu, Yun Zuo, Yude Bai, Fei Guo","doi":"10.1093/bfgp/elaf003","DOIUrl":null,"url":null,"abstract":"Super-enhancers (SEs) are typically located in the regulatory regions of genes, driving high-level gene expression. Identifying SEs is crucial for a deeper understanding of gene regulatory networks, disease mechanisms, and the development and physiological processes of organisms, thus exerting a profound impact on research and applications in the life sciences field. Traditional experimental methods for identifying SEs are costly and time-consuming. Existing methods for predicting SEs based solely on sequence data use deep learning for feature representation and have achieved good results. However, they overlook biological features related to physicochemical properties, leading to low interpretability. Additionally, the complex model structure often requires extensive labeled data for training, which limits their further application in biological data. In this paper, we integrate the strengths of different models and proposes an ensemble model based on an integration strategy to enhance the model's generalization ability. It designs a multi-angle feature representation method that combines local structure and global information to extract high-dimensional abstract relationships and key low-dimensional biological features from sequences. This enhances the effectiveness and interpretability of the model's input features, providing technical support for discovering cell-specific and species-specific patterns of SEs. We evaluated the performance on both mouse and human datasets using five metrics, including area under the receiver operating characteristic curve accuracy, and others. Compared to the latest models, EnsembleSE achieved an average improvement of 4.5% in F1 score and an average improvement of 8.05% in recall, demonstrating the robustness and adaptability of the model on a unified test set. Source codes are available at https://github.com/2103374200/EnsembleSE-main.","PeriodicalId":55323,"journal":{"name":"Briefings in Functional Genomics","volume":"24 ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12008123/pdf/","citationCount":"0","resultStr":"{\"title\":\"EnsembleSE: identification of super-enhancers based on ensemble learning.\",\"authors\":\"Wenying He, Jialu Xu, Yun Zuo, Yude Bai, Fei Guo\",\"doi\":\"10.1093/bfgp/elaf003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Super-enhancers (SEs) are typically located in the regulatory regions of genes, driving high-level gene expression. Identifying SEs is crucial for a deeper understanding of gene regulatory networks, disease mechanisms, and the development and physiological processes of organisms, thus exerting a profound impact on research and applications in the life sciences field. Traditional experimental methods for identifying SEs are costly and time-consuming. Existing methods for predicting SEs based solely on sequence data use deep learning for feature representation and have achieved good results. However, they overlook biological features related to physicochemical properties, leading to low interpretability. Additionally, the complex model structure often requires extensive labeled data for training, which limits their further application in biological data. In this paper, we integrate the strengths of different models and proposes an ensemble model based on an integration strategy to enhance the model's generalization ability. It designs a multi-angle feature representation method that combines local structure and global information to extract high-dimensional abstract relationships and key low-dimensional biological features from sequences. This enhances the effectiveness and interpretability of the model's input features, providing technical support for discovering cell-specific and species-specific patterns of SEs. We evaluated the performance on both mouse and human datasets using five metrics, including area under the receiver operating characteristic curve accuracy, and others. Compared to the latest models, EnsembleSE achieved an average improvement of 4.5% in F1 score and an average improvement of 8.05% in recall, demonstrating the robustness and adaptability of the model on a unified test set. Source codes are available at https://github.com/2103374200/EnsembleSE-main.\",\"PeriodicalId\":55323,\"journal\":{\"name\":\"Briefings in Functional Genomics\",\"volume\":\"24 \",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12008123/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Briefings in Functional Genomics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bfgp/elaf003\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in Functional Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bfgp/elaf003","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

超级增强子通常位于基因的调控区域，驱动高水平的基因表达。识别se对于深入了解基因调控网络、疾病机制、生物发育和生理过程至关重要，对生命科学领域的研究和应用具有深远的影响。传统的实验方法既昂贵又耗时。现有的仅基于序列数据的se预测方法使用深度学习进行特征表示，并取得了良好的效果。然而，它们忽略了与物理化学性质相关的生物特征，导致可解释性较低。此外，复杂的模型结构往往需要大量的标记数据进行训练，这限制了其在生物数据中的进一步应用。本文综合了不同模型的优点，提出了一种基于集成策略的集成模型，以提高模型的泛化能力。设计了一种结合局部结构和全局信息的多角度特征表示方法，从序列中提取高维抽象关系和关键低维生物特征。这增强了模型输入特征的有效性和可解释性，为发现se的细胞特异性和物种特异性模式提供了技术支持。我们使用五个指标评估了小鼠和人类数据集的性能，包括接收器工作特征曲线下的面积，精度等。与最新模型相比，EnsembleSE的F1得分平均提高了4.5%，召回率平均提高了8.05%，显示了模型在统一测试集上的鲁棒性和适应性。源代码可从https://github.com/2103374200/EnsembleSE-main获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

EnsembleSE: identification of super-enhancers based on ensemble learning.

查看原文本刊更多论文

EnsembleSE: identification of super-enhancers based on ensemble learning.

Super-enhancers (SEs) are typically located in the regulatory regions of genes, driving high-level gene expression. Identifying SEs is crucial for a deeper understanding of gene regulatory networks, disease mechanisms, and the development and physiological processes of organisms, thus exerting a profound impact on research and applications in the life sciences field. Traditional experimental methods for identifying SEs are costly and time-consuming. Existing methods for predicting SEs based solely on sequence data use deep learning for feature representation and have achieved good results. However, they overlook biological features related to physicochemical properties, leading to low interpretability. Additionally, the complex model structure often requires extensive labeled data for training, which limits their further application in biological data. In this paper, we integrate the strengths of different models and proposes an ensemble model based on an integration strategy to enhance the model's generalization ability. It designs a multi-angle feature representation method that combines local structure and global information to extract high-dimensional abstract relationships and key low-dimensional biological features from sequences. This enhances the effectiveness and interpretability of the model's input features, providing technical support for discovering cell-specific and species-specific patterns of SEs. We evaluated the performance on both mouse and human datasets using five metrics, including area under the receiver operating characteristic curve accuracy, and others. Compared to the latest models, EnsembleSE achieved an average improvement of 4.5% in F1 score and an average improvement of 8.05% in recall, demonstrating the robustness and adaptability of the model on a unified test set. Source codes are available at https://github.com/2103374200/EnsembleSE-main.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Briefings in Functional Genomics BIOTECHNOLOGY & APPLIED MICROBIOLOGY-GENETICS & HEREDITY

CiteScore

6.30

自引率

2.50%

发文量

审稿时长

6-12 weeks

期刊介绍： Briefings in Functional Genomics publishes high quality peer reviewed articles that focus on the use, development or exploitation of genomic approaches, and their application to all areas of biological research. As well as exploring thematic areas where these techniques and protocols are being used, articles review the impact that these approaches have had, or are likely to have, on their field. Subjects covered by the Journal include but are not restricted to: the identification and functional characterisation of coding and non-coding features in genomes, microarray technologies, gene expression profiling, next generation sequencing, pharmacogenomics, phenomics, SNP technologies, transgenic systems, mutation screens and genotyping. Articles range in scope and depth from the introductory level to specific details of protocols and analyses, encompassing bacterial, fungal, plant, animal and human data. The editorial board welcome the submission of review articles for publication. Essential criteria for the publication of papers is that they do not contain primary data, and that they are high quality, clearly written review articles which provide a balanced, highly informative and up to date perspective to researchers in the field of functional genomics.