{"title":"基于均值偏移欠采样的超增强子预测","authors":"Han Cheng, Shumei Ding, Cangzhi Jia","doi":"10.2174/0115748936268302231110111456","DOIUrl":null,"url":null,"abstract":"Background:: Super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks. It has been reported that super-enhancers are transcriptionally more active and cell-type-specific than regular enhancers. Therefore, it is necessary to identify super-enhancers from regular enhancers. A variety of computational methods have been proposed to identify super-enhancers as auxiliary tools. However, most methods use ChIP-seq data, and the lack of this part of the data will make the predictor unable to execute or fail to achieve satisfactory performance. Objective:: The aim of this study is to propose a stacking computational model based on the fusion of multiple features to identify super-enhancers in both human and mouse species. Methods:: This work adopted mean-shift to cluster majority class samples and selected five sets of balanced datasets for mouse and three sets of balanced datasets for humans to train the stacking model. Five types of sequence information are used as input to the XGBoost classifier, and the average value of the probability outputs from each classifier is designed as the final classification result. Results:: The results of 10-fold cross-validation and cross-cell-line validation prove that our method has superior performance compared to other existing methods. The source code and datasets are available at https://github.com/Cheng-Han-max/SE_voting. Conclusion:: The analysis of feature importance indicates that Mismatch accounts for the highest proportion among the top 20 important features.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of Super-enhancers Based on Mean-shift Undersampling\",\"authors\":\"Han Cheng, Shumei Ding, Cangzhi Jia\",\"doi\":\"10.2174/0115748936268302231110111456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background:: Super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks. It has been reported that super-enhancers are transcriptionally more active and cell-type-specific than regular enhancers. Therefore, it is necessary to identify super-enhancers from regular enhancers. A variety of computational methods have been proposed to identify super-enhancers as auxiliary tools. However, most methods use ChIP-seq data, and the lack of this part of the data will make the predictor unable to execute or fail to achieve satisfactory performance. Objective:: The aim of this study is to propose a stacking computational model based on the fusion of multiple features to identify super-enhancers in both human and mouse species. Methods:: This work adopted mean-shift to cluster majority class samples and selected five sets of balanced datasets for mouse and three sets of balanced datasets for humans to train the stacking model. Five types of sequence information are used as input to the XGBoost classifier, and the average value of the probability outputs from each classifier is designed as the final classification result. Results:: The results of 10-fold cross-validation and cross-cell-line validation prove that our method has superior performance compared to other existing methods. The source code and datasets are available at https://github.com/Cheng-Han-max/SE_voting. Conclusion:: The analysis of feature importance indicates that Mismatch accounts for the highest proportion among the top 20 important features.\",\"PeriodicalId\":10801,\"journal\":{\"name\":\"Current Bioinformatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.2174/0115748936268302231110111456\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936268302231110111456","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
Prediction of Super-enhancers Based on Mean-shift Undersampling
Background:: Super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks. It has been reported that super-enhancers are transcriptionally more active and cell-type-specific than regular enhancers. Therefore, it is necessary to identify super-enhancers from regular enhancers. A variety of computational methods have been proposed to identify super-enhancers as auxiliary tools. However, most methods use ChIP-seq data, and the lack of this part of the data will make the predictor unable to execute or fail to achieve satisfactory performance. Objective:: The aim of this study is to propose a stacking computational model based on the fusion of multiple features to identify super-enhancers in both human and mouse species. Methods:: This work adopted mean-shift to cluster majority class samples and selected five sets of balanced datasets for mouse and three sets of balanced datasets for humans to train the stacking model. Five types of sequence information are used as input to the XGBoost classifier, and the average value of the probability outputs from each classifier is designed as the final classification result. Results:: The results of 10-fold cross-validation and cross-cell-line validation prove that our method has superior performance compared to other existing methods. The source code and datasets are available at https://github.com/Cheng-Han-max/SE_voting. Conclusion:: The analysis of feature importance indicates that Mismatch accounts for the highest proportion among the top 20 important features.
期刊介绍:
Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science.
The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.