一种基于序列和属性信息的有效算法，用于在多个物种中鉴定 N4-甲基胞嘧啶

IF 1 4区化学 Q4 CHEMISTRY, ORGANIC

Letters in Organic Chemistry Pub Date : 2024-01-26 DOI:10.2174/0115701786277281231228093405

Lichao Zhang, Xueting Wang, Kang Xiao, Liang Kong

{"title":"一种基于序列和属性信息的有效算法，用于在多个物种中鉴定 N4-甲基胞嘧啶","authors":"Lichao Zhang, Xueting Wang, Kang Xiao, Liang Kong","doi":"10.2174/0115701786277281231228093405","DOIUrl":null,"url":null,"abstract":": N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which plays a significant role in biological progress and helps explain biological functions. Although biological experiments can identify potential 4mC sites, they are limited due to the experimental environment and labor-intensive process. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC); (2) easier method is needed for clinical research to design medicine or treat disease. Considering these aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in this study. Since nucleotide arrangement and its property information could reflect the sequence structure and function, several feature vectors have been developed based on nucleotide energy information, trinucleotide energy information, and nucleotide chemical property information. Besides, feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model. The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%, 1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans, D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could provide a reasonable supplement for biological research.","PeriodicalId":18116,"journal":{"name":"Letters in Organic Chemistry","volume":"18 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Effective Algorithm Based on Sequence and Property Information for N4-methylcytosine Identification in Multiple Species\",\"authors\":\"Lichao Zhang, Xueting Wang, Kang Xiao, Liang Kong\",\"doi\":\"10.2174/0115701786277281231228093405\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which plays a significant role in biological progress and helps explain biological functions. Although biological experiments can identify potential 4mC sites, they are limited due to the experimental environment and labor-intensive process. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC); (2) easier method is needed for clinical research to design medicine or treat disease. Considering these aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in this study. Since nucleotide arrangement and its property information could reflect the sequence structure and function, several feature vectors have been developed based on nucleotide energy information, trinucleotide energy information, and nucleotide chemical property information. Besides, feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model. The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%, 1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans, D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could provide a reasonable supplement for biological research.\",\"PeriodicalId\":18116,\"journal\":{\"name\":\"Letters in Organic Chemistry\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-01-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Letters in Organic Chemistry\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.2174/0115701786277281231228093405\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"CHEMISTRY, ORGANIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Letters in Organic Chemistry","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.2174/0115701786277281231228093405","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CHEMISTRY, ORGANIC","Score":null,"Total":0}

引用次数: 0

摘要

:N4-甲基胞嘧啶（4mC）是最重要的表观遗传修饰之一，在生物进化中发挥着重要作用，并有助于解释生物功能。虽然生物学实验可以确定潜在的 4mC 位点，但由于实验环境和实验过程耗费大量人力，实验结果有限。因此，构建一个计算模型来识别 4mC 位点至关重要。目前已提出了一些识别 4mC 位点的计算方法，但有些问题不容忽视，如以下问题：(1）需要更精确的算法来提高预测结果，尤其是马修相关系数（MCC）；（2）临床研究需要更简便的方法来设计药物或治疗疾病。考虑到这些方面，本研究提出了一种在多个物种中使用可理解编码的有效算法。由于核苷酸排列及其性质信息可以反映序列的结构和功能，因此根据核苷酸能量信息、三核苷酸能量信息和核苷酸化学性质信息开发了多个特征向量。此外，还对特征效应进行了分析，以选择多个物种的最佳特征向量。最后，将最优特征向量输入 CatBoost 算法，构建识别模型。评估结果表明，我们的研究获得了最高的 MCC，即在大连蝙蝠、优雅小鼠、黑腹蝇和大肠杆菌数据集上分别比以前的模型高出 2.5%~11.1%、1.4%~17.8%、1.1%~7.6% 和 2.3%~18.0%。这些令人满意的结果反映了所提出的方法可用于鉴定多个物种的 4mC 位点，尤其是 MCC。它可以为生物学研究提供合理的补充。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Effective Algorithm Based on Sequence and Property Information for N4-methylcytosine Identification in Multiple Species

: N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which plays a significant role in biological progress and helps explain biological functions. Although biological experiments can identify potential 4mC sites, they are limited due to the experimental environment and labor-intensive process. Therefore, it is crucial to construct a computational model to identify the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC); (2) easier method is needed for clinical research to design medicine or treat disease. Considering these aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in this study. Since nucleotide arrangement and its property information could reflect the sequence structure and function, several feature vectors have been developed based on nucleotide energy information, trinucleotide energy information, and nucleotide chemical property information. Besides, feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model. The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%, 1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans, D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could provide a reasonable supplement for biological research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Letters in Organic Chemistry 化学-有机化学

CiteScore

1.30

自引率

12.50%

发文量

135

审稿时长

7 months

期刊介绍： Aims & Scope Letters in Organic Chemistry publishes original letters (short articles), research articles, mini-reviews and thematic issues based on mini-reviews and short articles, in all areas of organic chemistry including synthesis, bioorganic, medicinal, natural products, organometallic, supramolecular, molecular recognition and physical organic chemistry. The emphasis is to publish quality papers rapidly by taking full advantage of latest technology for both submission and review of the manuscripts. The journal is an essential reading for all organic chemists belonging to both academia and industry.