用主成分分析法比较蛋白质序列的独特方法

J. Pal, Shinjini Ghosh, B. Maji, D. K. Bhattacharya
{"title":"用主成分分析法比较蛋白质序列的独特方法","authors":"J. Pal, Shinjini Ghosh, B. Maji, D. K. Bhattacharya","doi":"10.1109/ICTAI53825.2021.9673245","DOIUrl":null,"url":null,"abstract":"Physiochemical properties of amino acids has significant role in the study of comparison of protein sequences. In the literature, an arbitrary and random combination of these properties has been considered for protein sequence comparison. In the present paper, comparison of protein sequences is obtained using only five known physical properties of the amino acids. Principal component analysis (PCA) is applied on the numerical values corresponding to these physical properties related to twenty amino acids to reduce their dimensions. As a result, corresponding to each amino acid 20 TP values are obtained. Protein Sequences are represented based on these 20 TP values. Then cumulative sums on these represented sequences are taken to get the non-degenerate representations of each of the protein sequences. Now a new form of descriptor is obtained using generalized form of three moment vectors consisting of first, second and third order moments. Then distance matrices are obtained by using Euclidean distance as the distance measure. Finally phylogenetic tree based on such distance matrices using the UPGMA algorithm are constructed. The proposed method is applied on 9 ND4, 9 ND6, 16 ND5, 12 Baculovirus and also on 24 TF protein sequences. The result obtained by this new method is at par with the biological reference and also comparable with the results obtained earlier on the same species by other methods.","PeriodicalId":278263,"journal":{"name":"2021 International Conference on Technological Advancements and Innovations (ICTAI)","volume":"183 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Unique Approach for Comparison of Protein Sequence Using PCA Analysis\",\"authors\":\"J. Pal, Shinjini Ghosh, B. Maji, D. K. Bhattacharya\",\"doi\":\"10.1109/ICTAI53825.2021.9673245\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Physiochemical properties of amino acids has significant role in the study of comparison of protein sequences. In the literature, an arbitrary and random combination of these properties has been considered for protein sequence comparison. In the present paper, comparison of protein sequences is obtained using only five known physical properties of the amino acids. Principal component analysis (PCA) is applied on the numerical values corresponding to these physical properties related to twenty amino acids to reduce their dimensions. As a result, corresponding to each amino acid 20 TP values are obtained. Protein Sequences are represented based on these 20 TP values. Then cumulative sums on these represented sequences are taken to get the non-degenerate representations of each of the protein sequences. Now a new form of descriptor is obtained using generalized form of three moment vectors consisting of first, second and third order moments. Then distance matrices are obtained by using Euclidean distance as the distance measure. Finally phylogenetic tree based on such distance matrices using the UPGMA algorithm are constructed. The proposed method is applied on 9 ND4, 9 ND6, 16 ND5, 12 Baculovirus and also on 24 TF protein sequences. The result obtained by this new method is at par with the biological reference and also comparable with the results obtained earlier on the same species by other methods.\",\"PeriodicalId\":278263,\"journal\":{\"name\":\"2021 International Conference on Technological Advancements and Innovations (ICTAI)\",\"volume\":\"183 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Technological Advancements and Innovations (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI53825.2021.9673245\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Technological Advancements and Innovations (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI53825.2021.9673245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

氨基酸的理化性质在蛋白质序列比较研究中具有重要意义。在文献中,这些特性的任意和随机组合已被考虑用于蛋白质序列比较。在本论文中,蛋白质序列的比较是获得仅使用五个已知的物理性质的氨基酸。应用主成分分析(PCA)对20种氨基酸的物理性质对应的数值进行降维。因此,每个氨基酸对应20个TP值。蛋白质序列是基于这20个TP值表示的。然后对这些表示的序列进行累积和,得到每个蛋白质序列的非退化表示。利用由一阶、二阶和三阶矩组成的三个矩向量的广义形式,得到了一种新的描述子形式。然后以欧氏距离作为距离度量,得到距离矩阵。最后利用UPGMA算法构建了基于这些距离矩阵的系统进化树。该方法适用于9个ND4、9个ND6、16个ND5、12个杆状病毒和24个TF蛋白序列。新方法得到的结果与生物学参考资料相当,也可与以前用其他方法对同一物种得到的结果相比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Unique Approach for Comparison of Protein Sequence Using PCA Analysis
Physiochemical properties of amino acids has significant role in the study of comparison of protein sequences. In the literature, an arbitrary and random combination of these properties has been considered for protein sequence comparison. In the present paper, comparison of protein sequences is obtained using only five known physical properties of the amino acids. Principal component analysis (PCA) is applied on the numerical values corresponding to these physical properties related to twenty amino acids to reduce their dimensions. As a result, corresponding to each amino acid 20 TP values are obtained. Protein Sequences are represented based on these 20 TP values. Then cumulative sums on these represented sequences are taken to get the non-degenerate representations of each of the protein sequences. Now a new form of descriptor is obtained using generalized form of three moment vectors consisting of first, second and third order moments. Then distance matrices are obtained by using Euclidean distance as the distance measure. Finally phylogenetic tree based on such distance matrices using the UPGMA algorithm are constructed. The proposed method is applied on 9 ND4, 9 ND6, 16 ND5, 12 Baculovirus and also on 24 TF protein sequences. The result obtained by this new method is at par with the biological reference and also comparable with the results obtained earlier on the same species by other methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信