利用深度学习揭示数据变化对蛋白质序列分类的影响

F. Mostafa, Y. Afify, R. Ismail, N. Badr
{"title":"利用深度学习揭示数据变化对蛋白质序列分类的影响","authors":"F. Mostafa, Y. Afify, R. Ismail, N. Badr","doi":"10.21608/ijicis.2022.123177.1168","DOIUrl":null,"url":null,"abstract":": Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.","PeriodicalId":244591,"journal":{"name":"International Journal of Intelligent Computing and Information Sciences","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING\",\"authors\":\"F. Mostafa, Y. Afify, R. Ismail, N. Badr\",\"doi\":\"10.21608/ijicis.2022.123177.1168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.\",\"PeriodicalId\":244591,\"journal\":{\"name\":\"International Journal of Intelligent Computing and Information Sciences\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Intelligent Computing and Information Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21608/ijicis.2022.123177.1168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Computing and Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/ijicis.2022.123177.1168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

随着蛋白质数量的增加,生物信息学家在分析和研究蛋白质相似性方面面临着一个问题。蛋白质序列分析有助于蛋白质功能的预测。在分析过程中,能够根据蛋白质的序列对它们进行适当的分类是至关重要的。从蛋白质序列中提取特征的方法多种多样。本研究的目的是研究使用3D数据的深度学习模型的分类性能的不同数据变化。首先,很少有研究问题是关于以下标准的影响:数据集大小、IMF重要性、特征大小和预处理对所提出的深度学习分类过程的影响。其次,进行了全面的实验来回答研究问题。利用6种特征提取方法,分别生成7x7x7和9x9x9两种尺寸的三维特征,并将其输入卷积神经网络。使用了三种不同类型、大小和平衡状态的数据集。准确性、精密度、召回率和f1分是使用的标准评估指标。实验结果得出了重要的结论。首先,7x7x7特征矩阵的维度之间具有正相关关系,提高了结果。其次,使用前三个IMF成分的和比使用第一个IMF成分有更好的影响。第三,与大型数据集不同,小数据集的分类过程没有受益于特征的归一化。最后,数据集大小对CNN模型的训练有显著影响,训练准确率达到84.03%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING
: Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信