利用深度学习揭示数据变化对蛋白质序列分类的影响

International Journal of Intelligent Computing and Information Sciences Pub Date : 2022-05-18 DOI:10.21608/ijicis.2022.123177.1168

F. Mostafa, Y. Afify, R. Ismail, N. Badr

{"title":"利用深度学习揭示数据变化对蛋白质序列分类的影响","authors":"F. Mostafa, Y. Afify, R. Ismail, N. Badr","doi":"10.21608/ijicis.2022.123177.1168","DOIUrl":null,"url":null,"abstract":": Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.","PeriodicalId":244591,"journal":{"name":"International Journal of Intelligent Computing and Information Sciences","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING\",\"authors\":\"F. Mostafa, Y. Afify, R. Ismail, N. Badr\",\"doi\":\"10.21608/ijicis.2022.123177.1168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.\",\"PeriodicalId\":244591,\"journal\":{\"name\":\"International Journal of Intelligent Computing and Information Sciences\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Intelligent Computing and Information Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21608/ijicis.2022.123177.1168\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Computing and Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/ijicis.2022.123177.1168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着蛋白质数量的增加，生物信息学家在分析和研究蛋白质相似性方面面临着一个问题。蛋白质序列分析有助于蛋白质功能的预测。在分析过程中，能够根据蛋白质的序列对它们进行适当的分类是至关重要的。从蛋白质序列中提取特征的方法多种多样。本研究的目的是研究使用3D数据的深度学习模型的分类性能的不同数据变化。首先，很少有研究问题是关于以下标准的影响:数据集大小、IMF重要性、特征大小和预处理对所提出的深度学习分类过程的影响。其次，进行了全面的实验来回答研究问题。利用6种特征提取方法，分别生成7x7x7和9x9x9两种尺寸的三维特征，并将其输入卷积神经网络。使用了三种不同类型、大小和平衡状态的数据集。准确性、精密度、召回率和f1分是使用的标准评估指标。实验结果得出了重要的结论。首先，7x7x7特征矩阵的维度之间具有正相关关系，提高了结果。其次，使用前三个IMF成分的和比使用第一个IMF成分有更好的影响。第三，与大型数据集不同，小数据集的分类过程没有受益于特征的归一化。最后，数据集大小对CNN模型的训练有显著影响，训练准确率达到84.03%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING

: Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Intelligent Computing and Information Sciences

自引率

0.00%

发文量