一种以数据为中心，利用卷积神经网络选择更好的多序列比对方法

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2020-09-21 DOI:10.1145/3388440.3414909

Mengmeng Kuang, H. Ting

{"title":"一种以数据为中心，利用卷积神经网络选择更好的多序列比对方法","authors":"Mengmeng Kuang, H. Ting","doi":"10.1145/3388440.3414909","DOIUrl":null,"url":null,"abstract":"Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method\",\"authors\":\"Mengmeng Kuang, H. Ting\",\"doi\":\"10.1145/3388440.3414909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.\",\"PeriodicalId\":411338,\"journal\":{\"name\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3388440.3414909\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多序列比对(Multiple sequence alignment, MSA)被广泛用于找出每个输入序列的进化关系以及每个对齐残基的功能或结构作用。传统上，MSA问题是通过以算法为中心的方法来解决的，这种方法应用了许多经典的计算机算法(如动态规划、分而治之算法等)和经过验证的策略(如渐进策略、非渐进策略、基于一致性的策略、迭代改进等)。不同的单算法MSA方法对不同的相似蛋白家族具有不同的准确性。因此，为了综合不同MSA方法的优点，我们提出了一种全新的以数据为中心的管道，利用卷积神经网络(CNN)[3]，针对不同的相似蛋白家族选择更好的MSA方法。MSA与二维图像非常相似，具有良好的层次结构。MSA中的保守区域和相应的保守列可以看作是图片中的框和线。众所周知，CNN在识别含有存在噪声的不完美图像方面非常出色，这意味着它可能在识别msa草案方面表现良好。简单地说，该方法首先使用快速MSA方法，从蛋白质模拟工具INDELible生成的模拟蛋白家族中构建大规模草案MSA[1]。重点是通过CNN训练一个分类器，它使用MSA草案作为输入，并给出更好的MSA方法作为输出。在我们的研究中，我们模拟了超过64万个蛋白质家族，序列号从3到64不等。著名的MSA工具Mafft(FFT-NS-1)[2]的最快(但不准确)模式，其默认参数用于从这些族构建草稿MSA。我们将这些msa视为双色图像，一种颜色用于对齐残基，另一种颜色用于间隙。使用两层CNN和一个dropout为0.5的全连接层来训练决策模型。初步结果表明，最新版本的Mafft(L-INS-i)和Mafft(G-INS-i)在选择更好的对准溶液方面的分类准确率超过85%。目前，我们正在通过为蛋白质家族选择更好的类别和微调决策模型来提高该管道的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A data-centric pipeline using convolutional neural network to select better multiple sequence alignment method

Multiple sequence alignment (MSA) is widely used to find out the evolutionary relationship of every input sequence as well as the functional or structural roles of every aligned residue. Traditionally, the MSA problem was tackled by algorithm-centric approaches which had applied many classical computer algorithms (such as dynamic programming, divide-and-conquer algorithm and so on) and proven strategies (such as progressive strategy, non-progressive strategy, consistency-based strategy, iterative refinement etc.). Different single-algorithm MSA methods have different accuracies on different similarity protein families. Therefore, to integrate the advantages of different MSA methods, we present a brand-new data-centric pipeline using the convolutional neural network (CNN) [3] to choose better MSA method for different similarity protein families. An MSA is very similar to a 2D picture, which has a good hierarchical structure. The conserved regions and the corresponding conserved columns in an MSA could be seen as boxes and lines in a picture. CNN is known to be very good at recognizing imperfect pictures which containing existed noises, which means it may perform well for recognizing draft MSAs. Briefly, the method first using a quick MSA method to construct large-scale draft MSAs from the simulated protein families produced by protein simulation tool INDELible [1]. The main point is training a classifier by CNN which employing the draft MSAs as input and giving the better MSA method as output. In our research, we simulated more than 640,000 protein families with sequence number range from 3 to 64. The fastest (but not accurate) mode of a famous MSA tool, Mafft(FFT-NS-1) [2], with default parameters used for constructing draft MSAs from those families. We regard these MSAs as two-color images, one color for the aligned residues, and the other color for the gaps. Two layers of CNN and a fully connected layer with 0.5 of dropout were used for training the decision model. The preliminary results suggest more than 85% accuracy in classification of choosing better alignment solution between the newest versions of Mafft(L-INS-i) and Mafft(G-INS-i). Currently, we are improving the performance of this pipeline by selecting better categories for protein families and fine-tuning the decision model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

自引率

0.00%

发文量