基于Hadoop框架的快速可扩展蛋白质基序序列聚类

Erfan Farhangi, Nasser Ghadiri, Mahsa Asadi, M. Nikbakht, Sylvain Pitre
{"title":"基于Hadoop框架的快速可扩展蛋白质基序序列聚类","authors":"Erfan Farhangi, Nasser Ghadiri, Mahsa Asadi, M. Nikbakht, Sylvain Pitre","doi":"10.1109/ICWR.2017.7959300","DOIUrl":null,"url":null,"abstract":"In recent years, we are faced with large amounts of sporadic unstructured data on the web. With the explosive growth of such data, there is a growing need for effective methods such as clustering to analyze and extract information. Biological data forms an important part of unstructured data on the web. Protein sequence databases are considered as a primary source of biological data. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed of data processing and analysis. Proteins are responsible for most of the activities in cells. The majority of proteins show their function through interaction with other proteins. Hence, prediction of protein interactions is an important research area in the biomedical sciences. Motifs are fragments frequently occurred in protein sequences. A well- known method to specify the protein interaction is based on motif Clustering. Existing works on motif clustering methods share the problem of limitation in the number of clusters. However, regarding the vast amount of motifs and the necessity of a large number of clusters, it seems that an efficient, scalable and fast method is necessary to cluster such large number of sequences. In this paper, we propose a novel approach to cluster a large number of motifs. Our approach includes extracting motifs within protein sequences, feature selection, preprocessing, dimension reduction and utilizing BigFCM (a large-scale fuzzy clustering) on several distributed nodes with Hadoop framework to take the advantage of MapReduce Programming. Experimental Results show very good Performance of our approach.","PeriodicalId":304897,"journal":{"name":"2017 3th International Conference on Web Research (ICWR)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Fast and scalable protein motif sequence clustering based on Hadoop framework\",\"authors\":\"Erfan Farhangi, Nasser Ghadiri, Mahsa Asadi, M. Nikbakht, Sylvain Pitre\",\"doi\":\"10.1109/ICWR.2017.7959300\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, we are faced with large amounts of sporadic unstructured data on the web. With the explosive growth of such data, there is a growing need for effective methods such as clustering to analyze and extract information. Biological data forms an important part of unstructured data on the web. Protein sequence databases are considered as a primary source of biological data. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed of data processing and analysis. Proteins are responsible for most of the activities in cells. The majority of proteins show their function through interaction with other proteins. Hence, prediction of protein interactions is an important research area in the biomedical sciences. Motifs are fragments frequently occurred in protein sequences. A well- known method to specify the protein interaction is based on motif Clustering. Existing works on motif clustering methods share the problem of limitation in the number of clusters. However, regarding the vast amount of motifs and the necessity of a large number of clusters, it seems that an efficient, scalable and fast method is necessary to cluster such large number of sequences. In this paper, we propose a novel approach to cluster a large number of motifs. Our approach includes extracting motifs within protein sequences, feature selection, preprocessing, dimension reduction and utilizing BigFCM (a large-scale fuzzy clustering) on several distributed nodes with Hadoop framework to take the advantage of MapReduce Programming. Experimental Results show very good Performance of our approach.\",\"PeriodicalId\":304897,\"journal\":{\"name\":\"2017 3th International Conference on Web Research (ICWR)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 3th International Conference on Web Research (ICWR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWR.2017.7959300\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3th International Conference on Web Research (ICWR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWR.2017.7959300","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

近年来,我们在网络上面临着大量零星的非结构化数据。随着此类数据的爆炸式增长,越来越需要聚类等有效的方法来分析和提取信息。生物数据是网络上非结构化数据的重要组成部分。蛋白质序列数据库被认为是生物数据的主要来源。聚类可以帮助将序列组织成同源和功能相似的组,可以提高数据处理和分析的速度。蛋白质负责细胞中的大部分活动。大多数蛋白质通过与其他蛋白质的相互作用来显示其功能。因此,蛋白质相互作用的预测是生物医学的一个重要研究领域。基序是蛋白质序列中经常出现的片段。一种已知的确定蛋白质相互作用的方法是基于基序聚类。现有的基序聚类方法都存在聚类数量有限的问题。然而,考虑到基序的数量巨大,需要大量的聚类,似乎需要一种高效、可扩展和快速的方法来聚类如此大量的序列。本文提出了一种聚类大量基序的新方法。我们的方法包括提取蛋白质序列中的基序、特征选择、预处理、降维,并利用Hadoop框架在多个分布式节点上利用BigFCM(一种大规模模糊聚类)来利用MapReduce编程的优势。实验结果表明,该方法具有良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Fast and scalable protein motif sequence clustering based on Hadoop framework
In recent years, we are faced with large amounts of sporadic unstructured data on the web. With the explosive growth of such data, there is a growing need for effective methods such as clustering to analyze and extract information. Biological data forms an important part of unstructured data on the web. Protein sequence databases are considered as a primary source of biological data. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed of data processing and analysis. Proteins are responsible for most of the activities in cells. The majority of proteins show their function through interaction with other proteins. Hence, prediction of protein interactions is an important research area in the biomedical sciences. Motifs are fragments frequently occurred in protein sequences. A well- known method to specify the protein interaction is based on motif Clustering. Existing works on motif clustering methods share the problem of limitation in the number of clusters. However, regarding the vast amount of motifs and the necessity of a large number of clusters, it seems that an efficient, scalable and fast method is necessary to cluster such large number of sequences. In this paper, we propose a novel approach to cluster a large number of motifs. Our approach includes extracting motifs within protein sequences, feature selection, preprocessing, dimension reduction and utilizing BigFCM (a large-scale fuzzy clustering) on several distributed nodes with Hadoop framework to take the advantage of MapReduce Programming. Experimental Results show very good Performance of our approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信