人类肠道微生物群的K-Means聚类

2018 21st Saudi Computer Society National Computer Conference (NCC) Pub Date : 2018-04-01 DOI:10.1109/NCG.2018.8593154

Wesam Sami Taie, Yasser Omar, A. Badr

{"title":"人类肠道微生物群的K-Means聚类","authors":"Wesam Sami Taie, Yasser Omar, A. Badr","doi":"10.1109/NCG.2018.8593154","DOIUrl":null,"url":null,"abstract":"According to most researches it is stated that 1–3% of the human body mass consist of microbiota. The gut and intestinal part of human has some several types of microorganisms, which is important for human health and diseases. Hence, understanding the behavior of the human gut and intestine microbiomes increase the chance of detecting and predicting the disease earlier to take the precautions for treatment. Time is an important measure for collecting more information about gut and intestine microbiota so, the proposed work used the 16S rRNA metagenomic approach which is a best suited approach that provides a knowledge-based way to understand the human microbiota much faster. The nucleotide database of bacterial 16S rRNA gene sequences isolated from human intestinal and fecal samples used to develop microbiota microarray that's includes Human Intestine Microbiomes, their Protein's Information and the weight of each protein in the dataset that's calculated used two efficient techniques such as KMeans Clustering Algorithm and Needleman-Wunsch Algorithm. This proposed work contribution highlights on avoiding time consumption of Needleman-Wunsch sequence alignment Algorithm on assigning weights to such large scale of proteins that counts 56117 Protein. In this work validation experiments, the microarray correctly identified genomic DNA from all 18bacterial species used. According to the analytical study of this approach on the dataset it proves that calculating the alignment distance for large amount of sequences become more efficient and faster when extracting some features that is considered an important factor in clustering the dataset into 8 clusters which reduce the runtime of full dataset from 2 years to 3 days. This microbiota microarrays will be clustered using Genetic algorithm taking into consideration the protein weight assigned by Needleman-Wunsch Algorithm to grouping the human intestine microbiomes' proteins to k clusters to get identity for proteins that has unknown structure and get the interaction between all proteins using Protein-Protein Interaction Model.","PeriodicalId":305464,"journal":{"name":"2018 21st Saudi Computer Society National Computer Conference (NCC)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Clustering of Human Intestine Microbiomes with K-Means\",\"authors\":\"Wesam Sami Taie, Yasser Omar, A. Badr\",\"doi\":\"10.1109/NCG.2018.8593154\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"According to most researches it is stated that 1–3% of the human body mass consist of microbiota. The gut and intestinal part of human has some several types of microorganisms, which is important for human health and diseases. Hence, understanding the behavior of the human gut and intestine microbiomes increase the chance of detecting and predicting the disease earlier to take the precautions for treatment. Time is an important measure for collecting more information about gut and intestine microbiota so, the proposed work used the 16S rRNA metagenomic approach which is a best suited approach that provides a knowledge-based way to understand the human microbiota much faster. The nucleotide database of bacterial 16S rRNA gene sequences isolated from human intestinal and fecal samples used to develop microbiota microarray that's includes Human Intestine Microbiomes, their Protein's Information and the weight of each protein in the dataset that's calculated used two efficient techniques such as KMeans Clustering Algorithm and Needleman-Wunsch Algorithm. This proposed work contribution highlights on avoiding time consumption of Needleman-Wunsch sequence alignment Algorithm on assigning weights to such large scale of proteins that counts 56117 Protein. In this work validation experiments, the microarray correctly identified genomic DNA from all 18bacterial species used. According to the analytical study of this approach on the dataset it proves that calculating the alignment distance for large amount of sequences become more efficient and faster when extracting some features that is considered an important factor in clustering the dataset into 8 clusters which reduce the runtime of full dataset from 2 years to 3 days. This microbiota microarrays will be clustered using Genetic algorithm taking into consideration the protein weight assigned by Needleman-Wunsch Algorithm to grouping the human intestine microbiomes' proteins to k clusters to get identity for proteins that has unknown structure and get the interaction between all proteins using Protein-Protein Interaction Model.\",\"PeriodicalId\":305464,\"journal\":{\"name\":\"2018 21st Saudi Computer Society National Computer Conference (NCC)\",\"volume\":\"78 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 21st Saudi Computer Society National Computer Conference (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCG.2018.8593154\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 21st Saudi Computer Society National Computer Conference (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCG.2018.8593154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

根据大多数研究表明，人体质量的1-3%由微生物群组成。人体的肠道部分有几种类型的微生物，它们对人体的健康和疾病至关重要。因此，了解人类肠道和肠道微生物组的行为可以增加早期发现和预测疾病的机会，从而采取预防措施进行治疗。时间是收集更多肠道和肠道微生物群信息的重要指标，因此，本研究采用16S rRNA宏基因组方法，这是一种最适合的方法，它提供了一种基于知识的方法，可以更快地了解人类微生物群。从人类肠道和粪便样本中分离的细菌16S rRNA基因序列的核苷酸数据库用于开发微生物群微阵列，包括人类肠道微生物组，它们的蛋白质信息和数据集中每个蛋白质的权重，该数据集使用KMeans聚类算法和Needleman-Wunsch算法等两种高效技术计算。本文提出的工作贡献重点是避免了Needleman-Wunsch序列比对算法在为56117个蛋白质分配权重时所耗费的时间。在这项工作验证实验中，微阵列正确地识别了所有18种细菌的基因组DNA。通过对该方法在数据集上的分析研究，证明了在提取部分特征时计算大量序列的对齐距离更加高效和快速，这是将数据集聚成8个簇的重要因素，将完整数据集的运行时间从2年减少到3天。该微生物微阵列将采用遗传算法进行聚类，考虑Needleman-Wunsch算法赋予的蛋白质权重，将人类肠道微生物组的蛋白质分组为k簇，以获得结构未知的蛋白质的身份，并使用蛋白质-蛋白质相互作用模型获得所有蛋白质之间的相互作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering of Human Intestine Microbiomes with K-Means

According to most researches it is stated that 1–3% of the human body mass consist of microbiota. The gut and intestinal part of human has some several types of microorganisms, which is important for human health and diseases. Hence, understanding the behavior of the human gut and intestine microbiomes increase the chance of detecting and predicting the disease earlier to take the precautions for treatment. Time is an important measure for collecting more information about gut and intestine microbiota so, the proposed work used the 16S rRNA metagenomic approach which is a best suited approach that provides a knowledge-based way to understand the human microbiota much faster. The nucleotide database of bacterial 16S rRNA gene sequences isolated from human intestinal and fecal samples used to develop microbiota microarray that's includes Human Intestine Microbiomes, their Protein's Information and the weight of each protein in the dataset that's calculated used two efficient techniques such as KMeans Clustering Algorithm and Needleman-Wunsch Algorithm. This proposed work contribution highlights on avoiding time consumption of Needleman-Wunsch sequence alignment Algorithm on assigning weights to such large scale of proteins that counts 56117 Protein. In this work validation experiments, the microarray correctly identified genomic DNA from all 18bacterial species used. According to the analytical study of this approach on the dataset it proves that calculating the alignment distance for large amount of sequences become more efficient and faster when extracting some features that is considered an important factor in clustering the dataset into 8 clusters which reduce the runtime of full dataset from 2 years to 3 days. This microbiota microarrays will be clustered using Genetic algorithm taking into consideration the protein weight assigned by Needleman-Wunsch Algorithm to grouping the human intestine microbiomes' proteins to k clusters to get identity for proteins that has unknown structure and get the interaction between all proteins using Protein-Protein Interaction Model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 21st Saudi Computer Society National Computer Conference (NCC)

自引率

0.00%

发文量