利用水平尺度基础设施搜索生态系统基因组数据的相似性

A. Tskhai, S. Murzintsev
{"title":"利用水平尺度基础设施搜索生态系统基因组数据的相似性","authors":"A. Tskhai, S. Murzintsev","doi":"10.1109/CSGB.2018.8544736","DOIUrl":null,"url":null,"abstract":"For the processing of reference information (for example, from ENSEMBL, GenBank, KEGG), namely: rapid comparison of genomes of organisms in order to discover recurring sets of nucleotides, a special-purpose computer system has been developed. Due to the large amount of data that appears during the processing of the source information, a transition to non-relational databases has been made, as more flexible and scalable. The distributed non-relational DB MongoDB and the algorithm of data processing Winnowing were used as the basis of the approach. When using a non-relational database to identify genetic similarity, was proposed the option of submitting the prints of structural genomic variations in the form of \"key-value\". The software implementation of the developed model was implemented. Computing experiments were performed: (1) loading data into a database using one and three shards (servers where the data is stored and the information is searched and processed); (2) search for coincidences of genomes with DB of genomes using one and three shards; (3) calculation of the speed of searching for genomes in the database; (4) calculation of the rate of loading of genomes in the database. The result of the experiments was confirmation of the possibility of using the proposed method of searching for genetic similarity, for example, for using in analysis of deviations at the gene level. The continuation of the work can be carried out in the following directions: (1) solving the problem of determining the moment when it is necessary to add a node to the cluster with increasing the number of deviations considered and increasing the number of genomes in the DB of organisms; (2) study of genomic disorders to assess the probability of genetic abnormalities at the at the recognition stage of the potentially possible unfavorable development of the situation.","PeriodicalId":230439,"journal":{"name":"2018 11th International Multiconference Bioinformatics of Genome Regulation and Structure\\Systems Biology (BGRS\\SB)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"USING HORIZONTALLY SCALE INFRASTRUCTURE IN SEARCHING FOR SIMILARITY IN GENOME DATA OF ECOSYSTEMS\",\"authors\":\"A. Tskhai, S. Murzintsev\",\"doi\":\"10.1109/CSGB.2018.8544736\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For the processing of reference information (for example, from ENSEMBL, GenBank, KEGG), namely: rapid comparison of genomes of organisms in order to discover recurring sets of nucleotides, a special-purpose computer system has been developed. Due to the large amount of data that appears during the processing of the source information, a transition to non-relational databases has been made, as more flexible and scalable. The distributed non-relational DB MongoDB and the algorithm of data processing Winnowing were used as the basis of the approach. When using a non-relational database to identify genetic similarity, was proposed the option of submitting the prints of structural genomic variations in the form of \\\"key-value\\\". The software implementation of the developed model was implemented. Computing experiments were performed: (1) loading data into a database using one and three shards (servers where the data is stored and the information is searched and processed); (2) search for coincidences of genomes with DB of genomes using one and three shards; (3) calculation of the speed of searching for genomes in the database; (4) calculation of the rate of loading of genomes in the database. The result of the experiments was confirmation of the possibility of using the proposed method of searching for genetic similarity, for example, for using in analysis of deviations at the gene level. The continuation of the work can be carried out in the following directions: (1) solving the problem of determining the moment when it is necessary to add a node to the cluster with increasing the number of deviations considered and increasing the number of genomes in the DB of organisms; (2) study of genomic disorders to assess the probability of genetic abnormalities at the at the recognition stage of the potentially possible unfavorable development of the situation.\",\"PeriodicalId\":230439,\"journal\":{\"name\":\"2018 11th International Multiconference Bioinformatics of Genome Regulation and Structure\\\\Systems Biology (BGRS\\\\SB)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 11th International Multiconference Bioinformatics of Genome Regulation and Structure\\\\Systems Biology (BGRS\\\\SB)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSGB.2018.8544736\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 11th International Multiconference Bioinformatics of Genome Regulation and Structure\\Systems Biology (BGRS\\SB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSGB.2018.8544736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

为了处理参考信息(例如,来自ENSEMBL, GenBank, KEGG),即:快速比较生物体的基因组以发现重复出现的核苷酸集,已经开发了一种专用计算机系统。由于在源信息处理过程中出现了大量数据,因此需要向非关系数据库过渡,因为它更加灵活和可伸缩。采用分布式非关系型数据库MongoDB和数据处理算法Winnowing作为该方法的基础。在使用非关系数据库进行遗传相似性鉴定时,提出了以“键值”形式提交结构基因组变异图谱的选择。对所开发的模型进行了软件实现。进行计算实验:(1)使用一个和三个分片(存储数据和搜索处理信息的服务器)将数据加载到数据库中;(2)利用1条和3条分片搜索基因组与基因组DB的一致性;(3)计算数据库中基因组的搜索速度;(4)计算数据库中基因组的加载率。实验结果证实了使用所提出的寻找遗传相似性的方法的可能性,例如,用于分析基因水平上的偏差。后续的工作可以在以下方向进行:(1)通过增加考虑的偏差数量和增加生物数据库中的基因组数量来解决确定需要向集群添加节点的时刻的问题;(2)研究基因组性疾病,评估遗传异常在识别阶段可能出现的潜在不利发展情况的概率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
USING HORIZONTALLY SCALE INFRASTRUCTURE IN SEARCHING FOR SIMILARITY IN GENOME DATA OF ECOSYSTEMS
For the processing of reference information (for example, from ENSEMBL, GenBank, KEGG), namely: rapid comparison of genomes of organisms in order to discover recurring sets of nucleotides, a special-purpose computer system has been developed. Due to the large amount of data that appears during the processing of the source information, a transition to non-relational databases has been made, as more flexible and scalable. The distributed non-relational DB MongoDB and the algorithm of data processing Winnowing were used as the basis of the approach. When using a non-relational database to identify genetic similarity, was proposed the option of submitting the prints of structural genomic variations in the form of "key-value". The software implementation of the developed model was implemented. Computing experiments were performed: (1) loading data into a database using one and three shards (servers where the data is stored and the information is searched and processed); (2) search for coincidences of genomes with DB of genomes using one and three shards; (3) calculation of the speed of searching for genomes in the database; (4) calculation of the rate of loading of genomes in the database. The result of the experiments was confirmation of the possibility of using the proposed method of searching for genetic similarity, for example, for using in analysis of deviations at the gene level. The continuation of the work can be carried out in the following directions: (1) solving the problem of determining the moment when it is necessary to add a node to the cluster with increasing the number of deviations considered and increasing the number of genomes in the DB of organisms; (2) study of genomic disorders to assess the probability of genetic abnormalities at the at the recognition stage of the potentially possible unfavorable development of the situation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信