clusterdb:一个高性能的大规模序列匹配工具

17th International Workshop on Database and Expert Systems Applications (DEXA'06) Pub Date : 2006-09-04 DOI:10.1109/DEXA.2006.40

J. Kleffe, Friedrich Möller, B. Wittig

{"title":"clusterdb:一个高性能的大规模序列匹配工具","authors":"J. Kleffe, Friedrich Möller, B. Wittig","doi":"10.1109/DEXA.2006.40","DOIUrl":null,"url":null,"abstract":"High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use","PeriodicalId":282986,"journal":{"name":"17th International Workshop on Database and Expert Systems Applications (DEXA'06)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"ClustDB: A High-Performance Tool for Large Scale Sequence Matching\",\"authors\":\"J. Kleffe, Friedrich Möller, B. Wittig\",\"doi\":\"10.1109/DEXA.2006.40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use\",\"PeriodicalId\":282986,\"journal\":{\"name\":\"17th International Workshop on Database and Expert Systems Applications (DEXA'06)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"17th International Workshop on Database and Expert Systems Applications (DEXA'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.2006.40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"17th International Workshop on Database and Expert Systems Applications (DEXA'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2006.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

表达序列标签(est)的高通量采样产生了大量的转录本集合，使用现有的序列匹配工具难以相互比较。主要的问题是计算机内存不足。因此，我们提出了一种新的精确且节省内存的算法，用于同时识别大序列集中的匹配子串。将其应用于Genbank中日期为2005-04-06的600多万个人类ESTs，计数超过33亿个碱基对，使用一台具有2gb RAM、2.8 GHz处理器速度的标准PC，不到4小时就能找到长度至少为50个核苷酸的700多万个多个子串簇。相应的程序ClustDB能够处理的数据至少是VMATCH的8倍，VMATCH是目前已知的内存效率最高的精确软件。我们的程序可免费用于学术用途

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ClustDB: A High-Performance Tool for Large Scale Sequence Matching

High throughput sampling of expressed sequence tags (ESTs) has generated huge collections of transcripts that are difficult to compare with each other using existing tools for sequence matching. The major problem is lack of computer memory. We therefore present a new exact and memory efficient algorithm for the simultaneous identification of matching substrings in large sets of sequences. Its application to more than six million human ESTs in Genbank of date 2005-04-06, counting more than 3.3 billion base pairs, takes less than four hours to find all more than seven million clusters of multiple substrings of at least 50 nucleotides in length, say, by using a standard PC with 2 GB of RAM, 2.8 GHz processor speed. The corresponding program ClustDB is able to handle at least eight times more data than VMATCH, the most memory efficient exact software known today. Our program is freely available for academic use

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

17th International Workshop on Database and Expert Systems Applications (DEXA'06)

自引率

0.00%

发文量