Francisco Claude, A. Fariña, Miguel A. Martínez-Prieto, G. Navarro
{"title":"高度重复生物序列的压缩q-Gram索引","authors":"Francisco Claude, A. Fariña, Miguel A. Martínez-Prieto, G. Navarro","doi":"10.1109/BIBE.2010.22","DOIUrl":null,"url":null,"abstract":"The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole individual sequences to the more advanced one of providing fast searches in the collection. In this paper we study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams). We introduce two novel techniques and show they constitute practical alternatives to handle this scenario. They excel particularly in two cases: when $q$ is small (up to 6), and when the collection is extremely repetitive (less than 0.01% mutations).","PeriodicalId":330904,"journal":{"name":"2010 IEEE International Conference on BioInformatics and BioEngineering","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":"{\"title\":\"Compressed q-Gram Indexing for Highly Repetitive Biological Sequences\",\"authors\":\"Francisco Claude, A. Fariña, Miguel A. Martínez-Prieto, G. Navarro\",\"doi\":\"10.1109/BIBE.2010.22\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole individual sequences to the more advanced one of providing fast searches in the collection. In this paper we study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams). We introduce two novel techniques and show they constitute practical alternatives to handle this scenario. They excel particularly in two cases: when $q$ is small (up to 6), and when the collection is extremely repetitive (less than 0.01% mutations).\",\"PeriodicalId\":330904,\"journal\":{\"name\":\"2010 IEEE International Conference on BioInformatics and BioEngineering\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"56\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on BioInformatics and BioEngineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2010.22\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on BioInformatics and BioEngineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2010.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences
The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole individual sequences to the more advanced one of providing fast searches in the collection. In this paper we study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams). We introduce two novel techniques and show they constitute practical alternatives to handle this scenario. They excel particularly in two cases: when $q$ is small (up to 6), and when the collection is extremely repetitive (less than 0.01% mutations).