{"title":"Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings","authors":"Jongik Kim, Chen Li, Xiaohui Xie","doi":"10.1109/ICDE.2016.7498238","DOIUrl":null,"url":null,"abstract":"Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"169-180"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19
Abstract
Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.