{"title":"大规模并行处理器上基于架构感知优化的HMMER 3.0细粒度加速","authors":"Hanyu Jiang, N. Ganesan","doi":"10.1109/IPDPSW.2015.107","DOIUrl":null,"url":null,"abstract":"HMMER search used for protein Motif finding which is a probabilistic method based on profile hidden Markov models, is one of popular tools for protein homology sequence search. The current version of HMMER (version 3.0) is highly optimized for performance on multi-core and SSE-supported systems while maintaining accuracy. The computational workhorse of the HMMER 3.0 task-pipeline, the MSV and P7Viterbi stages together consume about 95% of the execution time. These two stages can prove to be a significant bottleneck for the current implementation, and can be accelerated via architecture-aware reformulation of the algorithm, along with hybrid task and data level parallelism. In this work we target the core-segments of HMMER3 hmmsearch tool viz. The MSV and the P7Viterbi and present a fine grained parallelization scheme designed and implemented on Graphics Processing Units (GPUs). This three-tiered approach, parallelizes scoring of a sequence across each warp, multiple sequences within each block and multiple blocks within the device. At the fine-grained level, this technique naturally takes advantage of the concurrency of threads within a warp, and completely eliminates the overhead of synchronization. The HMM used for the MSV and P7Viterbi segments share several core features, with few differences. Hence the techniques developed for acceleration of the MSV segment can also be readily applied to the P7Viterbi segment. However, the presence of additional D-D transitions in the HMM for P7Viterbi induces sequential dependencies. This is handled by implementing the Lazy-F procedure as in HMMER 3.0 but for SIMT architectures in a warp-synchronous fashion. Finally, we also study scalability across multiple devices of early Fermi Architecture. Compared to the core-segments, MSV and P7Viterbi of the optimized HMMER3 task pipeline, our implementation achieves up to 5.4-fold speedup for MSV, 2.9-fold speedup for P7viterbi and 3.8-fold speedup for combined pipeline of them on a single Kepler GPU while preserving the sensitivity and accuracy of HMMER 3.0. Multi-GPU implementation on Fermi architecture yields up to 7.8× speedup.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Fine-Grained Acceleration of HMMER 3.0 via Architecture-Aware Optimization on Massively Parallel Processors\",\"authors\":\"Hanyu Jiang, N. Ganesan\",\"doi\":\"10.1109/IPDPSW.2015.107\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"HMMER search used for protein Motif finding which is a probabilistic method based on profile hidden Markov models, is one of popular tools for protein homology sequence search. The current version of HMMER (version 3.0) is highly optimized for performance on multi-core and SSE-supported systems while maintaining accuracy. The computational workhorse of the HMMER 3.0 task-pipeline, the MSV and P7Viterbi stages together consume about 95% of the execution time. These two stages can prove to be a significant bottleneck for the current implementation, and can be accelerated via architecture-aware reformulation of the algorithm, along with hybrid task and data level parallelism. In this work we target the core-segments of HMMER3 hmmsearch tool viz. The MSV and the P7Viterbi and present a fine grained parallelization scheme designed and implemented on Graphics Processing Units (GPUs). This three-tiered approach, parallelizes scoring of a sequence across each warp, multiple sequences within each block and multiple blocks within the device. At the fine-grained level, this technique naturally takes advantage of the concurrency of threads within a warp, and completely eliminates the overhead of synchronization. The HMM used for the MSV and P7Viterbi segments share several core features, with few differences. Hence the techniques developed for acceleration of the MSV segment can also be readily applied to the P7Viterbi segment. However, the presence of additional D-D transitions in the HMM for P7Viterbi induces sequential dependencies. This is handled by implementing the Lazy-F procedure as in HMMER 3.0 but for SIMT architectures in a warp-synchronous fashion. Finally, we also study scalability across multiple devices of early Fermi Architecture. Compared to the core-segments, MSV and P7Viterbi of the optimized HMMER3 task pipeline, our implementation achieves up to 5.4-fold speedup for MSV, 2.9-fold speedup for P7viterbi and 3.8-fold speedup for combined pipeline of them on a single Kepler GPU while preserving the sensitivity and accuracy of HMMER 3.0. Multi-GPU implementation on Fermi architecture yields up to 7.8× speedup.\",\"PeriodicalId\":340697,\"journal\":{\"name\":\"2015 IEEE International Parallel and Distributed Processing Symposium Workshop\",\"volume\":\"82 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Parallel and Distributed Processing Symposium Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2015.107\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2015.107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fine-Grained Acceleration of HMMER 3.0 via Architecture-Aware Optimization on Massively Parallel Processors
HMMER search used for protein Motif finding which is a probabilistic method based on profile hidden Markov models, is one of popular tools for protein homology sequence search. The current version of HMMER (version 3.0) is highly optimized for performance on multi-core and SSE-supported systems while maintaining accuracy. The computational workhorse of the HMMER 3.0 task-pipeline, the MSV and P7Viterbi stages together consume about 95% of the execution time. These two stages can prove to be a significant bottleneck for the current implementation, and can be accelerated via architecture-aware reformulation of the algorithm, along with hybrid task and data level parallelism. In this work we target the core-segments of HMMER3 hmmsearch tool viz. The MSV and the P7Viterbi and present a fine grained parallelization scheme designed and implemented on Graphics Processing Units (GPUs). This three-tiered approach, parallelizes scoring of a sequence across each warp, multiple sequences within each block and multiple blocks within the device. At the fine-grained level, this technique naturally takes advantage of the concurrency of threads within a warp, and completely eliminates the overhead of synchronization. The HMM used for the MSV and P7Viterbi segments share several core features, with few differences. Hence the techniques developed for acceleration of the MSV segment can also be readily applied to the P7Viterbi segment. However, the presence of additional D-D transitions in the HMM for P7Viterbi induces sequential dependencies. This is handled by implementing the Lazy-F procedure as in HMMER 3.0 but for SIMT architectures in a warp-synchronous fashion. Finally, we also study scalability across multiple devices of early Fermi Architecture. Compared to the core-segments, MSV and P7Viterbi of the optimized HMMER3 task pipeline, our implementation achieves up to 5.4-fold speedup for MSV, 2.9-fold speedup for P7viterbi and 3.8-fold speedup for combined pipeline of them on a single Kepler GPU while preserving the sensitivity and accuracy of HMMER 3.0. Multi-GPU implementation on Fermi architecture yields up to 7.8× speedup.