Asynchronous Automata Processing on GPUs

Proceedings of the ACM on Measurement and Analysis of Computing Systems Pub Date : 2023-02-27 DOI:10.1145/3579453

Hongyuan Liu, Sreepathi Pai, Adwait Jog

{"title":"Asynchronous Automata Processing on GPUs","authors":"Hongyuan Liu, Sreepathi Pai, Adwait Jog","doi":"10.1145/3579453","DOIUrl":null,"url":null,"abstract":"Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.

查看原文本刊更多论文

gpu上的异步自动机处理

有限状态自动机作为许多应用领域的计算内核，例如模式匹配和数据分析。现有的gpu方法在自动机处理任务中利用了三个层次的并行性:1)~输入流级，2)~自动机级和3)~状态级。其中，只有状态级的并行性是自动机固有的，而其他两个级别的并行性取决于要处理的自动机和输入流的数量。随着GPU资源的增加，并行性受限的自动机处理任务可能会导致GPU计算资源的利用率不足。为此，我们提出了AsyncAP，这是一种低开销的方法，可以优化可伸缩性和吞吐量。我们的见解是，大多数自动机处理任务都有来自输入符号的额外并行性来源，这在以前没有被利用过。使与自动机任务相关联的匹配过程异步，即并行GPU线程从不同的输入位置开始处理输入流，而不是串行地处理它，可以显着提高吞吐量并随输入长度扩展。当任务没有足够的并行性来利用所有GPU内核时，对12个被评估应用程序的详细评估表明，与最先进的GPU自动机处理引擎相比，AsyncAP平均实现了高达58倍的加速。当任务有足够的并行性来利用GPU内核时，AsyncAP仍然可以实现2.4倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM on Measurement and Analysis of Computing Systems

CiteScore

3.20

自引率

0.00%

发文量