SIMD-accelerated regular expression matching

International Workshop on Data Management on New Hardware Pub Date : 2016-06-26 DOI:10.1145/2933349.2933357

Evangelia A. Sitaridi, Orestis Polychroniou, K. A. Ross

引用次数: 19

Abstract

String processing tasks are common in analytical queries powering business intelligence. Besides substring matching, provided in SQL by the like operator, popular DBMSs also support regular expressions as selective filters. Substring matching can be optimized by using specialized SIMD instructions on mainstream CPUs, reaching the performance of numeric column scans. However, generic regular expressions are harder to evaluate, being dependent on both the DFA size and the irregularity of the input. Here, we optimize matching string columns against regular expressions using SIMD-vectorized code. Our approach avoids accessing the strings in lockstep without branching, to exploit cases when some strings are accepted or rejected early by looking at the first few characters. On common string lengths, our implementation is up to 2X faster than scalar code on a mainstream CPU and up to 5X faster on the Xeon Phi co-processor, improving regular expression support in DBMSs.

查看原文本刊更多论文

simd加速正则表达式匹配

字符串处理任务在支持业务智能的分析查询中很常见。除了SQL中由like操作符提供的子字符串匹配之外，流行的dbms还支持正则表达式作为选择性过滤器。可以通过在主流cpu上使用专门的SIMD指令来优化子字符串匹配，从而达到数字列扫描的性能。然而，通用正则表达式更难求值，这取决于DFA大小和输入的不规则性。在这里，我们使用simd矢量化代码针对正则表达式优化匹配字符串列。我们的方法避免在没有分支的情况下同步访问字符串，以便通过查看前几个字符来利用某些字符串被接受或拒绝的情况。对于常见的字符串长度，我们的实现比主流CPU上的标量代码快2倍，在Xeon Phi协处理器上快5倍，从而改进了dbms中的正则表达式支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Workshop on Data Management on New Hardware

自引率

0.00%

发文量