Token-based dictionary pattern matching for text analytics

2013 23rd International Conference on Field programmable Logic and Applications Pub Date : 2013-10-24 DOI:10.1109/FPL.2013.6645535

R. Polig, K. Atasu, C. Hagleitner

引用次数: 9

Abstract

When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s.

查看原文本刊更多论文

用于文本分析的基于标记的字典模式匹配

在对非结构化文本数据执行文本分析查询时，大量的处理时间花在正则表达式和字典匹配上。在本文中，我们提出了一个支持标记模式序列检测的标记绑定模式匹配的可编译架构。所提出的架构能够以高吞吐量检测数百个字典，每个字典包含数千个元素。使用可编程状态机作为模式检测引擎，在保持低存储需求的同时实现确定性性能。对于令牌序列的检测，基于非确定性自动机编写了专用电路。级联的结果查找确保了高效的存储，同时允许检测多个令牌元素并报告多个字典命中。我们在Altera Stratix IV GX530上实现，并能够以9.7 Gb/s的峰值吞吐量并行处理多达16个文档。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 23rd International Conference on Field programmable Logic and Applications

自引率

0.00%

发文量