{"title":"Token-based dictionary pattern matching for text analytics","authors":"R. Polig, K. Atasu, C. Hagleitner","doi":"10.1109/FPL.2013.6645535","DOIUrl":null,"url":null,"abstract":"When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s.","PeriodicalId":200435,"journal":{"name":"2013 23rd International Conference on Field programmable Logic and Applications","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 23rd International Conference on Field programmable Logic and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL.2013.6645535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s.
在对非结构化文本数据执行文本分析查询时,大量的处理时间花在正则表达式和字典匹配上。在本文中,我们提出了一个支持标记模式序列检测的标记绑定模式匹配的可编译架构。所提出的架构能够以高吞吐量检测数百个字典,每个字典包含数千个元素。使用可编程状态机作为模式检测引擎,在保持低存储需求的同时实现确定性性能。对于令牌序列的检测,基于非确定性自动机编写了专用电路。级联的结果查找确保了高效的存储,同时允许检测多个令牌元素并报告多个字典命中。我们在Altera Stratix IV GX530上实现,并能够以9.7 Gb/s的峰值吞吐量并行处理多达16个文档。