一个自动机统治一切:超越多个正则表达式的执行

L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi
{"title":"一个自动机统治一切:超越多个正则表达式的执行","authors":"L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi","doi":"10.1109/CGO57630.2024.10444810","DOIUrl":null,"url":null,"abstract":"Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99\\times$ throughput improvement and $4.05\\times$ speedup against single and multiple parallel FSAs.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"65 6","pages":"193-206"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution\",\"authors\":\"L. Cicolini, F. Carloni, Marco D. Santambrogio, Davide Conficconi\",\"doi\":\"10.1109/CGO57630.2024.10444810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99\\\\times$ throughput improvement and $4.05\\\\times$ speedup against single and multiple parallel FSAs.\",\"PeriodicalId\":517814,\"journal\":{\"name\":\"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)\",\"volume\":\"65 6\",\"pages\":\"193-206\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CGO57630.2024.10444810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CGO57630.2024.10444810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

正则表达式(Regular Expressions,REs)匹配对于识别数据流中表现出特定形态属性的字符串至关重要,因此在计算机安全的深度数据包检查和生物信息学的基因组分析等方面发挥着重要作用。然而,由于其内在的数据依赖特性,REs 代表了一个复杂的计算内核,众多解决方案从不同方向研究模式匹配的效率。然而,大多数方案都缺乏全面的规则集优化方法,无法在同时考虑多个 RE 时真正提高模式匹配性能。因此,利用同一数据集中 RE 的形态相似性可以减少存储模式时的内存,并大大提高数据集匹配的吞吐量。基于这一观点,我们提出了多 RE 有限状态自动机(MFSA),它扩展了有限状态自动机(FSA)模型,通过利用特定应用规则集中的相似性来改进 RE 的并行化。我们设计了一个多级编译框架来管理 REs 合并和优化,以生成 MFSA。此外,我们还利用新颖的 iMFAnt 引擎扩展了用于执行 MFSA 的 iNFAnt 算法。我们的评估研究了在单线程和多线程配置下,与多 FSA 相比,MFSA 的大小缩减影响和执行吞吐量。就状态而言,这种方法的平均压缩率为 71.95%,编译时间开销有限。此外,与单线程和多线程并行 FSA 相比,最佳 iMFAnt 的吞吐量提高了 5.99 美元,速度提高了 4.05 美元。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution
Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput. Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single-and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean $5.99\times$ throughput improvement and $4.05\times$ speedup against single and multiple parallel FSAs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信