{"title":"LaFA: lookahead finite automata for scalable regular expression detection","authors":"M. Bando, S. Artan, Jonathan Chao","doi":"10.1145/1882486.1882496","DOIUrl":"https://doi.org/10.1145/1882486.1882496","url":null,"abstract":"Although Regular Expressions (RegExes) have been widely used in network security applications, their inherent complexity often limits the total number of RegExes that can be detected using a single chip for a reasonable throughput. This limit on the number of RegExes impairs the scalability of today's RegEx detection systems. The scalability of existing schemes is generally limited by the traditional per character state processing and state transition detection paradigm. The main focus of existing schemes is in optimizing the number of states and the required transitions, but not the suboptimal character-based detection method. Furthermore, the potential benefits of reduced number of operations and states using out-of-sequence detection methods have not been explored. In this paper, we propose Looka-head Finite Automata (LaFA) to perform scalable RegEx detection using very small amount of memory. LaFA's memory requirement is very small due to the following three areas of effort described in this paper: (1) Different parts of a RegEx, namely RegEx components, are detected using different detectors, each of which is specialized and optimized for the detection of a certain RegEx component. (2) We systematically reorder the RegEx component detection sequence, which provides us with new possibilities for memory optimization. (3) Many redundant states in classical finite automata are identified and eliminated in LaFA. Our simulations show that LaFA requires an order of magnitude less memory compared to today's state-of-the-art RegEx detection systems. A single commodity Field Programmable Gate Array (FPGA) chip can accommodate up to twenty-five thousand (25k) RegExes. Based on the throughput of our LaFA prototype on FPGA, we estimated that a 34-Gbps throughput can be achieved.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124911447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lock-free, cache-efficient shared ring buffer for multi-core architectures","authors":"P. Lee, T. Bu, Girish P. Chandranmenon","doi":"10.1145/1882486.1882508","DOIUrl":"https://doi.org/10.1145/1882486.1882508","url":null,"abstract":"We propose MCRingBuffer, a lock-free, cache-efficient shared ring buffer that provides fast data accesses among threads running in multi-core architectures. MCRingBuffer seeks to reduce the cost of inter-core communication by allowing concurrent lock-free data accesses and improving the cache locality of accessing control variables used for thread synchronization. Evaluation on an Intel Xeon multi-core machine shows that MCRingBuffer achieves a throughput gain of up to 4.9x over existing concurrent lock-free ring buffers. A motivating application of MCRingBuffer is parallel network traffic monitoring, in which MCRingBuffer facilitates multi-core architectures to process packets at line rate.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124281099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory optimization for packet classification algorithms","authors":"J. Blaho, J. Korenek, V. Pus","doi":"10.1145/1882486.1882524","DOIUrl":"https://doi.org/10.1145/1882486.1882524","url":null,"abstract":"We propose novel method how to reduce data structure size for the family of packet classification algorithms at the cost of additional pipelined processing with only small amount of logic resources. The reduction significantly decreases overhead given by the crossproduct nature of classification rules. Therefore the data structure can be compressed to 10% on average. As high compression ratio is achieved, fast on-chip memory can be used to store data structures and hardware architectures can process network traffic at significantly higher speed.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126335768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junchen Jiang, Yi Tang, B. Liu, Xiaofei Wang, Yang Xu
{"title":"SPC-FA: synergic parallel compact finite automaton to accelerate multi-string matching with low memory","authors":"Junchen Jiang, Yi Tang, B. Liu, Xiaofei Wang, Yang Xu","doi":"10.1145/1882486.1882523","DOIUrl":"https://doi.org/10.1145/1882486.1882523","url":null,"abstract":"Deterministic Finite Automaton (DFA) is well-known for its constant matching speed in worst case, and widely used in multi-string matching, which is a critical technique in high performance Network Intrusion Detection System (NIDS) design. Existing DFA-based researches achieve high throughput at the expense of extremely high memory cost, so they fail to be used in situations like embedded systems where very tight memory resource is available. In this paper, we propose a memory-efficient multi-string matching acceleration scheme named Synergic Parallel Compact (SPC) Match Engine, which can provide a high matching speedup with no extra memory cost than the traditional DFA. Our scheme can be understood as consisting of k SPC-FAs, each of which can process one character from the input stream, causing achieving a constant speedup factor k with reduced memory occupation. Experimental evaluations with Snort and ClamAV rulesets show that a speedup of 9X can be practically achieved by a single SPC Match Engine instance with a reduced memory size than the up-to-date DFA-based compression approaches.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangdeng Liao, Danhua Guo, L. Bhuyan, Steve R. King
{"title":"Software techniques to improve virtualized I/O performance on multi-core systems","authors":"Guangdeng Liao, Danhua Guo, L. Bhuyan, Steve R. King","doi":"10.1145/1477942.1477971","DOIUrl":"https://doi.org/10.1145/1477942.1477971","url":null,"abstract":"Virtualization technology is now widely deployed on high performance networks such as 10-Gigabit Ethernet (10GE). It offers useful features like functional isolation, manageability and live migration. Unfortunately, the overhead of network I/O virtualization significantly degrades the performance of network-intensive applications. Two major factors of loss in I/O performance result from the extra driver domain to process I/O requests and the extra scheduler inside the virtual machine monitor (VMM) for scheduling domains.\u0000 In this paper we first examine the negative effect of virtualization in multi-core platforms with 10GE networking. We study virtualization overhead and develop two optimizations for the VMM scheduler to improve I/O performance. The first solution uses cache-aware scheduling to reduce inter-domain communication cost. The second solution steals scheduler credits to favor I/O VCPUs in the driver domain. We also propose two optimizations to improve packet processing in the driver domain. First we re-design a simple bridge for more efficient switching of packets. Second we develop a patch to make transmit (TX) queue length in the driver domain configurable and adaptable to 10GE networks. Using all the above techniques, our experiments show that virtualized I/O bandwidth can be increased by 96%. Our optimizations also improve the efficiency by saving 36% in core utilization per gigabit. All the optimizations are based on pure software approaches and do not hinder live migration. We believe that the findings from our study will be useful to guide future VMM development.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130674550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient regular expression evaluation: theory to practice","authors":"M. Becchi, P. Crowley","doi":"10.1145/1477942.1477950","DOIUrl":"https://doi.org/10.1145/1477942.1477950","url":null,"abstract":"Several algorithms and techniques have been proposed recently to accelerate regular expression matching and enable deep packet inspection at line rate. This work aims to provide a comprehensive practical evaluation of existing techniques, extending them and analyzing their compatibility. The study focuses on two hardware architectures: memory-based ASICs and FPGAs.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124213606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Packet prediction for speculative cut-through switching","authors":"Paul Congdon, M. Farrens, P. Mohapatra","doi":"10.1145/1477942.1477957","DOIUrl":"https://doi.org/10.1145/1477942.1477957","url":null,"abstract":"The amount of intelligent packet processing in an Ethernet switch continues to grow, in order to support of embedded applications such as network security, load balancing and quality of service assurance. This increased packet processing is contributing to greater per-packet latency through the switch.\u0000 In addition, there is a growing interest in using Ethernet switches in low latency environments such as high-performance clusters, storage area networks and real-time media distribution. In this paper we propose Packet Prediction for Speculative Cut-through Switching (PPSCS), a novel approach to reducing the latency of modern Ethernet switches without sacrificing feature rich policy-based forwarding enabled by deep packet inspection.\u0000 PPSCS exploits the temporal nature of network communications to predict the flow classification of incoming packets and begin the speculative forwarding of packets before complex lookup operations are complete.\u0000 Simulation studies using actual network traces indicate that correct prediction rates of up to 97% are achievable using only a small amount of prediction circuitry per port. These studies also indicate that PPSCS can reduce the latency in traditional store-and-forward switches by nearly a factor of 8, and reduce the latency of cut-through switches by a factor of 3.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132332975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive scheduling to maximize NIC throughput in a COTS router","authors":"Qinghua Ye, M. MacGregor","doi":"10.1145/1477942.1477960","DOIUrl":"https://doi.org/10.1145/1477942.1477960","url":null,"abstract":"In routers based on commodity off-the-shelf (COTS) hardware and open-source operating systems, there are correlations between the transmission and reception capabilities of individual network interface cards (NICs) and multiple NICs on the same bus because of NIC/bus bottlenecks. To manage the adverse effects of this correlation, we propose an adaptive scheduling mechanism based on system state information, which balances transmission and reception rates and increases the overall forwarding rate.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132683933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compact architecture for high-throughput regular expression matching on FPGA","authors":"Y. Yang, Weirong Jiang, V. Prasanna","doi":"10.1145/1477942.1477948","DOIUrl":"https://doi.org/10.1145/1477942.1477948","url":null,"abstract":"In this paper we present a novel architecture for high-speed and high-capacity regular expression matching (REM) on FPGA. The proposed REM architecture, based on nondeterministic finite automaton (RE-NFA), efficiently constructs regular expression matching engines (REME) of arbitrary regular patterns and character classes in a uniform structure, utilizing both logic slices and block memory (BRAM) available on modern FPGA devices. The resulting circuits take advantage of synthesis and routing optimizations to achieve high operating speed and area efficiency. The uniform structure of our RE-NFA design can be stacked in a simple way to produce multi-character input circuits to scale up throughput further. An n-state m-character input REME takes only O (n X log2 m) time to construct and occupies no more than O (n X m) logic units. The REMEs can be staged and pipelined in large numbers to achieve high parallelism without sacrificing clock frequency.\u0000 Using the proposed RE-NFA architecture, we are able to implement 3 copies of two-character input REMEs, each with 760 regular expressions, 18715 states and 371 character classes, onto a single Xilinx Virtex 4 LX-100-12 device. Each copy processes 2 characters per clock cycle at 300 MHz, resulting in a concurrent throughput of 14.4 Gbps for 760 REMEs. Compared with the automatic NFA-to-VHDL REME compilation [13], our approach achieves over 9x throughput efficiency (Gbps*state/LUT). Compared with state-of-the-art REMEs on FPGA, our approach also indicates up to 70% better throughput efficiency.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122102973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low power architecture for high speed packet classification","authors":"A. Kennedy, Xiaojun Wang, Z. Liu, B. Liu","doi":"10.1145/1477942.1477967","DOIUrl":"https://doi.org/10.1145/1477942.1477967","url":null,"abstract":"Today's routers need to perform packet classification at wire speed in order to provide critical services such as traffic billing, priority routing and blocking unwanted Internet traffic. With everincreasing ruleset size and line speed, the task of implementing wire speed packet classification with reduced power consumption remains difficult. Software approaches are unable to classify packets at wire speed as line rates reach OC-768, while state of the art hardware approaches such as TCAM still consume large amounts of power.\u0000 This paper presents a low power architecture for a high speed packet classifier which can meet OC-768 line rate. The architecture consists of an adaptive clocking unit which dynamically changes the clock speed of an energy efficient packet classifier to match fluctuations in traffic on a router line card. It achieves this with the help of a scheme developed to keep clock frequencies at the lowest speed capable of servicing the line card while reducing frequency switches. The low power architecture has been tested on OC-48, OC-192 and OC-768 packet traces created from real life network traces obtained from NLANR while classifying packets using synthetic rulesets containing up to 25,000 rules. Simulation results of our classifier implemented on a Cyclone 3 and Stratix 3 FPGA, and as an ASIC show that power savings of between 17--88% can be achieved, using our adaptive clocking unit rather than a fixed clock speed.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117123698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}