{"title":"Automated task distribution in multicore network processors using statistical analysis","authors":"A. Mallik, Yu Zhang, G. Memik","doi":"10.1145/1323548.1323563","DOIUrl":"https://doi.org/10.1145/1323548.1323563","url":null,"abstract":"Chip multiprocessor designs are the most common types of architectures seen in Network Processors. As the Network Processors are used to implement increasingly complicated applications, task distribution among the cores is becoming an important problem. In this paper, we propose a new task allocation scheme for such architectures. This scheme relies on the inherent modular nature of the networking applications and intelligently distributes modules among different execution cores. Additionally, we selectively replicate modules to parallelize execution of tasks having longer processing time. We have developed a technique that uses the probability distribution of the execution times of different modules in the networking applications. The proposed schemes result in resource utilization of up to 95%, 89%, and 84% on average for the processors with 2, 4, and 8 cores, respectively. The schemes are highly scalable and can improve the throughput by 6.72 times for 8 core processors, aggregated over four representative applications. The combination of selective replication of modules and variation-aware task allocation result in up to 12.5% (9.9% on average) performance improvement as compared to a scheme based on just mean processing time.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132324825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Shi, B. Liu, Changhua Sun, Zhengyu Yin, L. Bhuyan, H. J. Chao
{"title":"Flow-slice: a novel load-balancing scheme for multi-path switching systems","authors":"Lei Shi, B. Liu, Changhua Sun, Zhengyu Yin, L. Bhuyan, H. J. Chao","doi":"10.1145/1323548.1323559","DOIUrl":"https://doi.org/10.1145/1323548.1323559","url":null,"abstract":"Multi-Path Switching systems (MPS) are intensively used in the state-of-the-art core routers. One of the most intractable issues is how to load-balance traffic across its multiple paths while not disturbing the intra-flow packet orders. In this paper, based on the studies of tens of real Internet traces, we develop a novel scheme, namely Flow-Slice (FS), which cuts off each flow into flow-slices at every intra-flow interval larger than a slicing threshold set to 1ms 4ms and balances the load on the finer granularity. Through theoretical analyses and comprehensive trace-driven simulations, we show that FS achieves impressive load-balancing performance with little hardware cost while limiting the packet out-of-order chances to a negligible level (below 10 -6).","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130216976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A programmable message classification engine for session initiation protocol (SIP)","authors":"A. Acharya, Xiping Wang, Charles P. Wright","doi":"10.1145/1323548.1323578","DOIUrl":"https://doi.org/10.1145/1323548.1323578","url":null,"abstract":"Session Initiation Protocol (SIP) has begun to be widely deployed for multiple services such as VoIP, Instant Messaging and Presence. Each of these services uses different SIP messages, and depending on the value of a service, e.g. revenue, the associated messages may need to be prioritized accordingly. Even within the same service, different messages may be assigned different priorities. In this paper, we present the design and implementation of a programmable classification engine for SIP messages in the Linux kernel. This design uses a novel algorithm that in addition to classifying messages can extract and maintain state information across multiple messages. We apply the classifier for overload control using operator-specified rules for categorizing messages and associated actions, augmented with a protocol-level understanding of SIP message structure. When faced with loads beyond their capacity (e.g., during catastrophic situations and major network outages), SIP servers must drop messages. It is therefore desirable that the server process high-value messages in preference to lower-value messages. We evaluated our in-kernel classifier implementation with an open source SIP server (SER) for such an overload scenario. The workload consists of a mix of call setup and call handoff messages, and the classifier is programmed with rules that prioritize handoffs over call setups. We show that, while SER can process about 40K messages/sec (in a FIFO manner), our classifier can examine and prioritize 105K messages/sec during overload. With the classifier operating at peak throughput, SER's processing rate drops to 31.6K messages/sec, but all of the available high-value messages are processed.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116204381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sushanth Kumar, B. Chandrasekaran, J. Turner, G. Varghese
{"title":"Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia","authors":"S. Sushanth Kumar, B. Chandrasekaran, J. Turner, G. Varghese","doi":"10.1145/1323548.1323574","DOIUrl":"https://doi.org/10.1145/1323548.1323574","url":null,"abstract":"The importance of network security has grown tremendously and a collection of devices have been introduced, which can improve the security of a network. Network intrusion detection systems (NIDS) are among the most widely deployed such system; popular NIDS use a collection of signatures of known security threats and viruses, which are used to scan each packet's payload. Today, signatures are often specified as regular expressions; thus the core of the NIDS comprises of a regular expressions parser; such parsers are traditionally implemented as finite automata. Deterministic Finite Automata (DFA) are fast, therefore they are often desirable at high network link rates. DFA for the signatures, which are used in the current security devices, however require prohibitive amounts of memory, which limits their practical use.\u0000 In this paper, we argue that the traditional DFA based NIDS has three main limitations: first they fail to exploit the fact that normal data streams rarely match any virus signature; second, DFAs are extremely inefficient in following multiple partially matching signatures and explodes in size, and third, finite automaton are incapable of efficiently keeping track of counts. We propose mechanisms to solve each of these drawbacks and demonstrate that our solutions can implement a NIDS much more securely and economically, and at the same time substantially improve the packet throughput.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133180532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaxuan Qi, Bosheng Xu, Fei He, Baohua Yang, Jianming Yu, Jun Li
{"title":"Towards high-performance flow-level packet processing on multi-core network processors","authors":"Yaxuan Qi, Bosheng Xu, Fei He, Baohua Yang, Jianming Yu, Jun Li","doi":"10.1145/1323548.1323552","DOIUrl":"https://doi.org/10.1145/1323548.1323552","url":null,"abstract":"There is a growing interest in designing high-performance network devices to perform packet processing at flow level. Applications such as stateful access control, deep inspection and flow-based load balancing all require efficient flow-level packet processing. In this paper, we present a design of high-performance flow-level packet processing system based on multi-core network processors. Main contribution of this paper includes: a) A high performance flow classification algorithm optimized for network processors; b) An efficient flow state management scheme leveraging memory hierarchy to support large number of concurrent flows; c) Two hardware-optimized order-preserving strategies that preserve internal and external per-flow packet order. Experimental results show that: a) The proposed flow classification algorithm, AggreCuts, outperforms the well-known HiCuts algorithm in terms of classification rate and memory usage; b) The presented SigHash scheme can manage over 10M concurrent flow states on the Intel IXP2850 NP with extremely low collision rate; c) The performance of internal packet order-preserving scheme using SRAM queue-array is about 70% of that of external packet order-preserving scheme realized by ordered-thread execution.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130802514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-latency scheduling in large switches","authors":"W. Olesinski, N. Gura, H. Eberle, A. Mejia","doi":"10.1145/1323548.1323566","DOIUrl":"https://doi.org/10.1145/1323548.1323566","url":null,"abstract":"Scheduling in large switches is challenging. Arbiters must operate at high rates to keep up with the high switching rates demanded by multi-gigabit-per-second link rates and short cells. Low-latency requirements of some applications also challenge the design of schedulers. In this paper, we propose the Parallel Wrapped Wave Front Arbiter with Fast Scheduler (PWWFA-FS). We analyze its performance, present simulation results, discuss its implementation, and show how this scheme can provide low latency under light load while scaling to large switches with multi-terabit-per-second throughput and hundreds of ports.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115019740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing interoperability and stateful analysis of cooperative network intrusion detection systems","authors":"M. Colajanni, Daniele Gozzi, Mirco Marchetti","doi":"10.1145/1323548.1323576","DOIUrl":"https://doi.org/10.1145/1323548.1323576","url":null,"abstract":"A traditional Network Intrusion Detection System (NIDS) is based on a centralized architecture that does not satisfy the needs of most modern network infrastructures characterized by high traffic volumes and complex topologies. The of decentralized NIDS based on multiple sensors is that each of them gets just a partial view of the network traffic and this prevents a stateful and fully reliable traffic analysis. We propose a novel cooperation mechanism that the previous issues through an innovative state management and state migration framework. It allows multiple decentralized sensors to share their internal state, thus accomplishing innovative and powerful traffic analysis. The advanced functionalities and performance of the proposed cooperative framework for network intrusion detection systems are demonstrated through a fully operative prototype.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129593929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On LID assignment in infiniBand networks","authors":"Wickus Nienaber, Xin Yuan, Z. Duan","doi":"10.1145/1323548.1323567","DOIUrl":"https://doi.org/10.1145/1323548.1323567","url":null,"abstract":"To realize a path in an InfiniBand network, an address, known as Local IDentifier (LID)in the InfiniBand specification, must be assigned to the destination and used in the forwarding tables of intermediate switches to direct the traffic following the path. Hence, path computation in InfiniBand networks has two tasks: (1)computing the paths, and (2 )assigning LIDs to destinations (and using the LIDs in the forwarding tables to realize the paths). We will refer to the task of computing paths as routing and the task of assigning LIDs as LID assignment Existing path computation methods for InfiniBand networks integrate these two tasks in one phase. In this paper, we propose to separate routing and LID assignment into two phases so as to achieve the best performance for both routing and LID assignment. Since the routing component has been extensively studied and is fairly well understood, this paper focuses on LID assignment whose major issue is to minimize the number of LIDs required to support a routing. We prove that the problem of realizing a routing with a minimum number of LIDs is NP-complete, develop a number of heuristics for this problem, and evaluate the performance of the heuristics through simulation. Our results demonstrate that by separating routing from LID assignment and using the schemes that are known to achieve good performance for routing and LID assignment separately, more effective path computation methods than existing ones can be developed.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129098526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of adaptive communication channel buffers for low-power area-efficient network-on-chip architecture","authors":"Avinash Karanth Kodi, Ashwini Sarathy, A. Louri","doi":"10.1145/1323548.1323561","DOIUrl":"https://doi.org/10.1145/1323548.1323561","url":null,"abstract":"Network-on-Chip (NoC)architectures provide a scalable solution to the wire delay constraints in deep submicron VLSI designs. Recent research into the ptimization of NoC architectures has shown that the design of buffers in the NoC routers influences the power consumption, area overhead and performance of the entire network. In this paper, we propose a low-power area-efficient NoC architecture by reducing the number of router buffers. As a reduction in the number of buffers degrades the network's performance, we propose to use the existing repeaters along the inter-router links as adaptive channel buffers for storing data when required. We evaluate the proposed adaptive communication channel buffers under static and dynamic buffer allocation in 8 x 8 mesh and folded torus network topologies. Simulation results show that reducing the router buffer size in half and using the adaptive channel buffers reduces the buffer power by 40-52% and leads to a 17-20% savings in overall network power with a 50% reduction in router area. The design with dynamic buffer allocation shows a marginal 1-5% drop in performance, while static buffer allocation shows a 10-20% drop in performance, for various traffic patterns.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129116925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"To CMP or not to CMP: analyzing packet classification on modern and traditional parallel architectures","authors":"Randy Smith, Dan Gibson, Shijin Kong","doi":"10.1145/1323548.1323558","DOIUrl":"https://doi.org/10.1145/1323548.1323558","url":null,"abstract":"Packet classification is central to modern network functionality, yet satisfactory memory usage and performance remains elusive at the highest speeds. The recent emergence of low-cost, highly parallel architectures provides a promising platform on which to realize increased classification performance. We analyze two classic algorithms (ABV and HiCuts) in multiple parallel contexts. Our results show that performance depends strongly on many factors, including algorithm choice, hardware platform, and parallelization scheme. We find that there is no clear \"best solution,\" but in the best cases hardware constraints are mitigated by the parallelization scheme and vice versa, yielding near-linear speedups as the degree of parallelization increases.","PeriodicalId":329300,"journal":{"name":"Symposium on Architectures for Networking and Communications Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124679979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}