S. Seetharama, M. Cohen, S. Sengupta, D. Panda, L. Paraschis
{"title":"Tutorials - HOTI 2012","authors":"S. Seetharama, M. Cohen, S. Sengupta, D. Panda, L. Paraschis","doi":"10.1109/HOTI.2012.25","DOIUrl":"https://doi.org/10.1109/HOTI.2012.25","url":null,"abstract":"This keynotes discusses the following: Hands-on Tutorial on Software-Defined Networking; Interconnection Networks for Cloud Data Centers; Designing Scientific, Enterprise, and Cloud Computing Systems with InfiniBand and High-Speed Ethernet: Current Status and Trends; The Evolution of Network Architecture towards CloudCentric Applications.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126318563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted Differential Scheduler","authors":"H. Eberle, W. Olesinski","doi":"10.1109/HOTI.2012.12","DOIUrl":"https://doi.org/10.1109/HOTI.2012.12","url":null,"abstract":"The Weighted Differential Scheduler (WDS) is a new scheduling discipline for accessing shared resources. The work described here was motivated by the need for a simple weighted scheduler for a network switch where multiple packet flows are competing for an output port. The scheme can be implemented with simple arithmetic logic and finite state machines. We are describing several versions of WDS that can merge two or more flows. An analysis reveals that WDS has lower jitter than any other weighted scheduler known to us.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130798638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gutierrez, N. Hjelm, Manjunath Gorentla Venkata, R. Graham
{"title":"Performance Evaluation of Open MPI on Cray XE/XK Systems","authors":"S. Gutierrez, N. Hjelm, Manjunath Gorentla Venkata, R. Graham","doi":"10.1109/HOTI.2012.11","DOIUrl":"https://doi.org/10.1109/HOTI.2012.11","url":null,"abstract":"Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of Open MPI, however, lack support for the Cray XE6 and XK6 architectures -- both of which use the Gemini System Interconnect. In this paper, we present extensions to natively support these architectures within Open MPI, describe and propose solutions for performance and scalability bottlenecks, and provide an extensive evaluation of our implementation, which is the first completely open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes. Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPI's. Micro-benchmark results show short-data 1-byte and 1,024-byte message latencies of 1.20 μs and 4.13 μs, which are 10.00% and 39.71% better than the vendor-supplied MPI's, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPI's bandwidth at the same message size. Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores -- where we exhibited similar performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPI's achieved parallel efficiency.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124460078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monia Ghobadi, Geoffrey Salmon, Y. Ganjali, Martin Labrecque, J. Steffan
{"title":"Caliper: Precise and Responsive Traffic Generator","authors":"Monia Ghobadi, Geoffrey Salmon, Y. Ganjali, Martin Labrecque, J. Steffan","doi":"10.1109/HOTI.2012.16","DOIUrl":"https://doi.org/10.1109/HOTI.2012.16","url":null,"abstract":"This paper presents Caliper, a highly-accurate packet injection tool that generates precise and responsive traffic. Caliper takes live packets generated on a host computer and transmits them onto a gigabit Ethernet network with precise inter-transmission times. Existing software traffic generators rely on generic Network Interface Cards which, as we demonstrate, do not provide high-precision timing guarantees. Hence, performing valid and convincing experiments becomes difficult or impossible in the context of time-sensitive network experiments. Our evaluations show that Caliper is able to reproduce packet inter-transmission times from a given arbitrary distribution while capturing the closed-loop feedback of TCP sources. Specifically, we demonstrate that Caliper provides three orders of magnitude better precision compared to commodity NIC: with requested traffic rates up to the line rate, Caliper incurs an error of 8 ns or less in packet transmission times. Furthermore, we explore Caliper's ability to integrate with existing network simulators to project simulated traffic characteristics into a real network environment. Caliper is freely available online.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"51 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128871285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rx Stack Accelerator for 10 GbE Integrated NIC","authors":"F. Abel, C. Hagleitner, Fabrice Verplanken","doi":"10.1109/HOTI.2012.18","DOIUrl":"https://doi.org/10.1109/HOTI.2012.18","url":null,"abstract":"The miniaturization of CMOS technology has reached a scale at which server processors are starting to integrate multi-gigabit network interface controllers (NIC). While transistors are becoming cheap and abundant in solid-state circuits, they remain at a premium on a processor die if they do not contribute to increase the number of cores and caches. Therefore, an integrated NIC (iNIC) must provide high networking performance under high logic density and low power dissipation. This paper describes the design of an integrated accelerator to offload computation-intensive protocol-processing tasks. The accelerator combines the concepts of the transport-triggered architecture with a programmable finite-state machine to deliver high instruction-level parallelism, efficient multiway branching and flexibility. The flexibility is key to adapt to protocol changes and address new applications. This accelerator was used in the construction of a 10 GbE iNIC in 45-nm CMOS technology. The ratio of performance (15 Mfps - 20 Gb/s Tput per port) to area (0.7 mm2) and the power consumption (0.15 W) of this accelerator were core enablers for constructing a processor compute complex with four iNICs.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128009425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jérôme Vienne, Jitong Chen, Md. Wasi-ur-Rahman, Nusrat S. Islam, H. Subramoni, D. Panda
{"title":"Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems","authors":"Jérôme Vienne, Jitong Chen, Md. Wasi-ur-Rahman, Nusrat S. Islam, H. Subramoni, D. Panda","doi":"10.1109/HOTI.2012.19","DOIUrl":"https://doi.org/10.1109/HOTI.2012.19","url":null,"abstract":"Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middleware (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However, no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).In this paper we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middleware. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129790033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Neeser, Nikolaos Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, Kenneth M. Valk, C. Basso
{"title":"Occupancy Sampling for Terabit CEE Switches","authors":"F. Neeser, Nikolaos Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, Kenneth M. Valk, C. Basso","doi":"10.1109/HOTI.2012.14","DOIUrl":"https://doi.org/10.1109/HOTI.2012.14","url":null,"abstract":"One consequential feature of Converged Enhanced Ethernet (CEE) is loss lessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN). We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch, however, as future switches scale to higher port counts and link speeds, purely output-queued or shared-memory architectures lead to excessive memory bandwidth requirements, moreover, PFC typically requires dedicated buffers per input. Our objective is to complement PFC's coarse per-port/priority granularity with QCN's per-flow control. By detecting buffer overload early, QCN can drastically reduce PFC's side effects. We install QCN congestion points (CPs) at input buffers with virtual output queues and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims. Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state. For CPs with arbitrarily scheduled buffers, QCN-OSis shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114370577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeffrey Fong, Xiang Wang, Yaxuan Qi, Jun Li, Weirong Jiang
{"title":"ParaSplit: A Scalable Architecture on FPGA for Terabit Packet Classification","authors":"Jeffrey Fong, Xiang Wang, Yaxuan Qi, Jun Li, Weirong Jiang","doi":"10.1109/HOTI.2012.17","DOIUrl":"https://doi.org/10.1109/HOTI.2012.17","url":null,"abstract":"Packet classification is a fundamental enabling function for various applications in switches, routers and firewalls. Due to their performance and scalability limitations, current packet classification solutions are insufficient in ad-dressing the challenges from the growing network bandwidth and the increasing number of new applications. This paper presents a scalable parallel architecture, named Para Split, for high-performance packet classification. We propose a rule set partitioning algorithm based on range-point conversion to reduce the overall memory requirement. We further optimize the partitioning by applying the Simulated Annealing technique. We implement the architecture on a Field Programmable Gate Array (FPGA) to achieve high throughput by exploiting the abundant parallelism in the hardware. Evaluation using real-life data sets including Open Flow-like 11-tuple rules shows that Para Split achieves significant reduction in memory requirement, compared with the-state-of-the-art algorithms such as Hyper Split [6] and EffiCuts [8]. Because of the memory efficiency of Para Split, our FPGA design can support in the on-chip memory multiple engines, each of which contains up to 10K complex rules. As a result, the architecture with multiple Para Split engines in parallel can achieve up to Terabit per second throughput for large and complex rule sets on a single FPGA device.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130154550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Lockwood, Adwait Gupte, Nishit Mehta, Michaela Blott, T. English, K. Vissers
{"title":"A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT)","authors":"J. Lockwood, Adwait Gupte, Nishit Mehta, Michaela Blott, T. English, K. Vissers","doi":"10.1109/HOTI.2012.15","DOIUrl":"https://doi.org/10.1109/HOTI.2012.15","url":null,"abstract":"Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters. The high and unpredictable latency of these systems has led the trading world to explore alternative \"hybrid\" architectures with hardware acceleration. In this paper, we survey existing solutions and describe how FPGAs are being used in electronic trading to approach the goal of zero latency. We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed end-to-end latency of 1μs - up to two orders of magnitude lower than comparable software implementations.","PeriodicalId":197180,"journal":{"name":"2012 IEEE 20th Annual Symposium on High-Performance Interconnects","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134451197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}