A new server I/O architecture for high speed networks

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI:10.1109/HPCA.2011.5749734

Guangdeng Liao, Xia Zhu, L. Bhuyan

{"title":"A new server I/O architecture for high speed networks","authors":"Guangdeng Liao, Xia Zhu, L. Bhuyan","doi":"10.1109/HPCA.2011.5749734","DOIUrl":null,"url":null,"abstract":"Traditional architectural designs are normally focused on CPUs and have been often decoupled from I/O considerations. They are inefficient for high-speed network processing with a bandwidth of 10Gbps and beyond. Long latency I/O interconnects on mainstream servers also substantially complicate the NIC designs. In this paper, we start with fine-grained driver and OS instrumentation to fully understand the network processing overhead over 10GbE on mainstream servers. We obtain several new findings: 1) besides data copy identified by previous works, the driver and buffer release are two unexpected major overheads (up to 54%); 2) the major source of the overheads is memory stalls and data relating to socket buffer (SKB) and page data structures are mainly responsible for the stalls; 3) prevailing platform optimizations like Direct Cache Access (DCA) are insufficient for addressing the network processing bottlenecks. Motivated by the studies, we propose a new server I/O architecture where DMA descriptor management is shifted from NICs to an on-chip network engine (NEngine), and descriptors are extended with information about data incurring memory stalls. NEngine relies on data lookups and preloads data to eliminate the stalls during network processing. Moreover, NEngine implements efficient packet movement inside caches to address the remaining issues in data copy. The new architecture allows DMA engine to have very fast access to descriptors and keeps packets in CPU caches instead of NIC buffers, significantly simplifying NICs. Experimental results demonstrate that the new server I/O architecture improves the network processing efficiency by 47% and web server throughput by 14%, while substantially reducing the NIC hardware complexity.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2011.5749734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

Abstract

Traditional architectural designs are normally focused on CPUs and have been often decoupled from I/O considerations. They are inefficient for high-speed network processing with a bandwidth of 10Gbps and beyond. Long latency I/O interconnects on mainstream servers also substantially complicate the NIC designs. In this paper, we start with fine-grained driver and OS instrumentation to fully understand the network processing overhead over 10GbE on mainstream servers. We obtain several new findings: 1) besides data copy identified by previous works, the driver and buffer release are two unexpected major overheads (up to 54%); 2) the major source of the overheads is memory stalls and data relating to socket buffer (SKB) and page data structures are mainly responsible for the stalls; 3) prevailing platform optimizations like Direct Cache Access (DCA) are insufficient for addressing the network processing bottlenecks. Motivated by the studies, we propose a new server I/O architecture where DMA descriptor management is shifted from NICs to an on-chip network engine (NEngine), and descriptors are extended with information about data incurring memory stalls. NEngine relies on data lookups and preloads data to eliminate the stalls during network processing. Moreover, NEngine implements efficient packet movement inside caches to address the remaining issues in data copy. The new architecture allows DMA engine to have very fast access to descriptors and keeps packets in CPU caches instead of NIC buffers, significantly simplifying NICs. Experimental results demonstrate that the new server I/O architecture improves the network processing efficiency by 47% and web server throughput by 14%, while substantially reducing the NIC hardware complexity.

查看原文本刊更多论文

一种用于高速网络的新型服务器I/O架构

传统的体系结构设计通常关注cpu，并且经常与I/O考虑分离。对于带宽超过10Gbps的高速网络处理来说，它们的效率很低。主流服务器上的长延迟I/O互连也使NIC设计变得非常复杂。在本文中，我们从细粒度的驱动程序和操作系统检测开始，以充分理解主流服务器上超过10GbE的网络处理开销。我们得到了几个新的发现:1)除了以前的工作确定的数据拷贝之外，驱动器和缓冲区释放是两个意想不到的主要开销(高达54%);2)开销的主要来源是内存停顿，而与套接字缓冲区(SKB)和页面数据结构相关的数据是造成停顿的主要原因;3)像直接缓存访问(DCA)这样的主流平台优化不足以解决网络处理瓶颈。受这些研究的启发，我们提出了一种新的服务器I/O架构，其中DMA描述符管理从nic转移到片上网络引擎(NEngine)，并且描述符扩展了有关导致内存停滞的数据的信息。NEngine依靠数据查找和预加载数据来消除网络处理期间的停顿。此外，NEngine在缓存中实现了高效的数据包移动，以解决数据复制中的剩余问题。新的架构允许DMA引擎非常快速地访问描述符，并将数据包保存在CPU缓存中而不是NIC缓冲区中，从而大大简化了NIC。实验结果表明，新的服务器I/O架构提高了47%的网络处理效率和14%的web服务器吞吐量，同时大大降低了网卡硬件复杂性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 17th International Symposium on High Performance Computer Architecture

自引率

0.00%

发文量