Roadblocks of I/O Parallelization: Removing H/W Contentions by Static Role Assignment in VNFs

2020 IEEE 9th International Conference on Cloud Networking (CloudNet) Pub Date : 2020-11-09 DOI:10.1109/CloudNet51028.2020.9335803

Masahiro Asada, Ryota Kawashima, Hiroki Nakayama, Tsunemasa Hayashi, H. Matsuo

{"title":"Roadblocks of I/O Parallelization: Removing H/W Contentions by Static Role Assignment in VNFs","authors":"Masahiro Asada, Ryota Kawashima, Hiroki Nakayama, Tsunemasa Hayashi, H. Matsuo","doi":"10.1109/CloudNet51028.2020.9335803","DOIUrl":null,"url":null,"abstract":"Achieving 100 Gbps+ throughput with commodity servers is a challenging goal, even with state-of-the-art Data Plane Development Kit (DPDK). Fundamental performance of CPU/Memory is now the bottleneck and simple code optimization of Network Functions (NFs) cannot be the solution. Hardware accelerators including FPGA are getting attentions for performance boost; however, relying on specific features degrades manageability of NFV-nodes. Common Receive Side Scaling (RSS) provides a means of H/W-level parallelization, but per-flow throughput is not accelerated. Existing software-based approaches distribute processing load of NFs, but I/O is still serialized for each datapath. We tackled I/O parallelization and uncovered encounterd certainly misty contentions in our previous study. Specifically, per-thread CPU cycle consumptions proportionally grew as increasing parallelization level, although the overhead of conceivable mutual executions (e.g. CAS operations) was trivial. In this paper, we pursue the cause of the issue and upgrade our I/O parallelization scheme. Our careful investigation of NFV-node internals ranging from application to device driver layers indicates that hidden H/W-level contentions involving DMA heavily consume CPU cycles. We propose a contention avoidance design of thread role assignment and prove our design can flatten per-thread CPU cycle consumptions.","PeriodicalId":156419,"journal":{"name":"2020 IEEE 9th International Conference on Cloud Networking (CloudNet)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 9th International Conference on Cloud Networking (CloudNet)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudNet51028.2020.9335803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Achieving 100 Gbps+ throughput with commodity servers is a challenging goal, even with state-of-the-art Data Plane Development Kit (DPDK). Fundamental performance of CPU/Memory is now the bottleneck and simple code optimization of Network Functions (NFs) cannot be the solution. Hardware accelerators including FPGA are getting attentions for performance boost; however, relying on specific features degrades manageability of NFV-nodes. Common Receive Side Scaling (RSS) provides a means of H/W-level parallelization, but per-flow throughput is not accelerated. Existing software-based approaches distribute processing load of NFs, but I/O is still serialized for each datapath. We tackled I/O parallelization and uncovered encounterd certainly misty contentions in our previous study. Specifically, per-thread CPU cycle consumptions proportionally grew as increasing parallelization level, although the overhead of conceivable mutual executions (e.g. CAS operations) was trivial. In this paper, we pursue the cause of the issue and upgrade our I/O parallelization scheme. Our careful investigation of NFV-node internals ranging from application to device driver layers indicates that hidden H/W-level contentions involving DMA heavily consume CPU cycles. We propose a contention avoidance design of thread role assignment and prove our design can flatten per-thread CPU cycle consumptions.

查看原文本刊更多论文

I/O并行化的障碍:通过VNFs中的静态角色分配消除H/W争用

使用普通服务器实现100 Gbps+的吞吐量是一个具有挑战性的目标，即使使用最先进的数据平面开发工具包(DPDK)。CPU/内存的基本性能现在是瓶颈，简单的网络函数(NFs)代码优化不能解决问题。包括FPGA在内的硬件加速器因性能提升而备受关注;然而，依赖特定的特性会降低nfv节点的可管理性。公共接收端缩放(RSS)提供了一种H/ w级并行化的方法，但没有加速每流吞吐量。现有的基于软件的方法分配NFs的处理负载，但是I/O仍然是为每个数据路径序列化的。我们解决了I/O并行化问题，并在之前的研究中发现了一些模糊的争论。具体来说，随着并行化水平的提高，每个线程的CPU周期消耗成比例地增长，尽管可能的相互执行(例如CAS操作)的开销微不足道。在本文中，我们探讨了问题的原因，并升级了我们的I/O并行化方案。我们对从应用程序到设备驱动程序层的nfv节点内部的仔细调查表明，涉及DMA的隐藏H/ w级争用严重消耗CPU周期。我们提出了一种避免争用的线程角色分配设计，并证明了我们的设计可以平抑每个线程的CPU周期消耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 9th International Conference on Cloud Networking (CloudNet)

自引率

0.00%

发文量