{"title":"NVSwap: Latency-Aware Paging using Non-Volatile Main Memory","authors":"Yekang Wu, Xuechen Zhang","doi":"10.1109/nas51552.2021.9605418","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605418","url":null,"abstract":"Page relocation (paging) from DRAM to swap devices is an important task of a virtual memory system in operating systems. Existing Linux paging mechanisms have two main deficiencies: (1) they may incur a high I/O latency due to write interference on solid-state disks and aggressive memory page reclaiming rate under high memory pressure and (2) they do not provide predictable latency bound for latency-sensitive applications because they cannot control the allocation of system resources among concurrent processes sharing swap devices.In this paper, we present the design and implementation of a latency-aware paging mechanism called NVSwap. It supports a hybrid swap space using both regular secondary storage devices (e.g., solid-state disks) and non-volatile main memory (NVMM). The design is more cost-effective than using only NVMM as swap spaces. Furthermore, NVSwap uses NVMM as a persistent paging buffer to serve the page-out requests and hide the latency of paging between the regular swap device and DRAM. It supports in-situ paging for pages in the persistent paging buffer avoiding the slow I/O path. Finally, NVSwap allows users to specify latency bounds for individual processes or a group of related processes and enforces the bounds by dynamically controlling the resource allocation of NVMM and page reclaiming rate in memory among scheduling units. We have implemented a prototype of NVSwap in the Linux kernel-4.4.241 based on Intel Optane DIMMs. Our results demonstrate that NVSwap reduces paging latency by up to 99% and provides performance guarantee and isolation among concurrent applications sharing swap devices.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114173714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Copyright notice]","authors":"","doi":"10.1109/nas51552.2021.9605439","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605439","url":null,"abstract":"","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128911315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing AI Model Inference Applications Running in the SGX Environment","authors":"Shixiong Jing, Qinkun Bao, Pei Wang, Xulong Tang, Dinghao Wu","doi":"10.1109/nas51552.2021.9605445","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605445","url":null,"abstract":"Intel Software Guard Extensions (SGX) is a set of extensions built into Intel CPUs for the trusted computation. It creates a hardware-assisted secure container, within which programs are protected from data leakage and data manipulations by privileged software and hypervisors. With the trend that more and more machine learning based programs are moving to cloud computing, SGX can be used in cloud-based Machine Learning applications to protect user data from malicious privileged programs.However, applications running in SGX suffer from several overheads, including frequent context switching, memory page encryption/decryption, and memory page swapping, which significantly degrade the execution efficiency. In this paper, we aim to i) comprehensively explore the execution of general AI applications running on SGX, ii) systematically characterize the data reuses at both page granularity and cacheline granularity, and iii) provide optimization insights for efficient deployment of machine learning based applications on SGX. To the best of our knowledge, our work is the first to study machine learning applications on SGX and explore the potential of data reuses to reduce the runtime overheads in SGX.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"133 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115171338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decoupling Control and Data Transmission in RDMA Enabled Cloud Data Centers","authors":"Qingyue Liu, P. Varman","doi":"10.1109/nas51552.2021.9605415","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605415","url":null,"abstract":"Advances in storage, processing, and networking hardware are changing the structure of distributed applications. RDMA networks provide multiple communication mechanisms that enable novel hybrid protocols specialized to different data transfer requirements. In this paper, we present a distributed communication scheme that separates control and data communication channels directly at the RNIC rather than the application level. We develop a new communication artifact, a remote random access buffer, to efficiently implement this separation. Data messages are sent silently to the receiver, which is informed of the location of the data by a subsequent control message. Experiments on an RDMA-enabled cluster with micro benchmarks and two distributed applications validate the performance benefits of our approach.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127909522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaunté W. Lacewell, Nilesh A. Ahuja, Pablo Muñoz, Parual Datta, Ragaad Altarawneh, Vui Seng Chua, Nilesh Jain, Omesh Tickoo, R. Iyer
{"title":"E2E Visual Analytics: Achieving >10X Edge/Cloud Optimizations","authors":"Chaunté W. Lacewell, Nilesh A. Ahuja, Pablo Muñoz, Parual Datta, Ragaad Altarawneh, Vui Seng Chua, Nilesh Jain, Omesh Tickoo, R. Iyer","doi":"10.1109/nas51552.2021.9605404","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605404","url":null,"abstract":"As visual analytics continues to rapidly grow, there is a critical need to improve the end-to-end efficiency of visual processing in edge/cloud systems. In this paper, we cover algorithms, systems and optimizations in three major areas for edge/cloud visual processing: (1) addressing storage and retrieval efficiency of visual data and meta-data by employing and optimizing visual data management systems, (2) addressing compute efficiency of visual analytics by taking advantage of co-optimization between the compression and analytics domains and (3) addressing networking (bandwidth) efficiency of visual data compression by tailoring it based on analytics tasks. We describe techniques in each of the above areas and measure its efficacy on state-of-the-art platforms (Intel Xeon), workloads and datasets. Our results show that we can achieve >10X improvements in each area based on novel algorithms, systems, and co-design optimizations. We also outline future research directions based on our findings which outline areas of further performance and efficiency advantages in end-to-end visual analytics.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"26 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128226692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Adapting the Cache Block Size in SSD Caches","authors":"Nikolaus Jeremic, Helge Parzyjegla, Gero Mühl","doi":"10.1109/nas51552.2021.9605462","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605462","url":null,"abstract":"SSD-based block-level caches can notably increase the performance of HDD-based storage systems. However, this demands a sensible choice of the cache block size, which depends strongly on the workload characteristics. Many workloads will most likely favor either small or large cache blocks. Unfortunately, choosing the appropriate cache block size is difficult due to the diversity and dynamics of storage workloads. Thus, adapting the cache block size to the workload characteristics at run time has the potential to substantially improve the cache performance compared to using a fixed cache block size. However, changing the used cache block size for all cached data is very costly and neglects that distinct parts of the data may exhibit different access patterns, which favor distinct cache block sizes.In this paper, we experimentally study the performance impact of the cache block size and fine-grained adaptation, i.e., for individual parts of the data, between small and large cache blocks in write-back SSD caches. Based on our results, we make two major observations on the performance impact of the cache block size and its adaptation. First, using an inappropriate cache block size can reduce the overall throughput by up to 84% compared to using the most suitable cache block size. Second, fine-grained adaptation between small and large cache blocks is highly beneficial as it avoids such a performance deterioration, whereas it can increase the overall throughput by up to 126% in comparison to using the more suitable fixed cache block size.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131060944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Storage Device Characteristics of A RISC-V Little-core SoC","authors":"Tao Lu","doi":"10.1109/nas51552.2021.9605430","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605430","url":null,"abstract":"Low-power system-on-chips (SoCs) dominate the Internet of Things (IoT) ecosystem, which consists of billions of devices that can generate Zettabytes of data. SoC directly interacts with big data, but there is little research on its storage performance and power consumption characteristics, especially the lack of quantitative evaluation. In this paper, we study the storage characteristics of a low-power RISC-V SoC FPGA. Specifically, we deploy a PCIe SSD to study the performance of storage devices under little cores. We quantitatively evaluate device bandwidth, IOPS throughput, and power consumption. In addition, we compare the same device on the low-power RISC-V SoC and a high-performance x86 server to observe the similarities and differences of the storage device behavior on different computing platforms.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131409951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing Latency and Quality in Web Search","authors":"Liang Zhou, K. Ramakrishnan","doi":"10.1109/nas51552.2021.9605375","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605375","url":null,"abstract":"Selecting the right time budget for a search query is challenging because a proper balance between the search latency, quality and efficiency has to be maintained. State-of-the-art approaches leverage a centralized sample index at the aggregator to select the Index Serving Nodes (ISNs) to maintain quality and responsiveness. In this paper, we propose Cottage, a coordinated framework between the aggregator and ISNs for latency and quality optimization in web search. Cottage has two separate neural network models at each ISN to predict the quality contribution and latency, respectively. Then, these prediction results are sent back to the aggregator for latency and quality optimizations. The key task is integration of the predictions at the aggregator in determining an optimal dynamic time budget for identifying slow and low quality ISNs to improve latency and search efficiency. Our experiments on the Solr search engine prove that Cottage can reduce the average query latency by 54% and achieve a good P@10 search quality of 0.947.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120894459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flow Scheduling in a Heterogeneous NFV Environment using Reinforcement Learning","authors":"Chun Jen Lin, Yan Luo, Liang-Min Wang, Li-De Chen","doi":"10.1109/nas51552.2021.9605395","DOIUrl":"https://doi.org/10.1109/nas51552.2021.9605395","url":null,"abstract":"Network function virtualization (NFV) allows net-work functions executed on general-purpose servers or virtual machines (VMs) instead of proprietary hardware, greatly improving the flexibility and scalability of network services. Recent trends in using programmable accelerators to speed up NFV performance introduce challenges in flow scheduling in a dynamic NFV environment. Reinforcement learning (RL) trains machine learning models for decision making to maximize returns in uncertain environments such as NFV. In this paper, we study the allocation of heterogeneous processors (CPUs and FPGAs) to minimize the delays of flows in the system. We conduct extensive simulations to evaluate the performance of reinforcement learning based scheduling algorithms such as Advantage Actor Critic (A2C), Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), and compare with greedy policies. The results show that RL based schedulers can effectively learn from past experiences and converge to the optimal greedy policy. We also analyze in-depth how the policies lead to different processor utilization and flow processing time, and provide insights into these policies.","PeriodicalId":135930,"journal":{"name":"2021 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127829280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}