Youwei Zhuo, Jingji Chen, G. Rao, Qinyi Luo, Yanzhi Wang, Hailong Yang, D. Qian, Xuehai Qian
{"title":"Distributed Graph Processing System and Processing-in-memory Architecture with Precise Loop-carried Dependency Guarantee","authors":"Youwei Zhuo, Jingji Chen, G. Rao, Qinyi Luo, Yanzhi Wang, Hailong Yang, D. Qian, Xuehai Qian","doi":"10.1145/3453681","DOIUrl":"https://doi.org/10.1145/3453681","url":null,"abstract":"To hide the complexity of the underlying system, graph processing frameworks ask programmers to specify graph computations in user-defined functions (UDFs) of graph-oriented programming model. Due to the nature of distributed execution, current frameworks cannot precisely enforce the semantics of UDFs, leading to unnecessary computation and communication. It exemplifies a gap between programming model and runtime execution. This article proposes novel graph processing frameworks for distributed system and Processing-in-memory (PIM) architecture that precisely enforces loop-carried dependency; i.e., when a condition is satisfied by a neighbor, all following neighbors can be skipped. Our approach instruments the UDFs to express the loop-carried dependency, then the distributed execution framework enforces the precise semantics by performing dependency propagation dynamically. Enforcing loop-carried dependency requires the sequential processing of the neighbors of each vertex distributed in different nodes. We propose to circulant scheduling in the framework to allow different nodes to process disjoint sets of edges/vertices in parallel while satisfying the sequential requirement. The technique achieves an excellent trade-off between precise semantics and parallelism—the benefits of eliminating unnecessary computation and communication offset the reduced parallelism. We implement a new distributed graph processing framework SympleGraph, and two variants of runtime systems—GraphS and GraphSR—for PIM-based graph processing architecture, which significantly outperform the state-of-the-art.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124746609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaim Baskin, Natan Liss, Eli Schwartz, Evgenii Zheltonozhskii, R. Giryes, A. Bronstein, A. Mendelson
{"title":"UNIQ","authors":"Chaim Baskin, Natan Liss, Eli Schwartz, Evgenii Zheltonozhskii, R. Giryes, A. Bronstein, A. Mendelson","doi":"10.1145/3444943","DOIUrl":"https://doi.org/10.1145/3444943","url":null,"abstract":"We present a novel method for neural network quantization. Our method, named UNIQ, emulates a non-uniform k-quantile quantizer and adapts the model to perform well with quantized weights by injecting noise to the weights at training time. As a by-product of injecting noise to weights, we find that activations can also be quantized to as low as 8-bit with only a minor accuracy degradation. Our non-uniform quantization approach provides a novel alternative to the existing uniform quantization techniques for neural networks. We further propose a novel complexity metric of number of bit operations performed (BOPs), and we show that this metric has a linear relation with logic utilization and power. We suggest evaluating the trade-off of accuracy vs. complexity (BOPs). The proposed method, when evaluated on ResNet18/34/50 and MobileNet on ImageNet, outperforms the prior state of the art both in the low-complexity regime and the high accuracy regime. We demonstrate the practical applicability of this approach, by implementing our non-uniformly quantized CNN on FPGA.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134090552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiming Zhang, Chengfei Zhang, Yaozheng Wang, Kai Yu, Guangtao Xue, J. Crowcroft
{"title":"KylinX","authors":"Yiming Zhang, Chengfei Zhang, Yaozheng Wang, Kai Yu, Guangtao Xue, J. Crowcroft","doi":"10.1145/3436512","DOIUrl":"https://doi.org/10.1145/3436512","url":null,"abstract":"Unikernel specializes a minimalistic LibOS and a target application into a standalone single-purpose virtual machine (VM) running on a hypervisor, which is referred to as (virtual) appliance. Compared to traditional VMs, Unikernel appliances have smaller memory footprint and lower overhead while guaranteeing the same level of isolation. On the downside, Unikernel strips off the process abstraction from its monolithic appliance and thus sacrifices flexibility, efficiency, and applicability. In this article, we examine whether there is a balance embracing the best of both Unikernel appliances (strong isolation) and processes (high flexibility/efficiency). We present KylinX, a dynamic library operating system for simplified and efficient cloud virtualization by providing the pVM (process-like VM) abstraction. A pVM takes the hypervisor as an OS and the Unikernel appliance as a process allowing both page-level and library-level dynamic mapping. At the page level, KylinX supports pVM fork plus a set of API for inter-pVM communication (IpC, which is compatible with conventional UNIX IPC). At the library level, KylinX supports shared libraries to be linked to a Unikernel appliance at runtime. KylinX enforces mapping restrictions against potential threats. We implement a prototype of KylinX by modifying MiniOS and Xen tools. Extensive experimental results show that KylinX achieves similar performance both in micro benchmarks (fork, IpC, library update, etc.) and in applications (Redis, web server, and DNS server) compared to conventional processes, while retaining the strong isolation benefit of VMs/Unikernels.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121112406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Sadrosadati, Amirhossein Mirhosseini, Ali Hajiabadi, Seyed Borna Ehsani, Hajar Falahati, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu
{"title":"Highly Concurrent Latency-tolerant Register Files for GPUs","authors":"Mohammad Sadrosadati, Amirhossein Mirhosseini, Ali Hajiabadi, Seyed Borna Ehsani, Hajar Falahati, H. Sarbazi-Azad, M. Drumond, B. Falsafi, Rachata Ausavarungnirun, O. Mutlu","doi":"10.1145/3419973","DOIUrl":"https://doi.org/10.1145/3419973","url":null,"abstract":"Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123678267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shooting Down the Server Front-End Bottleneck","authors":"Rakesh Kumar, Boris Grot","doi":"10.1145/3484492","DOIUrl":"https://doi.org/10.1145/3484492","url":null,"abstract":"The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction footprints. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between metadata storage cost and performance. Temporal Stream prefetchers deliver high performance but require a prohibitive amount of metadata to accommodate the temporal history. Meanwhile, BTB-directed prefetchers incur low cost by using the existing in-core branch prediction structures but fall short on performance due to BTB’s inability to capture the massive control flow working set of server applications. This work overcomes the fundamental limitation of BTB-directed prefetchers, which is capturing a large control flow working set within an affordable BTB storage budget. We re-envision the BTB organization to maximize its control flow coverage by observing that an application’s instruction footprint can be mapped as a combination of its unconditional branch working set and, for each unconditional branch, a spatial encoding of the cache blocks around the branch target. Effectively capturing a map of the application’s instruction footprint in the BTB enables highly effective BTB-directed prefetching that outperforms the state-of-the-art prefetchers by up to 10% for equivalent storage budget.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134511900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Won Wook Song, Youngseok Yang, Jeongyoon Eo, J. Seo, J. Kim, Sanha Lee, Gyewon Lee, Taegeon Um, Haeyoon Cho, Byung-Gon Chun
{"title":"Apache Nemo: A Framework for Optimizing Distributed Data Processing","authors":"Won Wook Song, Youngseok Yang, Jeongyoon Eo, J. Seo, J. Kim, Sanha Lee, Gyewon Lee, Taegeon Um, Haeyoon Cho, Byung-Gon Chun","doi":"10.1145/3468144","DOIUrl":"https://doi.org/10.1145/3468144","url":null,"abstract":"Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130671058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Richins, Dharmisha Doshi, Matthew Blackmore, A. Nair, Neha Pathapati, Ankit Patel, Brainard Daguman, Daniel Dobrijalowski, R. Illikkal, Kevin Long, David Zimmerman, V. Reddi
{"title":"AI Tax","authors":"Daniel Richins, Dharmisha Doshi, Matthew Blackmore, A. Nair, Neha Pathapati, Ankit Patel, Brainard Daguman, Daniel Dobrijalowski, R. Illikkal, Kevin Long, David Zimmerman, V. Reddi","doi":"10.1145/3440689","DOIUrl":"https://doi.org/10.1145/3440689","url":null,"abstract":"Artificial intelligence and machine learning are experiencing widespread adoption in industry and academia. This has been driven by rapid advances in the applications and accuracy of AI through increasingly complex algorithms and models; this, in turn, has spurred research into specialized hardware AI accelerators. Given the rapid pace of advances, it is easy to forget that they are often developed and evaluated in a vacuum without considering the full application environment. This article emphasizes the need for a holistic, end-to-end analysis of artificial intelligence (AI) workloads and reveals the “AI tax.” We deploy and characterize Face Recognition in an edge data center. The application is an AI-centric edge video analytics application built using popular open source infrastructure and machine learning (ML) tools. Despite using state-of-the-art AI and ML algorithms, the application relies heavily on pre- and post-processing code. As AI-centric applications benefit from the acceleration promised by accelerators, we find they impose stresses on the hardware and software infrastructure: storage and network bandwidth become major bottlenecks with increasing AI acceleration. By specializing for AI applications, we show that a purpose-built edge data center can be designed for the stresses of accelerated AI at 15% lower TCO than one derived from homogeneous servers and infrastructure.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129447526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tanakorn Leesatapornwongsa, Aritra Sengupta, Masoud Saeida Ardekani, G. Petri, Cesar A. Stuardo
{"title":"Transactuations","authors":"Tanakorn Leesatapornwongsa, Aritra Sengupta, Masoud Saeida Ardekani, G. Petri, Cesar A. Stuardo","doi":"10.1145/3380907","DOIUrl":"https://doi.org/10.1145/3380907","url":null,"abstract":"A large class of IoT applications read sensors, execute application logic, and actuate actuators. However, the lack of high-level programming abstractions compromises correctness, especially in the presence of failures and unwanted interleaving between applications. A key problem arises when operations on IoT devices or the application itself fails, which leads to inconsistencies between the physical state and application state, breaking application semantics and causing undesired consequences. Transactions are a well-established abstraction for correctness, but assume properties that are absent in an IoT context. In this article, we study one such environment, smart home, and establish inconsistencies manifesting out of failures. We propose an abstraction called transactuation that empowers developers to build reliable applications. Our runtime, Relacs, implements the abstraction atop a real smart-home platform. We evaluate programmability, performance, and effectiveness of transactuations to demonstrate its potential as a powerful abstraction and execution model.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123490336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oana Balmau, Florin Dinu, W. Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, Diego Didona
{"title":"SILK+ Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads","authors":"Oana Balmau, Florin Dinu, W. Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, Diego Didona","doi":"10.1145/3380905","DOIUrl":"https://doi.org/10.1145/3380905","url":null,"abstract":"Log-Structured Merge Key-Value stores (LSM KVs) are designed to offer good write performance, by capturing client writes in memory, and only later flushing them to storage. Writes are later compacted into a tree-like data structure on disk to improve read performance and to reduce storage space use. It has been widely documented that compactions severely hamper throughput. Various optimizations have successfully dealt with this problem. These techniques include, among others, rate-limiting flushes and compactions, selecting among compactions for maximum effect, and limiting compactions to the highest level by so-called fragmented LSMs. In this article, we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and, in fact, in some cases, exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes, and compactions. Another major cause for tail latency is the heterogeneous nature of the workloads in terms of operation mix and item sizes whereby a few more computationally heavy requests slow down the vast majority of smaller requests. We introduce the notion of an Input/Output (I/O) bandwidth scheduler for an LSM-based KV store to reduce tail latency caused by interference of flushing and compactions and by workload heterogeneity. We explore three techniques as part of this I/O scheduler: (1) opportunistically allocating more bandwidth to internal operations during periods of low load, (2) prioritizing flushes and compactions at the lower levels of the tree, and (3) separating client requests by size and by data access path. SILK+ is a new open-source LSM KV that incorporates this notion of an I/O scheduler.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132212396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Corrigendum to “Derecho","authors":"Sagar Jha","doi":"10.1145/3395604","DOIUrl":"https://doi.org/10.1145/3395604","url":null,"abstract":"","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115415198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}