Christopher Jelesnianski, Jinwoo Yom, Changwoo Min, Yeongjin Jang
{"title":"MARDU","authors":"Christopher Jelesnianski, Jinwoo Yom, Changwoo Min, Yeongjin Jang","doi":"10.1145/3383669.3398280","DOIUrl":"https://doi.org/10.1145/3383669.3398280","url":null,"abstract":"Defense techniques such as Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR) were role models in preventing early return-oriented programming (ROP) attacks by keeping performance and scalability in the forefront, making them widely-adopted. As code reuse attacks evolved in complexity, defenses have lost touch with pragmatic defense design to ensure security, either being narrow in scope or providing unrealistic overheads. We present MARDU, an on-demand system-wide re-randomization technique that maintains strong security guarantees while providing better overall performance and having scalability most defenses lack. We achieve code sharing with diversification by implementing reactive and scalable, rather than continuous or one-time diversification. Enabling code sharing further minimizes needed tracking, patching, and memory overheads. The evaluation of MARDU shows low performance overhead of 5.5% on SPEC and minimal degradation of 4.4% in NGINX, proving its applicability to both compute-intensive and scalable real-world applications.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122867054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Core Specialization for AVX-512 Applications","authors":"Mathias Gottschlag, Peter Brantsch, Frank Bellosa","doi":"10.1145/3383669.3398282","DOIUrl":"https://doi.org/10.1145/3383669.3398282","url":null,"abstract":"Advanced Vector Extension (AVX) instructions operate on wide SIMD vectors. Due to the resulting high power consumption, recent Intel processors reduce their frequency when executing complex AVX2 and AVX-512 instructions. Following non-AVX code is slowed down by this frequency reduction in two situations: When it executes on the sibling hyperthread of the same core in parallel or - as restoring the non-AVX frequency is delayed - when it directly follows the AVX2/AVX-512 code. As a result, heterogeneous workloads consisting of AVX-512 and non-AVX code are frequently slowed down by 10% on average. In this work, we describe a method to mitigate the frequency reduction slowdown for workloads involving AVX-512 instructions in both situations. Our approach employs core specialization and partitions the CPU cores into AVX-512 cores and non-AVX-512 cores, and only the former execute AVX-512 instructions so that the impact of potential frequency reductions is limited to those cores. To migrate threads to AVX-512 cores, we configure the non-AVX-512 cores to raise an exception when executing AVX-512 instructions. We use a heuristic to determine when to migrate threads back to non-AVX-512 cores. Our approach is able to reduce the frequency reduction overhead by 70% for an assortment of common benchmarks.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128441034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Zarubin, Patrick Damme, Dirk Habich, Wolfgang Lehner
{"title":"Polymorphic Compressed Replication of Columnar Data in Scale-Up Hybrid Memory Systems","authors":"M. Zarubin, Patrick Damme, Dirk Habich, Wolfgang Lehner","doi":"10.1145/3383669.3398283","DOIUrl":"https://doi.org/10.1145/3383669.3398283","url":null,"abstract":"In-memory database systems adopting a columnar storage model play a crucial role with respect to data analytics. While data is completely kept in-memory by these systems for efficiency, data has to be stored on a non-volatile medium for persistence and fault tolerance as well. Traditionally, slow block-level devices like HDDs or SSDs are used which, however, can be replaced by fast byte-addressable NVRAM nowadays. Thus, hybrid memory systems consisting of DRAM and NVRAM offer a great opportunity for column-oriented database systems to persistently store and to efficiently process columnar data exclusively in main-memory. However, possible DRAM and NVRAM failures still necessitate the protection of primary data. While data replication is a suitable means, it increases the NVRAM endurance problem through increased write activities. To tackle that challenge and to reduce the overhead of replication, we propose a novel Polymorphic Compressed Replication (PCR) mechanism representing replicas using lightweight compression algorithms to reduce NVRAM writes, while supporting different compressed formats for the replicas of one column to facilitate different database operations during query processing. To show the feasibility and applicability, we developed an inmemory column-store prototype transparently employing PCR through an abstract user-space library. Based on this prototype, our conducted experiments show the effectiveness of our proposed PCR mechanism.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123006078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Re-Animator: Versatile High-Fidelity Storage-System Tracing and Replaying","authors":"I. Akgun, G. Kuenning, E. Zadok","doi":"10.1145/3383669.3398276","DOIUrl":"https://doi.org/10.1145/3383669.3398276","url":null,"abstract":"Modern applications use storage systems in complex and often surprising ways. Tracing system calls is a common approach to understanding applications' behavior, allowing offline analysis and enabling replay in other environments. But current system-call tracing tools have drawbacks: (1) they often omit some information---such as raw data buffers---needed for full analysis; (2) they have high overheads; (3) they often use non-portable trace formats; and (4) they may not offer useful and scalable analysis and replay tools. We have developed Re-Animator, a powerful system-call tracing tool that focuses on storage-related calls and collects maximal information, capturing complete data buffers and writing all traces in the standard DataSeries format. We also created a prototype replayer that focuses on calls related to file-system state. We evaluated our system on long-running server applications such as key-value stores and databases. Our tracer has an average overhead of only 1.8-2.3×, but the overhead can be as low as 5% for I/O-bound applications. Our replayer verifies that its actions are correct, and faithfully reproduces the logical file system state generated by the original application.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124156982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NVMFS-IOzone: Performance Evaluation for the New NVMM-based File Systems","authors":"S. Li, Dagang Li, D. Wu, Xiaogang Chen","doi":"10.1145/3383669.3398281","DOIUrl":"https://doi.org/10.1145/3383669.3398281","url":null,"abstract":"With the emerging of NVM (Non-Volatile Memories) technologies, NVMM-based (Non-Volatile Main Memories) file systems have attracted more and more attention. Compared to traditional file systems, most NVMM-based file systems bypass the page cache and the I/O software stack. With the new mmap interface known as the DAX-mmap interface (DAX: direct access), the CPU can access the NVMM much faster by loading from/storing to it directly. However, the existing file system benchmark tools are designed for traditional file systems and do not support the new features of NVMM-based file systems, so the returned results are very often not accurate. In this paper, a new benchmark tool called NVMFS-IOzone is proposed. The behavior of the tool is redesigned to reflect the new features of NVMM-based file systems. The NVM-lib from Intel is used instead of traditional msync() to keep data consistent when evaluating the performance of the DAX-mmap interface. Experimental results show that the new benchmark tool can reveal a hidden improvement of 1.4~2.1 times in NVMM-based file systems, which cannot be seen by the traditional evaluation tools. The data paths of direct load/store to NVMM and bypassing CPU cache are also provided to support the new ##features of NVMM-based file systems for multidimensional evaluation. Furthermore, embedded cleaning-ups has also been added to NVMFS-IOzone to support convenient evaluation consistency, which benefits both NVMM-based and non-NVMM-based file system benchmarking even for quick and casual tests. The whole experimental evaluation is based on real physical NVMs rather than simulated NVMs, and the experimental results confirm the effectiveness of our design.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"636 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131734042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting Transactions for Bulk NFSv4 Compounds","authors":"Wei Su, A. Aurora, Ming Chen, E. Zadok","doi":"10.1145/3383669.3398275","DOIUrl":"https://doi.org/10.1145/3383669.3398275","url":null,"abstract":"More applications nowadays use network and cloud storage; and modern network file system protocols support compounding operations---packing more operations in one request (e.g., NFSv4, SMB). This is known to improve overall throughput and latency by reducing the number of network round trips. It has been reported that by utilizing compounds, NFSv4 performance, especially in high-latency networks, can be improved by orders of magnitude. Alas, with more operations packed into a single message, partial failures become more likely---some server-side operations succeed while others fail to execute. This places a greater challenge on client-side applications to recover from such failures. To solve this and simplify application development, we designed and built TC-NFS, an NFSv4-based network file system with transactional compound execution. We evaluated TC-NFS with different workloads, compounding degrees, and network latencies. Compared to an already existing NFSv4 system that fully utilizes compounds, our end-to-end transactional support adds as little as ~1.1% overhead but as much as ~25× overhead for some intense micro- and macro-workloads.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125256638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory Elasticity Benchmark","authors":"Liran Funaro, Orna Agmon Ben-Yehuda, A. Schuster","doi":"10.1145/3383669.3398277","DOIUrl":"https://doi.org/10.1145/3383669.3398277","url":null,"abstract":"Cloud computing handles a vast share of the world's computing, but it is not as efficient as it could be due to its lack of support for memory elasticity. An environment that supports memory elasticity can dynamically change the size of the application's memory while it's running, thereby optimizing the entire system's use of memory. However, this means at least some of the applications must be memory-elastic. A memory elastic application can deal with memory size changes enforced on it, making the most out of all of the memory it has available at any one time. The performance of an ideal memory-elastic application would not be hindered by frequent memory changes. Instead, it would depend on global values, such as the sum of memory it receives over time. Memory elasticity has not been achieved thus far due to a circular dependency problem. On the one hand, it is difficult to develop computer systems for memory elasticity without proper benchmarking, driven by actual applications. On the other, application developers do not have an incentive to make their applications memory-elastic, when real-world systems do not support this property nor do they incentivize it economically. To overcome this challenge, we propose a system of memory-elastic benchmarks and an evaluation methodology for an application's memory elasticity characteristics. We validate this methodology by using it to accurately predict the performance of an application, with a maximal deviation of 8% on average. The proposed benchmarks and methodology have the potential to help bootstrap computer systems and applications towards memory elasticity.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122052446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, B. Ravindran
{"title":"Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains","authors":"Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, B. Ravindran","doi":"10.1145/3383669.3398278","DOIUrl":"https://doi.org/10.1145/3383669.3398278","url":null,"abstract":"Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. Programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed shared memory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memory programming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming. In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128590270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data","authors":"R. Kaplan, L. Yavits, R. Ginosar","doi":"10.1145/3383669.3398279","DOIUrl":"https://doi.org/10.1145/3383669.3398279","url":null,"abstract":"Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonly used. However, they are suboptimal and still compute-intensive. In this work, we present BioSEAL, a biological sequence alignment accelerator. BioSEAL is a massively parallel non-von Neumann processing-in-memory architecture for large-scale DNA and protein sequence alignment. BioSEAL is based on resistive content addressable memory, capable of energy-efficient and highperformance associative processing. We present an associative processing algorithm for entire database sequence alignment on BioSEAL and compare its performance and power consumption with state-of-art solutions. We show that BioSEAL can achieve up to 57× speedup and 156× better energy efficiency, compared with existing solutions for genome sequence alignment and protein sequence database search.","PeriodicalId":225327,"journal":{"name":"Proceedings of the 13th ACM International Systems and Storage Conference","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124020402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}