{"title":"Shared address translation revisited","authors":"Xiaowan Dong, S. Dwarkadas, A. Cox","doi":"10.1145/2901318.2901327","DOIUrl":"https://doi.org/10.1145/2901318.2901327","url":null,"abstract":"Modern operating systems avoid duplication of code and data when they are mapped by multiple processes by sharing physical memory through mechanisms like copy-on-write. Nonetheless, a separate copy of the virtual address translation structures, such as page tables, are still maintained for each process, even if they are identical. This duplication can lead to inefficiencies in the address translation process and interference within the memory hierarchy. In this paper, we show that on Android platforms, sharing address translation structures, specifically, page tables and TLB entries, for shared libraries can improve performance. For example, at a low level, sharing address translation structures reduces the cost of fork by more than half by reducing page table construction overheads. At a higher level, application launch and IPC are faster due to page fault elimination coupled with better cache and TLB performance when context switching.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81958422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A high performance file system for non-volatile main memory","authors":"Jiaxin Ou, J. Shu, Youyou Lu","doi":"10.1145/2901318.2901324","DOIUrl":"https://doi.org/10.1145/2901318.2901324","url":null,"abstract":"Emerging non-volatile main memories (NVMMs) provide data persistence at the main memory level. To avoid the double-copy overheads among the user buffer, the OS page cache, and the storage layer, state-of-the-art NVMM-aware file systems bypass the OS page cache which directly copy data between the user buffer and the NVMM storage. However, one major drawback of existing NVMM technologies is the slow writes. As a result, such direct access for all file operations can lead to suboptimal system performance. In this paper, we propose HiNFS, a high performance file system for non-volatile main memory. Specifically, HiNFS uses an NVMM-aware Write Buffer policy to buffer the lazy-persistent file writes in DRAM and persists them to NVMM lazily to hide the long write latency of NVMM. However, HiNFS performs direct access to NVMM for eager-persistent file writes, and directly reads file data from both DRAM and NVMM as they have similar read performance, in order to eliminate the double-copy overheads from the critical path. To ensure read consistency, HiNFS uses a combination of the DRAM Block Index and Cacheline Bitmap to track the latest data between DRAM and NVMM. Finally, HiNFS employs a Buffer Benefit Model to identify the eager-persistent file writes before issuing the write operations. Using software NVMM emulators, we evaluate HiNFS's performance with various workloads. Comparing with state-of-the-art NVMM-aware file systems - PMFS and EXT4-DAX, surprisingly, our results show that HiNFS improves the system throughput by up to 184% for filebench microbenchmarks and reduces the execution time by up to 64% for data-intensive traces and macro-benchmarks, demonstrating the benefits of hiding the long write latency of NVMM.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85409993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chia-che Tsai, Bhushan Jain, N. A. Abdul, Donald E. Porter
{"title":"A study of modern Linux API usage and compatibility: what to support when you're supporting","authors":"Chia-che Tsai, Bhushan Jain, N. A. Abdul, Donald E. Porter","doi":"10.1145/2901318.2901341","DOIUrl":"https://doi.org/10.1145/2901318.2901341","url":null,"abstract":"This paper presents a study of Linux API usage across all applications and libraries in the Ubuntu Linux 15.04 distribution. We propose metrics for reasoning about the importance of various system APIs, including system calls, pseudo-files, and libc functions. Our metrics are designed for evaluating the relative maturity of a prototype system or compatibility layer, and this paper focuses on compatibility with Linux applications. This study uses a combination of static analysis to understand API usage and survey data to weight the relative importance of applications to end users. This paper yields several insights for developers and researchers, which are useful for assessing the complexity and security of Linux APIs. For example, every Ubuntu installation requires 224 system calls, 208 ioctl, fcntl, and prctl codes and hundreds of pseudo files. For each API type, a significant number of APIs are rarely used, if ever. Moreover, several security-relevant API changes, such as replacing access with faccessat, have met with slow adoption. Finally, hundreds of libc interfaces are effectively unused, yielding opportunities to improve security and efficiency by restructuring libc.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"2017 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82832484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage","authors":"S. Mitra, R. Panta, Moo-Ryong Ra, S. Bagchi","doi":"10.1145/2901318.2901328","DOIUrl":"https://doi.org/10.1145/2901318.2901328","url":null,"abstract":"With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log2(k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84774946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, Haibo Chen
{"title":"Fast and general distributed transactions using RDMA and HTM","authors":"Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, Haibo Chen","doi":"10.1145/2901318.2901349","DOIUrl":"https://doi.org/10.1145/2901318.2901349","url":null,"abstract":"Recent transaction processing systems attempt to leverage advanced hardware features like RDMA and HTM to significantly boost performance, which, however, pose several limitations like requiring priori knowledge of read/write sets of transactions and providing no availability support. In this paper, we present DrTM+R, a fast in-memory transaction processing system that retains the performance benefit from advanced hardware features, while supporting general transactional workloads and high availability through replication. DrTM+R addresses the generality issue by designing a hybrid OCC and locking scheme, which leverages the strong atomicity of HTM and the strong consistency of RDMA to preserve strict serializability with high performance. To resolve the race condition between the immediate visibility of records updated by HTM transactions and the unready replication of such records, DrTM+R leverages an optimistic replication scheme that uses seqlock-like versioning to distinguish the visibility of tuples and the readiness of record replication. Evaluation using typical OLTP workloads like TPC-C and SmallBank shows that DrTM+R scales well on a 6-node cluster and achieves over 5.69 and 94 million transactions per second without replication for TPC-C and SmallBank respectively. Enabling 3-way replication on DrTM+R only incurs at most 41% overhead before reaching network bottleneck, and is still an order-of-magnitude faster than a state-of-the-art distributed transaction system (Calvin).","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78651637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PSLO: enforcing the Xth percentile latency and throughput SLOs for consolidated VM storage","authors":"Ning Li, Hong Jiang, D. Feng, Zhan Shi","doi":"10.1145/2901318.2901330","DOIUrl":"https://doi.org/10.1145/2901318.2901330","url":null,"abstract":"It is desirable but challenging to simultaneously support latency SLO at a pre-defined percentile, i.e., the Xth percentile latency SLO, and throughput SLO for consolidated VM storage. Ensuring the Xth percentile latency contributes to accurately differentiating service levels in the metric of the application-level latency SLO compliance, especially for the application built on multiple VMs. However, the Xth percentile latency SLO and throughput SLO enforcement are the opposite sides of the same coin due to the conflicting requirements for the level of IO concurrency. To address this challenge, this paper proposes PSLO, a framework supporting the Xth percentile latency and throughput SLOs under consolidated VM environment by precisely coordinating the level of IO concurrency and arrival rate for each VM issue queue. It is noted that PSLO can take full advantage of the available IO capacity allowed by SLO constraints to improve throughput or reduce latency with the best effort. We design and implement a PSLO prototype in the real VM consolidation environment created by Xen. Our extensive trace-driven prototype evaluation shows that our system is able to optimize the Xth percentile latency and throughput for consolidated VMs under SLO constraints.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86101352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"vScale: automatic and efficient processor scaling for SMP virtual machines","authors":"Luwei Cheng, J. Rao, F. Lau","doi":"10.1145/2901318.2901321","DOIUrl":"https://doi.org/10.1145/2901318.2901321","url":null,"abstract":"SMP virtual machines (VMs) have been deployed extensively in clouds to host multithreaded applications. A widely known problem is that when CPUs are oversubscribed, the scheduling delays due to VM preemption give rise to many performance problems because of the impact of these delays on thread synchronization and I/O efficiency. Dynamically changing the number of virtual CPUs (vCPUs) by considering the available physical CPU (pCPU) cycles has been shown to be a promising approach. Unfortunately, there are currently no efficient mechanisms to support such vCPU-level elasticity. We present vScale, a cross-layer design to enable SMP-VMs to adaptively scale their vCPUs, at the cost of only microseconds. vScale consists of two extremely light-weight mechanisms: i) a generic algorithm in the hypervisor scheduler to compute VMs' CPU extendability, based on their proportional shares and CPU consumptions, and ii) an efficient method in the guest OS to quickly reconfigure the vCPUs. vScale can be tightly integrated with existing OS/hypervisor primitives and has very little management complexity. With our prototype in Xen/Linux, we evaluate vScale's performance with several representative multithreaded applications, including NPB suite, PARSEC suite and Apache web server. The results show that vScale can significantly reduce the VM's waiting time, and thus can accelerate many applications, especially synchronization-intensive ones and I/O-intensive ones.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73746677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henggang Cui, H. Zhang, G. Ganger, Phillip B. Gibbons, E. Xing
{"title":"GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server","authors":"Henggang Cui, H. Zhang, G. Ganger, Phillip B. Gibbons, E. Xing","doi":"10.1145/2901318.2901323","DOIUrl":"https://doi.org/10.1145/2901318.2901323","url":null,"abstract":"Large-scale deep learning requires huge computational resources to train a multi-layer neural network. Recent systems propose using 100s to 1000s of machines to train networks with tens of layers and billions of connections. While the computation involved can be done more efficiently on GPUs than on more traditional CPU cores, training such networks on a single GPU is too slow and training on distributed GPUs can be inefficient, due to data movement overheads, GPU stalls, and limited GPU memory. This paper describes a new parameter server, called GeePS, that supports scalable deep learning across GPUs distributed among multiple machines, overcoming these obstacles. We show that GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single-node code). Moreover, GeePS achieves a higher training throughput with just four GPU machines than that a state-of-the-art CPU-only system achieves with 108 machines.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"74 3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80812248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gautam Kumar, G. Ananthanarayanan, S. Ratnasamy, I. Stoica
{"title":"Hold 'em or fold 'em?: aggregation queries under performance variations","authors":"Gautam Kumar, G. Ananthanarayanan, S. Ratnasamy, I. Stoica","doi":"10.1145/2901318.2901351","DOIUrl":"https://doi.org/10.1145/2901318.2901351","url":null,"abstract":"Systems are increasingly required to provide responses to queries, even if not exact, within stringent time deadlines. These systems parallelize computations over many processes and aggregate them hierarchically to get the final response (e.g., search engines and data analytics). Due to large performance variations in clusters, some processes are slower. Therefore, aggregators are faced with the question of how long to wait for outputs from processes before combining and sending them upstream. Longer waits increase the response quality as it would include outputs from more processes. However, it also increases the risk of the aggregator failing to provide its result by the deadline. This leads to all its results being ignored, degrading response quality. Our algorithm, Cedar, proposes a solution to this quandary of deciding wait durations at aggregators. It uses an online algorithm to learn distributions of durations at each level in the hierarchy and collectively optimizes the wait duration. Cedar's solution is theoretically sound, fully distributed, and generically applicable across systems that use aggregation trees since it is agnostic to the causes of performance variations. Evaluation using production latency distributions from Google, Microsoft and Facebook using deployment and simulation shows that Cedar improves average response quality by over 100%.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82476787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TFC: token flow control in data center networks","authors":"Jiao Zhang, Fengyuan Ren, Ran Shu, Peng Cheng","doi":"10.1145/2901318.2901336","DOIUrl":"https://doi.org/10.1145/2901318.2901336","url":null,"abstract":"Services in modern data center networks pose growing performance demands. However, the widely existed special traffic patterns, such as micro-burst, highly concurrent flows, on-off pattern of flow transmission, exacerbate the performance of transport protocols. In this work, an clean-slate explicit transport control mechanism, called Token Flow Control (TFC), is proposed for data center networks to achieve high link utilization, ultra-low latency, fast convergence, and rare packets dropping. TFC uses tokens to represent the link bandwidth resource and define the concept of effective flows to stand for consumers. The total tokens will be explicitly allocated to each consumer every time slot. TFC excludes in-network buffer space from the flow pipeline and thus achieves zero-queueing. Besides, a packet delay function is added at switches to prevent packets dropping with highly concurrent flows. The performance of TFC is evaluated using both experiments on a small real testbed and large-scale simulations. The results show that TFC achieves high throughput, fast convergence, near zero-queuing and rare packets loss in various scenarios.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88238550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}