Jun He, Sudarsun Kannan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
{"title":"The Unwritten Contract of Solid State Drives","authors":"Jun He, Sudarsun Kannan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/3064176.3064187","DOIUrl":"https://doi.org/10.1145/3064176.3064187","url":null,"abstract":"We perform a detailed vertical analysis of application performance atop a range of modern file systems and SSD FTLs. We formalize the \"unwritten contract\" that clients of SSDs should follow to obtain high performance, and conduct our analysis to uncover application and file system designs that violate the contract. Our analysis, which utilizes a highly detailed SSD simulation underneath traces taken from real workloads and file systems, provides insight into how to better construct applications, file systems, and FTLs to realize robust and sustainable performance.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125048944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GfxDoctor: A Holistic Graphics Energy Profiler for Mobile Devices","authors":"Ning Ding, Y. C. Hu","doi":"10.1145/3064176.3064206","DOIUrl":"https://doi.org/10.1145/3064176.3064206","url":null,"abstract":"Graphics is one of the major energy drain sources in smartphone apps. To optimize the app graphics energy, however, developers face the challenge of highly complex graphics rendering process, which involves multiple system layers including the app, the framework, the GPU, and the asynchronous interactions among them. Current diagnostic tools can profile the resource usage from certain layers, but fall short in stitching together profiling information across all the layers which is needed to provide developers with the visual effect-energy tradeoff at the app source-code level. In this paper, we design and implement a holistic graphics energy diagnosis tool, GfxDoctor1, that helps developers to systematically diagnose energy inefficiencies in app graphics at the app source-code level, by precisely quantifying (1) the visual effect of each UI update, and (2) the aggregate energy drain spent in traversing the entire frame rendering stack due to each UI update. GfxDoctor overcomes three challenges faced in deriving per-UI-update visual effect and energy accounting, asynchrony across system layers, UI update batching, and \"black-box\" GPU, with two key techniques -- lightweight view-frame-ID-based information flow tracking, and OpenGL record-and-replay plus frame diffing. We show the effectiveness of GfxDoctor by profiling a randomly sampled set of 30 popular Android apps which reveals three types of graphics energy bugs happening in 8 out of the 30 apps. Removing these bugs reduces the app energy drain by 46% to 90%.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115761535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mosaic: Processing a Trillion-Edge Graph on a Single Machine","authors":"Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woon-Hak Kang, Mohan Kumar, Taesoo Kim","doi":"10.1145/3064176.3064191","DOIUrl":"https://doi.org/10.1145/3064176.3064191","url":null,"abstract":"Processing a one trillion-edge graph has recently been demonstrated by distributed graph engines running on clusters of tens to hundreds of nodes. In this paper, we employ a single heterogeneous machine with fast storage media (e.g., NVMe SSD) and massively parallel coprocessors (e.g., Xeon Phi) to reach similar dimensions. By fully exploiting the heterogeneous devices, we design a new graph processing engine, named Mosaic, for a single machine. We propose a new locality-optimizing, space-efficient graph representation---Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric operations in massively parallel coprocessors. Our evaluation shows that for smaller graphs, Mosaic consistently outperforms other state-of-the-art out-of-core engines by 3.2-58.6x and shows comparable performance to distributed graph engines. Furthermore, Mosaic can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2×.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"82 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130874624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, A. Krishnamurthy
{"title":"An Empirical Study on the Correctness of Formally Verified Distributed Systems","authors":"Pedro Fonseca, Kaiyuan Zhang, Xi Wang, A. Krishnamurthy","doi":"10.1145/3064176.3064183","DOIUrl":"https://doi.org/10.1145/3064176.3064183","url":null,"abstract":"Recent advances in formal verification techniques enabled the implementation of distributed systems with machine-checked proofs. While results are encouraging, the importance of distributed systems warrants a large scale evaluation of the results and verification practices. This paper thoroughly analyzes three state-of-the-art, formally verified implementations of distributed systems: Iron-Fleet, Verdi, and Chapar. Through code review and testing, we found a total of 16 bugs, many of which produce serious consequences, including crashing servers, returning incorrect results to clients, and invalidating verification guarantees. These bugs were caused by violations of a wide-range of assumptions on which the verified components relied. Our results revealed that these assumptions referred to a small fraction of the trusted computing base, mostly at the interface of verified and unverified components. Based on our observations, we have built a testing toolkit called PK, which focuses on testing these parts and is able to automate the detection of 13 (out of 16) bugs.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127006333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Vilanova, Marc Jordà, N. Navarro, Yoav Etsion, M. Valero
{"title":"Direct Inter-Process Communication (dIPC): Repurposing the CODOMs Architecture to Accelerate IPC","authors":"L. Vilanova, Marc Jordà, N. Navarro, Yoav Etsion, M. Valero","doi":"10.1145/3064176.3064197","DOIUrl":"https://doi.org/10.1145/3064176.3064197","url":null,"abstract":"In current architectures, page tables are the fundamental mechanism that allows contemporary OSs to isolate user processes, binding each thread to a specific page table. A thread cannot therefore directly call another process's function or access its data; instead, the OS kernel provides data communication primitives and mediates process synchronization through inter-process communication (IPC) channels, which impede system performance. Alternatively, the recently proposed CODOMs architecture provides memory protection across software modules. Threads can cross module protection boundaries inside the same process using simple procedure calls, while preserving memory isolation. We present dIPC (for \"direct IPC\"), an OS extension that repurposes and extends the CODOMs architecture to allow threads to cross process boundaries. It maps processes into a shared address space, and eliminates the OS kernel from the critical path of inter-process communication. dIPC is 64.12× faster than local remote procedure calls (RPCs), and 8.87× faster than IPC in the L4 microkernel. We show that applying dIPC to a multi-tier OLTP web server improves performance by up to 5.12× (2.13× on average), and reaches over 94% of the ideal system efficiency.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128177894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georgios Chatzopoulos, R. Guerraoui, T. Harris, Vasileios Trigonakis
{"title":"Abstracting Multi-Core Topologies with MCTOP","authors":"Georgios Chatzopoulos, R. Guerraoui, T. Harris, Vasileios Trigonakis","doi":"10.1145/3064176.3064194","DOIUrl":"https://doi.org/10.1145/3064176.3064194","url":null,"abstract":"Portability and efficiency are usually antagonists in multi-core computing. In order to develop efficient code, one needs to take into account the topology of the target multi-cores (e.g., for locality). This clearly hampers code portability. In this paper, we show that you can have the cake and eat it too. We introduce MCTOP, an abstraction of multi-core topologies augmented with important low-level hardware information, such as memory bandwidths and communication latencies. We show how to automatically generate MCTOP using libmctop, our library that leverages the determinism of cache-coherence protocols to infer the topology of multi-cores using only latency measurements. MCTOP enables developers to accurately and portably define high-level performance optimization policies. We illustrate several such policies through four examples: (i-ii) thread placement in OpenMP and in a MapReduce library, (iii) a topology-aware mergesort algorithm, as well as (iv) automatic backoff schemes for locks. We illustrate the portability of these optimizations on five processors from Intel, AMD, and Oracle, with low effort.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Saturn: a Distributed Metadata Service for Causal Consistency","authors":"Manuel Bravo, L. Rodrigues, P. V. Roy","doi":"10.1145/3064176.3064210","DOIUrl":"https://doi.org/10.1145/3064176.3064210","url":null,"abstract":"This paper presents the design, implementation, and evaluation of Saturn, a metadata service for geo-replicated systems. Saturn can be used in combination with several distributed and replicated data services to ensure that remote operations are made visible in an order that respects causality, a requirement central to many consistency criteria. Saturn addresses two key unsolved problems inherent to previous approaches. First, it eliminates the tradeoff between throughput and data freshness, when deciding what metadata to use for tracking causality. Second, it enables genuine partial replication, a key property to ensure scalability when the number of geo-locations increases. Saturn addresses these challenges while keeping metadata size constant, independently of the number of clients, servers, data partitions, and locations. By decoupling metadata management from data dissemination, and by using clever metadata propagation techniques, it ensures that the throughput and visibility latency of updates on a given item are (mostly) shielded from operations on other items or locations. We evaluate Saturn in Amazon EC2 using realistic benchmarks under both full and partial geo-replication. Results show that weakly consistent datastores can lean on Saturn to upgrade their consistency guarantees to causal consistency with a negligible penalty on performance.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127957724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The lock holder and the lock waiter pre-emption problems: nip them in the bud using informed spinlocks (I-Spinlock)","authors":"Boris Teabe, Vlad Nitu, A. Tchana, D. Hagimont","doi":"10.1145/3064176.3064180","DOIUrl":"https://doi.org/10.1145/3064176.3064180","url":null,"abstract":"In native Linux systems, spinlock's implementation relies on the assumption that both the lock holder thread and lock waiter threads cannot be preempted. However, in a virtualized environment, these threads are scheduled on top of virtual CPUs (vCPU) that can be preempted by the hypervisor at any time, thus forcing lock waiter threads on other vCPUs to busy wait and to waste CPU cycles. This leads to the well-known Lock Holder Preemption (LHP) and Lock Waiter Preemption (LWP) issues. In this paper, we propose I-Spinlock (for Informed Spinlock), a new spinlock implementation for virtualized environments. Its main principle is to only allow a thread to acquire a lock if and only if the remaining time-slice of its vCPU is sufficient to enter and leave the critical section. This is possible if the spinlock primitive is aware (informed) of its time-to-preemption (by the hypervisor). We implemented I-Spinlock in the Xen virtualization system. We show that our solution is compliant with both para-virtual and hardware virtualization modes. We performed extensive performance evaluations with various reference benchmarks and compared our solution to previous solutions. The evaluations demonstrate that I-Spinlock outperforms other solutions, and more significantly when the number of core increases.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122994826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ROS: A Rack-based Optical Storage System with Inline Accessibility for Long-Term Data Preservation","authors":"Wenrui Yan, Jie Yao, Q. Cao, C. Xie, Hong Jiang","doi":"10.1145/3064176.3064207","DOIUrl":"https://doi.org/10.1145/3064176.3064207","url":null,"abstract":"The combination of the explosive growth in digital data and the need to preserve much of this data in the long term has made it an imperative to find a more cost-effective way than HDD arrays and more easily accessible way than tape libraries to store massive amounts of data. While modern optical discs are capable of guaranteeing more than 50-year data preservation without migration, individual optical disks' lack of the performance and capacity relative to HDDs or tapes has significantly limited their use in datacenters. This paper presents a Rack-scale Optical disc library System, or ROS in short, that provides a PB-level total capacity and inline accessibility on thousands of optical discs built within a 42U Rack. A rotatable roller and robotic arm separating and fetching the discs are designed to improve disc placement density and simplify the mechanical structure. A hierarchical storage system based on SSD, hard disks and optical discs are presented to hide the delay of mechanical operation. On the other hand, an optical library file system is proposed to schedule mechanical operation and organize data on the tiered storage with a POSIX user interface to provide an illusion of inline data accessibility. We evaluate ROS on a few key performance metrics including operation delays of the mechanical structure and software overhead in a prototype PB-level ROS system. The results show that ROS stacked on Samba and FUSE can provide almost 323MB/s read and 236MB/s write throughput, about 53ms file write and 15ms read latency via 10GbE network for external users, exhibiting its inline accessibility. Besides, ROS is able to effectively hide and virtualize internal complex operational behaviors and be easily deployable in datacenters.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee, Yanqi Zhou, R. Alagappan, K. Strauss, S. Swanson
{"title":"Atomic In-place Updates for Non-volatile Main Memories with Kamino-Tx","authors":"Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee, Yanqi Zhou, R. Alagappan, K. Strauss, S. Swanson","doi":"10.1145/3064176.3064215","DOIUrl":"https://doi.org/10.1145/3064176.3064215","url":null,"abstract":"Data structures for non-volatile memories have to be designed such that they can be atomically modified using transactions. Existing atomicity methods require data to be copied in the critical path which significantly increases the latency of transactions. These overheads are further amplified for transactions on byte-addressable persistent memories where often the byte ranges modified for data structure updates are significantly smaller compared to the granularity at which data can be efficiently copied and logged. We propose Kamino-Tx that provides a new way to perform transactional updates on non-volatile byte-addressable memories (NVM) without requiring any copying of data in the critical path. Kamino-Tx maintains an additional copy of data off the critical path to achieve atomicity. But in doing so Kamino-Tx has to overcome two important challenges of safety and minimizing NVM storage overhead. We propose a more dynamic approach to maintaining the additional copy of data to reduce storage overheads. To further mitigate the storage overhead of using Kamino-Tx in a replicated setting, we develop Kamino-Tx-Chain, a variant of Chain Replication where replicas perform in-place updates and do not maintain data copies locally; replicas in Kamino-Tx-Chain leverage other replicas as copies to roll back or forward for atomicity. Our results show that using Kamino-Tx increases throughput by up to 9.5x for unreplicated systems and up to 2.2x for replicated settings.","PeriodicalId":262089,"journal":{"name":"Proceedings of the Twelfth European Conference on Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129555222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}