{"title":"Fast and strongly-consistent per-item resilience in key-value stores","authors":"Konstantin Taranov, G. Alonso, T. Hoefler","doi":"10.1145/3190508.3190536","DOIUrl":"https://doi.org/10.1145/3190508.3190536","url":null,"abstract":"In-memory key-value stores (KVSs) provide different forms of resilience through basic r-way replication and complex erasure codes such as Reed-Solomon. Each storage scheme exhibits different tradeoffs in terms of reliability and resources used (memory, network load, latency, storage required, etc.). Unfortunately, most KVSs support only a single such storage scheme, forcing designers to employ different KVSs for different applications. To address this problem, we have designed a strongly consistent in-memory KVS, Ring, that empowers its users to set the level of resilience on a KV pair basis while still maintaining overall consistency and without compromising efficiency. At the heart of Ring lies a novel encoding scheme, Stretched Reed-Solomon coding, that combines hash key distributions of heterogeneous replication and erasure coding schemes. Ring utilizes RDMA to ensure low latencies and offload communication tasks. Its latency, bandwidth, and throughput are comparable to state-of-the-art systems that do not support changing resilience and, thus, have much higher memory overheads. We show use cases that demonstrate significant memory savings and discuss trade-offs between reliability, performance, and cost. Our work demonstrates how future applications that consciously manage resilience of KV pairs can reduce the overall operational cost and significantly improve the performance of KVS deployments.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129201827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimus: an efficient dynamic resource scheduler for deep learning clusters","authors":"Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chuanxiong Guo","doi":"10.1145/3190508.3190517","DOIUrl":"https://doi.org/10.1145/3190508.3190517","url":null,"abstract":"Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133210313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elli Androulaki, Artem Barger, V. Bortnikov, C. Cachin, K. Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, S. Muralidharan, Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith A. Smith, A. Sorniotti, C. Stathakopoulou, M. Vukolic, S. Cocco, Jason Yellick
{"title":"Hyperledger fabric: a distributed operating system for permissioned blockchains","authors":"Elli Androulaki, Artem Barger, V. Bortnikov, C. Cachin, K. Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, S. Muralidharan, Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith A. Smith, A. Sorniotti, C. Stathakopoulou, M. Vukolic, S. Cocco, Jason Yellick","doi":"10.1145/3190508.3190538","DOIUrl":"https://doi.org/10.1145/3190508.3190538","url":null,"abstract":"Fabric is a modular and extensible open-source system for deploying and operating permissioned blockchains and one of the Hyperledger projects hosted by the Linux Foundation (www.hyperledger.org). Fabric is the first truly extensible blockchain system for running distributed applications. It supports modular consensus protocols, which allows the system to be tailored to particular use cases and trust models. Fabric is also the first blockchain system that runs distributed applications written in standard, general-purpose programming languages, without systemic dependency on a native cryptocurrency. This stands in sharp contrast to existing block-chain platforms that require \"smart-contracts\" to be written in domain-specific languages or rely on a cryptocurrency. Fabric realizes the permissioned model using a portable notion of membership, which may be integrated with industry-standard identity management. To support such flexibility, Fabric introduces an entirely novel blockchain design and revamps the way blockchains cope with non-determinism, resource exhaustion, and performance attacks. This paper describes Fabric, its architecture, the rationale behind various design decisions, its most prominent implementation aspects, as well as its distributed application programming model. We further evaluate Fabric by implementing and benchmarking a Bitcoin-inspired digital currency. We show that Fabric achieves end-to-end throughput of more than 3500 transactions per second in certain popular deployment configurations, with sub-second latency, scaling well to over 100 peers.","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131095867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Thirteenth EuroSys Conference","authors":"","doi":"10.1145/3190508","DOIUrl":"https://doi.org/10.1145/3190508","url":null,"abstract":"","PeriodicalId":334267,"journal":{"name":"Proceedings of the Thirteenth EuroSys Conference","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123583832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}