Operating Systems Review (ACM)最新文献_第2页

Enabling Practical Cloud Performance Debugging with Unsupervised Learning 使用无监督学习实现实用的云性能调试

Operating Systems Review (ACM) Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544503

Yu Gan, Mingyu Liang, Sundar Dev, David Lo, Christina Delimitrou

引用次数: 1

Pharos Pharos

Operating Systems Review (ACM) Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544505

Srinivas Vippagunta, Ken Finnigan, K. Pusukuri

引用次数: 1

One Profile Fits All 一个配置文件适用于所有配置文件

Operating Systems Review (ACM) Pub Date : 2022-06-14 DOI: 10.1145/3544497.3544502

Muhammed Ugur, Cheng Jiang, Alex Erf, Tanvir Ahmed Khan, Baris Kasikci

引用次数: 1

VAIF: Variance-driven Automated Instrumentation Framework 方差驱动的自动化仪器框架

Operating Systems Review (ACM) Pub Date : 2022-01-01 DOI: 10.1145/3544497.3544504

Mert Toslali, E. Ates, Darby Huye, Alex Ellis, Zhao Zhang, Lan Liu, Samantha Puterman, A. Coskun, Raja R. Sambasivan

引用次数: 0

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure Moneo:在AI基础设施中非侵入性地监控细粒度指标

Operating Systems Review (ACM) Pub Date : 2022-01-01 DOI: 10.1145/3544497.3544501

Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong

{"title":"Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure","authors":"Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong","doi":"10.1145/3544497.3544501","DOIUrl":"https://doi.org/10.1145/3544497.3544501","url":null,"abstract":"Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"56 1","pages":"18-25"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64052256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

lODA lODA

Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483573

Huaicheng Li, Martin L. Putra, Ronald Shi, Xing Lin, G. Ganger, Haryadi S. Gunawi

{"title":"lODA","authors":"Huaicheng Li, Martin L. Putra, Ronald Shi, Xing Lin, G. Ganger, Haryadi S. Gunawi","doi":"10.1145/3477132.3483573","DOIUrl":"https://doi.org/10.1145/3477132.3483573","url":null,"abstract":"Predictable latency on flash storage is a long-pursuit goal, yet, unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-level controls to SSD internal management tasks. While promising, challenges remain on how to exploit it for truly predictable performance. We present IODA, an I/O deterministic flash array design built on top of small but powerful extensions to the IOD interface for easy deployment. IODA exploits data redundancy in the context of IOD for a strong latency predictability contract. In IODA, SSDs are expected to quickly fail an I/O on purpose to allow predictable I/Os through proactive data reconstruction. In the case of concurrent internal operations, IODA introduces busy remaining time exposure and predictable-latency-window formulation to guarantee predictable data reconstructions. Overall, IODA only adds 5 new fields to the NVMe interface and a small modification in the flash firmware, while keeping most of the complexity in the host OS. Our evaluation shows that IODA improves the 95-99.99th latencies by up to 75x. IODA is also the nearest to the ideal, no disturbance case compared to 7 state-of-the-art preemption, suspension, GC coordination, partitioning, tiny-tail flash controller, prediction, and proactive approaches.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73513389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

dSpace dSpace

Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483559

Silvery Fu, S. Ratnasamy

引用次数: 23

Caracal

Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483591

Dai Qin, Angela Demke Brown, Ashvin Goel

{"title":"Caracal","authors":"Dai Qin, Angela Demke Brown, Ashvin Goel","doi":"10.1145/3477132.3483591","DOIUrl":"https://doi.org/10.1145/3477132.3483591","url":null,"abstract":"Deterministic databases offer several benefits: they ensure serializable execution while avoiding concurrency-control related aborts, and they scale well in distributed environments. Today, most deterministic database designs use partitioning to scale up and avoid contention. However, partitioning requires significant programmer effort, leads to poor performance under skewed workloads, and incurs unnecessary overheads in certain uncontended workloads. We present the design of Caracal, a novel shared-memory, deterministic database that performs well under both skew and contention. Our deterministic scheme batches transactions in epochs and executes the transactions in an epoch in a predetermined order. Our scheme enables reducing contention by batching concurrency control operations. It also allows analyzing the transactions in the epoch to determine contended keys accurately. Certain transactions can then be split into independent contended and uncontended pieces and run deterministically and in parallel, further reducing contention. Based on these ideas, we present two novel optimizations, batch append and split-on-demand, for managing contention. With these optimizations, Caracal scales well and outperforms existing deterministic schemes in most workloads by 1.9x to 9.7x.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81406144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications Shard Manager:一个用于地理分布式应用程序的通用Shard管理框架

Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483546

Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, K. Huang, Yatpang Cheung, Yiding Zhou, K. Veeraraghavan, Biren Damani, Pol Mauri Ruiz, V. Mehta, Chunqiang Tang

{"title":"Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications","authors":"Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, K. Huang, Yatpang Cheung, Yiding Zhou, K. Veeraraghavan, Biren Damani, Pol Mauri Ruiz, V. Mehta, Chunqiang Tang","doi":"10.1145/3477132.3483546","DOIUrl":"https://doi.org/10.1145/3477132.3483546","url":null,"abstract":"Sharding is widely used to scale an application. Despite a decade of effort to build generic sharding frameworks that can be reused across different applications, the extent of their success remains unclear. We attempt to answer a fundamental question: what barriers prevent a sharding framework from getting adopted by the majority of sharded applications? We analyze hundreds of sharded applications at Facebook and identify two major barriers: 1) lack of support for geo-distributed applications, which account for most of Facebook's applications, and 2) inability to maintain application availability during planned events such as software upgrades, which happen ≈1000 times more frequently than unplanned failures. A sharding framework that does not help applications to address these fundamental challenges is not sufficiently attractive for most applications to adopt it. Other adoption barriers include the burden of supporting many complex applications in a one-size-fit-all sharding framework and the difficulty in supporting sophisticated shard-placement requirements. Theoretically, a constraint solver can handle complex placement requirements, but in practice it is not scalable enough to perform near-realtime shard placement at a global scale. We have overcome these adoption barriers in Facebook's sharding framework called Shard Manager. Currently, Shard Manager is used by hundreds of applications running on over one million machines, which account for about 54% of all sharded applications at Facebook.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87152392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Cuckoo Trie: Exploiting Memory-Level Parallelism for Efficient DRAM Indexing 杜鹃树:利用内存级并行性实现高效的DRAM索引

Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483551

Adar Zeitak, Adam Morrison

引用次数: 4