{"title":"PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation","authors":"Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen","doi":"arxiv-2405.12079","DOIUrl":"https://doi.org/arxiv-2405.12079","url":null,"abstract":"Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is\u0000an OS-level GPU C/R system: It can transparently checkpoint or restore\u0000processes that use the GPU, without requiring any cooperation from the\u0000application, a key feature required by modern systems like the cloud. Moreover,\u0000POS is the first OS-level C/R system that can concurrently execute C/R with the\u0000application execution: a critical feature that can be trivially achieved when\u0000the processes only running on the CPU, but becomes challenging when the\u0000processes use GPU. The problem is how to ensure consistency during concurrent\u0000execution with the lack of application semantics due to transparency. CPU\u0000processes can leverage OS and hardware paging to fix inconsistency without\u0000application semantics. Unfortunately, GPU bypasses OS and paging for high\u0000performance. POS fills the semantic gap by speculatively extracting buffer\u0000access information of GPU kernels during runtime. Thanks to the simple and\u0000well-structured nature of GPU kernels, our speculative extraction (with runtime\u0000validation) achieves 100% accuracy on applications from training to inference\u0000whose domains span from vision, large language models, and reinforcement\u0000learning. Based on the extracted semantics, we systematically overlap C/R with\u0000application execution, and achieves orders of magnitude higher performance\u0000under various tasks compared with the state-of-the-art OS-level GPU C/R,\u0000including training fault tolerance, live GPU process migration, and cold starts\u0000acceleration in GPU-based serverless computing.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhagyashri Tushir, Vikram K Ramanna, Yuhong Liu, Behnam Dezfouli
{"title":"Leveraging Machine Learning for Accurate IoT Device Identification in Dynamic Wireless Contexts","authors":"Bhagyashri Tushir, Vikram K Ramanna, Yuhong Liu, Behnam Dezfouli","doi":"arxiv-2405.17442","DOIUrl":"https://doi.org/arxiv-2405.17442","url":null,"abstract":"Identifying IoT devices is crucial for network monitoring, security\u0000enforcement, and inventory tracking. However, most existing identification\u0000methods rely on deep packet inspection, which raises privacy concerns and adds\u0000computational complexity. More importantly, existing works overlook the impact\u0000of wireless channel dynamics on the accuracy of layer-2 features, thereby\u0000limiting their effectiveness in real-world scenarios. In this work, we define\u0000and use the latency of specific probe-response packet exchanges, referred to as\u0000\"device latency,\" as the main feature for device identification. Additionally,\u0000we reveal the critical impact of wireless channel dynamics on the accuracy of\u0000device identification based on device latency. Specifically, this work\u0000introduces \"accumulation score\" as a novel approach to capturing fine-grained\u0000channel dynamics and their impact on device latency when training machine\u0000learning models. We implement the proposed methods and measure the accuracy and\u0000overhead of device identification in real-world scenarios. The results confirm\u0000that by incorporating the accumulation score for balanced data collection and\u0000training machine learning algorithms, we achieve an F1 score of over 97% for\u0000device identification, even amidst wireless channel dynamics, a significant\u0000improvement over the 75% F1 score achieved by disregarding the impact of\u0000channel dynamics on data collection and device latency.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Task Scheduling in Heterogeneous Computing Environments: A Comparative Analysis of CPU, GPU, and ASIC Platforms Using E2C Simulator","authors":"Ali Mohammadjafari, Poorya Khajouie","doi":"arxiv-2405.08187","DOIUrl":"https://doi.org/arxiv-2405.08187","url":null,"abstract":"Efficient task scheduling in heterogeneous computing environments is\u0000imperative for optimizing resource utilization and minimizing task completion\u0000times. In this study, we conducted a comprehensive benchmarking analysis to\u0000evaluate the performance of four scheduling algorithms First Come, First-Served\u0000(FCFS), FCFS with No Queuing (FCFS-NQ), Minimum Expected Completion Time\u0000(MECT), and Minimum Expected Execution Time (MEET) across varying workload\u0000scenarios. We defined three workload scenarios: low, medium, and high, each\u0000representing different levels of computational demands. Through rigorous\u0000experimentation and analysis, we assessed the effectiveness of each algorithm\u0000in terms of total completion percentage, energy consumption, wasted energy, and\u0000energy per completion. Our findings highlight the strengths and limitations of\u0000each algorithm, with MECT and MEET emerging as robust contenders, dynamically\u0000prioritizing tasks based on comprehensive estimates of completion and execution\u0000times. Furthermore, MECT and MEET exhibit superior energy efficiency compared\u0000to FCFS and FCFS-NQ, underscoring their suitability for resource-constrained\u0000environments. This study provides valuable insights into the efficacy of task\u0000scheduling algorithms in heterogeneous computing environments, enabling\u0000informed decision-making to enhance resource allocation, minimize task\u0000completion times, and improve energy efficiency","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reid PriedhorskyLos Alamos National Laboratory, Michael JenningsLos Alamos National Laboratory, Megan Phinney
{"title":"Zero-consistency root emulation for unprivileged container image build","authors":"Reid PriedhorskyLos Alamos National Laboratory, Michael JenningsLos Alamos National Laboratory, Megan Phinney","doi":"arxiv-2405.06085","DOIUrl":"https://doi.org/arxiv-2405.06085","url":null,"abstract":"Do Linux distribution package managers need the privileged operations they\u0000request to actually happen? Apparently not, at least for building container\u0000images for HPC applications. We use this observation to implement a root\u0000emulation mode using a Linux seccomp filter that intercepts some privileged\u0000system calls, does nothing, and returns success to the calling program. This\u0000approach provides no consistency whatsoever but appears sufficient to build all\u0000Dockerfiles we examined, simplifying fully-unprivileged workflows needed for\u0000HPC application containers.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140929077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxime Letemple, Gaulthier Gain, Sami Ben Mariem, Laurent Mathy, Benoit Donnet
{"title":"uTNT: Unikernels for Efficient and Flexible Internet Probing","authors":"Maxime Letemple, Gaulthier Gain, Sami Ben Mariem, Laurent Mathy, Benoit Donnet","doi":"arxiv-2405.04036","DOIUrl":"https://doi.org/arxiv-2405.04036","url":null,"abstract":"The last twenty years have seen the development and popularity of network\u0000measurement infrastructures. Internet measurement platforms have become common\u0000and have demonstrated their relevance in Internet understanding and security\u0000observation. However, despite their popularity, those platforms lack of\u0000flexibility and reactivity, as they are usually used for longitudinal\u0000measurements. As a consequence, they may miss detecting events that are\u0000security or Internet-related. During the same period, operating systems have\u0000evolved to virtual machines (VMs) as self-contained units for running\u0000applications, with the recent rise of unikernels, ultra-lightweight VMs\u0000tailored for specific applications, eliminating the need for a host OS. In this\u0000paper, we advocate that measurement infrastructures could take advantage of\u0000unikernels to become more flexible and efficient. We propose uTNT, a\u0000proof-of-concept unikernel-based implementation of TNT, a traceroute extension\u0000able to reveal MPLS tunnels. This paper documents the full toolchain for\u0000porting TNT into a unikernel and evaluates uTNT performance with respect to\u0000more traditional approaches. The paper also discusses a use case in which uTNT\u0000could find a suitable usage. uTNT source code is publicly available on Gitlab.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention","authors":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar","doi":"arxiv-2405.04437","DOIUrl":"https://doi.org/arxiv-2405.04437","url":null,"abstract":"Efficient use of GPU memory is essential for high throughput LLM inference.\u0000Prior systems reserved memory for the KV-cache ahead-of-time, resulting in\u0000wasted capacity due to internal fragmentation. Inspired by OS-based virtual\u0000memory systems, vLLM proposed PagedAttention to enable dynamic memory\u0000allocation for KV-cache. This approach eliminates fragmentation, enabling\u0000high-throughput LLM serving with larger batch sizes. However, to be able to\u0000allocate physical memory dynamically, PagedAttention changes the layout of\u0000KV-cache from contiguous virtual memory to non-contiguous virtual memory. This\u0000change requires attention kernels to be rewritten to support paging, and\u0000serving framework to implement a memory manager. Thus, the PagedAttention model\u0000leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management.\u0000In contrast to PagedAttention, vAttention retains KV-cache in contiguous\u0000virtual memory and leverages low-level system support for demand paging, that\u0000already exists, to enable on-demand physical memory allocation. Thus,\u0000vAttention unburdens the attention kernel developer from having to explicitly\u0000support paging and avoids re-implementation of memory management in the serving\u0000framework. We show that vAttention enables seamless dynamic memory management\u0000for unchanged implementations of various attention kernels. vAttention also\u0000generates tokens up to 1.97x faster than vLLM, while processing input prompts\u0000up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention\u0000and FlashInfer.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees","authors":"Damiano Carra, Giovanni Neglia","doi":"arxiv-2405.01263","DOIUrl":"https://doi.org/arxiv-2405.01263","url":null,"abstract":"The commonly used caching policies, such as LRU or LFU, exhibit optimal\u0000performance only for specific traffic patterns. Even advanced Machine\u0000Learning-based methods, which detect patterns in historical request data,\u0000struggle when future requests deviate from past trends. Recently, a new class\u0000of policies has emerged that makes no assumptions about the request arrival\u0000process. These algorithms solve an online optimization problem, enabling\u0000continuous adaptation to the context. They offer theoretical guarantees on the\u0000regret metric, which is the gap between the gain of the online policy and the\u0000gain of the optimal static cache allocation in hindsight. Nevertheless, the\u0000high computational complexity of these solutions hinders their practical\u0000adoption. In this study, we introduce a groundbreaking gradient-based online\u0000caching policy, the first to achieve logarithmic computational complexity\u0000relative to catalog size along with regret guarantees. This means our algorithm\u0000can efficiently handle large-scale data while minimizing the performance gap\u0000between real-time decisions and optimal hindsight choices. As requests arrive,\u0000our policy dynamically adjusts the probabilities of including items in the\u0000cache, which drive cache update decisions. Our algorithm's streamlined\u0000complexity is a key advantage, enabling its application to real-world traces\u0000featuring millions of requests and items. This is a significant achievement, as\u0000traces of this scale have been out of reach for existing policies with regret\u0000guarantees. To the best of our knowledge, our experimental results show for the\u0000first time that the regret guarantees of gradient-based caching policies bring\u0000significant benefits in scenarios of practical interest.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"837 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140840024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis Gerhorst, Henriette Herzog, Peter Wägemann, Maximilian Ott, Rüdiger Kapitza, Timo Hönig
{"title":"Mitigating Spectre-PHT using Speculation Barriers in Linux BPF","authors":"Luis Gerhorst, Henriette Herzog, Peter Wägemann, Maximilian Ott, Rüdiger Kapitza, Timo Hönig","doi":"arxiv-2405.00078","DOIUrl":"https://doi.org/arxiv-2405.00078","url":null,"abstract":"High-performance IO demands low-overhead communication between user- and\u0000kernel space. This demand can no longer be fulfilled by traditional system\u0000calls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kernel\u0000transitions by just-in-time compiling user-provided bytecode and executing it\u0000in kernel mode with near-native speed. To still isolate BPF programs from the\u0000kernel, they are statically analyzed for memory- and type-safety, which imposes\u0000some restrictions but allows for good expressiveness and high performance.\u0000However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenses\u0000which reject potentially-dangerous programs had to be deployed. We find that\u0000this affects 24% to 54% of programs in a dataset with 844 real-world BPF\u0000programs from popular open-source projects. To solve this, users are forced to\u0000disable the defenses to continue using the programs, which puts the entire\u0000system at risk. To enable secure and expressive untrusted Linux kernel extensions, we propose\u0000Berrify, an enhancement to the kernel's Spectre defenses that reduces the\u0000number of BPF application programs rejected from 54% to zero. We measure\u0000Berrify's overhead for all mainstream performance-sensitive applications of BPF\u0000(i.e., event tracing, profiling, and packet processing) and find that it\u0000improves significantly upon the status-quo where affected BPF programs are\u0000either unusable or enable transient execution attacks on the kernel.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140840091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic
{"title":"Dirigent: Lightweight Serverless Orchestration","authors":"Lazar Cvetković, François Costa, Mihajlo Djokic, Michal Friedman, Ana Klimovic","doi":"arxiv-2404.16393","DOIUrl":"https://doi.org/arxiv-2404.16393","url":null,"abstract":"While Function as a Service (FaaS) platforms can initialize function\u0000sandboxes on worker nodes in 10-100s of milliseconds, the latency to schedule\u0000functions in real FaaS clusters can be orders of magnitude higher. We find that\u0000the current approach of building FaaS cluster managers on top of legacy\u0000orchestration systems like Kubernetes leads to high scheduling delay at high\u0000sandbox churn, which is typical in FaaS clusters. While generic cluster\u0000managers use hierarchical abstractions and multiple internal components to\u0000manage and reconcile state with frequent persistent updates, this becomes a\u0000bottleneck for FaaS, where cluster state frequently changes as sandboxes are\u0000created on the critical path of requests. Based on our root cause analysis of\u0000performance issues in existing FaaS cluster managers, we propose Dirigent, a\u0000clean-slate system architecture for FaaS orchestration with three key\u0000principles. First, Dirigent optimizes internal cluster manager abstractions to\u0000simplify state management. Second, it eliminates persistent state updates on\u0000the critical path of function invocations, leveraging the fact that FaaS\u0000abstracts sandboxes from users to relax exact state reconstruction guarantees.\u0000Finally, Dirigent runs monolithic control and data planes to minimize internal\u0000communication overheads and maximize throughput. We compare Dirigent to\u0000state-of-the-art FaaS platforms and show that Dirigent reduces 99th percentile\u0000per-function scheduling latency for a production workload by 2.79x compared to\u0000AWS Lambda and can spin up 2500 sandboxes per second at low latency, which is\u00001250x more than with Knative.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"244 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taming Server Memory TCO with Multiple Software-Defined Compressed Tiers","authors":"Sandeep Kumar, Aravinda Prasad, Sreenivas Subramoney","doi":"arxiv-2404.13886","DOIUrl":"https://doi.org/arxiv-2404.13886","url":null,"abstract":"Memory accounts for 33 - 50% of the total cost of ownership (TCO) in modern\u0000data centers. We propose a novel solution to tame memory TCO through the novel\u0000creation and judicious management of multiple software-defined compressed\u0000memory tiers. As opposed to the state-of-the-art solutions that employ a 2-Tier solution, a\u0000single compressed tier along with DRAM, we define multiple compressed tiers\u0000implemented through a combination of different compression algorithms, memory\u0000allocators for compressed objects, and backing media to store compressed\u0000objects. These compressed memory tiers represent distinct points in the access\u0000latency, data compressibility, and unit memory usage cost spectrum, allowing\u0000rich and flexible trade-offs between memory TCO savings and application\u0000performance impact. A key advantage with ntier is that it enables aggressive\u0000memory TCO saving opportunities by placing warm data in low latency compressed\u0000tiers with a reasonable performance impact while simultaneously placing cold\u0000data in the best memory TCO saving tiers. We believe our work represents an\u0000important server system configuration and optimization capability to achieve\u0000the best SLA-aware performance per dollar for applications hosted in production\u0000data center environments. We present a comprehensive and rigorous analytical cost model for performance\u0000and TCO trade-off based on continuous monitoring of the application's data\u0000access profile. Guided by this model, our placement model takes informed\u0000actions to dynamically manage the placement and migration of application data\u0000across multiple software-defined compressed tiers. On real-world benchmarks,\u0000our solution increases memory TCO savings by 22% - 40% percentage points while\u0000maintaining performance parity or improves performance by 2% - 10% percentage\u0000points while maintaining memory TCO parity compared to state-of-the-art 2-Tier\u0000solutions.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}