J. Schuchart, Poornima Nookala, M. Javanmard, T. Hérault, Edward F. Valeev, G. Bosilca, R. Harrison
{"title":"Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment","authors":"J. Schuchart, Poornima Nookala, M. Javanmard, T. Hérault, Edward F. Valeev, G. Bosilca, R. Harrison","doi":"10.1109/ipdps53621.2022.00086","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00086","url":null,"abstract":"We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115036132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuguang Wang, Robbie Watling, Junqiao Qiu, Zhenlin Wang
{"title":"GSpecPal: Speculation-Centric Finite State Machine Parallelization on GPUs","authors":"Yuguang Wang, Robbie Watling, Junqiao Qiu, Zhenlin Wang","doi":"10.1109/ipdps53621.2022.00053","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00053","url":null,"abstract":"Finite State Machine (FSM) plays a critical role in many real-world applications, ranging from pattern matching to network security. In recent years, significant research efforts have been made to accelerate FSM computations on different parallel platforms, including multicores, GPUs, and DRAM-based accelerators. A popular direction is the speculation-centric parallelization. Despite their abundance and promising results, the benefits of speculation-centric FSM parallelization on GPUs heavily depend on high speculation accuracy and are greatly limited by the inefficient sequential recovery. Inspired by speculative data forwarding used in Thread Level Speculation (TLS), this work addresses the existing bottlenecks by introducing speculative recovery with two heuristics for thread scheduling, which can effectively remove redundant computations and increase the GPU thread utilization. To maximize the performance of running FSMs on GPUs, this work integrates different speculative parallelization schemes into a latency-sensitive framework, GSpecPal, along with a scheme selector which aims to automatically configure the optimal GPU-based parallelization for a given FSM. Evaluation on a set of real-world FSMs with diverse characteristics confirms the effectiveness of GSpecPal. Experimental results show that GSpecPal can obtain 7.2× speedup on average (up to 20×) over the state-of-the-art on an Nvidia GeForce RTX 3090 GPU.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129754282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin You, Hailong Yang, Zhibo Xuan, Zhongzhi Luan, D. Qian
{"title":"PowerSpector: Towards Energy Efficiency with Calling-Context-Aware Profiling","authors":"Xin You, Hailong Yang, Zhibo Xuan, Zhongzhi Luan, D. Qian","doi":"10.1109/ipdps53621.2022.00126","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00126","url":null,"abstract":"Energy efficiency has become one of the major concerns in high-performance computing systems towards exascale. On mainstream systems, dynamic voltage and frequency scaling (DVFS) and uncore frequency scaling (UFS) are two popular techniques to trade-off performance and power consumption to achieve better energy efficiency. However, the existing system software is oblivious to application characteristics and thus misses the opportunity for fine-grained power management. Meanwhile, manually instrumenting applications with power management codes are prohibitive due to heavy engineering efforts and thus hardly portable across platforms. In this paper, we propose Powerspector, a fine-grained code profiling and optimization tool with calling context awareness to automatically explore the opportunity for optimizing energy efficiency. The design of Powerspector consists of three phases, including significant region detection, performance profiling and power modeling, and frequency optimization. The first phase automatically identifies the profitable regions for frequency optimization. Then, the second phase guides the core/uncore frequency optimization with power models. The third phase injects frequency optimization codes targeting each significant code region across different calling contexts automatically. The experiment results demonstrate that Powerspector can achieve 1.13×(1.00×), 1.28×(1.09×), and 1.17×(1.06×) improvement on energy efficiency compared to static(region-based) tuning on Haswell, Broadwell, and Skylake platforms, respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133625058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, M. Guo
{"title":"QoS-awareness of Microservices with Excessive Loads via Inter-Datacenter Scheduling","authors":"Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, M. Guo","doi":"10.1109/ipdps53621.2022.00039","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00039","url":null,"abstract":"User-facing applications often experience excessive loads and are shifting towards microservice software architecture. While the local datacenter may not have enough resources to host the excessive loads, a reasonable solution is moving some microservices of the applications to remote datacenters. However, it is nontrivial to identify the appropriate migration decision, as the microservices show different characteristics, and the local datacenter also shows different resource contention situations. We therefore propose ELIS, an inter-datacenter scheduling system that ensures the required Quality-of-Service (QoS) of the microservice application with excessive loads, while minimizing the resource usage of the remote datacenter. ELIS comprises a resource manager and a reward-based microservice migrator. The resource manager finds the near-optimal resource configurations for different microservices to minimize resource usage while ensuring QoS. The microservice migrator migrates some microservices to remote datacenters when local resources cannot afford the excessive loads. Our experimental results show that ELIS ensures the required QoS of user-facing applications at excessive loads. Meanwhile, it reduces overall/remote resource usage by 13.1% and 58.1% on average, respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131316816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haowei Huang, Pu Pang, Quan Chen, Jieru Zhao, Wenli Zheng, Minyi Guo
{"title":"CSC: Collaborative System Configuration for I/O-Intensive Applications in Multi-Tenant Clouds","authors":"Haowei Huang, Pu Pang, Quan Chen, Jieru Zhao, Wenli Zheng, Minyi Guo","doi":"10.1109/ipdps53621.2022.00131","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00131","url":null,"abstract":"I/O-intensive applications are important workloads of public clouds. Multiple cloud applications co-run on the same physical machine in different virtual machines (VMs), and the shared resources (e.g., disk bandwidth) are often isolated for fairness. Our investigation shows that the performance of an I/O-intensive application is impacted by both disk bandwidth allocation and the page cache settings in the guest operating system. However, none of prior work considers adjusting the page cache settings for better performance, when the disk bandwidth allocation is adjusted. We therefore propose CSC, a system that collaboratively identifies the appropriate disk bandwidth allocation and page cache settings in the guest operating system of each VM. CSC aims to improve the system-wide I/O throughput of the physical machine, while also improve the I/O throughput of each individual I/O-intensive application in VMs. CSC comprises an online disk bandwidth allocator and an adaptive dirty page setting optimizer. The bandwidth allocator monitors the disk bandwidth utilization and re-allocates some bandwidth from free VMs to busy VMs periodically. After the re-allocation, the opti-mizer identifies the appropriate dirty page settings in the guest operating system of the VMs using Bayesian Optimization. The experimental results show that CSC improves the performance of I/O-intensive applications by 9.5 % on average (up to 17.29 %) when 5 VMs are co-located while fairness is guaranteed.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114697015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Swap Dominated Tensor Re-Generation Strategy for Training Deep Learning Models","authors":"Lijie Wen, Zan Zong, Li Lin, Leilei Lin","doi":"10.1109/ipdps53621.2022.00101","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00101","url":null,"abstract":"With the growing of the depth of neural networks and the scale of data, the difficulty of network training also increases. When the GPU memory is insufficient, it is challenging to train deeper models. Recent research uses tensor swapping and recomputation techniques in a combined manner to optimize the memory usage. However, complex dependencies of the DNN graph limit the improvement of the single GPU memory optimization. Improper swap decisions even brings negative effects because the source of the recomputation may have been swapped out. In this paper, we propose a novel swap dominated tensor re-generation strategy, called STR, which combines swap and recomputation techniques to find the optimal execution plan for the DNN training when the memory is limited. We formalize our memory optimization problem with constraints which describe the dependency of the operator calculation and the bandwidth usage of swap. A host checkpoint mechanism is designed to make full use of the swapped tensors, which reduces the cost of the recomputation. We also present an approximation method based on a recursive source tracing procedure to improve the optimization efficiency. We implement a prototype of STR as a plugin on TensorFlow. The experimental result shows that STR improves up to 21.3% throughput compared with the state-of-the-art hybrid optimization strategy.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116383362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henrique Fingler, Zhiting Zhu, Esther Yoon, Zhipeng Jia, E. Witchel, C. Rossbach
{"title":"DGSF: Disaggregated GPUs for Serverless Functions","authors":"Henrique Fingler, Zhiting Zhu, Esther Yoon, Zhipeng Jia, E. Witchel, C. Rossbach","doi":"10.1109/ipdps53621.2022.00077","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00077","url":null,"abstract":"Ease of use and transparent access to elastic resources have attracted many applications away from traditional platforms toward serverless functions. Many of these applications, such as machine learning, could benefit significantly from GPU acceleration. Unfortunately, GPUs remain inaccessible from serverless functions in modern production settings. We present DGSF, a platform that transparently enables serverless functions to use GPUs through general purpose APIs such as CUDA. DGSF solves provisioning and utilization challenges with disaggregation, serving the needs of a potentially large number of functions through virtual GPUs backed by a small pool of physical GPUs on dedicated servers. Disaggregation allows the provider to decouple GPU provisioning from other resources, and enables significant benefits through consolidation. We describe how DGSF solves GPU disaggregation challenges including supporting API transparency, hiding the latency of communication with remote GPUs, and load-balancing access to heavily shared GPUs. Evaluation of our prototype on six workloads shows that DGSF's API remoting optimizations can improve the runtime of a function by up to 50% relative to unoptimized DGSF. Such optimizations, which aggressively remove GPU runtime and object management latency from the critical path, can enable functions running over DGSF to have a lower end-to-end time than when running on a GPU natively. By enabling GPU sharing, DGSF can reduce function queueing latency by up to 53%. We use DGSF to augment AWS Lambda with GPU support, showing similar benefits.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123230056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Gorbunova, R. Guerraoui, Anne-Marie Kermarrec, A. Kucherenko, Rafael Pinot
{"title":"The Universal Gossip Fighter","authors":"A. Gorbunova, R. Guerraoui, Anne-Marie Kermarrec, A. Kucherenko, Rafael Pinot","doi":"10.1109/ipdps53621.2022.00116","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00116","url":null,"abstract":"The notion of adversary is a staple of distributed computing. An adversary typically models “hostile” assumptions about the underlying distributed environment, e.g., a network that can drop messages, an operating system that can delay processes or an attacker that can hack machines. So far, the goal of distributed computing researchers has mainly been to develop a distributed algorithm that can face a given adversary, the abstraction characterizing worst-case scenarios. This paper initiates the study of the somehow opposite approach. Given a distributed algorithm, the adversary is the abstraction we seek to implement. More specifically, we consider the problem of controlling the spread of messages in a large-scale system, conveying the practical motivation of limiting the dissemination of fake news or viruses. Essentially, we assume a general class of gossip protocols, called all-to-all gossip protocols, and devise a practical method to hinder the dissemination. We present the Universal Gossip Fighter (UGF). Just like classical adversaries in distributed computing, UGF can observe the status of a dissemination and decide to stop some processes or delay some messages. The originality of UGF lies in the fact that it is universal, i.e., it applies to any all-to-all gossip protocol. We show that any gossip protocol attacked by UGF ends up exhibiting a quadratic message complexity (in the total number of processes) if it achieves sublinear time of dissemination. We also show that if a gossip protocol aims to achieve a message complexity $alpha$ times smaller than quadratic, then the time complexity rises exponentially in relation to $alpha$. We convey the practical relevance of our theoretical findings by implementing UGF and conducting a set of empirical experiments that confirm some of our results.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130023801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minhyeok Kweun, Goeun Kim, Byungsoo Oh, Seong-In Jung, Taegeon Um, Woo-Yeon Lee
{"title":"PokéMem: Taming Wild Memory Consumers in Apache Spark","authors":"Minhyeok Kweun, Goeun Kim, Byungsoo Oh, Seong-In Jung, Taegeon Um, Woo-Yeon Lee","doi":"10.1109/ipdps53621.2022.00015","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00015","url":null,"abstract":"Apache Spark is a widely used in-memory processing system due to its high performance. For fast data processing, Spark manages in-memory data such as cached or shuffling (aggregate and sorting) data in its own managed memory pools. However, despite its sophisticated memory management scheme, we found that Spark still suffers from out-of-memory (OOM) exceptions and high garbage collection (GC) overheads when wild memory consumers, who are not tracked by Spark and execute external codes, use a large amount of memory. To resolve the problems, we propose PokéMem, which is an enhanced Spark that incorporates wild memory consumers into the managed ones to prevent them from taking up memory spaces excessively in stealth. Our main idea is to open the black-box of unmanaged memory regions in external codes by providing customized data collections. PokéMem enables fine-grained controls of created objects within running tasks, by spilling and reloading the objects of custom data collections based on the memory pressure and access patterns. To further reduce memory pressures, PokéMem exploits pre-built memory estimation models to predict the external code's memory usage and proactively acquires memory before the execution of external code, and also performs JVM heap-usage monitoring to avoid critical memory pressures. With the help of these techniques, our evaluations show that PokéMem outperforms vanilla Spark with at most 3× faster execution with 3.9× smaller GC overheads, and successfully runs workloads without OOM exception that vanilla Spark has failed to run.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"13 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128892448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fuping Niu, Jianhui Yue, Jiangqiu Shen, Xiaofei Liao, Haikun Liu, Hai Jin
{"title":"FlashWalker: An In-Storage Accelerator for Graph Random Walks","authors":"Fuping Niu, Jianhui Yue, Jiangqiu Shen, Xiaofei Liao, Haikun Liu, Hai Jin","doi":"10.1109/ipdps53621.2022.00107","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00107","url":null,"abstract":"Graph random walk is widely used in the graph processing as it is a fundamental component in graph analysis, ranging from vertices ranking to the graph embedding. Different from traditional graph processing workload, random walk features massive processing parallelisms and poor graph data reuse, being limited by low I/O efficiency. Prior designs for random walk mitigate slow I/O operations. However, the state-of-the-art random walk processing systems are bounded by slow disk I/O bandwidth, which is confirmed by our experiments with real-world graphs. To address this issue, we propose FlashWalker, an in-storage accelerator for random walk that moves walk updating close to graph data stored in flash memory, by exploiting significant parallelisms inside SSD. Featuring a heterogeneous and parallel processing system, FlashWalker includes a board-level accelerator, channel-level accelerators, and chip-level accelerators. To address challenges posed by the tight resource constraints for processing large-scale graphs, we propose novel designs: storing a few popular subgraphs in accelerators, the pre-walking for dense walks, two optimizations to search the subgraph mapping table, and a subgraph scheduling algorithm. We implement FlashWalker in RTL, showing small circuit area overhead. Our evaluation shows FlashWalker reduces the execution time of random walk algorithms by up to 660.50×, compared with GraphWalker, which is the state-of-the-art system for random walk algorithms.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131945602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}