{"title":"SecFortress: Securing Hypervisor using Cross-layer Isolation","authors":"Qihang Zhou, Xiaoqi Jia, Shengzhi Zhang, Nan Jiang, Jiayun Chen, Weijuan Zhang","doi":"10.1109/ipdps53621.2022.00029","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00029","url":null,"abstract":"Virtualization is the corner stone of cloud computing, but the hypervisor, the crucial software component that enables virtualization, is known to suffer from various attacks. It is challenging to secure the hypervisor due to at least two reasons. On one hand, commercial hypervisors are usually integrated into a privileged Operating System (OS), which brings in a larger attack surface. On the other hand, multiple Virtual Machines (VM) share a single hypervisor, thus a malicious VM could leverage the hypervisor as a bridge to launch “cross-VM” attacks. In this work, we propose SecFortress, a dependable hypervisor design that decouples the virtualization layer into a mediator, an outerOS, and multiple HypBoxes through a cross-layer isolation approach. SecFortress extends the nested kernel approach to de-privilege the outerOS from accessing the mediator's memory and creates an isolated hypervisor instance, HypBox, to confine the impacts from the untrusted VMs. We implemented SecFortress based on KVM and evaluated its effectiveness and efficiency through case studies and performance evaluation. Experimental results show that SecFortress can significantly improve the security of the hypervisor with negligible runtime overhead.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory Access Granularity Aware Lossless Compression for GPUs","authors":"S. Lal, M. Renz, Julian Hartmer, B. Juurlink","doi":"10.1109/ipdps53621.2022.00108","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00108","url":null,"abstract":"High-bandwidth off-chip memory has played a key role in the success of Graphics Processing Units (GPUs) as an accelerator. However, as memory bandwidth scaling continues to lag behind the computational power, it remains a key bottleneck in computing systems. While memory compression has shown immense potential to increase the effective memory bandwidth by compressed data transfers between on-chip and off-chip memory, the large memory access granularity (MAG) of off-chip memory limits compression techniques from achieving a high effective compression ratio. Unfortunately, state-of-the-art lossless memory compression techniques do not take the large MAG of off-chip memory into account. A recent study has used MAG-aware approximation to increase the effective compression ratio, however, not all applications can tolerate errors, which limits its applicability. We propose extensions and GPU-specific optimizations to adapt a lossless memory compression technique to a MAG size to increase the effective compression ratio and performance gain. Our technique is based on the well-known Base-Delta-Immediate (BDI) compression technique that compresses a memory block to a common base and multiple deltas. We leverage the key observation that deltas often contain enough leading zeros to compress a block to a multiple of MAG without any loss of information. We show that MAG-aware BDI provides, on average, 48 % higher effective compression ratio, 10% (up to 27%) higher speedup, and 16% bandwidth reduction compared to normal BDI. While BDI, FPC, and CPACK have a similar compression ratio, MAG-aware BDI outperforms FPC, CPACK, and SLC by 56%, 47%, and 33%, respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124779238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Xu, Anchengcheng Zhou, Kunal Agrawal, I. Lee
{"title":"PINT: Parallel INTerval-Based Race Detector","authors":"Yifan Xu, Anchengcheng Zhou, Kunal Agrawal, I. Lee","doi":"10.1109/ipdps53621.2022.00087","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00087","url":null,"abstract":"A race detector for task-parallel code typically consists of two main components - a reachability analysis component that checks whether two instructions are logically in parallel and an access history component that keeps track of memory locations accessed by previous instructions. Race detectors from prior work typically utilize a hashmap to maintain the access history, which provides asymptotically optimal overhead per operation but can incur significant overhead in practice, since the detector needs to insert into and query the hashmap for every memory access. An exception is STINT by Xu et al., which race detects task-parallel code by coalescing memory accesses into intervals, or continuous memory locations accessed within a sequence of instructions without any parallel construct. STINT utilizes a treap to manage access history that allows for insertions and queries of non-overlapping intervals. While a treap incurs higher asymptotic overhead per operation, this strategy works well in practice as the race detector performs operation on the access history with much lower frequency compared to the strategy that utilizes a hashmap. STINT only executes task-parallel code sequentially, however, due to the unique design of their treap that ensures no overlapping intervals exist in the tree. Parallelizing STINT efficiently is non-trivial, as it would require a concurrent treap that ensures no overlapping interval, which is challenging to design and likely incurs high synchronization overhead. This work proposes PINT, a race detector that, like STINT, race detects task-parallel code at the interval granularity and utilizes the same treap design to maintain access history. PINT executes the computation in parallel, however, while keeping the parallelization / synchronization overhead low. A key insight is that, PINT separates out operations needed for race detection into the core part (e.g., reachability maintenance) and the access history part. Doing so allows PINT to parallelize the core part efficiently and perform the access history part asynchronously, thereby incurring low overhead.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114735907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Phase Task-Based HPC Applications: Quickly Learning how to Run Fast","authors":"Lucas Leandro Nesi, L. Schnorr, Arnaud Legrand","doi":"10.1109/ipdps53621.2022.00042","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00042","url":null,"abstract":"Parallel applications performance strongly depends on the number of resources. Although adding new nodes usually reduces execution time, excessive amounts are often detrimental as they incur substantial communication overhead, which is difficult to anticipate. Characteristics like network contention, data distribution methods, synchronizations, and how communications and computations overlap generally impact the performance. Finding the correct number of resources can thus be particularly tricky for multi-phase applications as each phase may have very different needs, and the popularization of hybrid ($C$ PU+GPU) machines and heterogeneous partitions makes it even more difficult. In this paper, we study and propose, in the context of a task-based GeoStatistic application, strategies for the application to actively learn and adapt to the best set of heterogeneous nodes it has access to. We propose strategies that use the Gaussian Process method with trends, bound mechanisms for reducing the search space, and heterogeneous behavior modeling. We compare these methods with traditional exploration strategies in 16 different machines scenarios. In the end, the proposed strategies are able to gain up to ≈51% compared to the standard case of using all the nodes while having low overhead.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120843675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DEAN: A Lightweight and Resource-efficient Blockchain Protocol for Reliable Edge Computing","authors":"Abdullah Al-Mamun, Haoting Shen, Dongfang Zhao","doi":"10.1109/ipdps53621.2022.00125","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00125","url":null,"abstract":"Edge computing draws a lot of recent research interests because of the performance improvement by offloading many workloads from the remote data center to nearby edge nodes. Nonetheless, one open challenge of this emerging paradigm lies in the potential security issues on edge nodes. This paper proposes a cooperative protocol, namely DEAN, equipped with a unique resource-efficient quorum building mechanism to adopt blockchain seamlessly in an edge computing infrastructure to prevent data manipulation and allow fair data sharing with quick recovery under resource constraints of limited storage, computing, and network capacity. Specifically, DEAN leverages a parallel mechanism equipped with three independent core components, effectively achieving low resource consumption while allowing secured parallel block processing on edge nodes. We have implemented a system prototype based on DEAN and experimentally verified its effectiveness with a comparison with four popular blockchain implementations: Ethereum, Parity, IOTA, and Hyperledger Fabric. Experimental results show that the system prototype exhibits high resilience to arbitrary failures. Performance-wise, DEAN-based blockchain implementation out-performs the state-of-the-art blockchain systems with up to 88.6 x higher throughput and 26 x lower latency.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"113 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120909028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialin Li, Huang Ye, Shaobo Tian, Xinyuan Li, Jian Zhang
{"title":"A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility","authors":"Jialin Li, Huang Ye, Shaobo Tian, Xinyuan Li, Jian Zhang","doi":"10.1109/ipdps53621.2022.00089","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00089","url":null,"abstract":"General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121361111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges and Opportunities in Designing High-Performance and Scalable Middleware for HPC and AI: Past, Present, and Future","authors":"D. Panda","doi":"10.1109/ipdps53621.2022.00009","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00009","url":null,"abstract":"This talk focuses on challenges and opportunities emerging over the years (past, present, and future) in designing middleware for HPC and AI (Deep/Machine Learning) workloads on modern high-end computing systems. The talk initially presents the challenges in designing HPC runtime environments with MPI+X programming models by considering support for dense multi-core CPUs, high-performance interconnects, GPUs, and emerging DPUs. Advanced designs and solutions (such as RDMA, in-network computing, GPUDirect RDMA, on-the-fly compression) to exploit novel features of these emerging technologies and their benefits in the context of MVAPICH2 libraries are presented. Next, the talk focuses on MPI-driven solutions for the Deep/Machine Learning domains to extract performance and scalability for popular Deep Learning frameworks, large out-of-core models, GPUs, and DPUs. MPI-driven solutions to accelerate data science applications like Dask are highlighted. Challenges and experiences in deploying this middleware to the HPC cloud environments for Azure, AWS, and Oracle Cloud are presented. The talk concludes with an overview of the newly established NSF-AI Institute ICICLE (https://icicle.osu.edu/) to address challenges in designing future high-performance edge-to-HPC/ cloud middleware for AI-driven data-intensive applications.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129206835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Direct solution of larger coupled sparse/dense linear systems using low-rank compression on single-node multi-core machines in an industrial context","authors":"E. Agullo, M. Felsöci, G. Sylvand","doi":"10.1109/ipdps53621.2022.00012","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00012","url":null,"abstract":"While hierarchically low-rank compression methods are now commonly available in both dense and sparse direct solvers, their usage for the direct solution of coupled sparse/dense linear systems has been little investigated. The solution of such systems is though central for the simulation of many important physics problems such as the simulation of the propagation of acoustic waves around aircrafts. Indeed, the heterogeneity of the jet flow created by reactors often requires a Finite Element Method (FEM) discretization, leading to a sparse linear system, while it may be reasonable to assume as homogeneous the rest of the space and hence model it with a Boundary Element Method (BEM) discretization, leading to a dense system. In an industrial context, these simulations are often operated on modern multicore workstations with fully-featured linear solvers. Exploiting their low-rank compression techniques is thus very appealing for solving larger coupled sparse/dense systems (hence ensuring a finer solution) on a given multicore workstation, and - of course - possibly do it fast. The standard method performing an efficient coupling of sparse and dense direct solvers is to rely on the Schur complement functionality of the sparse direct solver. However, to the best of our knowledge, modern fully-featured sparse direct solvers offering this functionality return the Schur complement as a non compressed matrix. In this paper, we study the opportunity to process larger systems in spite of this constraint. For that we propose two classes of algorithms, namely multi-solve and multi-factorization, consisting in composing existing parallel sparse and dense methods on well chosen submatrices. An experimental study conducted on a 24 cores machine equipped with 128 GiB of RAM shows that these algorithms, implemented on top of state-of-the-art sparse and dense direct solvers, together with proper low-rank assembly schemes, can respectively process systems of 9 million and 2.5 million total unknowns instead of 1.3 million unknowns with a standard coupling of compressed sparse and dense solvers.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114256859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FAM-Graph: Graph Analytics on Disaggregated Memory","authors":"Daniel Zahka, Ada Gavrilovska","doi":"10.1109/ipdps53621.2022.00017","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00017","url":null,"abstract":"Disaggregated memory is being proposed as a way to provide efficient memory scaling for data intensive applications. High performance interconnect technologies, such as CXL, make disaggregated, fabric-attached-memory (FAM) a viable secondary tier of memory. Previous work on remote memory relies on extending kernel level paging to utilize FAM as an additional storage tier after local memory. These approaches have the advantage of exposing remote memory in application transparent ways that do not require code changes, but they incur large overheads due to the mismatch between the abstraction of a flat virtual address space and the reality of the tiered nature of FAM. In this paper, we present an alternative approach to remote memory based on application-specific objects. We design FAM-Graph - a semi-external graph processing system that leverages application-level properties, such as read only edge data, to efficiently tier data between local and remote memory, and prefetch remote data for local computation. Using several graph algorithms and datasets, we demonstrate that FAM-Graph achieves end-to-end performance within factors of 1–6× of Galois, the state of the art shared memory graph processing system, while using up to 20× less local memory. When Galois is used in conjunction with an OS-level FAM solution, we show that FAM-Graph achieves better end-to-end performance by up to 9× when both systems are configured with the same amount of local memory.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134374199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Chen, Hai Jin, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Qinggang Wang, Haifeng Liu, Haiheng He, Xiaofei Liao, Ran Zheng
{"title":"A General Offloading Approach for Near-DRAM Processing-In-Memory Architectures","authors":"Dan Chen, Hai Jin, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Qinggang Wang, Haifeng Liu, Haiheng He, Xiaofei Liao, Ran Zheng","doi":"10.1109/ipdps53621.2022.00032","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00032","url":null,"abstract":"Processing-in-memory (PIM) is promising to solve the well-known data movement challenge by performing in-situ computations near the data. Leveraging PIM features is pretty profitable to boost the energy efficiency of applications. Early studies mainly focus on improving the programmability for computation offloading on PIM architectures. They lack a comprehensive analysis of computation locality and hence fail to accelerate a wide variety of applications. In this paper, we present a general-purpose instruction-level offloading technique for near-DRAM PIM architectures, namely IOTPIM, to exploit PIM features comprehensively. IOTPIM is novel with two technical advances: 1) a new instruction offloading policy that fully considers the locality of the whole on-chip cache hierarchy, and 2) an offloading performance benefit prediction model that directly predicts offloading performance benefits of an instruction based on the input dataset characterizes, preserving low analysis overheads. The evaluation demonstrates that IOTPIM can be applied to accelerate a wide variety of applications, including graph processing, machine learning, and image processing. IOT-PIM outperforms the state-of-the-art PIM offloading techniques by 1.28×-1.51× while ensuring offloading accuracy as high as 91.89% on average.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133904465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}