Mazen Al-Wadi, Rujia Wang, David A. Mohaisen, C. Hughes, S. Hammond, Amro Awad
{"title":"Minerva: Rethinking Secure Architectures for the Era of Fabric-Attached Memory Architectures","authors":"Mazen Al-Wadi, Rujia Wang, David A. Mohaisen, C. Hughes, S. Hammond, Amro Awad","doi":"10.1109/ipdps53621.2022.00033","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00033","url":null,"abstract":"Fabric-attached memory (FAM) is proposed to enable the seamless integration of directly accessible memory modules attached to the shared system fabric, which will provide future systems with flexible memory integration options, mitigate underutilization, and facilitate data sharing. Recently proposed interconnects, such as Gen-Z and Compute Express Link (CXL), define security, correctness, and performance requirements of fabric-attached devices, including memory. These initiatives are supported by most major system and processor vendors, bringing widespread adoption of FAM-enabled systems one step closer to reality and security concerns to the forefront. This paper discusses the challenges for adapting secure memory implementations to FAM-enabled systems for the first time in literature. Specifically, we observe that handling the security metadata used to protect fabric-attached memories needs to be done deliberately to eliminate unintentional integrity check failures and/or security vulnerabilities, caused by an inconsistent view of the shared security metadata across nodes. Our scheme, Minerva, elegantly adapts secure memory implementations to support FAM-enabled systems with negligible performance over-heads (3.8% of an ideal scheme), compared to the performance overhead (99.5% of an ideal scheme) for a scheme that uses conventional invalidation-based cache coherence to ensure the consistency of security metadata across nodes.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125441309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yulei Jia, Guangping Xu, C. Sung, Salwa Mostafa, Yulei Wu
{"title":"HRaft: Adaptive Erasure Coded Data Maintenance for Consensus in Distributed Networks","authors":"Yulei Jia, Guangping Xu, C. Sung, Salwa Mostafa, Yulei Wu","doi":"10.1109/IPDPS53621.2022.00130","DOIUrl":"https://doi.org/10.1109/IPDPS53621.2022.00130","url":null,"abstract":"Distributed data services usually rely on consensus protocols like Paxos and Raft to provide fault-tolerance and data consistency across global and local-distributed data centers. Erasure coding replication has appealing storage and network cost saving compared with full copy replication, which helps consensus protocols achieve low latency, high fault tolerance, and high throughput for data access. Applying erasure coding in consensus protocols directly will degrade the liveness level when the number of failure servers reaches a certain level. To address the challenge, CRaft just stores full copy replication instead of erasure coding replication when the number of failed servers reaches a certain threshold. In such situation, CRaft will be downgraded sharply to the same storage and network costs as Raft. To overcome the shortcoming of CRaft, we propose a protocol, called HRaft, which can adapt the placement of data blocks in order to always have enough blocks to recover the stored value when servers fail. By replenishing some coded blocks in healthy servers instead of full copy replication, it can avoid switching to the full replication when a certain threshold on the number of failures is reached. We designed and implemented a key-value (KV) storage prototype to validate the proposed protocol and evaluate its performance. The experimental results show HRaft can significantly reduce storage and network costs and improve write performance while keeping the liveness level compared to CRaft.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121535939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TagTree: Global Tagging Index with Efficient Querying for Time Series Databases","authors":"Jin Xue, Zhiqi Wang, Tianyu Wang, Z. Shao","doi":"10.1109/ipdps53621.2022.00127","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00127","url":null,"abstract":"Modern time series databases come with a tag-based query interface that allows users to select time series, which are essentially sequences of timestamped data values, based on a set of specific tags. A tagging index is an important component that can efficiently provide such tag-based services. However, existing methods store tag information in external databases or time-partitioned data structures, which has a negative impact on query performance. In this paper, we present a novel abstraction for efficient queries of tag information in time series databases: a hybrid tagging index that manages all tags in one place. By managing tag information globally in a single disk-based data structure, we can fundamentally relieve memory pressure and eliminate I/O overhead of duplicate metadata from existing methods. Furthermore, the tagging index is internally partitioned by time to support time range based queries and data retention which are essential to time series databases. We implement the proposed tagging index as a standalone module which can be integrated with time series storage engines. Experiments on the TSBS benchmark show our proposed method can significantly speed up queries by on average 84.0% and 87.2% compared to Prometheus (using a time-partitioned segment method) and Graphite (using an external database for tag management), respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"68 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123116855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seong-Bin Park, Hajin Kim, Tanveer Ahmad, Nauman Ahmed, Z. Al-Ars, H. P. Hofstee, Youngsok Kim, Jinho Lee
{"title":"SALoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs","authors":"Seong-Bin Park, Hajin Kim, Tanveer Ahmad, Nauman Ahmed, Z. Al-Ars, H. P. Hofstee, Youngsok Kim, Jinho Lee","doi":"10.1109/ipdps53621.2022.00076","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00076","url":null,"abstract":"Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern GPUs. This paper presents SALoBa, a GPU-accelerated sequence alignment library focused on seed extension. Based on the analysis of previous work with real-world sequencing data, we propose techniques to exploit the data locality and improve work-load balancing. The experimental results reveal that SALoBa significantly improves the seed extension kernel compared to state-of-the-art GPU-based methods.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126275471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Tensor Train Rounding using Gram SVD","authors":"Hussam Al Daas, Grey Ballard, Lawton Manning","doi":"10.1109/ipdps53621.2022.00095","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00095","url":null,"abstract":"Tensor Train (TT) is a low-rank tensor representation consisting of a series of three-way cores whose dimensions specify the TT ranks. Formal tensor train arithmetic often causes an artificial increase in the TT ranks. Thus, a key operation for applications that use the TT format is rounding, which truncates the TT ranks subject to an approximation error guarantee. Truncation is performed via SVD of a highly structured matrix, and current rounding methods require careful orthogonalization to compute an accurate SVD. We propose a new algorithm for TT-Rounding based on the Gram SVD algorithm that avoids the expensive orthogonalization phase. Our algorithm performs less computation and can be parallelized more easily than existing approaches, at the expense of a slight loss of accuracy. We demonstrate that our implementation of the rounding algorithm is efficient, scales well, and consistently outperforms the existing state-of-the-art parallel implementation in our experiments.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125251930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Computation Offloading for Green Things-Edge-Cloud Computing with Local Caching","authors":"Xianzhong Tian, Huixiao Meng, Yanjun Li, Pingting Miao, Pengcheng Xu","doi":"10.1109/ipdps53621.2022.00103","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00103","url":null,"abstract":"With the increasing popularity of the internet of things (IoT) and 5G, emerging things-edge-cloud computing (TEC) paradigm provides a flexible way for execution of delay-sensitive and computation-intensive applications running on the user equipment (UE). By offloading these workloads to the mobile edge computing (MEC) or mobile cloud computing (MCC) server, the quality of experience, e.g., the execution delay, could be greatly improved. Nevertheless, conventional battery-powered devices face the challenge of battery exhaustion for task offloading. Using renewable energy via energy harvesting (EH) technologies has become a promising way to power these devices. In this paper, we investigate a multi-user green TEC system with EH UEs, each has a task buffer with limited capacity. A joint offloading decision and resource allocation problem is formulated, which addresses the long-term average execution delay, the task dropping and the long-term average energy cost constraint. A low-complexity online algorithm is proposed leveraging Lyapunov optimization framework and matroid theory, which jointly decides the offloading decision, the MEC server CPU frequencies and the transmit power for computation offloading. A unique advantage of this algorithm is that the decisions depend only on the current system state without requiring distribution information of the arrival tasks, wireless channel state, and EH processes. The implementation of the algorithm only requires to solve a deterministic problem in each time slot. Simulation results show that our proposed algorithm makes a best trade-off between minimizing the long-term average generalized delay and satisfying the long-term average energy cost constraint. Impacts of various parameters on the delay and energy cost performance are also discussed.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122656343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Behrooz Zarebavani, Kazem Cheshmi, Bangtian Liu, M. Strout, M. Dehnavi
{"title":"HDagg: Hybrid Aggregation of Loop-carried Dependence Iterations in Sparse Matrix Computations","authors":"Behrooz Zarebavani, Kazem Cheshmi, Bangtian Liu, M. Strout, M. Dehnavi","doi":"10.1109/ipdps53621.2022.00121","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00121","url":null,"abstract":"This paper proposes a novel aggregation algorithm, called Hybrid DAG Aggregation (HDagg), that groups iterations of sparse matrix computations with loop carried dependence to improve their parallel execution on multicore processors. Prior approaches to optimize sparse matrix computations fail to provide an efficient balance between locality, load balance, and synchronization and are primarily optimized for codes with a tree-structure data dependence. HDagg is optimized for sparse matrix computations that their data dependence graphs (DAGs) do not have a tree structure, such as incomplete matrix factorization algorithms. It uses a hybrid approach to aggregate vertices and wavefronts in the DAG of a sparse computation to create well-balanced parallel workloads with good locality. Across three sparse kernels, triangular solver, incomplete Cholesky, and incomplete LU, HDagg outperforms existing sparse libraries such as MKL with an average speedup of 3.56× and is faster than state-of-the-art inspector-executor approaches that optimize sparse computations, i.e. DAGP, LBC, wavefront parallelism techniques, and SpMP by an average speedup of 3.87×, 3.41×, 1.95×, and 1.43× respectively.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115677706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Lu, Nannan Zhao, Ji-guang Wan, Changhong Fei, Wei Zhao, Tongliang Deng
{"title":"RLRP: High-Efficient Data Placement with Reinforcement Learning for Modern Distributed Storage Systems","authors":"Kai Lu, Nannan Zhao, Ji-guang Wan, Changhong Fei, Wei Zhao, Tongliang Deng","doi":"10.1109/ipdps53621.2022.00064","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00064","url":null,"abstract":"Modern distributed storage systems with massive data and storage nodes pose higher requirements to the data placement strategy. Furthermore, with emerged new storage devices, heterogeneous storage architecture has become increasingly common and popular. However, traditional strategies expose great limitations in the face of these requirements, especially do not well consider distinct characteristics of heterogeneous storage nodes yet, which will lead to suboptimal performance. In this paper, we present and evaluate the RLRP, a deep reinforcement learning (RL) based replica placement strategy. RLRP constructs placement and migration agents through the Deep-Q-Network (DQN) model to achieve fair distribution and adaptive data migration. Besides, RLRP provides optimal performance for heterogeneous environment by an attentional Long Short-term Memory (LSTM) model. Finally, RLRP adopts Stagewise Training and Model fine-tuning to accelerate the training of RL models with large-scale state and action space. RLRP is implemented on Park and the evaluation results indicate RLRP is a highly efficient data placement strategy for modern distributed storage systems. RLRP can reduce read latency by 10%∼50% in heterogeneous environment compared with existing strategies. In addition, RLRP is used in the real-world system Ceph, which improves the read performance of Ceph by 30%∼40%.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114662767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task-based Acceleration of Bidirectional Recurrent Neural Networks on Multi-core Architectures","authors":"Robin Kumar Sharma, Marc Casas","doi":"10.1109/ipdps53621.2022.00096","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00096","url":null,"abstract":"This paper proposes a novel parallel execution model for Bidirectional Recurrent Neural Networks (BRNNs), B-Par (Bidirectional-Parallelization), which exploits data and control dependencies for forward and reverse input computations. B-Par divides BRNN workloads across different parallel tasks by defining input and output dependencies for each RNN cell in both forward and reverse orders. B-Par does not require per-layer barriers to synchronize the parallel execution of BRNNs. We evaluate B-Par considering the TIDIGITS speech database and the Wikipedia data-set. Our experiments indicate that B-Par outperforms the state-of-the-art deep learning frameworks TensorFlow-Keras and Pytorch by achieving up to 2.34× and 9.16× speed-ups, respectively, on modern multi-core CPU architectures while preserving accuracy. Moreover, we analyze in detail aspects like task granularity, locality, or parallel efficiency to illustrate the benefits of B-Par.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123664203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coloring the Vertices of 9-pt and 27-pt Stencils with Intervals","authors":"Dante Durrman, Erik Saule","doi":"10.1109/ipdps53621.2022.00098","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00098","url":null,"abstract":"Graph coloring is commonly used to schedule computations on parallel systems. Given a good estimation of the computational requirement for each task, one can refine the model by adding a weight to each vertex. Instead of coloring each vertex with a single color, the problem is to color each vertex with an interval of colors. In this paper, we are interested in studying this problem for particular classes of graphs, namely stencil graphs. Stencil graphs appear naturally in the parallelisation of applications where the location of an object in a space affects the state of neighboring objects. Rectilinear decompositions of a space generate conflict graphs that are 9-pt stencils for 2D problems and 27-pt stencils for 3D problems. We show that the 5-pt stencil and 7-pt stencil relaxations of the problem can be solved in polynomial time. We prove that the decision problem on 27-pt stencil is NP-Complete. We discuss approximation algorithms with a ratio of 2 for the 9-pt stencil case, and 4 for the 27-pt stencil case. We identify two lower bounds for the problem that are used to design heuristics. We evaluate the effectiveness of several different algorithms experimentally on a set of real instances. Furthermore, these algorithms are integrated into a real application to demonstrate the soundness of the approach.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121635889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}