2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第3页

Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply 揭开张量核心的神秘面纱，优化半精度矩阵乘法

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00071

D. Yan, Wei Wang, X. Chu

{"title":"Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply","authors":"D. Yan, Wei Wang, X. Chu","doi":"10.1109/IPDPS47924.2020.00071","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00071","url":null,"abstract":"Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"634-643"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88751718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in Clouds 通过云中的快速并发适应减轻大的响应时间波动

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00046

Jianshu Liu, Shungeng Zhang, Qingyang Wang, Jinpeng Wei

{"title":"Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in Clouds","authors":"Jianshu Liu, Shungeng Zhang, Qingyang Wang, Jinpeng Wei","doi":"10.1109/IPDPS47924.2020.00046","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00046","url":null,"abstract":"Dynamically reallocating computing resources to handle bursty workloads is a common practice for web applications (e.g., e-commerce) in clouds. However, our empirical analysis on a standard n-tier benchmark application (RUBBoS) shows that simply scaling an n-tier application by reallocating hardware resources without fast adapting soft resources (e.g., server threads, connections) may lead to large response time fluctuations. This is because soft resources control the workload concurrency of component servers in the system: adding or removing hardware resources such as Virtual Machines (VMs) can implicitly change the workload concurrency of dependent servers, causing either under- or over-utilization of the critical hardware resource in the system. To quickly identify the optimal soft resource allocation of each server in the system and stabilize response time fluctuation, we propose a novel Scatter-Concurrency-Throughput (SCT) model based on the monitoring of each server’s real-time concurrency and throughput. We then implement a Concurrency-aware system Scaling (ConScale) framework which integrates the SCT model to fast adapt the soft resource allocations of key servers during the system scaling process. Our experiments using six realistic bursty workload traces show that ConScale can effectively mitigate the response time fluctuations of the target web application compared to the state-of-the-art cloud scaling strategies such as EC2-AutoScaling.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"84 1","pages":"368-377"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77210042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The Case of Performance Variability on Dragonfly-based Systems 基于dragonfly的系统性能变异性的案例

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00096

A. Bhatele, Jayaraman J. Thiagarajan, Taylor L. Groves, Rushil Anirudh, Staci A. Smith, B. Cook, D. Lowenthal

引用次数: 16

Scheduling Malleable Jobs Under Topological Constraints 拓扑约束下的可塑作业调度

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00041

E. Bampis, Konstantinos Dogeas, A. Kononov, Giorgio Lucarelli, Fanny Pascual

{"title":"Scheduling Malleable Jobs Under Topological Constraints","authors":"E. Bampis, Konstantinos Dogeas, A. Kononov, Giorgio Lucarelli, Fanny Pascual","doi":"10.1109/IPDPS47924.2020.00041","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00041","url":null,"abstract":"Bleuse et al. (EuroPar 2018) introduced a general model for interference-aware scheduling in large scale parallel platforms. They considered two different types of communications: the flows induced by data exchanges during computations and the flows related to Input/Output operations. Rather than taking into account these communications explicitly, they restrict the possible allocations of a job by external topological constraints. In their work, jobs are considered to be rigid: a job requires a specific number of machines in order to be executed. Here, we first adopt the same framework for the platform and the aforementioned topological constraints. We show that there is no polynomial time approximation algorithm under the rigid setting with ratio smaller than 3/2, unless P = NP. Then, we focus on the malleable setting. We show that in the proportional-malleable setting, where the work of every job remains constant independently of the number of machines on which it is executed, the scheduling problem remains NPhard even in the uniform case, where the maximum number of machines is the same for all the jobs. Then, we propose a 2-approximation algorithm for this case. Furthermore, we present an approximation algorithm solving the more general case where the maximum number of machines is job-dependent and the work of the jobs is increasing with respect to the number of used machines, due to the communication overhead.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"316-325"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87654228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Aarohi: Making Real-Time Node Failure Prediction Feasible Aarohi:使实时节点故障预测可行

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00115

Anwesha Das, F. Mueller, B. Rountree

{"title":"Aarohi: Making Real-Time Node Failure Prediction Feasible","authors":"Anwesha Das, F. Mueller, B. Rountree","doi":"10.1109/IPDPS47924.2020.00115","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00115","url":null,"abstract":"Large-scale production systems are well known to encounter node failures, which affect compute capacity and energy. Both in HPC systems and enterprise data centers, combating failures is becoming challenging with increasing hardware and software complexity. Several data mining solutions of logs have been investigated in the context of anomaly detection in such systems. However, with subsequent proactive failure mitigation, the existing log mining solutions are not sufficiently fast for real-time anomaly detection. Machine learning (ML)-based training can produce high accuracy but the inference scheme needs to be enhanced with rapid parsers to assess anomalies in real-time. This work tackles online anomaly prediction in computing systems by exploiting context free grammar-based rapid event analysis.We present our framework Aarohi1, which describes an effective way to predict failures online. Aarohi is designed to be generic and scalable making it suitable as a real-time predictor. Aarohi obtains more than 3 minutes lead times to node failures with an average of 0.31 msecs prediction time for a chain length of 18. The overall improvement obtained w.r.t. the existing state-of-the-art is over a factor of 27.4×. Our compiler-based approach provides new research directions for lead time optimization with a significant prediction speedup required for the deployment of proactive fault tolerant solutions in practice.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"38 1","pages":"1092-1101"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87725127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

G-PBFT: A Location-based and Scalable Consensus Protocol for IoT-Blockchain Applications G-PBFT:一种基于位置和可扩展的物联网区块链共识协议

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00074

Laphou Lao, Xiaohai Dai, Bin Xiao, Songtao Guo

{"title":"G-PBFT: A Location-based and Scalable Consensus Protocol for IoT-Blockchain Applications","authors":"Laphou Lao, Xiaohai Dai, Bin Xiao, Songtao Guo","doi":"10.1109/IPDPS47924.2020.00074","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00074","url":null,"abstract":"IoT-blockchain applications have advantages of managing massive IoT devices, achieving advanced data security, and data credibility. However, there are still some challenges when deploying IoT applications on blockchain systems due to limited storage, power, and computing capability of IoT devices. Applying current consensus protocols to IoT applications may be vulnerable to Sybil node attacks or suffer from high-computational cost and low scalability. In this paper, we propose G-PBFT (Geographic-PBFT), a new location-based and scalable consensus protocol designed for IoT-blockchain applications. The principle of G-PBFT is based on the fact that most IoT-blockchain applications rely on fixed IoT devices for data collection and processing. Fixed IoT devices have more computational power than other mobile IoT devices, e.g., mobile phones and sensors, and are less likely to become malicious nodes. G-PBFT exploits geographic information of fixed IoT devices to reach consensus, thus avoiding Sybil attacks. In G-PBFT, we select those fixed, loyal, and capable nodes as endorsers, reducing the overhead for validating and recording transactions. As a result, G-PBFT achieves high consensus efficiency and low traffic intensity. Moreover, G-PBFT uses a new era switch mechanism to handle the dynamics of the IoT network. To evaluate our protocol, we conduct extensive experiments to compare the performance of G-PBFT against existing consensus protocol with over 200 participating nodes in a blockchain system. Experimental results demonstrate that G-PBFT significantly reduces consensus time, network overhead, and is scalable for IoT applications.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"53 1","pages":"664-673"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87152092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

IPDPS 2020 List Reviewer Page IPDPS 2020列表审查页面

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/ipdps47924.2020.00010

引用次数: 0

The Impossibility of Fast Transactions 快速交易的不可能

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00120

K. Antoniadis, Diego Didona, R. Guerraoui, W. Zwaenepoel

引用次数: 2

HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments HFetch:用于科学工作流在多层存储环境中的分层数据预取

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00017

H. Devarajan, Anthony Kougkas, Xian-He Sun

引用次数: 10

Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers 斯特金:提高功率受限计算机利用率的偏好感知协同定位

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00079

Pu Pang, Quan Chen, Deze Zeng, Chao Li, Jingwen Leng, Wenli Zheng, M. Guo

{"title":"Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers","authors":"Pu Pang, Quan Chen, Deze Zeng, Chao Li, Jingwen Leng, Wenli Zheng, M. Guo","doi":"10.1109/IPDPS47924.2020.00079","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00079","url":null,"abstract":"Large-scale datacenters often host latency-sensitive services that have stringent Quality-of-Service requirement and experience diurnal load pattern. Co-locating best-effort applications that have no QoS requirement with latency-sensitive services has been widely used to improve the resource utilization with careful shared resource management. However, existing co-location techniques tend to result in the power overload problem on power constrained computers due to the ignorance of the power consumption. To this end, we propose Sturgeon, a runtime system proactively manages resources between colocated applications in a power constrained environment, to ensure the QoS of latency-sensitive services while maximizing the resource utilization. Our investigation shows that, at a given load, there are multiple feasible resource configurations to meet both QoS requirement and power budget, while one of them yields the maximum throughput of best-effort applications. To find such a configuration, we establish models to accurately predict the performance and power consumption of the colocated applications. Sturgeon monitors the QoS periodically in order to eliminate the potential QoS violation caused by the unpredictable interference. The experimental results show that Sturgeon improves the throughput of best-effort applications by 24.96% compared to the state-of-the-art technique, while guaranteeing the 95%-ile latency within the QoS target.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"718-727"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82664234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5