2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

筛选
英文 中文
Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply 揭开张量核心的神秘面纱,优化半精度矩阵乘法
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00071
D. Yan, Wei Wang, X. Chu
{"title":"Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply","authors":"D. Yan, Wei Wang, X. Chu","doi":"10.1109/IPDPS47924.2020.00071","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00071","url":null,"abstract":"Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"634-643"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88751718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in Clouds 通过云中的快速并发适应减轻大的响应时间波动
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00046
Jianshu Liu, Shungeng Zhang, Qingyang Wang, Jinpeng Wei
{"title":"Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in Clouds","authors":"Jianshu Liu, Shungeng Zhang, Qingyang Wang, Jinpeng Wei","doi":"10.1109/IPDPS47924.2020.00046","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00046","url":null,"abstract":"Dynamically reallocating computing resources to handle bursty workloads is a common practice for web applications (e.g., e-commerce) in clouds. However, our empirical analysis on a standard n-tier benchmark application (RUBBoS) shows that simply scaling an n-tier application by reallocating hardware resources without fast adapting soft resources (e.g., server threads, connections) may lead to large response time fluctuations. This is because soft resources control the workload concurrency of component servers in the system: adding or removing hardware resources such as Virtual Machines (VMs) can implicitly change the workload concurrency of dependent servers, causing either under- or over-utilization of the critical hardware resource in the system. To quickly identify the optimal soft resource allocation of each server in the system and stabilize response time fluctuation, we propose a novel Scatter-Concurrency-Throughput (SCT) model based on the monitoring of each server’s real-time concurrency and throughput. We then implement a Concurrency-aware system Scaling (ConScale) framework which integrates the SCT model to fast adapt the soft resource allocations of key servers during the system scaling process. Our experiments using six realistic bursty workload traces show that ConScale can effectively mitigate the response time fluctuations of the target web application compared to the state-of-the-art cloud scaling strategies such as EC2-AutoScaling.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"84 1","pages":"368-377"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77210042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Case of Performance Variability on Dragonfly-based Systems 基于dragonfly的系统性能变异性的案例
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00096
A. Bhatele, Jayaraman J. Thiagarajan, Taylor L. Groves, Rushil Anirudh, Staci A. Smith, B. Cook, D. Lowenthal
{"title":"The Case of Performance Variability on Dragonfly-based Systems","authors":"A. Bhatele, Jayaraman J. Thiagarajan, Taylor L. Groves, Rushil Anirudh, Staci A. Smith, B. Cook, D. Lowenthal","doi":"10.1109/IPDPS47924.2020.00096","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00096","url":null,"abstract":"Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"11 1","pages":"896-905"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91382982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Scheduling Malleable Jobs Under Topological Constraints 拓扑约束下的可塑作业调度
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00041
E. Bampis, Konstantinos Dogeas, A. Kononov, Giorgio Lucarelli, Fanny Pascual
{"title":"Scheduling Malleable Jobs Under Topological Constraints","authors":"E. Bampis, Konstantinos Dogeas, A. Kononov, Giorgio Lucarelli, Fanny Pascual","doi":"10.1109/IPDPS47924.2020.00041","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00041","url":null,"abstract":"Bleuse et al. (EuroPar 2018) introduced a general model for interference-aware scheduling in large scale parallel platforms. They considered two different types of communications: the flows induced by data exchanges during computations and the flows related to Input/Output operations. Rather than taking into account these communications explicitly, they restrict the possible allocations of a job by external topological constraints. In their work, jobs are considered to be rigid: a job requires a specific number of machines in order to be executed. Here, we first adopt the same framework for the platform and the aforementioned topological constraints. We show that there is no polynomial time approximation algorithm under the rigid setting with ratio smaller than 3/2, unless P = NP. Then, we focus on the malleable setting. We show that in the proportional-malleable setting, where the work of every job remains constant independently of the number of machines on which it is executed, the scheduling problem remains NPhard even in the uniform case, where the maximum number of machines is the same for all the jobs. Then, we propose a 2-approximation algorithm for this case. Furthermore, we present an approximation algorithm solving the more general case where the maximum number of machines is job-dependent and the work of the jobs is increasing with respect to the number of used machines, due to the communication overhead.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"316-325"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87654228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Aarohi: Making Real-Time Node Failure Prediction Feasible Aarohi:使实时节点故障预测可行
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00115
Anwesha Das, F. Mueller, B. Rountree
{"title":"Aarohi: Making Real-Time Node Failure Prediction Feasible","authors":"Anwesha Das, F. Mueller, B. Rountree","doi":"10.1109/IPDPS47924.2020.00115","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00115","url":null,"abstract":"Large-scale production systems are well known to encounter node failures, which affect compute capacity and energy. Both in HPC systems and enterprise data centers, combating failures is becoming challenging with increasing hardware and software complexity. Several data mining solutions of logs have been investigated in the context of anomaly detection in such systems. However, with subsequent proactive failure mitigation, the existing log mining solutions are not sufficiently fast for real-time anomaly detection. Machine learning (ML)-based training can produce high accuracy but the inference scheme needs to be enhanced with rapid parsers to assess anomalies in real-time. This work tackles online anomaly prediction in computing systems by exploiting context free grammar-based rapid event analysis.We present our framework Aarohi1, which describes an effective way to predict failures online. Aarohi is designed to be generic and scalable making it suitable as a real-time predictor. Aarohi obtains more than 3 minutes lead times to node failures with an average of 0.31 msecs prediction time for a chain length of 18. The overall improvement obtained w.r.t. the existing state-of-the-art is over a factor of 27.4×. Our compiler-based approach provides new research directions for lead time optimization with a significant prediction speedup required for the deployment of proactive fault tolerant solutions in practice.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"38 1","pages":"1092-1101"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87725127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
G-PBFT: A Location-based and Scalable Consensus Protocol for IoT-Blockchain Applications G-PBFT:一种基于位置和可扩展的物联网区块链共识协议
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00074
Laphou Lao, Xiaohai Dai, Bin Xiao, Songtao Guo
{"title":"G-PBFT: A Location-based and Scalable Consensus Protocol for IoT-Blockchain Applications","authors":"Laphou Lao, Xiaohai Dai, Bin Xiao, Songtao Guo","doi":"10.1109/IPDPS47924.2020.00074","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00074","url":null,"abstract":"IoT-blockchain applications have advantages of managing massive IoT devices, achieving advanced data security, and data credibility. However, there are still some challenges when deploying IoT applications on blockchain systems due to limited storage, power, and computing capability of IoT devices. Applying current consensus protocols to IoT applications may be vulnerable to Sybil node attacks or suffer from high-computational cost and low scalability. In this paper, we propose G-PBFT (Geographic-PBFT), a new location-based and scalable consensus protocol designed for IoT-blockchain applications. The principle of G-PBFT is based on the fact that most IoT-blockchain applications rely on fixed IoT devices for data collection and processing. Fixed IoT devices have more computational power than other mobile IoT devices, e.g., mobile phones and sensors, and are less likely to become malicious nodes. G-PBFT exploits geographic information of fixed IoT devices to reach consensus, thus avoiding Sybil attacks. In G-PBFT, we select those fixed, loyal, and capable nodes as endorsers, reducing the overhead for validating and recording transactions. As a result, G-PBFT achieves high consensus efficiency and low traffic intensity. Moreover, G-PBFT uses a new era switch mechanism to handle the dynamics of the IoT network. To evaluate our protocol, we conduct extensive experiments to compare the performance of G-PBFT against existing consensus protocol with over 200 participating nodes in a blockchain system. Experimental results demonstrate that G-PBFT significantly reduces consensus time, network overhead, and is scalable for IoT applications.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"53 1","pages":"664-673"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87152092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
IPDPS 2020 List Reviewer Page IPDPS 2020列表审查页面
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/ipdps47924.2020.00010
{"title":"IPDPS 2020 List Reviewer Page","authors":"","doi":"10.1109/ipdps47924.2020.00010","DOIUrl":"https://doi.org/10.1109/ipdps47924.2020.00010","url":null,"abstract":"","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86284396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Impossibility of Fast Transactions 快速交易的不可能
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00120
K. Antoniadis, Diego Didona, R. Guerraoui, W. Zwaenepoel
{"title":"The Impossibility of Fast Transactions","authors":"K. Antoniadis, Diego Didona, R. Guerraoui, W. Zwaenepoel","doi":"10.1109/IPDPS47924.2020.00120","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00120","url":null,"abstract":"We prove that transactions cannot be fast in an asynchronous fault-tolerant system. Our result holds in any system where we require transactions to ensure monotonic writes, or any stronger consistency model, such as, causal consistency. Thus, our result unveils an important, and so far unknown, limitation of fast transactions: they are impossible if we want to tolerate the failure of even one server.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"539 1","pages":"1143-1154"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77717509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments HFetch:用于科学工作流在多层存储环境中的分层数据预取
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00017
H. Devarajan, Anthony Kougkas, Xian-He Sun
{"title":"HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments","authors":"H. Devarajan, Anthony Kougkas, Xian-He Sun","doi":"10.1109/IPDPS47924.2020.00017","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00017","url":null,"abstract":"In the era of data-intensive computing, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency. However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application’s I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach can resolve challenges such as cache pollution and redundancy. In this study, we present HFetch, a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. We demonstrate the benefits of such an approach. Results show 10-35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"20 1","pages":"62-72"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80209444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers 斯特金:提高功率受限计算机利用率的偏好感知协同定位
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI: 10.1109/IPDPS47924.2020.00079
Pu Pang, Quan Chen, Deze Zeng, Chao Li, Jingwen Leng, Wenli Zheng, M. Guo
{"title":"Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers","authors":"Pu Pang, Quan Chen, Deze Zeng, Chao Li, Jingwen Leng, Wenli Zheng, M. Guo","doi":"10.1109/IPDPS47924.2020.00079","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00079","url":null,"abstract":"Large-scale datacenters often host latency-sensitive services that have stringent Quality-of-Service requirement and experience diurnal load pattern. Co-locating best-effort applications that have no QoS requirement with latency-sensitive services has been widely used to improve the resource utilization with careful shared resource management. However, existing co-location techniques tend to result in the power overload problem on power constrained computers due to the ignorance of the power consumption. To this end, we propose Sturgeon, a runtime system proactively manages resources between colocated applications in a power constrained environment, to ensure the QoS of latency-sensitive services while maximizing the resource utilization. Our investigation shows that, at a given load, there are multiple feasible resource configurations to meet both QoS requirement and power budget, while one of them yields the maximum throughput of best-effort applications. To find such a configuration, we establish models to accurately predict the performance and power consumption of the colocated applications. Sturgeon monitors the QoS periodically in order to eliminate the potential QoS violation caused by the unpredictable interference. The experimental results show that Sturgeon improves the throughput of best-effort applications by 24.96% compared to the state-of-the-art technique, while guaranteeing the 95%-ile latency within the QoS target.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"718-727"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82664234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信