2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

筛选
英文 中文
Scaling and Optimizing the Gysela Code on a Cluster of Many-Core Processors Gysela代码在多核处理器集群上的扩展和优化
G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard
{"title":"Scaling and Optimizing the Gysela Code on a Cluster of Many-Core Processors","authors":"G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard","doi":"10.1109/CAHPC.2018.8645933","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645933","url":null,"abstract":"The current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPIjopenMP can be used. Many factors impact the performance achieved by applications on these devices: one of the key points is the efficient exploitation of SIMD vector units, and one another is the memory access pattern. Works have been conducted to adapt a plasma turbulence application, namely Gysela, for this architecture. A set of different techniques have been used: standard vectorization techniques, auto-tuning of one computation kernel, switching to high-order scheme. As a result, KNL execution times have been reduced by up to a factor 3. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work meant a large payoff without resorting to using low-level intrinsics.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121646048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Towards Green Scientific Data Compression Through High-Level I/O Interfaces 通过高级I/O接口实现绿色科学数据压缩
Yevhen Alforov, T. Ludwig, Anastasiia Novikova, Michael Kuhn, J. Kunkel
{"title":"Towards Green Scientific Data Compression Through High-Level I/O Interfaces","authors":"Yevhen Alforov, T. Ludwig, Anastasiia Novikova, Michael Kuhn, J. Kunkel","doi":"10.1109/CAHPC.2018.8645921","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645921","url":null,"abstract":"Every HPC system today has to cope with a deluge of data generated by scientific applications, simulations or large-scale experiments. The upscaling of supercomputer systems and infrastructures, generally results in a dramatic increase of their energy consumption. In this paper, we argue that techniques like data compression can lead to significant gains in terms of power efficiency by reducing both network and storage requirements. However, any data reduction is highly data specific and should comply with established requirements. Therefore, unsuitable or inappropriate compression strategy can utilize more resources and energy than necessary. To that end, we propose a novel methodology for achieving on-the-fly intelligent determination of energy efficient data reduction for a given data set by leveraging state-of-the-art compression algorithms and meta data at application-level I/O. We motivate our work by analyzing the energy and storage saving needs of data sets from real-life scientific HPC applications, and review the various lossless compression techniques that can be applied. We find that the resulting data reduction can decrease the data volume transferred and stored by as much as 80 % in some cases, consequently leading to significant savings in storage and networking costs.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121858239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Large Scale Language Modeling: Converging on 40GB of Text in Four Hours 大规模语言建模:在4小时内收敛40GB文本
Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro
{"title":"Large Scale Language Modeling: Converging on 40GB of Text in Four Hours","authors":"Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro","doi":"10.1109/CAHPC.2018.8645935","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645935","url":null,"abstract":"Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets [1], then transfer the knowledge gained from these models to a variety of tasks [2]. Following [3], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks. By utilizing mixed precision arithmetic and a 32k batch size distributed across 128 NVIDIA Tesla V100 GPUs, we are able to train a character-level 4096-dimension multiplicative LSTM (mLSTM) [4] for unsupervised text reconstruction over 3 epochs of the 40 GB Amazon Reviews dataset [5] in four hours. This runtime compares favorably with previous work taking one month to train the same size and configuration for one epoch over the same dataset [3]. Converging large batch RNN models can be challenging. Recent work has suggested scaling the learning rate as a function of batch size, but we find that simply scaling the learning rate as a function of batch size leads either to significantly worse convergence or immediate divergence for this problem. We provide a learning rate schedule that allows our model to converge with a 32k batch size. Since our model converges over the Amazon Reviews dataset in hours, and our compute requirement of 128 Tesla V100 GPUs, while substantial, is commercially available, this work opens up large scale unsupervised NLP training to most commercial applications and deep learning researchers 11Our code is publicly available: https://github.com/NVIDIA/sentiment-discovery, A model can be trained over most public or private text datasets overnight.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128772278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
T-SNE-CUDA: GPU-Accelerated T-SNE and its Applications to Modern Data T-SNE- cuda: gpu加速T-SNE及其在现代数据中的应用
David Chan, Roshan Rao, Forrest Huang, J. Canny
{"title":"T-SNE-CUDA: GPU-Accelerated T-SNE and its Applications to Modern Data","authors":"David Chan, Roshan Rao, Forrest Huang, J. Canny","doi":"10.1109/CAHPC.2018.8645912","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645912","url":null,"abstract":"Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples. Existing visualization methods which employ dimensionality reduction to two or three dimensions are often inefficient and/or ineffective for these datasets. This paper introduces T-SNE-CUDA, a GPU-accelerated implementation of t-distributed Symmetric Neighbour Embedding (t-SNE) for visualizing datasets and models. T-SNE-CUDA significantly outperforms current implementations with 50-700x speedups on the CIFAR-10 and MNIST datasets. These speedups enable, for the first time, visualization of the neural network activations on the entire ImageNet dataset - a feat that was previously computationally intractable. We also demonstrate visualization performance in the NLP domain by visualizing the GloVe embedding vectors. From these visualizations, we can draw interesting conclusions about using the L2 metric in these embedding spaces. T-SNE-CUDA is publicly available at https://github.com/CannyLab/tsne-cuda.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123842474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 78
An Argument in Favor of Strong Scaling for Deep Neural Networks with Small Datasets 支持小数据集深度神经网络强缩放的论证
R. L. F. Cunha, E. Rodrigues, Matheus Palhares Viana, Dario Augusto Borges Oliveira
{"title":"An Argument in Favor of Strong Scaling for Deep Neural Networks with Small Datasets","authors":"R. L. F. Cunha, E. Rodrigues, Matheus Palhares Viana, Dario Augusto Borges Oliveira","doi":"10.1109/SBAC-PAD.2018.00057","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2018.00057","url":null,"abstract":"In recent years, with the popularization of deep learning frameworks and large datasets, researchers have started parallelizing their models in order to train faster. This is crucially important, because they typically explore many hyperparameters in order to find the best ones for their applications. This process is time consuming and, consequently, speeding up training improves productivity. One approach to parallelize deep learning models followed by many researchers is based on weak scaling. The minibatches increase in size as new GPUs are added to the system. In addition, new learning rates schedules have been proposed to fix optimization issues that occur with large minibatch sizes. In this paper, however, we show that the recommendations provided by recent work do not apply to models that lack large datasets. In fact, we argument in favor of using strong scaling for achieving reliable performance in such cases. We evaluated our approach with up to 32 GPUs and show that weak scaling not only does not have the same accuracy as the sequential model, it also fails to converge most of time. Meanwhile, strong scaling has good scalability while having exactly the same accuracy of a sequential implementation.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116956979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation RTL神经网络加速器的弹性:故障表征与缓解
Behzad Salami, O. Unsal, A. Cristal
{"title":"On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation","authors":"Behzad Salami, O. Unsal, A. Cristal","doi":"10.1109/CAHPC.2018.8645906","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645906","url":null,"abstract":"Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate) and NN layers and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信