Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region最新文献_第10页

Efficient dentry lookup with backward finding mechanism 具有向后查找机制的高效牙齿查找

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149468

N. Song, Hwajung Kim, Hyuck Han, H. Yeom

{"title":"Efficient dentry lookup with backward finding mechanism","authors":"N. Song, Hwajung Kim, Hyuck Han, H. Yeom","doi":"10.1145/3149457.3149468","DOIUrl":"https://doi.org/10.1145/3149457.3149468","url":null,"abstract":"As modern computer systems face the challenge of managing large data, filesystems must deal with a large number of files. This leads to amplified concerns of metadata and data operations. Filesystems in Linux manage the metadata of files by constructing in-memory structures such as directory entry (dentry) and inode. However, we found inefficiencies in metadata management mechanisms, especially in the path traversal mechanism of Linux file systems when searching for a dentry in the dentry cache. In this paper, we optimize metadata operations of path traversing by searching for the dentry in the backward manner. By using the backward finding mechanism, we can find the target dentry with reduced number of dentry cache lookups when compared with the original forward finding mechanism. However, this backward path lookup mechanism complicates permission guarantee of each path component. We addess this issue by proposing the use of a permission-granted list. We have evaluated our optimized techniques with several benchmarks including real-world workload. The experimental results show that our optimizations improve path lookup latency by up to 40% and overall throughput by up to 56% in real-world benchmarks which has a number of path-deepen files.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130886622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing 面向可重构高性能计算的OpenCL-ready高速FPGA网络

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149479

Ryohei Kobayashi, Yuma Oobata, N. Fujita, Y. Yamaguchi, T. Boku

{"title":"OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing","authors":"Ryohei Kobayashi, Yuma Oobata, N. Fujita, Y. Yamaguchi, T. Boku","doi":"10.1145/3149457.3149479","DOIUrl":"https://doi.org/10.1145/3149457.3149479","url":null,"abstract":"Field programmable gate arrays (FPGAs) have gained attention in high-performance computing (HPC) research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore's Law. In addition to FPGA performance improvements, OpenCL-based FPGA development toolchains have been developed and offered by FPGA vendors, which reduces the programming effort required as compared to the past. These improvements reveal the possibilities of realizing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is one of the keys to more improve the performance of modern heterogeneous supercomputers using accelerators like GPUs. In this paper, we propose high-performance inter-FPGA Ethernet communication using OpenCL and Verilog HDL mixed programming in order to demonstrate the feasibility of realizing this concept. OpenCL is used to program application algorithms and data movement control when Verilog HDL is used to implement low-level components for Ethernet communication. Experimental results using ping-pong programs showed that our proposed approach achieves a latency of 0.99 μs and as much as 4.97 GB/s between FPGAs over different nodes, thus confirming that the proposed method is effective at realizing this concept.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125179350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Performance improvement of the general-purpose CFD code FrontFlow/blue on the K computer 通用CFD代码FrontFlow/blue在K计算机上的性能改进

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149470

Kiyoshi Kumahata, K. Minami, Y. Yamade, C. Kato

{"title":"Performance improvement of the general-purpose CFD code FrontFlow/blue on the K computer","authors":"Kiyoshi Kumahata, K. Minami, Y. Yamade, C. Kato","doi":"10.1145/3149457.3149470","DOIUrl":"https://doi.org/10.1145/3149457.3149470","url":null,"abstract":"The general-purpose fluid simulation software FrontFlow/blue (FFB) is based on the finite element method (FEM). It was designed to accept extremely large-scale simulations and is an important application in the manufacturing field in Japan. Moreover, since this application is significant in both the manufacturing field and the development of the post-K supercomputer, it is employed as an important application for the new post-K supercomputer that is under development. The K computer is still the important infrastructure in Japan. And there are some supercomputers having the same architecture to the K computer. Therefore we continue to improve the performance of the FFB on the K computer. On significant subroutines, several improvement techniques, store order based loop modification decreasing total load and store operations, unrolled loop rerolling to employ SIMD load instruction, adjusting number of arrays in loop, using sector cache function, and so on, were employed. As a result, an improvement of 160% was obtained on a single CPU performance. This paper shows and discusses the detail of these improvements.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116833101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Forward Computation in Adjoint Method via Multi-level Blocking 基于多级块的伴随法正演优化

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149458

T. Ikeda, S. Ito, H. Nagao, T. Katagiri, Toru Nagai, M. Ogino

{"title":"Optimizing Forward Computation in Adjoint Method via Multi-level Blocking","authors":"T. Ikeda, S. Ito, H. Nagao, T. Katagiri, Toru Nagai, M. Ogino","doi":"10.1145/3149457.3149458","DOIUrl":"https://doi.org/10.1145/3149457.3149458","url":null,"abstract":"Data assimilation (DA) is a computational technique that integrates large-scale numerical simulations with observed data, and the adjoint method is classified as a non-sequential DA technique. The target model for the simulations in this paper is the phase-field model, which is often used to simulate the temporal evolution of the internal structures of materials. Since the phase-field method computes a continuous field, a naïve implementation of the adjoint method requires an enormous amount of computation time. One reason for the increase in computation time is that the amount of data required for simulations is much larger than the cache capacity of computers. To reduce memory access and achieve better performance, it is necessary to use computational blocking, which involves reusing data within the cache as much as possible. In this paper, we propose multi-level blocking to optimize forward computation in the adjoint method. The proposed multi-level blocking consists of spatial blocking, temporal blocking, and the blocking of multiple forward computations in the adjoint method. We investigated the effectiveness of the proposed multi-level blocking on the Fujitsu PRIMEHPC FX100 supercomputer. By applying spatial and temporal blocking, we attained a speed-up of 1.89 x in execution time without blocking and that of 1.48 x as the upper limit by applying blocking to multiple forward computations (MFB). We also attained a speed-up of 1.13 by applying multi-level blocking to execution time without blocking.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"9 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115676185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlexProtect: A SDN-based DDoS Attack Protection Architecture for Multi-tenant Data Centers FlexProtect:基于sdn的多租户数据中心DDoS攻击防护架构

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149476

Ming-Hung Chen, Jyun-Yan Ciou, I. Chung, Cheng-Fu Chou

{"title":"FlexProtect: A SDN-based DDoS Attack Protection Architecture for Multi-tenant Data Centers","authors":"Ming-Hung Chen, Jyun-Yan Ciou, I. Chung, Cheng-Fu Chou","doi":"10.1145/3149457.3149476","DOIUrl":"https://doi.org/10.1145/3149457.3149476","url":null,"abstract":"With the recent advances in software-defined networking (SDN), the multi-tenant data centers provide more efficient and flexible cloud platform to their subscribers. However, as the number, scale, and diversity of distributed denial-of-service (DDoS) attack is dramatically escalated in recent years, the availability of those platforms is still under risk. We note that the state-of-art DDoS protection architectures did not fully utilize the potential of SDN and network function virtualization (NFV) to mitigate the impact of attack traffic on data center network. Therefore, in this paper, we exploit the flexibility of SDN and NFV to propose FlexProtect, a flexible distributed DDoS protection architecture for multi-tenant data centers. In FlexProtect, the detection virtual network functions (VNFs) are placed near the service provider and the defense VNFs are placed near the edge routers for effectively detection and avoid internal bandwidth consumption, respectively. Based on the architecture, we then propose FP-SYN, an anti-spoofing SYN flood protection mechanism. The emulation and simulation results with real-world data demonstrates that, compared with the traditional approach, the proposed architecture can significantly reduce 46% of the additional routing path and save 60% internal bandwidth consumption. Moreover, the proposed detection mechanism for anti-spoofing can achieve 98% accuracy.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115041706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Acceleration of Dynamic n-Tuple Computations in Many-Body Molecular Dynamics 多体分子动力学中动态n元组计算的加速

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149463

Patrick E. Small, Kuang Liu, S. Tiwari, R. Kalia, A. Nakano, K. Nomura, P. Vashishta

引用次数: 0

A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems 求解大规模非厄米线性系统的一种分布式并行异步联合征服方法

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3154481

Xinzhe Wu, S. Petiton

引用次数: 6

A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework Tapas 分层n体算法框架中全对运算的可移植性层

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3149471

Motohiko Matsuda, Keisuke Fukuda, N. Maruyama

{"title":"A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework Tapas","authors":"Motohiko Matsuda, Keisuke Fukuda, N. Maruyama","doi":"10.1145/3149457.3149471","DOIUrl":"https://doi.org/10.1145/3149457.3149471","url":null,"abstract":"Tapas is a C++ programming framework for developing hierarchical N-body algorithms such as Barnes-Hut and Fast Multipole Method, designed to experiment new implementations including even variations of tree traversals. A pairwise interaction calculation in N-body simulations, or an all-pairs operation, is an important part of Tapas for performance, which enables accelerations with GPUs. However, there is no commonly agreed all-pairs interface appropriate as a primitive, and moreover, it is not supported in existing data-parallel libraries for GPUs such as NVIDIA's Thrust. Thus, we designed an interface for an all-pairs operation that can be easily adopted in libraries and applications. Tapas's all-pairs has an extra function argument for flexibility, which corresponds to a consumer function of the result of an all-pairs that is missing in existing designs. This addition is not an ad hoc one, but it is guided by the consideration of algorithmic skeletons, which indicates the effect of the added argument cannot be substituted by the other arguments in general. The change is just adding an argument, but it gives flexibility to process the result, and the resulting implementation can attain almost the same performance as the tuned N-body implementation in the CUDA examples.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"51 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132801788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Source-to-Source Translation of Coarray Fortran with MPI for High Performance 用MPI实现高性能Coarray Fortran的源到源转换

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3155888

H. Iwashita, M. Nakao, H. Murai, M. Sato

{"title":"A Source-to-Source Translation of Coarray Fortran with MPI for High Performance","authors":"H. Iwashita, M. Nakao, H. Murai, M. Sato","doi":"10.1145/3149457.3155888","DOIUrl":"https://doi.org/10.1145/3149457.3155888","url":null,"abstract":"Coarray Fortran (CAF) is a partitioned global address space (PGAS) language that is a part of standard Fortran 2008. We have implemented it as a source-to-source translator as a part of the Omni XcalebleMP compiler. Since the output is written in Fortran standard, the translator must utilize Fortran conventions such as the assumed-shape array and generic function in order to reduce both development costs and runtime overhead. The runtime library uses either GASNet, MPI-3, or Fujitsu's low-level Remote Direct Memory Access (RDMA) interface (FJ-RDMA) for one-sided communication. The Omni CAF translator and the runtime library support three types of memory managers that allocate coarray variables and register them to the communication library. The runtime library for the PUT/GET communication detects how contiguous and periodic the source and destination data are and performs communication aggregation. We measured fundamental performance levels by using EPCC Fortran Coarray microbenchmark and found our implementation of PUT/GET communication provides bandwidth as high as MPI_Send/Recv on two supercomputers. Although the small data latency was larger than the one of MPI_Send/Recv, we found that it could be reduced by using non-blocking communication for multiple coarray variables. As a result, when using 1024 processes, we achieved 27% and 42% higher performance than the original MPI code in the Himeno Benchmark classes L and XL, respectively.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121430475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

GPUhd: Augmenting YARN with GPU Resource Management GPUhd:增强YARN与GPU资源管理

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI: 10.1145/3149457.3155313

Daisuke Fukutomi, Yuki Iida, Takuya Azumi, S. Kato, N. Nishio

{"title":"GPUhd: Augmenting YARN with GPU Resource Management","authors":"Daisuke Fukutomi, Yuki Iida, Takuya Azumi, S. Kato, N. Nishio","doi":"10.1145/3149457.3155313","DOIUrl":"https://doi.org/10.1145/3149457.3155313","url":null,"abstract":"This paper presents GPUhd, a graphics processing unit (GPU) resource management approach that combines Hadoop and a GPU to obtain scale-out and scale-up functionality. There are several researches that combine Hadoop and GPU. However, there are no researches that can schedule tasks in consideration of GPU resource on Hadoop. Moreover, these researches cannot use multiple distributed frameworks. GPUhd extends the Yet Another Resource Negotiator (YARN) management mechanism and distributed processing frameworks for the coordinated use of GPU resources in Hadoop. We extend the YARN scheduling algorithm to consider GPU resources and incorporate a resources monitoring function. GPU resources can be managed on the basis of existing development methods because GPUhd simply handles GPU resources as host memory and CPU resources. In addition, GPUhd achieves high-speed processing, e.g., the computational time required to calculate 2048 x 2048 matrix multiplication is approximately 25 times less than that required when using only a CPU with Hadoop. GPUhd achieves high scalability and excellent response times in a heterogeneous distributed environment.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124490680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4