{"title":"Efficient dentry lookup with backward finding mechanism","authors":"N. Song, Hwajung Kim, Hyuck Han, H. Yeom","doi":"10.1145/3149457.3149468","DOIUrl":"https://doi.org/10.1145/3149457.3149468","url":null,"abstract":"As modern computer systems face the challenge of managing large data, filesystems must deal with a large number of files. This leads to amplified concerns of metadata and data operations. Filesystems in Linux manage the metadata of files by constructing in-memory structures such as directory entry (dentry) and inode. However, we found inefficiencies in metadata management mechanisms, especially in the path traversal mechanism of Linux file systems when searching for a dentry in the dentry cache. In this paper, we optimize metadata operations of path traversing by searching for the dentry in the backward manner. By using the backward finding mechanism, we can find the target dentry with reduced number of dentry cache lookups when compared with the original forward finding mechanism. However, this backward path lookup mechanism complicates permission guarantee of each path component. We addess this issue by proposing the use of a permission-granted list. We have evaluated our optimized techniques with several benchmarks including real-world workload. The experimental results show that our optimizations improve path lookup latency by up to 40% and overall throughput by up to 56% in real-world benchmarks which has a number of path-deepen files.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130886622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryohei Kobayashi, Yuma Oobata, N. Fujita, Y. Yamaguchi, T. Boku
{"title":"OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing","authors":"Ryohei Kobayashi, Yuma Oobata, N. Fujita, Y. Yamaguchi, T. Boku","doi":"10.1145/3149457.3149479","DOIUrl":"https://doi.org/10.1145/3149457.3149479","url":null,"abstract":"Field programmable gate arrays (FPGAs) have gained attention in high-performance computing (HPC) research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore's Law. In addition to FPGA performance improvements, OpenCL-based FPGA development toolchains have been developed and offered by FPGA vendors, which reduces the programming effort required as compared to the past. These improvements reveal the possibilities of realizing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is one of the keys to more improve the performance of modern heterogeneous supercomputers using accelerators like GPUs. In this paper, we propose high-performance inter-FPGA Ethernet communication using OpenCL and Verilog HDL mixed programming in order to demonstrate the feasibility of realizing this concept. OpenCL is used to program application algorithms and data movement control when Verilog HDL is used to implement low-level components for Ethernet communication. Experimental results using ping-pong programs showed that our proposed approach achieves a latency of 0.99 μs and as much as 4.97 GB/s between FPGAs over different nodes, thus confirming that the proposed method is effective at realizing this concept.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125179350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance improvement of the general-purpose CFD code FrontFlow/blue on the K computer","authors":"Kiyoshi Kumahata, K. Minami, Y. Yamade, C. Kato","doi":"10.1145/3149457.3149470","DOIUrl":"https://doi.org/10.1145/3149457.3149470","url":null,"abstract":"The general-purpose fluid simulation software FrontFlow/blue (FFB) is based on the finite element method (FEM). It was designed to accept extremely large-scale simulations and is an important application in the manufacturing field in Japan. Moreover, since this application is significant in both the manufacturing field and the development of the post-K supercomputer, it is employed as an important application for the new post-K supercomputer that is under development. The K computer is still the important infrastructure in Japan. And there are some supercomputers having the same architecture to the K computer. Therefore we continue to improve the performance of the FFB on the K computer. On significant subroutines, several improvement techniques, store order based loop modification decreasing total load and store operations, unrolled loop rerolling to employ SIMD load instruction, adjusting number of arrays in loop, using sector cache function, and so on, were employed. As a result, an improvement of 160% was obtained on a single CPU performance. This paper shows and discusses the detail of these improvements.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116833101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Ikeda, S. Ito, H. Nagao, T. Katagiri, Toru Nagai, M. Ogino
{"title":"Optimizing Forward Computation in Adjoint Method via Multi-level Blocking","authors":"T. Ikeda, S. Ito, H. Nagao, T. Katagiri, Toru Nagai, M. Ogino","doi":"10.1145/3149457.3149458","DOIUrl":"https://doi.org/10.1145/3149457.3149458","url":null,"abstract":"Data assimilation (DA) is a computational technique that integrates large-scale numerical simulations with observed data, and the adjoint method is classified as a non-sequential DA technique. The target model for the simulations in this paper is the phase-field model, which is often used to simulate the temporal evolution of the internal structures of materials. Since the phase-field method computes a continuous field, a naïve implementation of the adjoint method requires an enormous amount of computation time. One reason for the increase in computation time is that the amount of data required for simulations is much larger than the cache capacity of computers. To reduce memory access and achieve better performance, it is necessary to use computational blocking, which involves reusing data within the cache as much as possible. In this paper, we propose multi-level blocking to optimize forward computation in the adjoint method. The proposed multi-level blocking consists of spatial blocking, temporal blocking, and the blocking of multiple forward computations in the adjoint method. We investigated the effectiveness of the proposed multi-level blocking on the Fujitsu PRIMEHPC FX100 supercomputer. By applying spatial and temporal blocking, we attained a speed-up of 1.89 x in execution time without blocking and that of 1.48 x as the upper limit by applying blocking to multiple forward computations (MFB). We also attained a speed-up of 1.13 by applying multi-level blocking to execution time without blocking.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"9 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115676185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming-Hung Chen, Jyun-Yan Ciou, I. Chung, Cheng-Fu Chou
{"title":"FlexProtect: A SDN-based DDoS Attack Protection Architecture for Multi-tenant Data Centers","authors":"Ming-Hung Chen, Jyun-Yan Ciou, I. Chung, Cheng-Fu Chou","doi":"10.1145/3149457.3149476","DOIUrl":"https://doi.org/10.1145/3149457.3149476","url":null,"abstract":"With the recent advances in software-defined networking (SDN), the multi-tenant data centers provide more efficient and flexible cloud platform to their subscribers. However, as the number, scale, and diversity of distributed denial-of-service (DDoS) attack is dramatically escalated in recent years, the availability of those platforms is still under risk. We note that the state-of-art DDoS protection architectures did not fully utilize the potential of SDN and network function virtualization (NFV) to mitigate the impact of attack traffic on data center network. Therefore, in this paper, we exploit the flexibility of SDN and NFV to propose FlexProtect, a flexible distributed DDoS protection architecture for multi-tenant data centers. In FlexProtect, the detection virtual network functions (VNFs) are placed near the service provider and the defense VNFs are placed near the edge routers for effectively detection and avoid internal bandwidth consumption, respectively. Based on the architecture, we then propose FP-SYN, an anti-spoofing SYN flood protection mechanism. The emulation and simulation results with real-world data demonstrates that, compared with the traditional approach, the proposed architecture can significantly reduce 46% of the additional routing path and save 60% internal bandwidth consumption. Moreover, the proposed detection mechanism for anti-spoofing can achieve 98% accuracy.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115041706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick E. Small, Kuang Liu, S. Tiwari, R. Kalia, A. Nakano, K. Nomura, P. Vashishta
{"title":"Acceleration of Dynamic n-Tuple Computations in Many-Body Molecular Dynamics","authors":"Patrick E. Small, Kuang Liu, S. Tiwari, R. Kalia, A. Nakano, K. Nomura, P. Vashishta","doi":"10.1145/3149457.3149463","DOIUrl":"https://doi.org/10.1145/3149457.3149463","url":null,"abstract":"Computation on dynamic n-tuples of particles is ubiquitous in scientific computing, with an archetypal application in many-body molecular dynamics (MD) simulations. We propose a tuple-decomposition (TD) approach that reorders computations according to dynamically created lists of n-tuples. We analyze the performance characteristics of the TD approach on general purpose graphics processing units for MD simulations involving pair (n = 2) and triplet (n = 3) interactions. The results show superior performance of the TD approach over the conventional particle-decomposition (PD) approach. Detailed analyses reveal the register footprint as the key factor that dictates the performance. Furthermore, the TD approach is found to outperform PD for more intensive computations of quadruplet (n = 4) interactions in first principles-informed reactive MD simulations based on the reactive force-field (ReaxFF) method. This work thus demonstrates the viable performance portability of the TD approach across a wide range of applications.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116034316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems","authors":"Xinzhe Wu, S. Petiton","doi":"10.1145/3149457.3154481","DOIUrl":"https://doi.org/10.1145/3149457.3154481","url":null,"abstract":"Parallel Krylov Subspace Methods are commonly used for solving large-scale sparse linear systems. Facing the development of extreme scale platforms, the minimization of synchronous global communication becomes critical to obtain good efficiency and scalability. This paper highlights a recent development of a hybrid (unite and conquer) method, which combines three computation algorithms together with asynchronous communication to accelerate the resolution of non-Hermitian linear systems and to improve its fault tolerance and reusability. Experimentation shows that our method has an up to 5x speedup and better scalability than the conventional methods for the resolution on hierarchical clusters with hundreds of nodes.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132131451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework Tapas","authors":"Motohiko Matsuda, Keisuke Fukuda, N. Maruyama","doi":"10.1145/3149457.3149471","DOIUrl":"https://doi.org/10.1145/3149457.3149471","url":null,"abstract":"Tapas is a C++ programming framework for developing hierarchical N-body algorithms such as Barnes-Hut and Fast Multipole Method, designed to experiment new implementations including even variations of tree traversals. A pairwise interaction calculation in N-body simulations, or an all-pairs operation, is an important part of Tapas for performance, which enables accelerations with GPUs. However, there is no commonly agreed all-pairs interface appropriate as a primitive, and moreover, it is not supported in existing data-parallel libraries for GPUs such as NVIDIA's Thrust. Thus, we designed an interface for an all-pairs operation that can be easily adopted in libraries and applications. Tapas's all-pairs has an extra function argument for flexibility, which corresponds to a consumer function of the result of an all-pairs that is missing in existing designs. This addition is not an ad hoc one, but it is guided by the consideration of algorithmic skeletons, which indicates the effect of the added argument cannot be substituted by the other arguments in general. The change is just adding an argument, but it gives flexibility to process the result, and the resulting implementation can attain almost the same performance as the tuned N-body implementation in the CUDA examples.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"51 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132801788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Source-to-Source Translation of Coarray Fortran with MPI for High Performance","authors":"H. Iwashita, M. Nakao, H. Murai, M. Sato","doi":"10.1145/3149457.3155888","DOIUrl":"https://doi.org/10.1145/3149457.3155888","url":null,"abstract":"Coarray Fortran (CAF) is a partitioned global address space (PGAS) language that is a part of standard Fortran 2008. We have implemented it as a source-to-source translator as a part of the Omni XcalebleMP compiler. Since the output is written in Fortran standard, the translator must utilize Fortran conventions such as the assumed-shape array and generic function in order to reduce both development costs and runtime overhead. The runtime library uses either GASNet, MPI-3, or Fujitsu's low-level Remote Direct Memory Access (RDMA) interface (FJ-RDMA) for one-sided communication. The Omni CAF translator and the runtime library support three types of memory managers that allocate coarray variables and register them to the communication library. The runtime library for the PUT/GET communication detects how contiguous and periodic the source and destination data are and performs communication aggregation. We measured fundamental performance levels by using EPCC Fortran Coarray microbenchmark and found our implementation of PUT/GET communication provides bandwidth as high as MPI_Send/Recv on two supercomputers. Although the small data latency was larger than the one of MPI_Send/Recv, we found that it could be reduced by using non-blocking communication for multiple coarray variables. As a result, when using 1024 processes, we achieved 27% and 42% higher performance than the original MPI code in the Himeno Benchmark classes L and XL, respectively.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121430475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daisuke Fukutomi, Yuki Iida, Takuya Azumi, S. Kato, N. Nishio
{"title":"GPUhd: Augmenting YARN with GPU Resource Management","authors":"Daisuke Fukutomi, Yuki Iida, Takuya Azumi, S. Kato, N. Nishio","doi":"10.1145/3149457.3155313","DOIUrl":"https://doi.org/10.1145/3149457.3155313","url":null,"abstract":"This paper presents GPUhd, a graphics processing unit (GPU) resource management approach that combines Hadoop and a GPU to obtain scale-out and scale-up functionality. There are several researches that combine Hadoop and GPU. However, there are no researches that can schedule tasks in consideration of GPU resource on Hadoop. Moreover, these researches cannot use multiple distributed frameworks. GPUhd extends the Yet Another Resource Negotiator (YARN) management mechanism and distributed processing frameworks for the coordinated use of GPU resources in Hadoop. We extend the YARN scheduling algorithm to consider GPU resources and incorporate a resources monitoring function. GPU resources can be managed on the basis of existing development methods because GPUhd simply handles GPU resources as host memory and CPU resources. In addition, GPUhd achieves high-speed processing, e.g., the computational time required to calculate 2048 x 2048 matrix multiplication is approximately 25 times less than that required when using only a CPU with Hadoop. GPUhd achieves high scalability and excellent response times in a heterogeneous distributed environment.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124490680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}