Xiong Xiao, S. Hirasawa, H. Takizawa, Hiroaki Kobayashi
{"title":"Toward Dynamic Load Balancing across OpenMP Thread Teams for Irregular Workloads","authors":"Xiong Xiao, S. Hirasawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.15803/IJNC.7.2_387","DOIUrl":"https://doi.org/10.15803/IJNC.7.2_387","url":null,"abstract":"In the field of high performance computing, massively-parallel many-core processors such as Intel Xeon Phi coprocessors are becoming popular because they can significantly accelerate various applications. In order to efficiently parallelize applications for such many-core processors, several high-level programming models have been proposed. The de facto standard programming model mainly for shared-memory parallel processing is OpenMP. For hierarchical parallel processing, OpenMP version 4.0 or later allows programmers to create multiple thread teams. Each thread team contains a bunch of newly-created synchronizable threads. When multiple thread teams are used to execute an application, it is important to have dynamic load balancing across thread teams, since static load balancing easily encounters load imbalance across teams, and thus degrades performance. In this paper, we first motivate our work by clarifying the benefit of using multiple thread teams to execute an irregular workload on a many-core processor. Then, we demonstrate that dynamic load balancing across those thread teams has a potential of significantly improving the performance of irregular workloads on a many-core processor, with considering the scheduling overhead. Although such a dynamic load balancing mechanism has not been provided by the current OpenMP specification, the benefits of dynamic load balancing across thread teams are discussed through experiments using the Intel Xeon Phi coprocessor. We evaluate the performance gain of dynamic load balancing across thread teams using a ray tracing code. The results show that such a dynamic load balancing mechanism can improve the performance by up to 14% compared to static load balancing across teams, with considering scheduling overhead.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126976703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preface: Special Issue on the Fourth International Symposium on Computing and Networking","authors":"K. Nakano","doi":"10.15803/IJNC.7.2_105","DOIUrl":"https://doi.org/10.15803/IJNC.7.2_105","url":null,"abstract":"The Fourth International Symposium on Networking and Computing (CANDAR 2016 ) was held in Hiroshima , Japan, from November 22 nd to 25 th , 2016 . The organizers of the CANDAR 2016 invited authors to submit the extended version of the presented papers. As a result, 28 articles have been submitted to this special issue. This issue includes the extended version of 17 papers that have been accepted. This issue owes a great deal to a number of people who devoted their time and expertise to handle the submitted papers. In particular, I would like to thank the guest editors for the excel lent review process: Professor Ryusuke Egawa, Professor Akihiro Fujiwara, Professor Jose Gracia, Professor Katsunobu Imai, Professor Yasuaki Ito, Professor Yoshiaki Kakuda, Professor Michihiro Koibuchi, Professor Susumu Matsumae, Professor Toru Nakanishi, Professor Yasuyuki Nogami, Professor Satoshi Ohzahata, and Professor Tomoaki Tsumura. Words of gratitude are also due to the anonymous reviewers who carefully read the papers and provided detailed comments and suggestions to improve the quality of the submitted papers. This special issue would not have been without their efforts.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126048910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finite Computational Structures and Implementations: Semigroups and Morphic Relations","authors":"A. Egri-Nagy","doi":"10.15803/IJNC.7.2_318","DOIUrl":"https://doi.org/10.15803/IJNC.7.2_318","url":null,"abstract":"What is computable with limited resources? How can we verify the correctness of computations? How to measure computational power with precision? Despite the immense scientific and engineering progress in computing, we still have only partial answers to these questions. To make these problems more precise and easier to tackle, we describe an abstract algebraic definition of classical computation by generalizing traditional models to semigroups. This way implementations are morphic relations between semigroups. The mathematical abstraction also allows the investigation of different computing paradigms (e.g. cellular automata, reversible computing) in the same framework. While semigroup theory helps in clarifying foundational issues about computation, at the same time it has several open problems that require extensive computational efforts. This mutually beneficial relationship is the central tenet of the described research.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129851609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Komatsu, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi
{"title":"A Directive Generation Approach to High Code-Maintainability for Various HPC Systems","authors":"K. Komatsu, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.15803/IJNC.7.2_405","DOIUrl":"https://doi.org/10.15803/IJNC.7.2_405","url":null,"abstract":"The emergence of various high-performance computing (HPC) systems compels users to write a code considering the characteristic of each HPC system. To describe the system-dependent information without drastic code modifications, the directive sets such as the OpenMP directive set and the OpenACC directive set are proofed to be useful. However, the code becomes complex to achieve high performance on various HPC systems because different directive sets are required for various HPC systems. Thus, the code-maintainability and readability are degraded. This paper proposes a directive generation approach that generates various kinds of directive sets using user-defined rules. Instead of using several kinds of directive sets, users only have to write special placeholders that are utilized to specify a unique code pattern where several directives are inserted. Then, the special placeholders trigger the generation of appropriate directives for each system using a user-defined rule with a code transformation framework Xevolver . Because only special placeholders are inserted in the code, the proposed approach can keep the code-maintainability and readability. From the performance evaluations of directive-based implementations on various HPC systems, it is shown that the best implementation is different among the HPC systems. Then, through the demonstration of transformation into multiple kinds of implementations, the proposed approach can successfully generate directives from a smaller number of special placeholders. Therefore, it is clarified that the proposed directive generation approach is effective to keep the maintainability of a code to be executed on various HPC systems.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124041828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroki Tokura, Takumi Honda, Yasuaki Ito, K. Nakano, Mitsuya Nishino, Yushiro Hirota, M. Saeki
{"title":"An Efficient GPU Implementation of Bulk Computation of the Eigenvalue Problem for Many Small Real Non-symmetric Matrices","authors":"Hiroki Tokura, Takumi Honda, Yasuaki Ito, K. Nakano, Mitsuya Nishino, Yushiro Hirota, M. Saeki","doi":"10.15803/IJNC.7.2_227","DOIUrl":"https://doi.org/10.15803/IJNC.7.2_227","url":null,"abstract":"The main contribution of this paper is to present an efficient GPU implementation of bulk computation of eigenvalues for many small, non-symmetric, real matrices. This work is motivated by the necessity of such bulk computation in designing of control systems, which requires to compute the eigenvalues of hundreds of thousands non-symmetric real matrices of size up to 30x30. Several efforts have been devoted to accelerating the eigenvalue computation including computer languages, systems, environments supporting matrix manipulation offering specific libraries/function calls. Some of them are optimized for computing the eigenvalues of a very large matrix by parallel processing. However, such libraries/function calls are not aimed at accelerating the eigenvalues computation for a lot of small matrices. In our GPU implementation, we considered programming issues of the GPU architecture including warp divergence, coalesced access of the global memory, utilization of the shared memory, and so forth. In particular, we present two types of assignments of GPU threads to matrices and introduce three memory arrangements in the global memory. Furthermore, to hide CPU-GPU data transfer latency, overlapping computation on the GPU with the transfer is employed. Experimental results on NVIDIA TITAN~X show that our GPU implementation attains a speed-up factor of up to 83.50 and 17.67 over the sequential CPU implementation and the parallel CPU implementation with eight threads on Intel Core i7-6700K, respectively.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134377464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Self-optimizing Routing Algorithm using Local Information in a 3-dimensional Virtual Grid Network with Theoretical and Practical Analysis","authors":"Yonghwan Kim, Y. Katayama","doi":"10.15803/IJNC.7.2_349","DOIUrl":"https://doi.org/10.15803/IJNC.7.2_349","url":null,"abstract":"In this paper, we present a self-optimizing routing algorithm using only local information, in a three-dimensional (3D) virtual grid network. A virtual grid network is a well-known network model for its ease of designing algorithms and saving energy consumption. We consider a 3D virtual grid network which is obtained by virtually dividing a network into a set of unit cubes called cell s. One specific node named a router is decided at each cell, and each router is connected with the routers at adjacent cells. This implies that each router can communicate with 6 routers. We consider the maintenance of an inter-cell communication path from a source node to a destination node and propose a distributed self-optimizing routing algorithm which transforms an arbitrary given path to an optimal (shortest) one from the source node to the destination node. Our algorithm is executed at each router and uses only local information (6 hops: 3 hops each back and forward along the given path). Our algorithm can work in asynchronous networks without any global coordination among routers. We present that our algorithm transform any arbitrary path to a shortest path in O (| P |) synchronous rounds, where | P | is the length of the initial path, when it works in synchronous networks. Moreover, our experiments show that our algorithm converges in about | P |/2 synchronous rounds and the ratio becomes lower as | P | becomes larger.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131925203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-accelerated Exhaustive Verification of the Collatz Conjecture","authors":"Takumi Honda, Yasuaki Ito, K. Nakano","doi":"10.15803/IJNC.7.1_69","DOIUrl":"https://doi.org/10.15803/IJNC.7.1_69","url":null,"abstract":"The main contribution of this paper is to present an implementation that performs the exhaustive search to verify the Collatz conjecture using a GPU. Consider the following operation on an arbitrary positive number: if the number is even, divide it by two, and if the number is odd, triple it and add one. The Collatz conjecture asserts that, starting from any positive number m, repeated iteration of the operations eventually produces the value 1. We have implemented it on NVIDIA GeForce GTX TITAN~X and evaluated the performance. The experimental results show that, our GPU implementation can verify 1.31x10^12 64-bit numbers per second. While the sequential CPU implementation on Intel Core i7-4790 can verify 5.25x10^9 64-bit numbers per second. Thus, our implementation on the GPU attains a speed-up factor of 249 over the sequential CPU implementation. Additionally, we accelerated the computation of counting the number of the above operations until a number reaches 1, called delay that is one of the mathematical interests for the Collatz conjecture by the GPU. Using a similar idea, we achieved a speed-up factor of 73.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128276415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Countermeasure to Eavesdropping on Data Packets by Utilizing Control Packet Overhearing for Radio overlapping Reduced Multipath Routing in Ad Hoc Networks","authors":"T. Murakami, Eitaro Kohno, Y. Kakuda","doi":"10.15803/IJNC.6.2_345","DOIUrl":"https://doi.org/10.15803/IJNC.6.2_345","url":null,"abstract":"Ad hoc networks are autonomously distributed wireless networks which consist of wireless terminals (hereinafter, referred to as nodes). They do not rely on wireless network infrastructures such as base stations. Relaying nodes and their surrounding nodes are susceptible to data theft and eavesdropping because nodes communicate via radio waves. Previously, we had proposed the secure dispersed data transfer method for encryption, decryption, and transfer of the original data packets. To use the secure dispersed data transfer method securely, we had proposed using the node-disjoint multipath routing method. In this method, multiple versions of encrypted data packets are transferred along each disjoint multipath to counter data packet theft. We had also proposed the enhanced version of the aforementioned routing method to reduce radio area overlap by using rebroadcasting of control packets to counter eavesdropping attacks. In this paper, we propose a multipath routing method to reduce radio area overlap through the introduction of control packet overhearing. We introduce control packet overhearing mechanisms to eliminate excess control packet counts and latency in the pathfinding process. Our main contributions are as follows: (1) our proposed method can reduce radio area overlap without  each node's geographical location information (e.g., using GPS information); (2) our proposed method also can eliminate excess control packets and latency without degradation of the security. Furthermore we conducted simulation experiments to evaluate our proposed method. We observed that our proposed method can construct the desired paths with a smaller amount of control packets and a shorter latency in the pathfinding process. We also conducted additional experiments to discuss the applicable scope of our proposed method. As a result, we confirmed that our proposed method was more effective as the average number of adjacent nodes increased.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131419038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of User Information Disclosure Decision Method for Disaster Information Sharing System","authors":"Keita Kayaba, Akiko Takahashi","doi":"10.15803/IJNC.6.2_328","DOIUrl":"https://doi.org/10.15803/IJNC.6.2_328","url":null,"abstract":"During natural disasters, a significant amount of information is shared over the Internet. Therefore, it is desirable to provide disaster information based on information about individual users. However, there is a trade-off between the protection of user information and the quality of services that should be considered when providing disaster information. We propose a method that rationally determines the extent of user information to be disclosed. The effectiveness of the proposed method was evaluated experimentally. The experiments were conducted using the proposed method and a simple determination method wherein both the utility and intention of the user were considered relative to the extent of user information disclosure. In addition, the extent to which the trade-off was considered for each user was evaluated quantitatively.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132768303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Implementation of FDFM Approach for Euclidean Algorithms on the FPGA","authors":"Xin Zhou, K. Nakano, Yasuaki Ito","doi":"10.15803/IJNC.6.2_420","DOIUrl":"https://doi.org/10.15803/IJNC.6.2_420","url":null,"abstract":"The FDFM (Few DSP slices and Few block Memories) approach is an efficient approach which implements a processor core executing a particular algorithm using few DSP slices and few block RAMs in a single FPGA. Since a processor core based on the FDFM approach uses few hardware resources, hundreds of processor cores working in parallel can be implemented in an FPGA. The main contribution of this paper is to develop a processor core that executes Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in an FPGA. This processor core that we call GCD processor core uses only one DSP slice and one block RAM, and 1280 GCD processors can be implemented in a Xilinx Virtex-7 family FPGA XC7VX485T-2. The experimental results show that the performance of this FPGA implementation using 1280 GCD processor cores is 0.0904us per one GCD computation for two 1024-bit integers. Quite surprisingly, it is 3.8 times faster than the best GPU implementation and 316 times faster than a sequential implementation on the Intel Xeon CPU.","PeriodicalId":270166,"journal":{"name":"Int. J. Netw. Comput.","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129300326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}