{"title":"A Cost and Performance Analytical Model for Large-Scale On-Chip Interconnection Networks","authors":"Takanori Kurihara, Yamin Li","doi":"10.1109/CANDAR.2016.0083","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0083","url":null,"abstract":"As an interconnection topology, two-dimensional mesh is widely used in the design of the network-on-chip (NoC) for integrating dozens of cores on a VLSI chip because of its very simple structure and ease of on-chip implementation. However, as the progress of IC technology, it becomes possible to integrate a large-scale system on a chip that contains more than one thousand processing elements or cores. In such a case, mesh topology will deteriorate performance due to the increase of communication time among cores. This paper investigates topologies and IC layout schemes of mesh, torus, hypercube, and metacube for achieving good cost-performance tradeoffs. We propose an analytical model for evaluating cost-performance ratio by considering NoC's topology and layout. The model is parameterized with node degree, graph diameter, the number of routers, the router complexity, the bandwidth of the connection for the router, the number of processing cores, the total length of links, and the cost ratios of the link section and the router section. This model is helpful for us to find out the optimal topology and layout for NoC with a given network size. It was found that when the network size is small, mesh has a better cost-performance than others; as the network size increases, torus and hypercube outperform mesh; and metacube has the best cost-performance among them.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128614829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Semantic Dataflow Logger Connecting Java Objects and Database Rows and Columns","authors":"Toshio Ito, Y. Kaneko","doi":"10.1109/CANDAR.2016.0027","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0027","url":null,"abstract":"As computer systems become more complicated, monitoring dataflows in a system becomes important for maintaining its performance. However, because conventional methods of dataflow monitoring are either too fine-grained or too coarse-grained, it is difficult to analyze application-specific performance metrics. In this paper, we propose a dataflow logger with suitable granularity for performance analysis. Our logger is implemented as a Java library, which tracks two types of dataflows: dataflows between objects inside a Java program, and dataflows between a Java object and a row and column in a relational database. That way, our logger can produce dataflow logs with rich semantics about the application's data model. We conduct an experiment with an example system and demonstrate that we can obtain dataflow logs useful for performance analysis. We also conduct detailed overhead analysis of our logger. Although our logger slows down the example system 13 times, we figure out major sources of the overhead. We argue possible solutions to the overhead.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124617449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Polling-Based P2P File Sharing with High Success Rate and Low Communication Cost","authors":"Kouhei Ootani, S. Fujita","doi":"10.1109/CANDAR.2016.0060","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0060","url":null,"abstract":"This paper proposes a polling-based consistency maintenance scheme for the Peer-to-Peer (P2P) file sharing of editable contents. The proposed scheme achieves a high success rate of the acquisition of the latest copy of shared files with low communication cost. In the following we first show that when several peers acquire a copy of shared files from the same replica peer, the minimum success rate is achieved by the peer with the maximum query rate regardless of the polling and the update intervals. We then design a distributed algorithm to maintain the correspondence between client and replica peers to minimize the average polling rate while keeping the average success rate to a designated value. The performance of the proposed algorithm is evaluated by simulation.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124630877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Tsujita, A. Hori, Toyohisa Kameyama, Y. Ishikawa
{"title":"Topology-Aware Data Aggregation for High Performance Collective MPI-IO on a Multi-core Cluster System","authors":"Y. Tsujita, A. Hori, Toyohisa Kameyama, Y. Ishikawa","doi":"10.1109/CANDAR.2016.0022","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0022","url":null,"abstract":"Parallel I/O such as MPI-IO is one of the performance improvement solutions in parallel computing using MPI. ROMIO is a widely used MPI-IO implementation which addresses to improve collective I/O performance by using its optimization named two-phase I/O. File I/O task is given to a subset of or all of MPI processes, which are called aggregators. Multiple CPUs or CPU cores give a chance to increase computing power by deploying multiple MPI processes per compute node, while such deployment leads to poor I/O performance due to ROMIO's topology-unaware aggregator layout. In our previous work, optimized aggregator layout which was suitable for striping accesses on a Lustre file system improved I/O performance, however, its unbalanced communication load due to unawareness in MPI rank layout among compute nodes led to ineffective data aggregation. To address minimization in data aggregation time for further I/O performance improvements, we introduce a topology-aware data aggregation scheme which takes care of MPI rank layout across compute nodes. The proposal arranges data collection sequence by aggregators in order to mitigate network contention. The optimization has achieved up to 67% improvements in I/O performance compared with the original ROMIO in HPIO benchmark runs using 768 processes on 64 compute nodes of the TSUBAME2.5 supercomputer at the Tokyo Institute of Technology. Even if the number of aggregators was half or 1/3 of the total number of processes, the optimization has still kept comparable I/O performance with the maximum performance.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124013540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Last Path Caching: A Simple Way to Remove Redundant Memory Accesses of Path ORAM","authors":"Naoki Fujieda, Ryoichi Yamauchi, S. Ichikawa","doi":"10.1109/CANDAR.2016.0068","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0068","url":null,"abstract":"Oblivious RAM (ORAM) is a technique to hide the access pattern of data to untrusted memory along with their contents. Path ORAM is a recent lightweight ORAM protocol, whose derived access pattern involves some redundancy that can be removed without the loss of security. In this paper, we introduce last path caching, which removes the redundancy of Path ORAM with a simpler protocol than an existing scheme. By combining two caching strategies, our technique showed only 0.2% performance loss from the existing one, while keeping the determinacy of the derived access pattern.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127041947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Firing Squad Synchronization Problem on Higher-Dimensional CA with Multiple Updating Cycles","authors":"L. Manzoni, A. Porreca, H. Umeo","doi":"10.1109/CANDAR.2016.0053","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0053","url":null,"abstract":"Traditional cellular automata (CA) assume the presence of a single global clock regulating the update of all their cells. When this assumption is dropped, cells can update with different speeds, thus increasing the difficulty of solving synchronization problems. Here we solve the traditional and the generalized Firing Squad Synchronization Problem in dimension two and higher on multiple updating cycle CA.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129398428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication Link Switching Method Based on Destination IP Address for Power Savings","authors":"Masato Nishiguchi, S. Kimura","doi":"10.1109/CANDAR.2016.0067","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0067","url":null,"abstract":"As the number of Internet users increases, network devices are required to achieve power savings. For this purpose, the authors proposed a Gigabit Ethernet link rate switching method based on the destination IP address for typical networks in small offices or at home. However, this method has a problem in that the communication is interrupted for a few seconds when the link rate is switched. To solve the problem, this paper proposes a communication link switching method. In this method, a client is assumed to connect via multiple network interfaces such as Gigabit Ethernet and a wireless LAN to the user's subnet. When a user starts communicating, the method selects one of the interfaces based on the destination IP address. The communication experiments demonstrate that the proposed method has improved power consumption and avoided any communication interruption time compared to our previous method.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127915479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Vu, S. Kajkamhaeng, Shinya Takamaeda-Yamazaki, Y. Nakashima
{"title":"CPRtree: A Tree-Based Checkpointing Architecture for Heterogeneous FPGA Computing","authors":"H. Vu, S. Kajkamhaeng, Shinya Takamaeda-Yamazaki, Y. Nakashima","doi":"10.1109/CANDAR.2016.0024","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0024","url":null,"abstract":"FPGAs provide reconfigurability and high performance for parallel applications. Modern FPGAs can be integrated in computing systems as accelerators so that they can combine with host CPU to execute offload applications. This integration puts more pressure on the fault tolerance of computing systems and the question how to improve the dependability becomes crucial. Similar to CPU-based system, checkpoint/restart techniques are expected to be developed and applied to FPGA-based computing systems. There are two issues rising in this situation: how to checkpoint and restart FPGA, and how this checkpoint/restart model works well with the checkpoint/restart model of the whole computing system. In this paper, first we propose a new checkpoint/restart architecture along with a checkpointing mechanism on FPGA. Second, we propose \"fine-grain\" management for checkpointing to reduce performance degradation. Third, we propose a technique to capture consistent snapshots of FPGA and the rest of the computing system. For host software, we also provide CPRtree stack including API functions to manage checkpoint/restart procedures on FPGA. Our experimental results show that the checkpointing architecture causes up to 9.73% maximum clock frequency degradation, small breakdown, and small data footprint, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131421325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masaki Hara, Shinnosuke Nirasawa, A. Nakao, M. Oguchi, Shu Yamamoto, Saneyasu Yamaguchi
{"title":"Service Identification by Packet Inspection Based on N-grams in Multiple Connections","authors":"Masaki Hara, Shinnosuke Nirasawa, A. Nakao, M. Oguchi, Shu Yamamoto, Saneyasu Yamaguchi","doi":"10.1109/CANDAR.2016.0123","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0123","url":null,"abstract":"Identifying the service of traffic by given IP network flows is essential for various purposes, such as management of QoS and avoiding security issues. Typical methods for this are identification based on its IP addresses and port numbers. However, the achieved accuracies of these method are not sufficient, then improving these methods is required. Deep Packet Inspection (DPI) is one of the most effective methods for improving accuracy of identification. In this paper, we explore a method for identifying the service of flow. We propose an identifying method based on DPI which covers multiple connections in a service. Then, we present performance evaluation and demonstrate that our method can suitably identify service from given network flows.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131867667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiong Xiao, S. Hirasawa, H. Takizawa, Hiroaki Kobayashi
{"title":"The Importance of Dynamic Load Balancing among OpenMP Thread Teams for Irregular Workloads","authors":"Xiong Xiao, S. Hirasawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.1109/CANDAR.2016.0097","DOIUrl":"https://doi.org/10.1109/CANDAR.2016.0097","url":null,"abstract":"Recently, massively-parallel many-core processors such as Intel Xeon Phi coprocessors have attracted researchers' attentions because various applications are significantly accelerated with those processors. In the field of high-performance computing, OpenMP is a standard programming model commonly used to parallelize a kernel loop for many-core processors. For hierarchical parallel processing, OpenMP version 4.0 or later allows programmers to group threads into multiple thread teams. In this paper, we first show the performance gain of using multiple thread teams even for one many-core processor. Then, we demonstrate that dynamic load balancing among those thread teams has a potential of significantly improving the performance of irregular workloads on a many-core processor. Although the current OpenMP specification does not offer such a dynamic load balancing mechanism, we discuss possible benefits of dynamic load balancing among thread teams through experiments using the Intel Xeon Phi coprocessor.","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123500144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}