{"title":"Matrix Multiply-Add in Min-plus Algebra on a Short-Vector SIMD Processor of Cell/B.E.","authors":"Kazuya Matsumoto, S. Sedukhin","doi":"10.1109/IC-NC.2010.29","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.29","url":null,"abstract":"It is well-known that the all-pairs shortest paths problem has a similar algorithmic characteristic to the classical matrix-matrix multiply-add (MMA) problem, one of the differences between the two problems is in the underlying algebra: the matrix multiply-add uses linear (+, x)-algebra whereas the all-pairs shortest paths problem uses (min, +)-algebra. This paper presents an implementation of 64×64 matrix multiply-add kernel in (min, +)-algebra on a short-vector SIMD processor, the so-called Synergistic Processing Element (SPE), of the Cell Broadband Engine (Cell/B.E.). Our implementation for the shortest paths problem adopts an existing fast algorithm of matrix multiply-add with a reduction of the number of required registers. The MMA implementation in (min, +)-algebra achieves the speed of 8.502 Gflop/s, which is about three times as low as that of the (+, x)-algebra MMA and is very close to the theoretical estimation based on the required number of instructions.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124097264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Jurecko, J. Kocisová, J. Buša, T. Kasanický, M. Domiter, M. Zvada
{"title":"Evaluation Framework for GPU Performance Based on OpenCL Standard","authors":"Martin Jurecko, J. Kocisová, J. Buša, T. Kasanický, M. Domiter, M. Zvada","doi":"10.1109/IC-NC.2010.32","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.32","url":null,"abstract":"There are many projects focused on performance measurements of GPUs but there is no unifying test framework that could be used for evaluating generic floating point intensive applications. This work describes the testing suite for evaluating GPUs that measures raw performance and numerical precision of a subset of OpenCL operations, and analyzes results obtained from several commonly available high-end GPUs.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123857511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proposition of Criteria for Aborting Transaction Based on Log Data Size in LogTM","authors":"Hiroki Asai, Tomoaki Tsumura, H. Matsuo","doi":"10.1109/IC-NC.2010.51","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.51","url":null,"abstract":"Lock-based synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, LogTM has been proposed and studied for lock-free synchronization. LogTM is a kind of hardware transactional memory. In LogTM, transactions are executed speculatively to ensure serializability and atomicity. LogTM stores original values in a log before it is modified by a transaction. If a transaction accesses a shared datum which has been accessed by another transaction running in parallel, LogTM detects it as conflict and restores all data from the associated log and restarts the transaction. This is called aborting. On abort, the costs for restoring data from a log increases in proportion to the data size on the log. However, LogTM selects which transaction should be aborted by their initiated time. Hence, if conflicts occur frequently, it may degrades the performance. This paper proposes a criterion for selecting which transaction should be aborted taking account of data size in each log. In addition, another criterion which takes account of degree of conflict is also proposed. The result of the experiment with SPLASH-2 benchmark suite programs shows that the proposed methods improve the performance 2.7% in maximum.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116255415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CODIE: Continuation-Based Overlapping Data-Transfers with Instruction Execution","authors":"T. Miyoshi, Kenji Kise, H. Irie, T. Yoshinaga","doi":"10.1109/IC-NC.2010.26","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.26","url":null,"abstract":"In this paper, a runtime system termed CODIE is proposed to execute sequential part of programs efficiently in a many-core architecture. All independent processing elements in a many-core architecture use a shared network and off-chip memory. Therefore, contentions on such resources substantially degrade the system performance. On the CODIE system, when a cache miss occurs, the system first initiates a data transfer operation. Next, the system creates a continuation of executing instructions related to the missing data. The continuation is stored into the buffer, and the instructions not related to the missing data are executed subsequently. In other words, data transfer and instruction executions can be performed simultaneously. In this way, the effect of the overhead of the updating cache entry (increased by memory access contention) is tolerated. The results of evaluation show that the proposed CODIE system realizes a 1.86x speed up of the execution of the sequential write/read program on the M-Core architecture at 36 cores and a 1.97x speed up of the execution of the blacks holes(from PARSEC benchmark suite) on the Cell/BE processor with 6 SPEs.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122034265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Shomura, K. Yoshida, Akira Sato, Satoshi Matsumoto, K. Itano
{"title":"A Traffic Analysis Using Cardinalities and Header Information","authors":"Y. Shomura, K. Yoshida, Akira Sato, Satoshi Matsumoto, K. Itano","doi":"10.1109/IC-NC.2010.36","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.36","url":null,"abstract":"Recently, the variety and vastness of computer networks have increased rapidly. To keep networks stable and reliable, network administrators have to understand the nature of network traffic flows. We have developed a cardinality-analysis method that analyzes cardinalities in TCP/IP headers. The cardinalities can be used to detect abnormal traffic such as DDoS attacks and Internet worms. However there is much unclassified traffic remaining. In this paper, we propose further analysis that consists of two parts: 1) select service port numbers and 2) analyze the volume of inflow and outflow for each service along with packet sizes. The method proposed can analyze the behavior of hosts and services in detail. We applied the proposed analysis to the traffic captured at the University of Tsukuba’s campus network and demonstrated the ability of classifying services into four groups: download type, upload type, both way type, and control or real time communication type, which normally can’t be classified by cardinality analysis.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121844093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Bordim, A. V. Barbosa, Marcos F. Caetano, P. S. Barreto
{"title":"IEEE802.11b/g Standard: Theoretical Maximum Throughput","authors":"J. Bordim, A. V. Barbosa, Marcos F. Caetano, P. S. Barreto","doi":"10.1109/IC-NC.2010.40","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.40","url":null,"abstract":"Estimating the throughput of an WiFi connection can be quite complex, even when considering simplified scenarios. Indeed, the varying number of parameters specified in the standards makes it hard to understand their impact in terms of delay and throughput. The main contribution of this work is to present a simple scheme to compute the exact maximum throughput for an IEEE 802.11g network. The proposed scheme incorporates all the timings and settings which allows one to calculate the throughput for different channel spacing and modulation techniques specified in the standard. Numerical and experimental results showing the accuracy of the proposed scheme are also presented.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"274 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127549366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Wireless TCP Issue in Cognitive Radio Networks","authors":"Yunlei Cheng, E. Wu, Gen-Huey Chen","doi":"10.1109/IC-NC.2010.37","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.37","url":null,"abstract":"In the recent years, researches focus on decay of TCP throughput over wireless links, and many wireless TCP solutions are proposed to deal with this issue. On the other hand, followed by the improvement of hardware technology, new network structures and mechanisms are proposed to enhance wireless communications, for example, the Cognitive Radio (CR) networks. However this new network architecture causes a new problem, which is not solved in wireless TCPs. In this paper, we identify a new issue that impacts the TCP performance over CR networks, which we call Bandwidth Variation. A cross layer solution is proposed to deal with this new issue. Both numerical and simulation results are resented to demonstrate the effectiveness of the proposed solution.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128291062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of Hessenberg Reduction for Nonsymmetric Eigenvalue Problems Using GPU","authors":"Jun-ichi Muramatsu, Shaoliang Zhang, Yusaku Yamamoto","doi":"10.1109/IC-NC.2010.52","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.52","url":null,"abstract":"Solution of large-scale dense nonsymmetric eigenvalue problem is required in many areas of scientific and engineering computing, such as vibration analysis of automobiles and analysis of electronic diffraction patterns. In this study, we focus on the Hessenberg reduction step and consider accelerating it using GPU. Our main strategy is to use the CUBLAS, an optimized BLAS library for GPU. However, since Hessenberg reduction requires operations not supported by CUBLAS, we combine CPU and GPU to perform the computation. We propose two approaches for combining CPU and GPU: the one that performs as much work as possible on GPU and the one that aggressively assigns computation of small-size matrices to CPU. Experimental results show that the latter approach is considerably faster than the former. Compared with the computation on the Core i7 processor with 4 cores, the latter approach with the Tesla C1060 GPU and the Core i7 processor achieves 2.8 times speedup when computing the Hessenberg form of a 4,800 $times$ 4,800 real matrix.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"58 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132836842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Duhu Man, K. Uda, Hironobu Ueyama, Yasuaki Ito, K. Nakano
{"title":"Implementations of Parallel Computation of Euclidean Distance Map in Multicore Processors and GPUs","authors":"Duhu Man, K. Uda, Hironobu Ueyama, Yasuaki Ito, K. Nakano","doi":"10.1109/IC-NC.2010.55","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.55","url":null,"abstract":"Given a 2-D binary image of size $n times n$, Euclidean Distance Map (EDM) is a 2-D array of the same size such that each element is storing the Euclidean distance to the nearest black pixel. It is known that a sequential algorithm can compute the EDM in $O(n^2)$ and thus this algorithm is optimal. Also, work-time optimal parallel algorithms for shared memory model have been presented. However, these algorithms are too complicated to implement in existing shared memory parallel machines. The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two parallel platforms: multicore processors and a Graphics Processing Unit (GPU). More specifically, we have implemented our parallel algorithm in a Linux server with four Intel hexad-core processors (Intel Xeon X7460 2.66GHz). We have also implemented it in a modern GPU system, Tesla C1060, respectively. The experimental results have shown that, for an input binary image with size of $10000times 10000$, our implementation in the multi-core system achieves a speedup factor of 18 over the performance of a sequential algorithm using a single processor in the same system. Meanwhile, for the same input binary image, our implementation on the GPU achieves a speedup factor of 5 over the sequential algorithm implementation.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"36 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124989381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Canny Edge Detection Using a GPU","authors":"Kohei Ogawa, Yasuaki Ito, K. Nakano","doi":"10.1109/IC-NC.2010.13","DOIUrl":"https://doi.org/10.1109/IC-NC.2010.13","url":null,"abstract":"Recent GPUs, which have many processing units connected with a global memory, can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture). The main contribution of this paper is to implement a Canny edge detection algorithm on CUDA. The experimental result shows that our implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.","PeriodicalId":375145,"journal":{"name":"2010 First International Conference on Networking and Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}