Niang-Ying Huang, Chung-Yuan Su, Chi-Cheng Chuang, R.-I. Chang
{"title":"Video-Like Compression for High Efficiency Database Storage of Wireless Sensor Networks","authors":"Niang-Ying Huang, Chung-Yuan Su, Chi-Cheng Chuang, R.-I. Chang","doi":"10.1109/ICPP.2011.9","DOIUrl":"https://doi.org/10.1109/ICPP.2011.9","url":null,"abstract":"Wireless Sensor Networks (WSNs) consist of group sensor nodes which are placed in an area to monitor the changes of environment. Usually, sensing data are gathered and stored in a data server which maintains a database to organize and manage numerous of WSNs data. It allows researchers to retrieve these data for further study or analysis. Since the size of WSNs data is huge and the storage resource is limited, this database needs compression to lower the data size. In this paper, we propose a video-like compression method for high efficiency database storage of WSNs. First, the raw data are arranged according to the spatial correlation as an image frame. Then, several image frames with temporal correlation are maintained as a sequence of frames and lossless video compression is adopt for lowering the data size. Based on this idea, we also propose a data retrieve/query algorithm for parallel processing. The trade-off between space saving and query time is discussed after experiencing with real-world data. At last, we compare our proposed method to MySQL, a well-known database which compression is supported. The experimental results reveal that our method achieves over 96% of the space savings. It is over 13% more than that achieved by MySQL.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121781588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Virtual Machine Provisioning Based on Analytical Performance and QoS in Cloud Computing Environments","authors":"R. Calheiros, R. Ranjan, R. Buyya","doi":"10.1109/ICPP.2011.17","DOIUrl":"https://doi.org/10.1109/ICPP.2011.17","url":null,"abstract":"Cloud computing is the latest computing paradigm that delivers IT resources as services in which users are free from the burden of worrying about the low-level implementation or system administration details. However, there are significant problems that exist with regard to efficient provisioning and delivery of applications using Cloud-based IT resources. These barriers concern various levels such as workload modeling, virtualization, performance modeling, deployment, and monitoring of applications on virtualized IT resources. If these problems can be solved, then applications can operate more efficiently, with reduced financial and environmental costs, reduced under-utilization of resources, and better performance at times of peak load. In this paper, we present a provisioning technique that automatically adapts to workload changes related to applications for facilitating the adaptive management of system and offering end-users guaranteed Quality of Services (QoS) in large, autonomous, and highly dynamic environments. We model the behavior and performance of applications and Cloud-based IT resources to adaptively serve end-user requests. To improve the efficiency of the system, we use analytical performance (queueing network system model) and workload information to supply intelligent input about system requirements to an application provisioner with limited information about the physical infrastructure. Our simulation-based experimental results using production workload models indicate that the proposed provisioning technique detects changes in workload intensity (arrival pattern, resource demands) that occur over time and allocates multiple virtualized IT resources accordingly to achieve application QoS targets.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127709147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyong Ouyang, R. Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, D. Panda
{"title":"CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart","authors":"Xiangyong Ouyang, R. Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, D. Panda","doi":"10.1109/ICPP.2011.85","DOIUrl":"https://doi.org/10.1109/ICPP.2011.85","url":null,"abstract":"Checkpoint/Restart (C/R) mechanisms have been widely adopted by many MPI libraries [1 -- 3] to achieve fault-tolerance. However, a major limitation of such mechanisms is the intensive IO bottleneck caused by the need to dump the snapshots of all processes into persistent storage. Several studies have been conducted to minimize this overhead [4, 5], but most of these proposed optimizations are performed inside specific MPI stack or check pointing library or applications, hence they are not portable enough to be applied to other MPI stacks and applications. In this paper, we propose a filesystem based approach to alleviate this checkpoint IO bottleneck. We propose a new filesystem, named Checkpoint-Restart File system (CRFS), which is a lightweight user-level filesystem based on FUSE (File system in User space). CRFS is designed with Checkpoint/Restart I/O traffic in mind to efficiently handle the concurrent write requests. Any software component using standard filesystem interfaces can transparently benefit from CRFS's capabilities. CRFS intercepts the checkpoint file write system calls and aggregates them into fewer bigger chunks which are asynchronously written to the underlying filesystem for more efficient IO. CRFS manages a ?exible internal IO thread pool to throttle concurrent IO to alleviate IO contention for better IO performance. CRFS can be mounted over any standard filesystem like ext3, NFS and Lustre. We have implemented CRFS and evaluated its performance using three popular C/R capable MPI stacks: MVAPICH2, MPICH2 and OpenMPI. Experimental results show significant performance gains for all three MPI stacks. CRFS achieves up to 5.5X speedup in checkpoint writing performance to Lustre filesystem. Similar level of improvements are also obtained with ext3 and NFS filesystems. To the best of our knowledge, this is the first such portable and light-weight filesystem designed for generic Checkpoint/Restart data.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131872227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yibo Guo, Qingfeng Zhuge, J. Hu, Meikang Qiu, E. Sha
{"title":"Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems","authors":"Yibo Guo, Qingfeng Zhuge, J. Hu, Meikang Qiu, E. Sha","doi":"10.1109/ICPP.2011.79","DOIUrl":"https://doi.org/10.1109/ICPP.2011.79","url":null,"abstract":"Multi-core systems have been a popular design for high-performance embedded systems. Scratch Pad Memory (SPM), a software-controlled on-chip memory, has been widely adopted in many embedded systems due to its small area and low energy consumption. Existing data allocation algorithms either cannot achieve optimal results or take exponential time to complete. In this paper, we propose one polynomial-time algorithms to solve the data allocation problem on multi-core system with exclusive data copy. According to the experimental results, the proposed optimal data allocation method alone reduces time cost of memory accesses by 16.45% on average compared with greedy algorithm. The proposed data allocation algorithm also can reduce the energy cost significantly.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heehoon Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu
{"title":"A Scalable Tridiagonal Solver for GPUs","authors":"Heehoon Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu","doi":"10.1109/ICPP.2011.41","DOIUrl":"https://doi.org/10.1109/ICPP.2011.41","url":null,"abstract":"We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple sub problems each of which is independent of other, and 2) solving the sub problems using an efficient algorithm. We propose a hybrid method of tiled parallel cyclic reduction(tiled PCR) and thread-level parallel Thomas algorithm(p-Thomas). Algorithm transition from tiled PCR to p-Thomas is determined by input system size and hardware capability in order to achieve optimal performance. The proposed method is scalable as it can cope with various input system sizes by properly adjusting algorithmtrasition point. Our method on a NVidia GTX480 shows up to 8.3x and 49x speedups over multithreaded and sequential MKL implementations on a 3.33GHz Intel i7 975 in double precision, respectively.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"55 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129438464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teng Ma, G. Bosilca, Aurélien Bouteiller, Brice Goglin, J. Squyres, J. Dongarra
{"title":"Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs","authors":"Teng Ma, G. Bosilca, Aurélien Bouteiller, Brice Goglin, J. Squyres, J. Dongarra","doi":"10.1109/ICPP.2011.29","DOIUrl":"https://doi.org/10.1109/ICPP.2011.29","url":null,"abstract":"Shared memory is among the most common approaches to implementing message passing within multicorenodes. However, current shared memory techniques donot scale with increasing numbers of cores and expanding memory hierarchies -- most notably when handling large data transfers and collective communication. Neglecting the underlying hardware topology, using copy-in/copy-out memory transfer operations, and overloading the memory subsystem using one-to-many types of operations are some of the most common mistakes in today's shared memory implementations. Unfortunately, they all negatively impact the performance and scalability of MPI libraries -- and therefore applications. In this paper, we present several kernel-assisted intra-node collective communication techniques that address these three issues on many-core systems. We also present a new OpenMPI collective communication component that uses the KNEMLinux module for direct inter-process memory copying. Our Open MPI component implements several novel strategies to decrease the number of intermediate memory copies and improve data locality in order to diminish both cache pollution and memory pressure. Experimental results show that our KNEM-enabled Open MPI collective component can outperform state-of-art MPI libraries (Open MPI and MPICH2) on synthetic benchmarks, resulting in a significant improvement for a typical graph application.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128275489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Raskovic, A. Varbanescu, W. Vlothuizen, M. Ditzel, H. Sips
{"title":"OCL-BodyScan: A Case Study for Application-centric Programming of Many-Core Processors","authors":"M. Raskovic, A. Varbanescu, W. Vlothuizen, M. Ditzel, H. Sips","doi":"10.1109/ICPP.2011.89","DOIUrl":"https://doi.org/10.1109/ICPP.2011.89","url":null,"abstract":"Application development for many-core processors is predominately hardware-centric: programmers design, implement, and optimize applications for a pre-chosen target platform. While this approach may deliver very good performance, it lacks portability, being inefficient for applications that aim to use multiple architectures or large-scale parallel platforms with heterogeneous many-core nodes. In this work, we focus on application portability. Therefore, we propose an application-centric approach for developing parallel workloads for many-cores, and we make use of OpenCL to preserve portability until the very last optimization stages. We validate our application-centric approach using 3D body scan, a data intensive application with soft real-time constraints. Thus, we design and implement OCL-body scan (the portable OpenCL-based version of 3D Body scan), and we evaluate its performance on three families of platforms - general purpose multi-cores, graphical processing units, and the Cell/B.E.. Our experiments show that our application-centric strategy enables portability and leads to good performance results. Additionally, typical platform-specific optimizations can be applied in the final implementation stages, leading to performance results similar to those obtained using the native tool-chains.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122351313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GSNP: A DNA Single-Nucleotide Polymorphism Detection System with GPU Acceleration","authors":"Mian Lu, Jiuxin Zhao, Qiong Luo, Bingqiang Wang, Shaohua Fu, Zhe Lin","doi":"10.1109/ICPP.2011.51","DOIUrl":"https://doi.org/10.1109/ICPP.2011.51","url":null,"abstract":"We have developed GSNP, a software package with GPU acceleration, for single-nucleotide polymorphism detection on DNA sequences generated from second-generation sequencing equipment. Compared with SOAPsnp, a popular, high-performance CPU-based SNP detection tool, GSNP has several distinguishing features: First, we design a sparse data representation format to reduce memory access as well as branch divergence. Second, we develop a multipass sorting network to efficiently sort a large number of small arrays on the GPU. Third, we compute a table of frequently used scores once to avoid repeated, expensive computation and to reduce random memory access. Fourth, we apply customized compression schemes to the output data to improve the I/O performance. As a result, on a server equipped with an Intel Xeon E5630 2.53 GHZ CPU and an NVIDIA Tesla M2050 GPU, it took GSNP about two hours to analyze a whole human genome dataset whereas the CPU-based, single-threaded SOAPsnp took three days for the same task on the same machine.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134145190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implications of Merging Phases on Scalability of Multi-core Architectures","authors":"M. Manivannan, B. Juurlink, P. Stenström","doi":"10.1109/ICPP.2011.74","DOIUrl":"https://doi.org/10.1109/ICPP.2011.74","url":null,"abstract":"Amdahl's Law dictates that in parallel applications serial sections establish an upper limit on the scalability. Asymmetric chip multiprocessors with a large core in addition to several small cores have been advocated for recently as a promising design paradigm because the large core can accelerate the execution of serial sections and hence mitigate the scalability bottlenecks due to large serial sections. This paper studies the scalability of a set of data mining workloads that have negligible serial sections. The formulation of Amdahl's Law, that optimistically assumes constant serial sections, estimates these workloads to scale to hundreds of cores in a chip multiprocessor (CMP). However the overhead in carrying out merging (or reduction) operations makes scalability to peak at lesser number. We establish this by extending theAmdahl's speedup model to factor in the impact of reduction operations on the speedup of applications on symmetric as well as asymmetric CMP designs. Our analytical model estimates that asymmetric CMPs with one large and many tiny cores are only optimal for applications with a low reduction overhead. However, as the overhead starts to increase, the balance is shifted towards using fewer but more capable cores. This eventually limits the performance advantage of asymmetric over symmetric CMPs.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133431028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Negi, J. Gil, M. Acacio, José M. García, P. Stenström
{"title":"Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory","authors":"A. Negi, J. Gil, M. Acacio, José M. García, P. Stenström","doi":"10.1109/ICPP.2011.63","DOIUrl":"https://doi.org/10.1109/ICPP.2011.63","url":null,"abstract":"Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chip-multiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.","PeriodicalId":115365,"journal":{"name":"2011 International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115609581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}