{"title":"Real-Time Monitoring of Multicore SoCs through Specialized Hardware Agents on NoC Network Interfaces","authors":"Georgios Kornaros, D. Pnevmatikatos","doi":"10.1109/IPDPSW.2012.27","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.27","url":null,"abstract":"Network-on-chip based multicore systems need efficient management of a multitude of processing resources, hence avoiding hardware and system software from making inefficient time- and power-decisions at runtime. Hardware event management is a necessary path to assist in high-speed management of captured events and enable efficient reaction mechanisms. This paper proposes different micro architecture alternatives and describes an infrastructure for real-time monitoring and management of network-on-chip based systems. High-speed and energy efficient circuit techniques are deployed for monitoring agents that reside at the network interfaces in order to be configured dynamically and communicate computed statistics to centralized hardware monitor managers of different functionality and complexity. An implementation of a pipelined centralized monitor manager is shown, with the capacity to maintain event ordering and process different types of concurrent events. A single event is served with a latency of seven clock cycles. The presented results of a quantitative evaluation provide guidelines for system-level designers, proving the need for flexible and at the same time efficient filters for real-time monitors inside complex NoC-based SoCs.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124038888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Optimized Reconfigurable System for Computing the Phylogenetic Likelihood Function on DNA Data","authors":"S. Berger, Nikolaos S. Alachiotis, A. Stamatakis","doi":"10.1109/IPDPSW.2012.43","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.43","url":null,"abstract":"The Phylogenetic Likelihood Function (PLF) is an important statistical function for evaluating phylogenetic trees. To this end, the PLF is the computational kernel of all state-of-the-art likelihood-based phylogenetic inference programs. Typically, it accounts for more than 85% of total execution time in such programs. We present a substantially improved hardware architecture for computing the PLF based on previous experiences with implementing the PLF on reconfigurable logic. Our new design is optimized for computing the PLF on four-state (DNA) input data. It is also adapted to the computational requirements of real-world tree inference programs and completely independent of the specific tree search algorithm at hand. Furthermore, we describe how our architecture can be modified and adapted to handle general n-state data, such as protein (20 states) or RNA secondary structure data (6, 7, or 16 states, depending on the model). Finally, we designed an interface mechanism such that our PLF hardware architecture can interact with the widely-used phylogenetic inference tool RAxML. We deploy FPGA technology to verify the correctness of the architecture and to evaluate performance.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127744097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Xie, Guoming Tang, Daifei Wang, W. Xiao, Daquan Tang, Jiuyang Tang
{"title":"A Fault-Tolerant Target-Tracking Strategy Based on Unreliable Sensing in Wireless Sensor Networks","authors":"Yi Xie, Guoming Tang, Daifei Wang, W. Xiao, Daquan Tang, Jiuyang Tang","doi":"10.1109/IPDPSW.2012.261","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.261","url":null,"abstract":"Focusing on the unreliable sensing phenomenon in wireless sensor networks and its impact on target-tracking accuracy, this paper first analyzes the uncertain area and its boundaries. Then the monitor area can be divided into faces by these uncertain boundaries and each face has an identical signature vector. On the other hand, for each target localization, any pair-wise nodes' RSS is ordinal or flipped can be determined by multiple grouping samplings and the sampling vector is built. Hence, the Fault-Tolerant Target-Tracking (FTTT) strategy is proposed, which transforms the tracking problem into a vector matching process in order to improve the tracking flexibility, increase the tracking accuracy and reduce the influence of in-the-filed factors. In addition, a heuristic matching algorithm is introduced to reduce the computational complexity. Results have shown that FTTT is more flexible and has higher tracking accuracy than congenerous methods.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127995947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coverage-aware Geocast Routing in Urban Vehicular Networks","authors":"Ruobing Jiang, Yanmin Zhu","doi":"10.1109/IPDPSW.2012.317","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.317","url":null,"abstract":"Geo cast routing in vehicular ad hoc networks plays an important role as the basis of applications such as traffic information sharing, emergency alarming, and geographic advertisement. It is quite challenging, however, to geo cast packets through multi-hop relay vehicles because of the highly dynamic network topology, large scale city road system and fast moving vehicles. Our idea is to measure vehicles' coverage capability and forward packets to those vehicles with higher probability to successfully deliver the packets. The idea is rooted in the widely accepted concept that vehicular trajectories improve packet routing and the fact that vehicular trajectories are nowadays available through widely used navigation system. To accomplish the idea, the difficulty is to measure the coverage capability of a vehicle over a specific region with only partially available vehicular trajectories without accurate timing information. We propose a novel coverage graph to maintain collected trajectories of all the encountered vehicles and their most update timing information so that the extended coverage capability of each vehicle can be estimated. The coverage graph is constructed in a distributed way based on locally shared information and the packet forwarding decisions can be adaptively made to meet different routing objectives.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125992545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Bisset, Ashwin M. Aji, Eric J. Bohm, L. Kalé, Tariq Kamal, M. Marathe, Jae-Seung Yeom
{"title":"Simulating the Spread of Infectious Disease over Large Realistic Social Networks Using Charm++","authors":"K. Bisset, Ashwin M. Aji, Eric J. Bohm, L. Kalé, Tariq Kamal, M. Marathe, Jae-Seung Yeom","doi":"10.1109/IPDPSW.2012.65","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.65","url":null,"abstract":"Preventing and controlling outbreaks of infectious diseases such as pandemic influenza is a top public health priority. EpiSimdemics is an implementation of a scalable parallel algorithm to simulate the spread of contagion, including disease, fear and information, in large (108 individuals), realistic social contact networks using individual-based models. It also has a rich language for describing public policy and agent behavior. We describe CharmSimdemics and evaluate its performance on national scale populations. Charm++ is a machine independent parallel programming system, providing high-level mechanisms and strategies to facilitate the task of developing highly complex parallel applications. Our design includes mapping of application entities to tasks, leveraging the efficient and scalable communication, synchronization and load balancing strategies of Charm++. Our experimental results on a 768 core system show that the Charm++ version achieves up to a 4-fold increase in performance when compared to the MPI version.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132106319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven A. Wright, S. Hammond, S. Pennycook, I. Miller, J. Herdman, S. Jarvis
{"title":"LDPLFS: Improving I/O Performance without Application Modification","authors":"Steven A. Wright, S. Hammond, S. Pennycook, I. Miller, J. Herdman, S. Jarvis","doi":"10.1109/IPDPSW.2012.172","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.172","url":null,"abstract":"Input/Output (I/O) operations can represent a significant proportion of run-time when large scientific applications are run in parallel and at scale. In order to address the growing divergence between processing speeds and I/O performance, the Parallel Log-structured File System (PLFS) has been developed by EMC Corporation and the Los Alamos National Laboratory (LANL) to improve the performance of parallel file activities. Currently, PLFS requires the use of either (i) the FUSE Linux Kernel module, (ii) a modified MPI library with a customised ROMIO MPI-IO library, or (iii) an application rewrite to utilise the PLFS API directly. In this paper we present an alternative method of utilising PLFS in applications. This method employs a dynamic library to intercept the low-level POSIX operations and retarget them to use the equivalents offered by PLFS. We demonstrate our implementation of this approach, named LDPLFS, on a set of standard UNIX tools, as well on as a set of standard parallel I/O intensive mini-applications. The results demonstrate almost equivalent performance to a modified build of ROMIO and improvements over the FUSE-based approach. Furthermore, through our experiments we demonstrate decreased performance in PLFS when ran at scale on the Lustre file system.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130422592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QoS-Oriented Data Dissemination in VANETs","authors":"Lifeng Zhang, Beihong Jin","doi":"10.1109/IPDPSW.2012.316","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.316","url":null,"abstract":"Data dissemination over long distances in urban scenarios is the foundation of many VANET applications, but rapid shifting in network topology, unstable quality of wireless communication and channel capacity constraints of VANETs pose many challenges to data dissemination. In response, we propose a connectivity-aware data delivery mechanism on the basis of an improved greedy broadcasting. Moreover, we present an in-network and hierarchical data aggregation mechanism to reduce the transferring of the redundant data which result from multi-source data collecting and multi-path data transmitting. Both mechanisms are intended to improve the qualities of data dissemination in VANETs either by enhancing the adaptability to varying traffic flows or by aggregating data in a hierarchical way.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"68 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130755740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sharkawi, Don DeSota, R. Panda, Stephen Stevens, V. Taylor, Xingfu Wu
{"title":"SWAPP: A Framework for Performance Projections of HPC Applications Using Benchmarks","authors":"S. Sharkawi, Don DeSota, R. Panda, Stephen Stevens, V. Taylor, Xingfu Wu","doi":"10.1109/IPDPSW.2012.214","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.214","url":null,"abstract":"Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement. SWAPP assumes that one has access to a base system and only benchmark data for a target system, the target system is not available for running the HPC application. Projections are developed using the performance profiles of the benchmarks and application on the base system and the benchmark data for the target system. SWAPP projects the performances of compute and communication components separately then combine the two projections to get the full application projection. In this paper SWAPP was used to project the performance of three NAS Multi-Zone benchmarks onto three systems (an IBM POWER6 575 cluster and an IBM Intel West mere x5670 both using an Infiniband interconnect and an IBM Blue Gene/P with a 3D Torus and Collective Tree interconnects), the base system is an IBM POWER5+ 575 cluster. The projected performance of the three benchmarks was within 11.44% average error magnitude and standard deviation of 2.64% for the three systems.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130827841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Offloading C++ Expression Templates to CUDA Enabled GPUs","authors":"Jie Chen, B. Joó, W. Watson, R. Edwards","doi":"10.1109/IPDPSW.2012.293","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.293","url":null,"abstract":"In the last few years, many scientific applications have been developed for powerful graphics processing units (GPUs) and have achieved remarkable speedups. This success can be partially attributed to high performance host callable GPU library routines that are offloaded to GPUs at runtime. These library routines are based on C/C++-like programming toolkits such as CUDA from NVIDIA and have the same calling signatures as their CPU counterparts. Recently, with the sufficient support of C++ templates from CUDA, the emergence of template libraries have enabled further advancement in code reusability and rapid software development for GPUs. However, Expression Templates (ET), which have been very popular for implementing data parallel scientific software for host CPUs because of their intuitive and mathematics-like syntax, have been underutilized by GPU development libraries. The lack of ET usage is caused by the difficulty of offloading expression templates from hosts to GPUs due to the inability to pass instantiated expressions to GPU kernels as well as the absence of the exact form of the expressions for the templates at the time of coding. This paper presents a general approach that enables automatic offloading of C++ expression templates to CUDA enabled GPUs by using the C++ metaprogramming technique and Just-In-Time (JIT) compilation methodology to generate and compile CUDA kernels for corresponding expression templates followed by executing the kernels with appropriate arguments. This approach allows developers to port applications to run on GPUs with virtually no code modifications. More specifically, this paper uses a large ET based data parallel physics library called QDP++ as an example to illustrate many aspects of the approach to offload expression templates automatically and to demonstrate very good speedups for typical QDP++ applications running on GPUs against running on CPUs using this method of offloading. In addition, this approach of automatic offloading expression templates could be applied to other many-core accelerators that provide C++ programming toolkits with the support of C++ template.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133139272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Efficient Associative Processor Solution to Guarantee Real-Time Requirements for Air Traffic Control Systems","authors":"M. Yuan, J. Baker, W. Meilander, K. Schaffer","doi":"10.1109/IPDPSW.2012.210","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.210","url":null,"abstract":"This paper proposes a solution to air traffic control (ATC) using an enhanced SIMD machine model called an Associative Processor (AP). Our solution differs from previous ATC systems that are designed for MIMD computers and have a great deal of difficulty meeting the predictability requirements for ATC, which are critical for meeting the strict certification standards required for safety critical software components. The proposed AP solution supports accurate predictions of worst case execution times and guarantees all deadlines are met. Furthermore, the software developed based on the AP model is much simpler and smaller in size than the current corresponding ATC software. As the associative processor is built from SIMD hardware, it is considerably cheaper and simpler than the MIMD hardware currently used to support ATC. We have designed a prototype for eight ATC real-time tasks on Clear Speed CSX600 accelerator that is used to emulate AP. Performance is evaluated in terms of execution time and predictability and is compared to the fastest host-only version implemented using OpenMP on an 8-core multiprocessor (MIMD). Our extensive experiments show that the AP implementation meets all deadlines that can be statically scheduled. To the contrary, some tasks miss their deadlines when implemented on MIMD. It is shown that the proposed AP solution will support accurate and meaningful predictions of worst case execution times and will guarantee that all deadlines are met.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134465180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}