{"title":"VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O Systems","authors":"Christopher Mitchell, J. Ahrens, Jun Wang","doi":"10.1109/IPDPS.2011.17","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.17","url":null,"abstract":"Petascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visualization of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a significant impediment to interactive visualization of time-varying data, as they are not configured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar to other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX file system semantics for compatibility with existing infrastructure, and (3) use of existing file formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed file system to provide linear scaling of I/O bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into Para View was conducted using the Hadoop Distributed File System (HDFS) on TACC's Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Venkatesan T. Chakaravarthy, G. Parija, Sambuddha Roy, Yogish Sabharwal, Amit Kumar
{"title":"Minimum Cost Resource Allocation for Meeting Job Requirements","authors":"Venkatesan T. Chakaravarthy, G. Parija, Sambuddha Roy, Yogish Sabharwal, Amit Kumar","doi":"10.1109/IPDPS.2011.12","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.12","url":null,"abstract":"We consider the problem of allocating resources for completing a collection of jobs. Each resource is specified by a start-time, finish-time and the capacity of resource available and has an associated cost, and each job is specified by a start-time, finish-time and the amount of the resource required (demand) during this interval. A feasible solution is a multiset of resources (i.e., multiple units of each resource may be picked) such that at any point of time, the sum of the capacities offered by the resources is at least the total demand of the jobs active at that point of time. The cost of the solution is the sum of the costs of the resources included in the solution (taking into account the units of the resources). The goal is to find a feasible solution of minimum cost. This problem arises naturally in many scenarios. For example, given a set of jobs, we would like to allocate some resource such as machines, memory or bandwidth in order to complete all the jobs. This problem generalizes a covering version of the knapsack problem which is known to be NP-hard. We present a constant factor approximation algorithm for this problem based on a Primal-Dual approach.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116467604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reader Activation Scheduling in Multi-reader RFID Systems: A Study of General Case","authors":"Shaojie Tang, Cheng Wang, Xiangyang Li, Changjun Jiang","doi":"10.1109/IPDPS.2011.109","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.109","url":null,"abstract":"Radio frequency identification (RFID) is a technology where a reader device can \"sense'' the presence of a close by object by reading a tag device attached to the object. To guarantee the coverage quality, multiple RFID readers can be deployed in the given region. In this paper, we consider the problem of activation schedule for readers in a multi-reader environment. In particular, we try to design a schedule for readers to maximize the number of served tags per time-slot while avoiding various interferences. We first develop a centralized algorithm under the assumption that different readers may have different interference and interrogation radius. Next, we propose a novel algorithm which does not need any location information of the readers. Finally, we extend the previous algorithm in distributed manner in order to suit the case where no central entity exists. We conduct extensive simulations to study the performances of our proposed algorithm. And our evaluation results corroborate our theoretical analysis.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121941572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling Functionally Heterogeneous Systems with Utilization Balancing","authors":"Yuxiong He, Jie Liu, Hongyang Sun","doi":"10.1109/IPDPS.2011.113","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.113","url":null,"abstract":"Heterogeneous systems become popular in both client and cloud. A parallel program can incur operations on multiple processing resources such as CPU, GPU, and vector processor units. This paper investigates scheduling problems on functionally heterogeneous systems with the objective of minimizing the completion time of parallel jobs. We first present performance bounds of online scheduling and show that any online algorithm is at best around (K + 1)-competitive with respect to job completion time, where K is the total number of resource types. There exist \"bad\" jobs that prevent any online algorithms from obtaining good interleaving of heterogeneous tasks. This lower bound suggests that the relative performance of online algorithms versus an offline optimal could degrade linearly as types of heterogeneous resources increase. The limitation of online scheduling motivates our study of how additional offline or look ahead information can help improve scheduling performance. We propose a Multi-Queue Balancing algorithm (MQB) that effectively transforms the problem of minimizing completion time to one of maximizing utilization of heterogeneous resources. It promotes interleaving of heterogeneous tasks through balancing the task queues of different types. Our simulation results suggest that MQB reduces the execution time of online greedy algorithms up to 40% over various workloads and outperforms other offline schemes in most cases. Furthermore, MQB can use limited and approximated offline information to improve scheduling decisions.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115967298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication","authors":"A. Buluç, Samuel Williams, L. Oliker, J. Demmel","doi":"10.1109/IPDPS.2011.73","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.73","url":null,"abstract":"On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bit masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bit masked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127241019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, Xiaobing Feng
{"title":"Automatic Library Generation for BLAS3 on GPUs","authors":"Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, Xiaobing Feng","doi":"10.1109/IPDPS.2011.33","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.33","url":null,"abstract":"High-performance libraries, the performance-critical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential benefits of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117303850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junghee Lee, C. Nicopoulos, Y. Lee, H. Lee, Jongman Kim
{"title":"Hardware-Based Job Queue Management for Manycore Architectures and OpenMP Environments","authors":"Junghee Lee, C. Nicopoulos, Y. Lee, H. Lee, Jongman Kim","doi":"10.1109/IPDPS.2011.47","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.47","url":null,"abstract":"The seemingly interminable dwindle of technology feature sizes well into the nano-scale regime has afforded computer architects with an abundance of computational resources on a single chip. The Chip Multi-Processor (CMP) paradigm is now seen as the de facto architecture for years to come. However, in order to efficiently exploit the increasing number of on-chip processing cores, it is imperative to achieve and maintain efficient utilization of the resources at run time. Uneven and skewed distribution of workloads misuses the CMP resources and may even lead to such undesired effects as traffic and temperature hotspots. While existing techniques rely mostly on software for the undertaking of load balancing duties and exploit hardware mainly for synchronization, we will demonstrate that there are wider opportunities for hardware support of load balancing in CMP systems. Based on this fact, this paper proposes IsoNet, a conflict-free dynamic load distribution engine that exploits hardware aggressively to reinforce massively parallel computation in many core settings. Moreover, the proposed architecture provides extensive fault-tolerance against both CPU faults and intra-IsoNet faults. The hardware takes charge of both (1) the management of the list of jobs to be executed, and (2) the transfer of jobs between processing elements to maintain load balance. Experimental results show that, unlike the existing popular techniques of blocking and job stealing, IsoNet is scalable with as many as 1024 processing cores.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116309490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling Parallel Iterative Applications on Volatile Resources","authors":"H. Casanova, F. Dufossé, Y. Robert, F. Vivien","doi":"10.1109/IPDPS.2011.97","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.97","url":null,"abstract":"In this paper we study the execution of iterative applications on volatile processors such as those found on desktop grids. We develop master-worker scheduling schemes that attempt to achieve good trade-offs between worker speed and worker availability. A key feature of our approach is that we consider a communication model where the bandwidth capacity of the master for sending application data to workers is limited. This limitation makes the scheduling problem more difficult both in a theoretical sense and in a practical sense. Furthermore, we consider that a processor can be in one of three states: available, down, or temporarily preempted by its owner. This preempted state also complicates the scheduling problem. In practical settings, e.g., desktop grids, master bandwidth is limited and processors are temporarily reclaimed. Consequently, addressing the aforementioned difficulties is necessary for successfully deploying master-worker applications on volatile platforms. Our first contribution is to determine the complexity of the scheduling problem in its off-line version, i.e., when processor availability behaviors are known in advance. Even with this knowledge, the problem is NP-hard, and cannot be approximated within a factor $8/7$. Our second contribution is a closed-form formula for the expectation of the time needed by a worker to complete a set of tasks. This formula relies on a Markovian assumption for the temporal availability of processors, and is at the heart of some heuristics that aim at favoring \"reliable'' processors in a sensible manner. Our third contribution is a set of heuristics, which we evaluate in simulation. Our results provide guidance to selecting the best strategy as a function of processor state availability versus average task duration.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123969076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power-Aware Replica Placement and Update Strategies in Tree Networks","authors":"A. Benoit, Paul Renaud-Goud, Y. Robert","doi":"10.1109/IPDPS.2011.11","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.11","url":null,"abstract":"This paper deals with optimal strategies to place replicas in tree networks, with the double objective to minimize the total cost of the servers, and/or to optimize power consumption. The client requests are known beforehand, and some servers are assumed to pre-exist in the tree. Without power consumption constraints, the total cost is an arbitrary function of the number of existing servers that are reused, and of the number of new servers. Whenever creating and operating a new server has higher cost than reusing an existing one (which is a very natural assumption), cost optimal strategies have to trade-off between reusing resources and load-balancing requests on new servers. We provide an optimal dynamic programming algorithm that returns the optimal cost, thereby extending known results without pre-existing servers. With power consumption constraints, we assume that servers operate under a set of $M$ different modes depending upon the number of requests that they have to process. In practice $M$ is a small number, typically $2$ or $3$, depending upon the number of allowed voltages. Power consumption includes a static part, proportional to the total number of servers, and a dynamic part, proportional to a constant exponent of the server mode, which depends upon the model for power. The cost function becomes a more complicated function that takes into account reuse and creation as before, but also upgrading or downgrading an existing server from one mode to another. We show that with an arbitrary number of modes, the power minimization problem is NP-complete, even without cost constraint, and without static power. Still, we provide an optimal dynamic programming algorithm that returns the minimal power, given a threshold value on the total cost, it has exponential complexity in the number of modes $M$, and its practical usefulness is limited to small values of~$M$. Still, experiments conducted with this algorithm show that it can process large trees in reasonable time, despite its worst-case complexity.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124661555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method","authors":"M. Hoemmen","doi":"10.1109/IPDPS.2011.93","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.93","url":null,"abstract":"Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, \"communication\" includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches for orthogonalizing the vectors within each block (\"normalization\"). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5 -- 20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129593393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}