{"title":"We have it easy, but do we have it right?","authors":"Amer Diwan","doi":"10.1109/IISWC.2008.4636085","DOIUrl":"https://doi.org/10.1109/IISWC.2008.4636085","url":null,"abstract":"We show two severe problems with the state of the art in empirical computer system performance evaluation, observer effect and measurement context bias, and we outline the path toward a solution.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133225646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kerbyson, R. Rajamony, C. Weems, J. Baker, H. Siegel, G. Almási, T. Boku, B. Chapman, H. Dietz, D. Katz, J. Levesque, J. Michalakes, C. Mendes, B. Mohr, Stathis Papaefstathiou, Michael Scherger, R. Walker, H. Wasserman, G. Wellein, P. Worley
{"title":"Workshop 22 introduction: Workshop on Large-Scale Parallel Processing - LSPP","authors":"D. Kerbyson, R. Rajamony, C. Weems, J. Baker, H. Siegel, G. Almási, T. Boku, B. Chapman, H. Dietz, D. Katz, J. Levesque, J. Michalakes, C. Mendes, B. Mohr, Stathis Papaefstathiou, Michael Scherger, R. Walker, H. Wasserman, G. Wellein, P. Worley","doi":"10.1109/IPDPS.2008.4536110","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536110","url":null,"abstract":"The workshop on Large-Scale Parallel Processing is a forum that focuses on computer systems that utilize thousands of processors and beyond. This is a very active area given the goals by many worldwide to enhance science-by-simulation by installing large-scale peta-flop systems at the start of the next decade. Large-scale systems, referred to by some as extreme-scale and Ultra-scale, have many important research aspects that need detailed examination in order for their effective design, deployment, and utilization to take place. These include handling the substantial increase in multi-core on a chip, the ensuing interconnection hierarchy, communication, and synchronization mechanisms. The workshop aims to bring together researchers from different communities working on challenging problems in this area for a dynamic exchange of ideas. Work at early stages of development as well as work that has been demonstrated in practice is equally welcome.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116785507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lightweight scalable I/O utility for optimizing High-End Computing applications","authors":"Shujia Zhou, Bruce H. Van Aartser, T. Clune","doi":"10.1109/IPDPS.2008.4536462","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536462","url":null,"abstract":"Filesystem I/O continues to be a major performance bottleneck for many high-end computing (HEC) applications and in particular for Earth science models, which often generate a relatively large volume of data for a given amount of computational work. The severity of this I/O bottleneck rapidly increases with the number of processors utilized. Consequently, considerable computing resources are wasted, and the sustained performance of HEC applications such as climate and weather models is highly constrained. To alleviate much of this bottleneck, we have developed a lightweight software utility designed to improve performance of typical scientific applications by circumventing bandwidth limitations of typical HEC filesystems. The approach is to exploit the faster inter- processor bandwidth to move output data from compute nodes to designated I/O nodes as quickly as possible, thereby minimizing the I/O wait time. This utility has successfully demonstrated a significant performance improvement within a major NASA weather application.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128684053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of laboratory and computational techniques for optimal and quantitative understanding of cellular metabolic networks","authors":"Xiao-Jiang Feng, J. Rabinowitz, H. Rabitz","doi":"10.1109/IPDPS.2008.4536416","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536416","url":null,"abstract":"This paper summarizes the development of laboratory and computational techniques for systematic and reliable understanding of metabolic networks. By combining a filter-based cell culture system and an optimized metabolite extraction protocol, a broad array of cellular metabolites can be reliably quantified following nutrient and other environment perturbations. A nonlinear closed-loop procedure was also developed for optimal bionetwork model identification. Computational illustrations and laboratory applications clearly demonstrate the capabilities of these techniques in understanding cellular metabolism, especially when they are integrated in an optimal fashion.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128217649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ECG segmentation in a body sensor network using Hidden Markov Models","authors":"Huaming Li, Jindong Tan","doi":"10.1109/ISSMDBS.2008.4575075","DOIUrl":"https://doi.org/10.1109/ISSMDBS.2008.4575075","url":null,"abstract":"A novel approach for segmenting ECG signal in a body sensor network employing hidden Markov modeling (HMM) technique is presented. The parameter adaptation in traditional HMM methods is conservative and slow to respond to these beat interval changes. Inadequate and slow parameter adaptation is largely responsible for the low positive predictivity rate. To solve the problem, we introduce an active HMM parameter adaptation and ECG segmentation algorithm. Body sensor networks are used to pre-segment the raw ECG data by performing QRS detection. Instead of one single generic HMM, multiple individualized HMMs are used. Each HMM is only responsible for extracting the characteristic waveforms of the ECG signals with similar temporal features from the same group, so that the temporal parameter adaptation can be naturally achieved.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123790651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Sousa, V. Poladian, D. Garlan, B. Schmerl, P. Steenkiste
{"title":"Steps toward activity-oriented computing","authors":"J. Sousa, V. Poladian, D. Garlan, B. Schmerl, P. Steenkiste","doi":"10.1109/IPDPS.2008.4536432","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536432","url":null,"abstract":"Most pervasive computing technologies focus on helping users with computer-oriented tasks. In this NSF-funded project, we instead focus on using computers to support user-centered \"activities\" that normally do not involve the use of computers. Examples may include everyday tasks around such as answering the doorbell or doing laundry. A focus on activity-based computing brings to the foreground a number of unique challenges. These include activity definition and representation, system design, interfaces for managing activities, and ensuring robust operation. Our project focuses on the first two challenges.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115237163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Avoiding communication in sparse matrix computations","authors":"J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick","doi":"10.1109/IPDPS.2008.4536305","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536305","url":null,"abstract":"The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the \"matrix powers kernel\" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can achieve near-minimal communication costs. We consider communication very broadly as both network communication in parallel code and memory hierarchy access in sequential code. In particular, we introduce a parallel algorithm for which the number of messages (total latency cost) is independent of the power k, and a sequential algorithm, that reduces both the number and volume of accesses, so that it is independent of k in both latency and bandwidth costs. This is part of a larger project to develop \"communication-avoiding Krylov subspace methods,\" which also addresses the numerical issues associated with these methods. Our algorithms work for general sparse matrices that \"partition well\". We introduce parallel performance models of matrices arising from 2D and 3D problems and show predicted speedups over a conventional algorithm of up to 7times on a petaflop-scale machine and up to 22times on computation across the grid. Analogous sequential performance models of the same problems predict speedups over a conventional algorithm of up to 10times on an out-of-core implementation, and up to 2.5times when we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we validate the model on an out-of-core sequential implementation and measured a speedup of over 3times, which is close to the predicted speedup.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115646727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiong Cai, J. M. Codina, José González, Antonio González
{"title":"A software-hardware hybrid steering mechanism for clustered microarchitectures","authors":"Qiong Cai, J. M. Codina, José González, Antonio González","doi":"10.1109/IPDPS.2008.4536229","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536229","url":null,"abstract":"Clustered microarchitectures provide a promising paradigm to solve or alleviate the problems of increasing microprocessor complexity and wire delays. High- performance out-of-order processors rely on hardware-only steering mechanisms to achieve balanced workload distribution among clusters. However, the additional steering logic results in a significant increase on complexity, which actually decreases the benefits of the clustered design. In this paper, we address this complexity issue and present a novel software-hardware hybrid steering mechanism for out-of-order processors. The proposed software- hardware cooperative scheme makes use of the concept of virtual clusters. Instructions are distributed to virtual clusters at compile time using static properties of the program such as data dependences. Then, at runtime, virtual clusters are mapped into physical clusters by considering workload information. Experiments using SPEC CPU2000 benchmarks show that our hybrid approach can achieve almost the same performance as a state-of-the-art hardware-only steering scheme, while requiring low hardware complexity. In addition, the proposed mechanism outperforms state-of-the-art software-only steering mechanisms by 5% and 10% on average for 2-cluster and 4-cluster machines, respectively.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127184344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Xie, Mithun P. Acharya, Suresh Thummalapenta, Kunal Taneja
{"title":"Improving software reliability and productivity via mining program source code","authors":"Tao Xie, Mithun P. Acharya, Suresh Thummalapenta, Kunal Taneja","doi":"10.1109/IPDPS.2008.4536384","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536384","url":null,"abstract":"A software system interacts with third-party libraries through various APIs. Insufficient documentation and constant refactorings of third-party libraries make API library reuse difficult and error prone. Using these library APIs often needs to follow certain usage patterns. These patterns aid developers in addressing commonly faced programming problems such as what checks should precede or follow API calls, how to use a given set of APIs for a given task, or what API method sequence should be used to obtain one object from another. Ordering rules (specifications) also exist between APIs, and these rules govern the secure and robust operation of the system using these APIs. These patterns and rules may not be well documented by the API developers. Furthermore, usage patterns and specifications might change with library refactorings, requiring changes in the software that reuse the library. To address these issues, we develop novel techniques (and their supporting tools) based on mining source code, assisting developers in productively reusing third party libraries to build reliable and secure software.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125184595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand","authors":"Matthew J. Koop, T. Jones, D. Panda","doi":"10.1109/IPDPS.2008.4536283","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536283","url":null,"abstract":"The need for computational cycles continues to exceed availability, driving commodity clusters to increasing scales. With upcoming clusters containing tens-of-thousands of cores, InfiniBand is a popular interconnect on these clusters, due to its low latency (1.5 musec) and high bandwidth (1.5 GB/sec). Since most scientific applications running on these clusters are written using the message passing interface (MPI) as the parallel programming model, the MPI library plays a key role in the performance and scalability of the system. Nearly all MPIs implemented over InfiniBand currently use the reliable connection (RC) transport of InfiniBand to implement message passing. Using this transport exclusively, however, has been shown to potentially reach a memory footprint of over 200 MB/task at 16 K tasks for the MPI library. The Unreliable Datagram (UD) transport, however, offers higher scalability, but at the cost of medium and large message performance. In this paper we present a multi-transport MPI design, MVAPICH-Aptus, that uses both the RC and UD transports of InfiniBand to deliver scalability and performance higher than that of a single-transport MPI design. Evaluation of our hybrid design on 512 cores shows a 12% improvement over an RC-based design and 4% better than a UD-based design for the SMG2000 application benchmark. In addition, for the molecular dynamics application NAMD we show a 10% improvement over an RC-only design. To the best of our knowledge, this is the first such analysis and design of optimized MPI using both UD and RC.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125909075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}