Guilherme Galante, Rogério Luís Rizzi, T. A. Diverio
{"title":"A Multigrid-Schwarz Method for the Solution of Hydrodynamics and Heat Transfer Problems in Unstructured Meshes","authors":"Guilherme Galante, Rogério Luís Rizzi, T. A. Diverio","doi":"10.1109/SBAC-PAD.2007.20","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.20","url":null,"abstract":"This paper presents a parallel multigrid-Schwarz method for the solution of hydrodynamics and heat transfer problems. In the proposed method, the solution is obtained by a multigrid method parallelized by domain decomposition techniques, more specifically by the additive Schwarz method. The experiments performed have shown that the proposed implementation is computationally efficient, have good scalability, and good numerical quality.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117164752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyacinthe Nzigou Mamadou, T. Nanri, K. Murakami, Guilherme de Melo Baptista Domingues
{"title":"Performance Analysis and Linear Optimization Modeling of All-to-all Collective Communication Algorithms","authors":"Hyacinthe Nzigou Mamadou, T. Nanri, K. Murakami, Guilherme de Melo Baptista Domingues","doi":"10.1109/SBAC-PAD.2007.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.25","url":null,"abstract":"The performance of collective communication operations still represents a critical issue for high performance computing systems. Users of parallel machines need to have a good grasp of how different communication patterns and styles affect the performance of message-passing applications. This paper reports our contribution of the analysis of collective communication algorithms in the context of MPI programming paradigm by extending a standard point- to-point communication model, which is P-LogP. We focus on MPI Alltoall since this function is one of the most communication intensive collective operations known. In order to reduce the gap between the predicted and the measured run-time, all the system parameters are also taken into account with the total performance estimation, by applying the linear regression modeling with the empirical data. Results on InfiniBand clusters show that the final performance prediction models can accurately capture the entire system communication behavior of all algorithms, even for large size messages and large number of processors.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132119098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva
{"title":"A Scalable Parallel Deduplication Algorithm","authors":"W. Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, R. Ferreira, Dorgival Olavo Guedes Neto, A. D. Silva","doi":"10.1109/SBAC-PAD.2007.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.32","url":null,"abstract":"The identification of replicas in a database is fundamental to improve the quality of the information. Deduplication is the task of identifying replicas in a database that refer to the same real world entity. This process is not always trivial, because data may be corrupted during their gathering, storing or even manipulation. Problems such as misspelled names, data truncation, data input in a wrong format, lack of conventions (like how to abbreviate a name), missing data or even fraud may lead to the insertion of replicas in a database. The deduplication process may be very hard, if not impossible, to be performed manually, since actual databases may have hundreds of millions of records. In this paper, we present our parallel deduplication algorithm, called FER- APARDA. By using probabilistic record linkage, we were able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. We believe that our results do not have similar in the literature when it comes to the size of the data set and the processing time.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124819204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Register File Energy Optimization for Snooping Based Clustered VLIW Architectures","authors":"Rahul Nagpal, Y. Srikant","doi":"10.1109/SBAC-PAD.2007.35","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.35","url":null,"abstract":"Frequent accesses to the register file make it one of the major sources of energy consumption in ILP architectures. The large number of functional units connected to a large unified register file in VLIW architectures make power dissipation in the register file even worse because of the need for a large number of ports. High power dissipation in a relatively smaller area occupied by a register file leads to a high power density in the register file and makes it one of the prime hot-spots. This makes it highly susceptible to the possibility of a catastrophic heatstroke. This in turn impacts the performance and cost because of the need for periodic cool down and sophisticated packaging and cooling techniques respectively. Clustered VLIW architectures partition the register file among clusters of functional units and reduce the number of ports required thereby reducing the power dissipation. However, we observe that the aggregate accesses to register files in clustered VLIW architectures (and associated energy consumption) become very high compared to the centralized VLIW architectures and this can be attributed to a large number of explicit inter-cluster communications. Snooping based clustered VLIW architectures provide very limited but very fast way of inter-cluster communication by allowing some of the functional units to directly read some of the operands from the register file of some of the other clusters. In this paper, we propose instruction scheduling algorithms that exploit the limited snooping capability to reduce the register file energy consumption on an average by 12% and 18% and improve the overall performance by 5% and 11% for a 2-clustered and a 4-clustered machine respectively, over an earlier state-of-the-art clustered scheduling algorithm when evaluated in the context of snooping based clustered VLIW architectures.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125742886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impacts of Multiprocessor Configurations on Workloads in Bioinformatics","authors":"Youfeng Wu, M. Breternitz, V. Ying","doi":"10.1109/SBAC-PAD.2007.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.30","url":null,"abstract":"Bioinformatics is among the most active research areas in computer science. In this study, we investigate a suite of workloads in bioinformatics on two multiprocessor systems with different configurations, and examine the effects of the configurations on the performance of the workloads. Our result indicates that the configurations of the multiprocessor systems have significant impact on the performance and scalability of the workloads. For example, a number of workloads have significantly higher scalability on one of the systems, but poorer absolute performance than on the other system. However, traditional scalability failed to capture the impacts of the system configurations on the workloads. We present insights on what kinds of workloads will run faster on which systems and propose new metrics to capture the impacts of multiple processor configurations on the workloads. These findings not only provide an easy way to compare results running on different systems, but also enable re-configuration of the underlying systems to run specific workloads efficiently. We also show how processor mapping and loop spreading may help map the workoads to the underlining multiprocessor configuration and achieve consistent scalability for these workloads.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"31 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116339539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Queue Register File Optimization Algorithm for QueueCore Processor","authors":"A. Canedo, B. Abderazek, M. Sowa","doi":"10.1109/SBAC-PAD.2007.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.10","url":null,"abstract":"The queue computation model offers an attractive alternative for high-performance embedded computing given its characteristics of short instructions and high instruction level parallelism. A queue-based processor uses a FIFO queue to read and write operands through hardware pointers located at the head and tail of the queue. Queue length is the number of elements stored between the head and the tail pointers during computations. We have found that 95% of the statements in integer applications require a queue length of less than 32 words. The remaining 5% requires larger queue length sizes up to 230 queue words. In this paper we propose a compiler technique to optimize the queue utilization for the hungry statements that require a large amount of queue. We show that for SPEC CINT95 benchmarks, our technique optimizes the queue length without decreasing parallelism. However, our optimization has a penalty of a slight increase in code size.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134449856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bruno Coutinho, Dorgival Olavo Guedes Neto, Wagner Meira Jr, R. Ferreira
{"title":"Fault-tolerance in filter-labeled-stream applications","authors":"Bruno Coutinho, Dorgival Olavo Guedes Neto, Wagner Meira Jr, R. Ferreira","doi":"10.1109/SBAC-PAD.2007.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.31","url":null,"abstract":"Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132949893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. R. Xavier, R. S. Oliveira, V. D. F. Vieira, R. D. Santos, Wagner Meira Jr
{"title":"Multi-level Parallelism in the Computational Modeling of the Heart","authors":"C. R. Xavier, R. S. Oliveira, V. D. F. Vieira, R. D. Santos, Wagner Meira Jr","doi":"10.1109/SBAC-PAD.2007.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.19","url":null,"abstract":"Computational modeling of the heart has demonstrated to be a useful tool for the investigation and comprehension of the complex biophysical processes that underlie cardiac function. Unfortunately, large scale simulations, such as those resulting from the discretization of an entire heart, remain a computational challenge. In order to reduce simulation execution times, parallel implementations have traditionally exploited data parallelism via numerical schemes based on domain-decomposition. However, it has been verified that the parallel efficiency of these implementations severely degrades as the number of processors increases. In this work, we propose and implement a new parallel algorithm for the solution of cardiac models. By relaxing the coherence of the execution, a new level of parallelism could be identified and exploited: pipelining. A synchronous parallel algorithm that uses both pipelining and data decomposition techniques was implemented and used the MPI library for communication. Numerical tests were performed in a 8-node Linux-cluster. Our preliminary results indicate that the proposed algorithm is able to increase the parallel efficiency up to 20% when compared to the traditional approach that uses pure data-level parallelism. In addition, the numerical precision was kept under control (relative errors under 4%) when the relaxed coherence execution was adopted.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125296246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Selector of Grid Resources based on the Semantic Integration of Multiple Ontologies","authors":"A. C. Silva, M. Dantas","doi":"10.1109/SBAC-PAD.2007.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.16","url":null,"abstract":"Different resources descriptions from different virtual organizations in a grid environment, exemplifies the challenge to match a specific resource, that could have similar characteristics, but with diverse descriptions. The use of a semantic matching method, based on ontology descriptions, is an alternative that can be considered by a software package to tackle this problem. However, recent researches indicate that fully automated systems are not able to recognize all possible relations between different ontologies. In other words, the human interaction is necessary after the recognition phase, when preliminary results are obtained from an ontology matching operation. This interaction is important in order to build a more logic knowledge to create efficient queries. In this article, we present a prototype tool which was designed and implementated to reduce issues related to match grid resource.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133528602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco Heron de Carvalho Junior, Ricardo C. Corrêa, G. A. Araújo, Jefferson de Carvalho Silva, R. Lins
{"title":"High-Level Service Connectors for Component-Based High Performance Computing","authors":"Francisco Heron de Carvalho Junior, Ricardo C. Corrêa, G. A. Araújo, Jefferson de Carvalho Silva, R. Lins","doi":"10.1109/SBAC-PAD.2007.34","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2007.34","url":null,"abstract":"Component-based programming has been applied to address the requirements of applications in high performance computing (HPC). The usual service connectors of commercial component models do not fit some requirements of HPC, mainly regarding the support of parallelism, however. This paper looks at extensions to the usual notion of service connector to meet such requirements, using the # component model as a substratum, evidencing its expressiveness.","PeriodicalId":261956,"journal":{"name":"19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134252871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}