{"title":"Design space exploration using T&D-Bench","authors":"S. Soares, F. Wagner","doi":"10.1109/CAHPC.2004.16","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.16","url":null,"abstract":"This paper presents T&D-Bench - teaching and design workbench, a software infrastructure for modeling and simulation of state-of-the-art processors. It combines features that simplify and accelerate the processor design process without restricting the designer possibilities, thus representing a good tradeoff for educational and research purposes that is not found in other environments. In T&D-Bench, a new model is constructed by the designer using script language to define microarchitecture, instruction set, and timing aspects of the processor. These scripts can be produced by a graphical front-end, and a Java simulator targeted at the modeled processor is automatically built from the scripts. This approach can fit well the requirements imposed by the educational environment. Fine-tuning adjustments or the description of more complex processor mechanisms can be achieved by means of modifications in selected parts of the software infrastructure.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127479341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High performance communication system based on generic programming","authors":"A.L.G. Sanches, F. R. Secco, A. A. Fröhlich","doi":"10.1109/CAHPC.2004.19","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.19","url":null,"abstract":"This paper presents a high performance communication system based on generic programming. The system adapts itself according to the protocol being used on communication, simplifying the development of libraries. In order to validate the concepts, a MPI implementation has been developed and it is compared to a traditional implementation - MPICH-GM. It is demonstrated that the same functionality and interface can be offered with similar performance, but with much less programming effort. That is evidence that the large size of traditional MPI implementations is due to the limitations of conventional communication systems.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"9 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127501481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the dynamic behavior of workload execution in SVM systems","authors":"S. Petit, J. Sahuquillo, A. Pont, D. Kaeli","doi":"10.1109/CAHPC.2004.12","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.12","url":null,"abstract":"The overhead associated with software management of shared virtual memory (SVM) systems can seriously impact overall system performance. One way to remedy this situation is to design more efficient SVM consistency protocols. In this paper we study a number of parallel workload characteristics that can negatively impact the performance of SVM systems. We attempt to quantify the sources of performance loss in some parallel workloads. Our goal is to better understand these characteristics, enabling us to develop SVM protocols that can adjust to dynamics in workload behavior. This paper has three main contributions: i) we measure the contention for synchronization resources, showing how applications exhibit distinct phases during their execution, ii) we quantify the relationship between page size and fragmentation/false sharing while varying the sharing unit size, and iii) we study the synergies between the contention for synchronization resources and fragmentation/false sharing, providing hints for developing improved protocols.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122176999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving server performance on transaction processing workloads by enhanced data placement","authors":"J. Rubio, C. Lefurgy, L. John","doi":"10.1109/CAHPC.2004.22","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.22","url":null,"abstract":"Modern servers access large volumes of data while running commercial workloads. The data is typically spread among several storage devices (e.g. disks). Carefully placing the data across the storage devices can minimize costly remote accesses and improve performance. We propose the use of simulated annealing to arrive at an effective layout of data on disk. The proposed technique considers the configuration of the system and the cost of data movement. An initial layout globally optimized across all queries, shows speedups of up to 13% for a group of DSS queries and up to 6% for selected OLTP queries. This technique can be re-applied at run-time to further improve performance beyond the initial, globally optimized data layout. This scheme monitors architecture parameters to prevent optimizations of multiple operations to conflict with each other. Such a dynamic reorganization results in speedups of up to 23% for the DSS queries and up to 10% for the OLTP queries.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130624206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Michel Busca, M. Bertier, F. Belkouch, Pierre Sens, L. Arantes
{"title":"A performance evaluation of a quorum-based state-machine replication algorithm for computing grids","authors":"Jean-Michel Busca, M. Bertier, F. Belkouch, Pierre Sens, L. Arantes","doi":"10.1109/CAHPC.2004.4","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.4","url":null,"abstract":"Quorum systems are well-known tools that improve the performance and the availability of distributed systems. In this paper we explore their use as a means to achieve low response time for network services that are replicated and accessed over computing grids. To that end, we propose both a quorum construction and a quorum-based state-machine replication algorithm that tolerates crash failures in a partially synchronous model. We show through the evaluation of a real implementation that although simple, this quorum construction and replication algorithm exhibits a response time 20% lower than that of a regular active replication algorithm in appropriate conditions.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114762876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph partitioning with the Party library: helpful-sets in practice","authors":"B. Monien, Stefan Schamberger","doi":"10.1109/CAHPC.2004.18","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.18","url":null,"abstract":"Graph partitioning is an important subproblem in many applications. To partition a graph into more than two parts, there exist two different commonly used approaches: Either the graph is partitioned directly into the desired amount of partitions or the graph is first split into two partitions that are then further divided recursively. It has been shown that even optimal recursive bisection can lead to solutions \"very far from the optimal one\". However, for \"important graph classes\" recursive bisection solutions are known to be \"almost always\" within a constant factor of the optimal one. Thus, the question arises how good recursive bisection performs in practice. In this paper we describe enhancements to the Party graph partitioning library which is based on the helpful-set bisection heuristic and present results of extensive tests undertaken with it. We thereby compare Party with the two state-of-the art libraries Metis and Jostle using a permutation based evaluation scheme. We show experimentally that there are indeed many cases where a recursive application of a good bisection heuristic is likely to find better solutions than up-to-date direct approaches.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131574106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A parallel engine for graphical interactive molecular dynamics simulations","authors":"E. Rodrigues, A. J. Preto, S. Stephany","doi":"10.1109/CAHPC.2004.3","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.3","url":null,"abstract":"The current work proposes a parallel implementation for interactive molecular dynamics simulations (MD). The interactive capability is modeled by finite automata that are executed in the processing nodes. Any interaction implies in a communication between the user interface and the finite automata. The ADKS, an interactive sequential MD code that provides graphical output was chosen as a case study. A parallel version of this code was developed using the MPI communication library to check its parallel performance without/with visualization. Performance results are discussed for both cases and the influence of visualization in the performance is also treated, including image update rate. In order to allow a modular approach, a new parallel version of the ADKS is being implemented employing the PyMPI Python extension.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117117839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. W. Netto, R. Azevedo, P. Centoducatte, G. Araújo
{"title":"Multi-profile instruction based compression","authors":"E. W. Netto, R. Azevedo, P. Centoducatte, G. Araújo","doi":"10.1109/CAHPC.2004.26","DOIUrl":"https://doi.org/10.1109/CAHPC.2004.26","url":null,"abstract":"Code compression has been used to minimize the memory area requirement of embedded systems. Recently, performance improvement and energy consumption reduction are observed as a by-product of compression. In this paper we propose a novel technique for efficiently exploring the trade-offs involved in code compression. Our multiprofile approach to build dictionaries combines the best features of both static and dynamic program behaviors. The experiments with Mediabench and MiBench suites and the Leon (SPARCv8) processor reveal a compression ratio as low as 71% while performance speed-up reaches 1.5.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121965285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the combined scheduling of malleable and rigid jobs","authors":"J. Hungershofer","doi":"10.1109/SBAC-PAD.2004.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2004.27","url":null,"abstract":"The demand of the users of parallel systems for low response times contradicts the ambition of the system maintainers for a high utilization. A high utilization normally results in long waiting times for the users' jobs. To fullfil the concerns of both interest groups is a hard job to do. The usage of more flexible jobs models can be a way out of the dilemma. These models allow jobs to change their width at application start (moldable jobs) or even during execution (malleable jobs). We have analyzed the quality of schedules using job sets with moldable and malleable jobs and combinations of both. Tracefiles from supercomputer installations have been modified to contain varying fractions of moldable and malleable jobs. Using a special simulation environment for the more flexible job models the jobs have been scheduled virtually. The results show that both interest groups mentioned above can be pleased if these job models are used and the average response times become significantly better.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129890300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A cluster-based strategy for scheduling task on heterogeneous processors","authors":"Cristina Boeres, J. V. Filho, Vinod E. F. Rebello","doi":"10.1109/SBAC-PAD.2004.1","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2004.1","url":null,"abstract":"Efficient task scheduling is fundamental for parallel applications to achieve good performance on distributed systems. While extensive work exists for scheduling tasks on homogeneous processors, fewer algorithms exist for the more common problem of scheduling in heterogeneous processor environments. In this paper, we propose coupling a replication-based clustering heuristic for homogeneous processors, with a mechanism to map the generated clusters to the heterogeneous environment. Experimental results show that this strategy compares favourably in terms of the makespan with traditional list scheduling approaches to this problem, particularly when communication costs are high.","PeriodicalId":375288,"journal":{"name":"16th Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130063795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}