Eduarda Monteiro, B. Vizzotto, C. Diniz, B. Zatt, S. Bampi
{"title":"Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm for High Performance Motion Estimation in Video Encoding","authors":"Eduarda Monteiro, B. Vizzotto, C. Diniz, B. Zatt, S. Bampi","doi":"10.1109/SBAC-PAD.2011.19","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.19","url":null,"abstract":"This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a video encoding system. We propose a way to partition the steps of Full Search block matching algorithm in the CUDA architecture. A comparison among the performance achieved by this solution with a theoretical model and two other implementations (sequential and parallel using OpenMP library) is made as well. We obtained a O(n^2/log^2n) speed-up which fits the proposed theoretical model considering different search areas. It represents up to 600x gain compared to the serial implementation, and 66x compared to the parallel OpenMP implementation.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128223404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Workload Balancing Methodology for Data-Intensive Applications with Divisible Load","authors":"C. Rosas, A. Sikora, Josep Jorba, Eduardo César","doi":"10.1109/SBAC-PAD.2011.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.15","url":null,"abstract":"Data-intensive applications are those that explore, query, analyze, and, in general, process very large data sets. Generally in High Performance Computing (HPC), the main performance problem associated to these applications is the load unbalance or inefficient resources utilization. This paper proposes a methodology for improving performance of data-intensive applications based on performing multiple data partitions prior to the execution, and ordering the data chunks according to their processing times during the application execution. As a first step, we consider that a single execution includes multiple related explorations on the same data set. Consequently, we propose to monitor the processing of each exploration and use the data gathered to dynamically tune the performance of the application. The tuning parameters included in the methodology are the partition factor of the data set, the distribution of these data chunks, and the number of processing nodes to be used by the application. The methodology has been initially tested using the well-known bioinformatics tool BLAST, obtaining encouraging results (up to a 40% of improvement).","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132613734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. R. Veloso, L. Cerf, Chedy Raïssi, Wagner Meira Jr
{"title":"Distributed Skycube Computation with Anthill","authors":"R. R. Veloso, L. Cerf, Chedy Raïssi, Wagner Meira Jr","doi":"10.1109/SBAC-PAD.2011.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.29","url":null,"abstract":"Recently skyline queries have gained considerable attention and are among the most important tools for multi-criteria analysis. In order to process all possible combinations of criteria along with their inherent analysis, researchers introduced and studied the notion of emph{skycube}. Simply put, a skycube is a pre-materialization of all possible subspaces with their associated skylines. An efficient skycube computation relies on the detection of redundancies in the different processing steps and enhanced result sharing between subspaces. Lately, the Orion algorithm was proposed to compute the skycube in a very efficient way. The approach relies on the derivation of skyline points over different subspaces. Nevertheless, because there are 2^{|D|} - 1 subspaces (where D is the set of dimensions) in a skycube, the running time still grows exponentially with the number of dimensions and easily becomes intractable on real-world datasets. In this study, we detail the distribution of Orion within a emph{filter-stream} framework and we conduct an extensive set of experiments on large datasets collected from Twitter to demonstrate the efficiency of our method.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131146172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Parallelism for Belief Propagation in Factor Graphs","authors":"N. Ma, Yinglong Xia, V. Prasanna","doi":"10.1109/SBAC-PAD.2011.34","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.34","url":null,"abstract":"We investigate data parallelism for belief propagation in a cyclic factor graphs on multicore/many core processors. Belief propagation is a key problem in exploring factor graphs, a probabilistic graphical model that has found applications in many domains. In this paper, we identify basic operations called node level primitives for updating the distribution tables in a factor graph. We develop algorithms for these primitives to explore data parallelism. We also propose a complete belief propagation algorithm to perform exact inference in such graphs. We implement the proposed algorithms on state-of-the-art multicore processors and show that the proposed algorithms exhibit good scalability using a representative set of factor graphs. On a 32-core Intel Nehalem-EX based system, we achieve 30× speedup for the primitives and 29× for the complete algorithm using factor graphs with large distribution tables.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122058143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque
{"title":"Predictive and Distributed Routing Balancing on High-Speed Cluster Networks","authors":"Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque","doi":"10.1109/SBAC-PAD.2011.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.27","url":null,"abstract":"In high performance clusters current parallel application communication needs such as traffic pattern, communication volume, etc., change along time and are difficult to know in advance. Such needs often exceed or do not match available resources causing resource use imbalance, network congestion, throughput reduction and message latency increase, thus degrading the overall system performance. Studies on parallel applications show repetitive behavior that can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing (PRDRB) technique, a new method developed to gradually control network congestion, based on paths expansion, traffic distribution, applications pattern repetitiveness and speculative adaptive routing, in order to maintain low latency values. PRDRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Traffic congestion experiments were conducted in order to evaluate the performance of the method, and improvements were observed.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117120012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thatyene Louise Alves de Souza Ramos, R. S. Oliveira, Ana Paula de Carvalho, R. Ferreira, Wagner Meira Jr
{"title":"Watershed: A High Performance Distributed Stream Processing System","authors":"Thatyene Louise Alves de Souza Ramos, R. S. Oliveira, Ana Paula de Carvalho, R. Ferreira, Wagner Meira Jr","doi":"10.1109/SBAC-PAD.2011.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.31","url":null,"abstract":"The task of extracting information from datasets that become larger at a daily basis, such as those collected from the web, is an increasing challenge, but also provides more interesting insights and analysis. Current analyses went beyond content and now focus on tracking and understanding users' relationships and interactions. Such computation is intensive both in terms of the processing demand imposed by the algorithms and also the sheer amount of data that has to handled. In this paper we introduce Watershed, a distributed computing framework designed to support the analysis of very large data streams online and in real-time. Data are obtained from streams by the system's processing components, transformed, and directed to other streams, creating large flows of information. The processing components are decoupled from each other and their connections are strictly data-driven. They can be dynamically inserted and removed, providing an environment in which it is feasible that different applications share intermediate results or cooperate to a global purpose. Our experiments demonstrate the flexibility in creating a set of data analysis algorithms and their composition into a powerful stream analysis environment.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126906920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Valero, J. Sahuquillo, S. Petit, P. López, J. Duato
{"title":"MRU-Tour-based Replacement Algorithms for Last-Level Caches","authors":"A. Valero, J. Sahuquillo, S. Petit, P. López, J. Duato","doi":"10.1109/SBAC-PAD.2011.13","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.13","url":null,"abstract":"Memory hierarchy design is a major concern in current microprocessors. Many research work focuses on the Last-Level Cache (LLC), which is designed to hide the long miss penalty of accessing to main memory. To reduce both capacity and conflict misses, LLCs are implemented as large memory structures with high associativities. To exploit temporal locality, LRU is the replacement algorithm usually implemented in caches. However, for a high-associative cache, its implementation is costly in terms of area and power consumption. Indeed, LRU is not well suited for the LLC, because as this cache level does not see all memory accesses, it cannot cope with temporal locality. In addition, blocks must descend down to the LRU position of the stack before eviction, even when they are not longer useful. In this paper, we show that most of the blocks are not referenced again once they leave the MRU position. Moreover, the probability of being referenced again does not depend on the location on the LRU stack. Based on these observations, we define the number of MRU-Tours (MRUTs) of a block as the number of times that a block occupies the MRU position while it is stored in the cache, and propose the MRUT replacement algorithm, which selects the block to be replaced among the blocks that show only one MRUT. Variations of this algorithm have been also proposed to exploit both MRUT behavior and recency of information. Experimental results show that, compared to LRU, the proposal reduces the MPKI up to 22%, while IPC is improved by 48%.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123862599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Power-Efficient Co-designed Out-of-Order Processor","authors":"Abhishek Deb, J. M. Codina, Antonio González","doi":"10.1109/SBAC-PAD.2011.9","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.9","url":null,"abstract":"A co-designed processor helps in cutting down both the complexity and power consumption by co-designing certain key performance enablers. In this paper, we propose a FIFO based co-designed out-of-order processor. Multiple FIFOs are added in order to dynamically schedule, in a complexity-effective manner, the micro-ops. We propose a commit logic that is able to commit the program state as a superblock commits atomically. This enables us to get rid of the Reorder Buffer (ROB) entirely. Instead to maintain the correct program state, we propose a four/eight entry Superblock Ordering Buffer (SOB). We also propose the per superblock Register Rename Table (SRRT) that holds the register state pertaining to the superblock. Our proposed processor dissipates 6% less power and obtains 12% speedup for SPECFP, as a result, it consumes less energy. Furthermore, we propose an enhanced steering heuristic and an early release mechanism to increase the performance of a FIFO based out-of-order processor. We obtain performance improvement of nearly 25% and 70% for a four FIFO and for a two FIFO configurations, respectively. We also show that our proposed steering heuristic based processor consumes 10% less energy than the previously proposed steering heuristic.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116003191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emanuel Vianna, Giovanni V. Comarela, Tatiana Pontes, J. Almeida, Virgílio A. F. Almeida, K. Wilkinson, Harumi A. Kuno, U. Dayal
{"title":"Modeling the Performance of the Hadoop Online Prototype","authors":"Emanuel Vianna, Giovanni V. Comarela, Tatiana Pontes, J. Almeida, Virgílio A. F. Almeida, K. Wilkinson, Harumi A. Kuno, U. Dayal","doi":"10.1109/SBAC-PAD.2011.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.24","url":null,"abstract":"MapReduce is an important paradigm to support modern data-intensive applications. In this paper we address the challenge of modeling performance of one implementation of MapReduce called Hadoop Online Prototype (HOP), with a specific target on the intra-job pipeline parallelism. We use a hierarchical model that combines a precedence model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time and resource utilization. We validate our solution against a queuing network simulator in various scenarios, finding that our performance model presents a close agreement, with maximum relative difference under 15%.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130389974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficiently Managing Advance Reservations Using Lists of Free Blocks","authors":"Jörg Schneider, B. Linnert","doi":"10.1109/SBAC-PAD.2011.25","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2011.25","url":null,"abstract":"Advance reservation was identified as a key technology to enable guaranteed Quality of Service and co-allocation in the Grid. Nonetheless, most Grid and local resource management systems still use the queuing approach because of the additional complexity introduced by advance reservation. A planning based resource management system has to keep track of the reservations in the future and needs a good overview on the available capacity during the negotiation of incoming reservations. For advance reservation, the resource management problem becomes a two dimensional problem. In this paper different data structures are investigated and discussed in order to fit to planning based resource management. As a result the benefits of using lists of resource allocation or free blocks are exposed. This general idea widely used to manage continuous resources is extended to cover not only the resource dimension but also the time dimension. The list of blocks approach is evaluated in a Grid level and a resource level resource management system. The extensive simulations showed a better runtime and higher reservation success rate compared with the currently favored approach of a slotted time.","PeriodicalId":390734,"journal":{"name":"2011 23rd International Symposium on Computer Architecture and High Performance Computing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130997366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}