Ferad Zyulkyarov, Srdjan Stipic, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero
{"title":"Discovering and understanding performance bottlenecks in transactional applications","authors":"Ferad Zyulkyarov, Srdjan Stipic, T. Harris, O. Unsal, A. Cristal, I. Hur, M. Valero","doi":"10.1145/1854273.1854311","DOIUrl":"https://doi.org/10.1145/1854273.1854311","url":null,"abstract":"Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and tuning programs which use transactions.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115046329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. J. Lee, John Kim, D. Abts, Michael R. Marty, Jae W. Lee
{"title":"Approximating age-based arbitration in on-chip networks","authors":"M. J. Lee, John Kim, D. Abts, Michael R. Marty, Jae W. Lee","doi":"10.1145/1854273.1854359","DOIUrl":"https://doi.org/10.1145/1854273.1854359","url":null,"abstract":"The on-chip network of emerging many-core CMPs enables the sharing of numerous on-chip components. This on-chip network needs to ensure fairness when accessing the shared resources. In this work, we propose providing equality of service (EoS) in future many-core CMPs on-chip networks by leveraging distance, or hop count, to approximate the age of packets in the network. We propose probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics - fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a complexity-effective mechanism for achieving EoS.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132112537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPACE: Sharing pattern-based directory coherence for multicore scalability","authors":"Hongzhou Zhao, Arrvindh Shriraman, S. Dwarkadas","doi":"10.1145/1854273.1854294","DOIUrl":"https://doi.org/10.1145/1854273.1854294","url":null,"abstract":"An important challenge in multicore processors is the maintenance of cache coherence in a scalable manner. Directory-based protocols save bandwidth and achieve scalability by associating information about sharer cores with every cache block. As the number of cores and cache sizes increase, the directory itself adds significant area and energy overheads.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133010893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ordered and unordered algorithms for parallel breadth first search","authors":"M. A. Hassaan, Martin Burtscher, K. Pingali","doi":"10.1145/1854273.1854341","DOIUrl":"https://doi.org/10.1145/1854273.1854341","url":null,"abstract":"We describe and evaluate ordered and unordered algorithms for shared-memory parallel breadth-first search. The unordered algorithm is based on viewing breadth-first search as a fixpoint computation, and in general, it may perform more work than the ordered algorithms while requiring less global synchronization.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133083623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Ganesan, Jungho Jo, W. Bircher, Dimitris Kaseridis, Zhibin Yu, L. John
{"title":"System-level Max POwer (SYMPO) - a systematic approach for escalating system-level power consumption using synthetic benchmarks","authors":"K. Ganesan, Jungho Jo, W. Bircher, Dimitris Kaseridis, Zhibin Yu, L. John","doi":"10.1145/1854273.1854282","DOIUrl":"https://doi.org/10.1145/1854273.1854282","url":null,"abstract":"To effectively design a computer system for the worst case power consumption scenario, system architects often use hand-crafted maximum power consuming benchmarks at the assembly language level. These stressmarks, also called power viruses, are very tedious to generate and require significant domain knowledge. In this paper, we propose SYMPO, an automatic SYstem level Max POwer virus generation framework, which maximizes the power consumption of the CPU and the memory system using genetic algorithm and an abstract workload generation framework. For a set of three ISAs, we show the efficacy of the power viruses generated using SYMPO by comparing the power consumption with that of MPrime torture test, which is widely used by industry to test system stability. Our results show that the usage of SYMPO results in the generation of power viruses that consume 14–41% more power compared to MPrime on SPARC ISA. The genetic algorithm achieved this result in about 70 to 90 generations in 11 to 15 hours when using a full system simulator. We also show that the power viruses generated in the Alpha ISA consume 9–24% more power compared to the previous approach of stressmark generation. We measure and provide the power consumption of these benchmarks on hardware by instrumenting a quad-core AMD Phenom II X4 system. The SYMPO power virus consumes more power compared to various industry grade power viruses on x86 hardware. We also provide a microarchitecture independent characterization of various industry standard power viruses.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125863323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling","authors":"Rong-Xin Chen, Haibo Chen, B. Zang","doi":"10.1145/1854273.1854337","DOIUrl":"https://doi.org/10.1145/1854273.1854337","url":null,"abstract":"The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently been shown to be a promising alternative to harness the multicore platform.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127236447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai
{"title":"Energy efficient speculative threads: Dynamic thread allocation in same-ISA heterogeneous multicore systems","authors":"Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai","doi":"10.1145/1854273.1854329","DOIUrl":"https://doi.org/10.1145/1854273.1854329","url":null,"abstract":"Thread-level parallelism at the chip level is critical in overcoming some of the challenges that have been ushered in through the advent of modern multicore processors (CMP). Extracting speculatively parallel threads from sequential applications and executing these threads on multicore processors is a promising technique to speed up these applications on multicore systems. However, the potential degradation in energy efficiency associated is an important factor that hinders the deployment of this technique. For multicore systems that integrate same-ISA heterogeneous cores, it is possible to judiciously allocate speculative threads to achieve energy-efficient performance improvement.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131382451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Hari, John B. P. McCabe, Jon Banafato, Marcus Henry, Kevin Ko, Emmanouil Koukoumidis, U. Kremer, M. Martonosi, L. Peh
{"title":"Adaptive spatiotemporal node selection in dynamic networks","authors":"P. Hari, John B. P. McCabe, Jon Banafato, Marcus Henry, Kevin Ko, Emmanouil Koukoumidis, U. Kremer, M. Martonosi, L. Peh","doi":"10.1145/1854273.1854304","DOIUrl":"https://doi.org/10.1145/1854273.1854304","url":null,"abstract":"Dynamic networks—spontaneous, self-organizing groups of devices—are a promising new computing platform. Writing applications for such networks is a daunting task, however, due to their extreme variability and unpredictability, with many devices having significant resource limitations. Intelligent, automated distribution of work across network nodes is needed to get the most out of limited resource budgets.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116946558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic vector instruction selection for dynamic compilation","authors":"R. Barik, Jisheng Zhao, Vivek Sarkar","doi":"10.1145/1854273.1854358","DOIUrl":"https://doi.org/10.1145/1854273.1854358","url":null,"abstract":"Accelerating program performance via short SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, and AltiVec SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers still fall short of using the full potential of vector units, and only generate vector code for simple loop nests. In this poster, we present the design and implementation of an auto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. Additionally, we describe a vector instruction selection algorithm based on dynamic programming. Our results obtained in JikesRVM dynamic compilation environment show performance improvement of up to 57.71% on an Intel Xeon processor, compared to non-vectorized execution.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116503568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Suleman, Moinuddin K. Qureshi, Khubaib, Y. Patt
{"title":"Feedback-directed pipeline parallelism","authors":"M. A. Suleman, Moinuddin K. Qureshi, Khubaib, Y. Patt","doi":"10.1145/1854273.1854296","DOIUrl":"https://doi.org/10.1145/1854273.1854296","url":null,"abstract":"Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127070306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}