{"title":"High performance two-dimensional phase unwrapping on GPUs","authors":"Zhenhua Wu, Wenjing Ma, Guoping Long, Yucheng Li, Qiuyan Tang, Zhongjie Wang","doi":"10.1145/2597917.2597931","DOIUrl":"https://doi.org/10.1145/2597917.2597931","url":null,"abstract":"Phase unwrapping is an important procedure in digital image and signal processing, and has been widely used in many fields, such as optical and microwave interferometry, magnetic resonance imaging, synthetic aperture radar, adaptive optics. Phase unwrapping is a time consuming process with large amount of calculations and complicated data dependency. A number of algorithms with different features have been developed to solve this problem. Among all of them, Goldstein's algorithm is one of the most widely used algorithms, and has been included in some standard libraries (such as MATLAB). In this paper we propose an innovative implementation of Goldstein's algorithm on GPU. Several important approaches and optimizations are proposed for the GPU algorithm. For example, by introducing a localmatching step, we were able to parallelize the branchcut step efficiently, getting much better performance than existing work. With a cascaded propagation model, another important operation in the algorithm, floodfill, is able to make good use of the computing power of GPU. We tested our GPU algorithm on NVIDIA C2050 and K20 GPUs, and achieved speedup of up to 781 and 896 over the CPU implementation respectively. To the best of our knowledge, this is the best performance of unwrap ever achieved on GPUs.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130669175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Moving computations from run-time to compile-time: hyper-metaprogramming in practice","authors":"Lucian Radu Teodorescu, Vlad Dumitrel, R. Potolea","doi":"10.1145/2597917.2597933","DOIUrl":"https://doi.org/10.1145/2597917.2597933","url":null,"abstract":"Computer programs often contain computations that can be executed at compile-time. The possibility arises when these operations do not depend on runtime information. When it comes to isolating and executing code at compile-time, static metaprogramming is the method of choice. However, it often suffers in terms of convenience, applicability, or performance. We explore the feasibility of applying hyper-metaprogramming -- an extension of traditional metaprogramming -- to alleviate such issues. We show that a language that supports hyper-metaprogramming enables the movement of computations from run-time to compile-time for arbitrarily complex applications. The paper features two case studies -- minimal perfect hashing and regular expressions -- that exemplify and discuss the implications of our approach. The experimental results show that our method produces significant speedups without compromising convenience.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133253637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fei Chen, Yi Shan, Yu Zhang, Yu Wang, H. Franke, Xiaotao Chang, Kun Wang
{"title":"Enabling FPGAs in the cloud","authors":"Fei Chen, Yi Shan, Yu Zhang, Yu Wang, H. Franke, Xiaotao Chang, Kun Wang","doi":"10.1145/2597917.2597929","DOIUrl":"https://doi.org/10.1145/2597917.2597929","url":null,"abstract":"Cloud computing is becoming a major trend for delivering and accessing infrastructure on demand via the network. Meanwhile, the usage of FPGAs (Field Programmable Gate Arrays) for computation acceleration has made significant inroads into multiple application domains due to their ability to achieve high throughput and predictable latency, while providing programmability, low power consumption and time-to-value. Many types of workloads, e.g. databases, big data analytics, and high performance computing, can be and have been accelerated by FPGAs. As more and more workloads are being deployed in the cloud, it is appropriate to consider how to make FPGAs and their capabilities available in the cloud. However, such integration is non-trivial due to issues related to FPGA resource abstraction and sharing, compatibility with applications and accelerator logics, and security, among others. In this paper, a general framework for integrating FPGAs into the cloud is proposed and a prototype of the framework is implemented based on OpenStack, Linux-KVM and Xilinx FPGAs. The prototype enables isolation between multiple processes in multiple VMs, precise quantitative acceleration resource allocation, and priority-based workload scheduling. Experimental results demonstrate the effectiveness of this prototype, an acceptable overhead, and good scalability when hosting multiple VMs and processes.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132652378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a new tuple-based programming paradigm for expressing and optimizing irregular parallel computations","authors":"K. Rietveld, H. Wijshoff","doi":"10.1145/2597917.2597923","DOIUrl":"https://doi.org/10.1145/2597917.2597923","url":null,"abstract":"Irregular computations have the inherent property of being hard to automatically optimize and parallelize. In this paper, a new tuple-based programming paradigm is described for expressing irregular computations. At the basis, this programming paradigm allows irregular computations to be specified on an elementary data entry (tuple) level rather than on (complicated) data structures. As a consequence the actual data structures are being constructed during the code generation phase. Using this framework not only current implementations of irregular computations in for instance the C programming language can be automatically mapped into the tuple-based programming model, but also the code generated from this specification is competitive with hand-optimized codes. The potential of this approach is demonstrated on two representative applications: sparse triangular solve to represent sparse linear algebra and an implementation of the Bellman-Ford algorithm to represent graph algorithms. We demonstrate that from an ordinary triangular solve code, parallelized implementations can be automatically generated that up till now could only be derived by hand. We show that the performance of these automatically generated implementations is comparable with the performance of hand-optimized triangular solvers. For the Bellman-Ford algorithm initial experiments have been conducted which show that the derived GPU implementations of this algorithm achieve speedups in execution time of two to four orders of magnitude compared to the initial implementation.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"29 20","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132707462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chameleon: a data organization transformation scheme for big data systems","authors":"Fengfeng Pan, Yinliang Yue, Jin Xiong","doi":"10.1145/2597917.2597921","DOIUrl":"https://doi.org/10.1145/2597917.2597921","url":null,"abstract":"Big data system requires multiple types of data organizations to efficiently support various operations. It is well known that in-place update index, unordered log structured index and ordered log structured index are three typical data organizations which are designed to meet different workload requirements respectively. Differentiated workload requirements in different phase of the data life-cycle lead to data organization transformation. However, typical sequential data organization transformation not only incurs extremely long time, but also significant energy consumption. In this paper, we propose Chameleon, a novel data organization transformation scheme for replication based big data system. The goal of Chameleon is to significantly shorten the data organization transformation process and improve the write performance and the subsequent read performance through data organization transformation, meanwhile eliminate the additional hardware and energy costs by reusing the mirrored disks. For each put request, Chameleon keeps two copies of the key-value pair. One in its normal place and organized in ordered log structured index, and the other in relatively high performance log disk and organized in unordered log structured index. By spreading destaging I/O activities among short idle time slots, key-value pairs are transformed from write-optimized index to read-optimized index. Extensive experimental evaluation based on our prototype shows that Chameleon can shorten the time of data organization transformation and enhance energy efficiency and performance.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130953201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating synchronization communications for high-density blade enclosure","authors":"Zheng Cao, Fei Chen, Xuejun An, Qiang Li","doi":"10.1145/2597917.2597946","DOIUrl":"https://doi.org/10.1145/2597917.2597946","url":null,"abstract":"The high-density blade server provides an attractive solution for the rapidly increasing demand on computing. The degree of parallelism inside a blade enclosure today has reached up to hundreds of cores. In such parallelism, it is necessary to accelerate synchronization operations. In order to accelerate intra-enclosure synchronization operations, this paper proposes a single chip design SyncRouter on the midplane of the blade enclosure. The architecture of SyncRouter is somewhat like a microprocessor whose memory system is of tag-value structure. We call such a memory system Shared Synchronization Memory (SSM). In this paper, both the architecture and usage of the proposed SyncRouter are introduced in detail. We also build a blade enclosure with the SyncRouter implemented in Xilinx XC6LX365T FPGA. Evaluations using both micro-benchmarks and benchmarks are performed on the blade enclosure. The latency of one pair of ssm_put and ssm_get and the minimum latency of ssm_barrier are 0.62 μ s, and the minimum latency of ssm_reduce is 0.81 μ s. Regarding the benchmarks 2D Wave-front and LU, the speedup of using the fine-grained synchronization primitive ssm_put/ssm_get outperforms the one of using the ssm_barrier by 20% on average.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123653206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An argument for thread-level speculation and just-in-time compilation in the Google's V8 JavaScript engine","authors":"Jan Kasper Martinsen, Håkan Grahn, Anders Isberg","doi":"10.1145/2597917.2597950","DOIUrl":"https://doi.org/10.1145/2597917.2597950","url":null,"abstract":"Thread-Level Speculation can be used to take advantage of multicore architectures for web applications. We have implemented Thread-Level Speculation in the state-of-the-art JavaScript engine V8 instead of using an interpreted JavaScript engine. We evaluate the implementation with the Chromium web browser on 15 popular web applications for 2, 4, and 8 cores. The results show that it is beneficial to combine Thread-Level Speculation and Just-in-time compilation and that it is possible to take advantage of multicore architectures while hiding the details of parallel programming from the programmer of web applications.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128346934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Micro-checkpointing in fault tolerant runtimes","authors":"Pavlos Katsogridakis, Polyvios Pratikakis","doi":"10.1145/2597917.2597926","DOIUrl":"https://doi.org/10.1145/2597917.2597926","url":null,"abstract":"Multicore processors are increasingly used in safety-critical applications. On one hand, their increasing chip density causes these processors to be more susceptible to transient faults; on the other hand the existence of many cores offers a straightforward compartmentalization against permanent hardware faults. To tackle the first issue and take advantage of the second, we present FT-BDDT, a fault-tolerant task-parallel runtime system. FT-BDDT extends the BDDT runtime system that implements the OMP-Ss dataflow programming model for spawning and scheduling parallel tasks, in which, similarly to OpenMP 4.0, a dynamic dependence analysis detects conicting tasks and automatically synchronizes them to avoid data races and non-determinism. FT-BDDT recovers from both transient and permanent faults. Transient faults during task execution result in simply re-running the task. To handle transient faults in the runtime system, FT-BDDT uses fine-grain micro-checkpointing of the runtime state, so that a recovery is always possible at the level of rerunning a basic block of code on error. Permanent faults are treated in a similar fashion, by having the master core \"steal\" the task checkpoint or the runtime micro-checkpoint and reschedule the task or recover the runtime state, respectively. We evaluate FT-BDDT on several benchmarks under various error conditions, while guiding errors to attain maximum coverage of the runtime code. We find a 9.5% average runtime overhead for checkpointing, a constant small space overhead, and a negligible recovery time per error.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"28 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120925733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple stream tracker: a new hardware stride prefetcher","authors":"Taesu Kim, Dali Zhao, A. Veidenbaum","doi":"10.1145/2597917.2597941","DOIUrl":"https://doi.org/10.1145/2597917.2597941","url":null,"abstract":"Data prefetching is a very important technique for hiding memory latency and improving performance in modern computer processors. Existing techniques are not able to find all or best data streams to prefetch. This paper proposes a new prefetching technique, Multiple Stream Tracker (MST), that improves over state-of-the-art by identifying strided accesses in a cache miss stream. Targeting the lower levels of cache it searches for the best among all possible strided streams to prefetch. A technique to efficiently search and rank multiple strided streams is proposed. The proposed technique can identify streams that subsume streams generated by both delta correlated and standard stride prefetchers. The MST pefetcher can also significantly improve performance in parallel programs. The Multiple Stream Tracker applied at the L3 cache improves the IPC by up to 173% (14% on average) over stride prefetching for SPEC CPU2006 benchmarks. The improvement is up to 92% over delta correlation (5% on average). The speedup for SPEComp programs is up to 300% over delta correlation (22% on average). MST also has lower average memory bandwidth requirements compared to prior techniques.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"30 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126084905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DTT: program structure-aware indirect branch optimization via direct-TPC-table in DBT system","authors":"Ning Jia, Chun Yang, Yu He, Xu Cheng","doi":"10.1145/2597917.2597944","DOIUrl":"https://doi.org/10.1145/2597917.2597944","url":null,"abstract":"Indirect branch handling is a major source of performance overhead in Dynamic Binary Translation (DBT) systems. Most existing solutions for indirect branches involve a run-time address translation from Source Program Counter (SPC) of the branch target to Translated Program Counter (TPC) at every execution of the indirect branches. This paper analyzes the program structures that cause indirect branches, and finds out that most of the branch targets are prestored in the program's memory as some kind of address tables. In other words, the branch target of an indirect branch is not obtained by \"calculating\", but by \"selecting\" from the memory. Based on this observation, we propose a program structure-aware indirect branch handling mechanism called Direct TPC Table (DTT). Our DTT approach probes the target address table of an indirect branch by a passive exception-based scheme, and generates a TPC table from the probed SPC address table at the translation time. Thus, the translated program can load the TPC of a branch target from the TPC table directly, which avoids performing an expensive SPC-to-TPC translation at every execution. In many cases, only 2 instructions are need to handle an indirect branch execution. We implemented the DTT mechanism on a public x86 DBT system. The experiment shows that, DTT improves the system performance by 19.0% compared with hash lookup on a set of indirect intensive benchmarks. Furthermore, DTT does not depend on the underlying architecture or special hardware, so that it can be deployed on various platforms. Meanwhile, DTT can cooperate with other optimization technique of different DBT systems to enhance the performance.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128586430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}