{"title":"EchoLoc: Accurate Device-Free Hand Localization Using COTS Devices","authors":"Huijie Chen, Fan Li, Yu Wang","doi":"10.1109/ICPP.2016.45","DOIUrl":"https://doi.org/10.1109/ICPP.2016.45","url":null,"abstract":"Hand tracking systems are becoming increasingly popular as a fundamental HCI approach. The trajectory of moving hand can be estimated through smoothing the position coordinates collected from continuous localization. Therefore, hand localization is a key component of any hand tracking systems. This paper presents EchoLoc, which locates the human hand by leveraging the speaker array in Commercial Off-The-Shelf (COTS) devices (i.e., a smart phone plugged with a stereo speaker). EchoLoc measures the distance from the hand to the speaker array via the Time Of Flight (TOF) of the chirp. The speaker array and hand yield a unique triangle, therefore, the hand can be localized with triangular geometry. We prototype EchoLoc on iOS as an application, and find it is capable of localization with the average resolution within five centimeters of 73% and three centimeters of 48%.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123267071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Think Global, Act Local: A Buffer Cache Design for Global Ordering and Parallel Processing in the WAFL File System","authors":"P. Denz, Matthew Curtis-Maury, V. Devadas","doi":"10.1109/ICPP.2016.51","DOIUrl":"https://doi.org/10.1109/ICPP.2016.51","url":null,"abstract":"Given the enormous disparity in access speeds between main memory and storage media, modern storage servers must leverage highly effective buffer cache policies to meet demanding performance requirements. At the same time, these page replacement policies need to scale efficiently with ever-increasing core counts and memory sizes, which necessitate parallel buffer cache management. However, these requirements of effectiveness and scalability are at odds, because centralized processing does not scale with more processors and parallel policies are a challenge to implement with maximum effectiveness. We have overcome this difficulty in the NetApp Data ONTAP WAFL file system by using a sophisticated technique to simultaneously allow global buffer prioritization while providing parallel management operations. In addition, we have extended the buffer cache to provide a soft isolation of different workloads' buffer cache usage, which is akin to buffer cache quality of server (QoS). This paper presents the design and implementation of these significant extensions in the buffer cache of a high-performance commercial file system.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116750350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ping-Hsiu Huang, Wenjie Liu, Kun Tang, Xubin He, Ke Zhou
{"title":"ROP: Alleviating Refresh Overheads via Reviving the Memory System in Frozen Cycles","authors":"Ping-Hsiu Huang, Wenjie Liu, Kun Tang, Xubin He, Ke Zhou","doi":"10.1109/ICPP.2016.26","DOIUrl":"https://doi.org/10.1109/ICPP.2016.26","url":null,"abstract":"DRAM memory performs periodic refreshes to prevent data loss due to charge leakage, while memory refreshes cause performance degradation and energy consumption, referred to as refresh overheads. In this paper, we propose Refresh-Oriented Prefetching (ROP) to alleviate memory refresh overheads. Before a refresh starts, ROP prefetches cache lines from the tobe-refreshed rank into an added SRAM buffer. In doing so, when a rank is undergoing refresh, memory requests can still be serviced rather than being blocked. At the core of ROP is a probabilistic prefetch model determining which cache lines are prefetched for a refresh based on the access patterns appearing in an observational window ahead of the refresh. A Pattern Profiler collects statistics about memory traffic occurring before and after the starting time of each refresh operation in a period of training time and it outputs two conditional probabilities which are used to control subsequent prefetch decisions. A Prefetcher maintains a prediction table which helps to ascertain access patterns appearing around refresh operations. The prediction table is updated every time an access occurs to the to-be-nextrefreshed ran during the observational window and is consulted to decide which cache lines are prefetched. Extensive evaluation results with benchmarks from SPEC CPU2006 on a DDR4 memory have demonstrated that with ROP memory performance can be improved by up to 9.2% (3.3% on average) for singlecore simulations, while reducing the overall memory energy by up to 6.7% (3.6% on average), relative to an auto-refresh baseline memory. Moreover, it increases the Weighted Speedup by up to 2.22X (1.32X on average) for 4-core multiprogram simulations, while reducing energy by up to 48.8% (24.4% on average).","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128408657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Pradelle, Benoît Meister, M. Baskaran, A. Konstantinidis, Thomas Henretty, R. Lethin
{"title":"Scalable Hierarchical Polyhedral Compilation","authors":"B. Pradelle, Benoît Meister, M. Baskaran, A. Konstantinidis, Thomas Henretty, R. Lethin","doi":"10.1109/ICPP.2016.56","DOIUrl":"https://doi.org/10.1109/ICPP.2016.56","url":null,"abstract":"Computers across the board, from embedded to future exascale computers, are consistently designed with deeper memory hierarchies. While this opens up exciting opportunities for improving software performance and energy efficiency, it also makes it increasingly difficult to efficiently exploit the hardware. Advanced compilation techniques are a possible solution to this difficult problem and, among them, the polyhedral compilation technology provides a pathway for performing advanced automatic parallelization and code transformations. However, the polyhedral model is also known for its poor scalability with respect to the number of dimensions in the polyhedra that are used for representing programs. Although current compilers can cope with such limitation when targeting shallow hierarchies, polyhedral optimizations often become intractable as soon as deeper hardware hierarchies are considered. We address this problem by introducing two new operators for polyhedral compilers: focalisation and defocalisation. When applied in the compilation flow, the new operators reduce the dimensionality of polyhedra, which drastically simplifies the mathematical problems solved during the compilation. We prove that the presented operators preserve the original program semantics, allowing them to be safely used in compilers. We implemented the operators in a production compiler, which drastically improved its performance and scalability when targeting deep hierarchies.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131069850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Criticality-Aware Partitioning for Multicore Mixed-Criticality Systems","authors":"Jianjun Han, Xin Tao, Dakai Zhu, Hakan Aydin","doi":"10.1109/ICPP.2016.33","DOIUrl":"https://doi.org/10.1109/ICPP.2016.33","url":null,"abstract":"The scheduling for mixed-criticality (MC) systems, where multiple activities have different certification requirements and thus different criticality on a shared hardware platform, has recently become an important research focus. In this work, considering that multicore processors have emerged as the de-facto platform for modern embedded systems, we propose a novel and efficient criticality-aware task partitioning algorithm (CA-TPA) for a set of periodic MC tasks running on multicore systems. We employ the state-of-the art EDF-VD scheduler on each core. Our work is based on the observation that the utilizations of MC tasks at different criticality levels can have quite large variations, hence when a task is allocated, its utilization contribution on different processors may vary by large margins and this can significantly affect the schedulability of tasks. During partitioning, CA-TPA sorts the tasks according to their utilization contributions on individual processors. Several heuristics are investigated to balance the workload on processors with the objective of improving the schedulability of tasks under CA-TPA. The simulation results show that our proposed CA-TPA scheme is effective, giving much higher schedulability ratios when compared to the classical partitioning schemes.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123542310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPI Overlap: Benchmark and Analysis","authors":"Alexandre Denis, François Trahay","doi":"10.1109/ICPP.2016.37","DOIUrl":"https://doi.org/10.1109/ICPP.2016.37","url":null,"abstract":"In HPC applications, one of the major overhead compared to sequential code, is communication cost. Application programmers often amortize this cost by overlapping communications with computation. To do so, they post a non-blocking MPI request, perform computation, and wait for communication completion, assuming MPI communication will progress in background. In this paper, we propose to measure what really happens when trying to overlap non-blocking point-to-point communications with computation. We explain how background progression works, we describe relevant test cases, we identify challenges for a benchmark, then we propose a benchmark suite to measure how much overlap happen in various cases. We exhibit overlap benchmark results on a wide panel of MPI libraries and hardware platforms. Finally, we classify, analyze, and explain the results using low-level traces to reveal the internal behavior of the MPI library.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122794222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ankur Sarker, Chenxi Qiu, Haiying Shen, A. Gil, J. Taiber, M. Chowdhury, Jim Martin, Mac Devine, A. J. Rindos
{"title":"An Efficient Wireless Power Transfer System to Balance the State of Charge of Electric Vehicles","authors":"Ankur Sarker, Chenxi Qiu, Haiying Shen, A. Gil, J. Taiber, M. Chowdhury, Jim Martin, Mac Devine, A. J. Rindos","doi":"10.1109/ICPP.2016.44","DOIUrl":"https://doi.org/10.1109/ICPP.2016.44","url":null,"abstract":"As an alternate form in the road transportation system, electric vehicle (EV) can help reduce the fossil-fuel consumption. However, the usage of EVs is constrained by the limited capacity of battery. Wireless Power Transfer (WPT) can increase the driving range of EVs by charging EVs in motion when they drive through a wireless charging lane embedded in a road. The amount of power that can be supplied by a charging lane at a time is limited. A problem here is when a large number of EVs pass a charging lane, how to efficiently distribute the power among different penetrations levels of EVs? However, there has been no previous research devoted to tackling this challenge. To handle this challenge, we propose a system to balance the State of Charge (called BSoC) among the EVs. It consists of three components: i) fog-based power distribution architecture, ii) power scheduling model, and iii) efficient vehicle-to-fog communication protocol. The fog computing center collects information from EVs and schedules the power distribution. We use fog closer to vehicles rather than cloud in order to reduce the communication latency. The power scheduling model schedules the power allocated to each EV. In order to avoid network congestion between EVs and the fog, we let vehicles choose their own communication channel to communicate with local controllers. Finally, we evaluate our system using extensive simulation studies in Network Simulator-3, MatLab, and Simulation for Urban MObility tools, and the experimental results confirm the efficiency of our system.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133248891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Two-Dimensional Unstructured Anisotropic Delaunay Mesh Generation of Complex Domains for Aerospace Applications","authors":"Juliette Pardue, Andrey N. Chernikov","doi":"10.1109/ICPP.2016.76","DOIUrl":"https://doi.org/10.1109/ICPP.2016.76","url":null,"abstract":"In this paper, we present a bottom-up approach to parallel anisotropic mesh generation by building a mesh generator from principles. Applications focusing on high-lift design or dynamic stall, or numerical methods and modeling test cases still focus on the two-dimensions. Our push-button parallel mesh generation approach can generate high-fidelity unstructured meshes with anisotropic boundary layers for use in the computational fluid dynamics field. The anisotropy requirement adds a level of complexity to a parallel meshing algorithm by making computation depend on the local alignment of elements, which in turn is dictated by geometric boundaries and the density functions. Our experimental results show 70% parallel efficiency over the fastest sequential isotropic mesh generator on 256 distributed memory nodes.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131406071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sagar Thapaliya, P. Bangalore, J. Lofstead, K. Mohror, A. Moody
{"title":"Managing I/O Interference in a Shared Burst Buffer System","authors":"Sagar Thapaliya, P. Bangalore, J. Lofstead, K. Mohror, A. Moody","doi":"10.1109/ICPP.2016.54","DOIUrl":"https://doi.org/10.1109/ICPP.2016.54","url":null,"abstract":"In this work, we investigate the problem of inter-application interference in a shared Burst Buffer (BB) system. A BB is a new storage technology for HPC architectures that acts as an intermediate layer between performance-hungry HPC applications and the slow parallel file system. While the BB is meant to alleviate the problem of slow I/O in HPC systems, it is itself prone to performance degradation under interference. We observe that the magnitude of interference effects can reach a level that matters to the HPC system and the jobs that run on it. We investigate I/O scheduling techniques as a mechanism to mitigate BB I/O interference. With our results, we show that scheduling techniques tuned to BBs can control interference and significant performance benefits can be achieved.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors","authors":"O. Kaya, B. Uçar","doi":"10.1109/ICPP.2016.19","DOIUrl":"https://doi.org/10.1109/ICPP.2016.19","url":null,"abstract":"We investigate an efficient parallelization of a class of algorithms for the well-known Tucker decomposition of general N-dimensional sparse tensors. The targeted algorithms are iterative and use the alternating least squares method. At each iteration, for each dimension of an N-dimensional input tensor, the following operations are performed: (i) the tensor is multiplied with (N - 1) matrices (TTMc step), (ii) the product is then converted to a matrix, and (iii) a few leading left singular vectors of the resulting matrix are computed (TRSVD step) to update one of the matrices for the next TTMc step. We propose an efficient parallelization of these algorithms for the current parallel platforms with multicore nodes. We discuss a set of preprocessing steps which takes all computational decisions out of the main iteration of the algorithm and provides an intuitive shared-memory parallelism for the TTM and TRSVD steps. We propose a coarse and a fine-grain parallel algorithm in a distributed memory environment, investigate data dependencies, and identify efficient communication schemes. We demonstrate how the computation of singular vectors in the TRSVD step can be carried out efficiently following the TTMc step. Finally, we develop a hybrid MPI-OpenMP implementation of the overall algorithm and report scalability results on up to 4096 cores on 256 nodes of an IBM BlueGene/Q supercomputer.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123619515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}