{"title":"Computing structural controllability of linearly-coupled complex networks","authors":"R. Rajaei, A. Ramezani, B. Shafai","doi":"10.1109/HPEC.2017.8091064","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091064","url":null,"abstract":"Structural controllability, as a generic structure-based property in determining the ability of a complex network to reach the desired configuration, is addressed in this work. Using a robust measure derived from robust control theory, this paper deals with structural controllability of a type of weighted network of networks (NetoNets) involving linear couplings between its corresponding networks and clusters. Unlike the structural controllability degrees rooted in graph theory, this paper takes the advantage of uncertain systems to define the notion of structural controllability in a straightforward and less computationally complex way. Moreover, the spectrum of required energy is discussed. Eventually, the results for the proposed measure of structural controllability of scale-free networks are given to justify the proposed measure of an efficient and effective guarantee for fully controllability of the NetoNets in exposure to cluster and network-dependency connections. The proposed measure is an optimal solution according to structural energy-related control of the NetoNet where the upper bound of the required energy is illustrated an efficient measure for structural controllability of the class of NetoNet. Arbitrarily connectivity of low connected vertices to their higher connected counterparts in clusters results in effective controllability. In the same direction, as seminal works in structural controllability of complex networks to avoid the highly-connected nodes, the larger the cluster/network connectivity degree is, the less fully controllability of NetoNet is guaranteed.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115887573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aekyeung Moon, Jaeyoung Kim, Jialing Zhang, S. Son
{"title":"Lossy compression on IoT big data by exploiting spatiotemporal correlation","authors":"Aekyeung Moon, Jaeyoung Kim, Jialing Zhang, S. Son","doi":"10.1109/HPEC.2017.8091030","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091030","url":null,"abstract":"As the volume of data generated by various deployed IoT devices increases, storing and processing IoT big data becomes a huge challenge. While compression, especially lossy ones, can drastically reduce data volume, finding an optimal balance between the volume reduction and the information loss is not an easy task given that the data collected by diverse sensors exhibit different characteristics. Motivated by this, we present a feasibility analysis of lossy compression on agricultural sensor data by comparing fidelity of reconstructed data from various signal processing algorithms and temporal difference encoding. Specifically, we evaluated five real-world sensor data from weather stations as one of major IoT applications. Our experimental results indicate that Discrete Cosine Transform (DCT) and Fast Walsh-Hadamard Transform (FWHT) generate higher compression ratios than others. In terms of information loss, Lossy Delta Encoding (LDE) significantly outperforms others nonetheless. We also observe that, as compression factor is increased, error rates for all compression algorithms also increase. However, the impact of introduced error is much severe in DCT and FWHT while LDE was able to maintain a relatively lower error rate than other methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115938974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static graph challenge on GPU","authors":"M. Bisson, M. Fatica","doi":"10.1109/HPEC.2017.8091034","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091034","url":null,"abstract":"This paper presents the details of a CUDA implementation of the Subgraph Isomorphism Graph Challenge, a new effort aimed at driving progress in the graph analytics field. challenge consists of two graph analytics: triangle counting and k-truss. We present our CUDA implementation of the graph triangle counting operation and of the k-truss subgraph decomposition. Both implementations share the same codebase taking advantage of a set intersection operation implemented via bitmaps. The analytics are implemented in four kernels optimized for different types of graphs. At runtime, lightweight heuristics are used to select the kernel to run based on the specific graph taken as input.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128878235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oded Green, James Fox, Euna Kim, F. Busato, N. Bombieri, Kartik Lakhotia, Shijie Zhou, Shreyas G. Singapura, Hanqing Zeng, R. Kannan, V. Prasanna, David A. Bader
{"title":"Quickly finding a truss in a haystack","authors":"Oded Green, James Fox, Euna Kim, F. Busato, N. Bombieri, Kartik Lakhotia, Shijie Zhou, Shreyas G. Singapura, Hanqing Zeng, R. Kannan, V. Prasanna, David A. Bader","doi":"10.1109/HPEC.2017.8091038","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091038","url":null,"abstract":"The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k-truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of non-conforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"77 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121915968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient and accurate Word2Vec implementations in GPU and shared-memory multicore architectures","authors":"Trevor M. Simonton, G. Alaghband","doi":"10.1109/HPEC.2017.8091076","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091076","url":null,"abstract":"Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this work, we propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory CPU systems, and on modern NVIDIA GPUs with CUDA. We accomplish this on multi-core CPUs by batching training operations to increase thread locality and to reduce accesses to shared memory. We then propose new heterogeneous NVIDIA GPU CUDA implementations of both the skip gram hierarchical softmax and negative sampling techniques that utilize shared memory registers and in-warp shuffle operations for maximized performance. Our GPU skip gram with negative sampling approach produces a higher quality of word vectors than previous GPU implementations, and our flexible skip gram with hierarchical softmax implementation achieves a factor of 10 speedup of the existing methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132188992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Yang, Jiayi Sheng, Rushi Patel, A. Sanaullah, Vipin Sachdeva, M. Herbordt
{"title":"OpenCL for HPC with FPGAs: Case study in molecular electrostatics","authors":"Chen Yang, Jiayi Sheng, Rushi Patel, A. Sanaullah, Vipin Sachdeva, M. Herbordt","doi":"10.1109/HPEC.2017.8091078","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091078","url":null,"abstract":"FPGAs have emerged as a cost-effective accelerator alternative in clouds and clusters. Programmability remains a challenge, however, with OpenCL being generally recognized as a likely part of the solution. In this work we seek to advance the use of OpenCL for HPC on FPGAs in two ways. The first is by examining a core HPC application, Molecular Dynamics. The second is by examining a fundamental design pattern that we believe has not yet been described for OpenCL: passing data from a set of producer datapaths to a set of consumer datapaths, in particular, where the producers generate data non-uniformly. We evaluate several designs: single level versions in Verilog and in OpenCL, a two-level Verilog version with optimized arbiter, and several two-level OpenCL versions with different arbitration and hand-shaking mechanisms, including one with an embedded Verilog module. For the Verilog designs, we find that FPGAs retain their high-efficiency with a factor of 50 χ to 80 χ performance benefit over a single core. We also find that OpenCL may be competitive with HDLs for the straightline versions of the code, but that for designs with more complex arbitration and hand-shaking, relative performance is substantially diminished.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungseob Whang, Tymani Rachford, Dimitra Papagiannopoulou, T. Moreshet, R. I. Bahar
{"title":"Evaluating critical bits in arithmetic operations due to timing violations","authors":"Sungseob Whang, Tymani Rachford, Dimitra Papagiannopoulou, T. Moreshet, R. I. Bahar","doi":"10.1109/HPEC.2017.8091090","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091090","url":null,"abstract":"Various error models are being used in simulation of voltage-scaled arithmetic units to examine application-level tolerance of timing violations. The selection of an error model needs further consideration, as differences in error models drastically affect the performance of the application. Specifically, floating point arithmetic units (FPUs) have architectural characteristics that characterize its behavior. We examine the architecture of FPUs and design a new error model, which we call Critical Bit. We run selected benchmark applications with Critical Bit and other widely used error injection models to demonstrate the differences.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124625485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid flash arrays for HPC storage systems: An alternative to burst buffers","authors":"T. Petersen, John Bent","doi":"10.1109/HPEC.2017.8091092","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091092","url":null,"abstract":"Cloud and high-performance computing storage systems are comprised of thousands of physical storage devices and uses software that organize them into multiple data tiers based on access frequency. The characteristics of these devices lend themselves well to these tiers as devices have differing ratios of performance to capacity. Due to this, these systems have, for the past several years, incorporated a mix of flash devices and mechanical spinning hard disk drives. Although a single media type will be ideal, the economic reality is that a hybrid system must use flash for performance and disk for capacity. Within the high-performance computing community, flash has been used to create a new tier called burst buffers which are typically software managed, user visible, wed to a particular file system, and buffer all IO traffic before subsequent migration to disk. In this paper, we propose an alternative architecture that is hardware managed, user transparent, file system agnostic, and that only buffers small IO while allowing large sequential IO to access the disks directly. Our evaluation of this alternative architecture finds that it achieves comparable results to the reported burst buffer numbers and improves on systems comprised solely of disks by several orders of magnitude for a fraction of the cost.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116026109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU accelerated gigabit level BCH and LDPC concatenated coding system","authors":"Selcuk Keskin, T. Koçak","doi":"10.1109/HPEC.2017.8091021","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091021","url":null,"abstract":"Increasing data traffic and multimedia services in recent years have paved the way for the development of optical transmission methods to be used in high bandwidth communications systems. In order to meet the very high throughput requirements, dedicated application specific integrated circuit and field programmable gate array solutions for low-density parity-check decoding are proposed in recent years. Conversely, software solutions are less expensive, scalable, and flexible and have shorter development cycle. A natural solution to lower the error floor is to concatenate the LDPC code with an algebraic outer code to clean up the residual errors. In this paper, we present the design and parallel software implementation of a major computation algorithm for LDPC decoding on general purpose graphics processing units as inner code and BCH decoding algorithm as outer code to achieve excellent error-correcting performance. The experimental results show that the proposed GPU-based concatenated decoder achieves the maximum decoding throughput of 1.82Gbps at 10 iterations with low bit-error rate (BER).","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129302962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-performance low-energy implementation of cryptographic algorithms on a programmable SoC for IoT devices","authors":"Boyou Zhou, Manuel Egele, A. Joshi","doi":"10.1109/HPEC.2017.8091062","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091062","url":null,"abstract":"Due to severe power and timing constraints of the \"things\" in the Internet of things (IoT), cryptography is expensive for these devices. Custom hardware provides a viable solution. However, implementations of cryptographic algorithms in the devices need to be upgraded frequently compared to the longevity of these \"things\". Therefore, there is a critical need for reconfigurable, low-power and high-performance cryptography implementations for IoT devices. In this paper, we propose to use an FPGA as the reconfigurable substrate for cryptographic operations. We demonstrate our proposed approach on a Zedboard, which has two ARM cores and a Zynq FPGA. The implemented cryptographic algorithms include symmetric cryptography, asymmetric cryptography, and secure hash functions. We also integrate our cryptographic engines with the OpenSSL library to inherit the library's support for block cipher modes. Our approach shows that the FPGA-based reconfigurable cryptographic components consume between 1.8× and 4033× less energy and run between 1.6× and 2983× faster than the software implementation. At the same time, the FPGA implementation of cryptographic operations is more flexible compared to custom hardware implementations of cryptographic components.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132554262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}