{"title":"A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA","authors":"Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang","doi":"10.1109/HPEC.2019.8916237","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916237","url":null,"abstract":"Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121533225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Ricke, James Watkins, Philip Fremont-Smith, Adam Michaleas
{"title":"IdPrism: Rapid Analysis of Forensic DNA Samples Using MPS SNP Profiles","authors":"D. Ricke, James Watkins, Philip Fremont-Smith, Adam Michaleas","doi":"10.1109/HPEC.2019.8916521","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916521","url":null,"abstract":"Massively parallel sequencing (MPS) of large single nucleotide polymorphism (SNP) panels enables identification, analysis of complex DNA mixture samples, and extended kinship predictions. Computational challenges related to SNP allele calling, probability of random man not excluded calculations, and both reference and complex mixture sample comparisons to tens of millions of reference profiles were encountered and resolved when scaling up from thousands to tens of thousands of SNP loci. A MPS SNP analysis pipeline is described for rapid analysis of forensic deoxyribonucleic acid (DNA) samples for thousands to tens of thousands of SNP loci against tens of millions of reference profiles. This pipeline is part of the MIT Lincoln Laboratory (MITLL) IdPrism advanced DNA forensic system.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127927347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner
{"title":"Singularity for Machine Learning Applications - Analysis of Performance Impact","authors":"B. R. Jordan, David Barrett, David Burke, Patrick Jardin, Amelia Littrell, P. Monticciolo, Michael Newey, J. Piou, Kara Warner","doi":"10.1109/HPEC.2019.8916443","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916443","url":null,"abstract":"Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader
{"title":"Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems","authors":"Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader","doi":"10.1109/HPEC.2019.8916307","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916307","url":null,"abstract":"Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126588492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool
{"title":"ECG Feature Processing Performance Acceleration on SLURM Compute Systems","authors":"Michael Nolan, Mark Hernandez, Philip Fremont-Smith, A. Swiston, K. Claypool","doi":"10.1109/HPEC.2019.8916397","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916397","url":null,"abstract":"Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we’ve developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124857800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things","authors":"M. Ilić, Rupamathi Jaddivada","doi":"10.1109/HPEC.2019.8916560","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916560","url":null,"abstract":"With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems, however, still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131278892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Stochastic Block Partitioning via Sampling","authors":"Frank Wanye, Vitaliy Gleyzer, Wu-chun Feng","doi":"10.1109/HPEC.2019.8916542","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916542","url":null,"abstract":"Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between $2.18 times$ and $7.26 times$ for graph sizes between 5,000 and 50,000 vertices without a significant loss in the accuracy of community detection.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130442985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System","authors":"Wenjia Zheng, Yun Song, Zihao Guo, Yongcheng Cui, Suwen Gu, Ying Mao, Long Cheng","doi":"10.1109/HPEC.2019.8916403","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916403","url":null,"abstract":"The neural-network based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g. CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115043470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son
{"title":"Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets","authors":"Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, S. Son","doi":"10.1109/HPEC.2019.8916286","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916286","url":null,"abstract":"As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133765375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}