IEEE Transactions on Multi-Scale Computing Systems最新文献

筛选
英文 中文
DPFEE: A High Performance Scalable Pre-Processor for Network Security Systems DPFEE:一种用于网络安全系统的高性能可扩展预处理器
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-10-23 DOI: 10.1109/TMSCS.2017.2765324
Vinayaka Jyothi;Sateesh K. Addepalli;Ramesh Karri
{"title":"DPFEE: A High Performance Scalable Pre-Processor for Network Security Systems","authors":"Vinayaka Jyothi;Sateesh K. Addepalli;Ramesh Karri","doi":"10.1109/TMSCS.2017.2765324","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2765324","url":null,"abstract":"Network Intrusion Detection Systems (NIDS) and Anti-Denial-of-Service (DoS) employ Deep Packet Inspection (DPI) which provides visibility to the content of payload to detect network attacks. All DPI engines assume a pre-processing step that extracts the various protocol-specific fields. However, application layer (L7) field extraction is computationally expensive. We propose a novel Deep Packet Field Extraction Engine (DPFEE) for application layer field extraction to hardware. DPFEE is a content-aware, grammar-based, Layer 7 programmable field extraction engine for text-based protocols. Our prototype DPFEE implementation for the Session Initiation Protocol (SIP) and HTTP protocol on a single FPGA, achieves a bandwidth of 408.5 Gbps and this can be scaled beyond 500 Gbps. Single DPFEE exhibits a speedup of 24X-89X against widely used preprocessors. Even against 12 multi-instances of a preprocessor, single DPFEE demonstrated a speedup of 4.7-7.4X. Single DPFEE achieved 3.14X higher bandwidth, 1020X lower latency, and 106X lower power consumption, when compared with 200 parallel streams of GPU accelerated preprocessor.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"55-68"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2765324","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
NanoStreams: A Microserver Architecture for Real-Time Analytics on Fast Data Streams NanoStreams:一种用于快速数据流实时分析的微型服务器体系结构
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-10-18 DOI: 10.1109/TMSCS.2017.2764087
U. I. Minhas;M. Russell;S. Kaloutsakis;P. Barber;R. Woods;G. Georgakoudis;C. Gillan;D. S. Nikolopoulos;A. Bilas
{"title":"NanoStreams: A Microserver Architecture for Real-Time Analytics on Fast Data Streams","authors":"U. I. Minhas;M. Russell;S. Kaloutsakis;P. Barber;R. Woods;G. Georgakoudis;C. Gillan;D. S. Nikolopoulos;A. Bilas","doi":"10.1109/TMSCS.2017.2764087","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2764087","url":null,"abstract":"Ever increasing power consumption has created great interest in energy-efficient microserver architectures but they lack the computational, networking, and storage power necessary to cope with real-time data analytics. We propose NanoStreams, an integrated architecture comprising an ARM-based microserver, coupled via a novel, low latency network interface, Nanowire, to an Analytics-on-Chip architecture implemented on Field Programmable Gate Array (FPGA) technology; the architecture comprises ARM cores for performing low latency transactional processing, integrated with programmable, energy efficient Nanocore processors for high-throughput streaming analytics. The paper outlines the complete system architecture, hardware level detail, compiler, network protocol, and programming environment. We present experiments from the financial services sector, comparing a state-of-the-art server based on Intel Sandy Bridge processors, an ARM based Calxeda ECS-1000 microserver and ODROID XU3 node, with the NanoStreams microserver architecture using an industrial workload. For end-to-end workload, the NanoStreams microserver achieves energy savings up to 10.7×, 5.87× and 5× compared to the Intel server, Calxeda microserver and ODROID node, respectively.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"396-409"},"PeriodicalIF":0.0,"publicationDate":"2017-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2764087","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parameter Exploration to Improve Performance of Memristor-Based Neuromorphic Architectures 提高忆阻器神经形态结构性能的参数探索
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-10-09 DOI: 10.1109/TMSCS.2017.2761231
Mahyar Shahsavari;Pierre Boulet
{"title":"Parameter Exploration to Improve Performance of Memristor-Based Neuromorphic Architectures","authors":"Mahyar Shahsavari;Pierre Boulet","doi":"10.1109/TMSCS.2017.2761231","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2761231","url":null,"abstract":"The brain-inspired spiking neural network neuromorphic architecture offers a promising solution for a wide set of cognitive computation tasks at a very low power consumption. Due to the practical feasibility of hardware implementation, we present a memristor-based model of hardware spiking neural networks which we simulate with Neural Network Scalable Spiking Simulator (N2S3), our open source neuromorphic architecture simulator. Although Spiking neural networks are widely used in the community of computational neuroscience and neuromorphic computation, there is still a need for research on the methods to choose the optimum parameters for better recognition efficiency. With the help of our simulator, we analyze and evaluate the impact of different parameters such as number of neurons, STDP window, neuron threshold, distribution of input spikes, and memristor model parameters on the MNIST hand-written digit recognition problem. We show that a careful choice of a few parameters (number of neurons, kind of synapse, STDP window, and neuron threshold) can significantly improve the recognition rate on this benchmark (around 15 points of improvement for the number of neurons, a few points for the others) with a variability of four to five points of recognition rate due to the random initialization of the synaptic weights.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"833-846"},"PeriodicalIF":0.0,"publicationDate":"2017-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2761231","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dynamic Trade-off among Fault Tolerance, Energy Consumption, and Performance on a Multiple-Issue VLIW Processor 多问题VLIW处理器容错、能耗和性能之间的动态权衡
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-10-06 DOI: 10.1109/TMSCS.2017.2760299
Anderson L. Sartor;Pedro H. E. Becker;Joost Hoozemans;Stephan Wong;Antonio C. S. Beck
{"title":"Dynamic Trade-off among Fault Tolerance, Energy Consumption, and Performance on a Multiple-Issue VLIW Processor","authors":"Anderson L. Sartor;Pedro H. E. Becker;Joost Hoozemans;Stephan Wong;Antonio C. S. Beck","doi":"10.1109/TMSCS.2017.2760299","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2760299","url":null,"abstract":"In the design of modern-day processors, energy consumption and fault tolerance have gained significant importance next to performance. This is caused by battery constraints, thermal design limits, and higher susceptibility to errors as transistor feature sizes are decreasing. However, achieving the ideal balance among them is challenging due to their conflicting nature (e.g., fault-tolerance techniques usually influence execution time or increase energy consumption), and that is why current processor designs target at most two of these axes. Based on that, we propose a new VLIW-based processor design capable of adapting the execution of the application at run-time in a totally transparent fashion, considering performance, fault tolerance, and energy consumption altogether, in which the weight (priority) of each one can be defined a priori. This is achieved by a novel decision module that dynamically controls the application's ILP to increase the possibility of replicating instructions or applying power gating. For an energy-oriented configuration, it is possible, on average, to reduce energy consumption by 37.2 percent with an overhead of only 8.2 percent in performance, while maintaining low levels of failure rate, when compared to a fault-tolerant design.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"327-339"},"PeriodicalIF":0.0,"publicationDate":"2017-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2760299","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Inter-Cluster Thread-to-Core Mapping and DVFS on Heterogeneous Multi-Cores 异构多核上的集群间线程到核映射和DVFS
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-09-26 DOI: 10.1109/TMSCS.2017.2755619
Basireddy Karunakar Reddy;Amit Kumar Singh;Dwaipayan Biswas;Geoff V. Merrett;Bashir M. Al-Hashimi
{"title":"Inter-Cluster Thread-to-Core Mapping and DVFS on Heterogeneous Multi-Cores","authors":"Basireddy Karunakar Reddy;Amit Kumar Singh;Dwaipayan Biswas;Geoff V. Merrett;Bashir M. Al-Hashimi","doi":"10.1109/TMSCS.2017.2755619","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2755619","url":null,"abstract":"Heterogeneous multi-core platforms that contain different types of cores, organized as clusters, are emerging, e.g., ARM's big.LITTLE architecture. These platforms often need to deal with multiple applications, having different performance requirements, executing concurrently. This leads to the generation of varying and mixed workloads (e.g., compute and memory intensive) due to resource sharing. Run-time management is required for adapting to such performance requirements and workload variabilities and to achieve energy efficiency. Moreover, the management becomes challenging when the applications are multi-threaded and the heterogeneity needs to be exploited. The existing run-time management approaches do not efficiently exploit cores situated in different clusters simultaneously (referred to as inter-cluster exploitation) and DVFS potential of cores, which is the aim of this paper. Such exploitation might help to satisfy the performance requirement while achieving energy savings at the same time. Therefore, in this paper, we propose a run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability. Then, it applies online adaptation by adjusting the voltage-frequency (V-f) levels to achieve energy optimization, without trading-off application performance. For thread-to-core mapping, offline profiled results are used, which contain performance and energy characteristics of applications when executed on the heterogeneous platform by using different types of cores in various possible combinations. For an application, thread-to-core mapping process defines the number of used cores and their type, which are situated in different clusters. The online adaptation process classifies the inherent workload characteristics of concurrently executing applications, incurring a lower overhead than existing learning-based approaches as demonstrated in this paper. The classification of workload is performed using the metric Memory Reads Per Instruction (MRPI). The adaptation process pro-actively selects an appropriate V-f pair for a predicted workload. Subsequently, it monitors the workload prediction error and performance loss, quantified by instructions per second (IPS), and adjusts the chosen V-f to compensate. We validate the proposed run-time management approach on a hardware platform, the Odroid-XU3, with various combinations of multi-threaded applications from PARSEC and SPLASH benchmarks. Results show an average improvement in energy efficiency up to 33 percent compared to existing approaches while meeting the performance requirements.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"369-382"},"PeriodicalIF":0.0,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2755619","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 54
Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors 基于FPGA的加速器与芯片多处理器的可扩展轻量级集成
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-09-21 DOI: 10.1109/TMSCS.2017.2754378
Zhe Lin;Sharad Sinha;Hao Liang;Liang Feng;Wei Zhang
{"title":"Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors","authors":"Zhe Lin;Sharad Sinha;Hao Liang;Liang Feng;Wei Zhang","doi":"10.1109/TMSCS.2017.2754378","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2754378","url":null,"abstract":"Modern multicore systems are migrating from homogeneous systems to heterogeneous systems with accelerator-based computing in order to overcome the barriers of performance and power walls. In this trend, FPGA-based accelerators are becoming increasingly attractive, due to their excellent flexibility and low design cost. In this paper, we propose the architectural support for efficient interfacing between FPGA-based multi-accelerators and chip-multiprocessors (CMPs) connected through the network-on-chip (NoC). Distributed packet receivers and hierarchical packet senders are designed to maintain scalability and reduce the critical path delay under a heavy task load. A dedicated accelerator chaining mechanism is also proposed to facilitate intra-FPGA data reuse among accelerators to circumvent prohibitive communication overhead between the FPGA and processors. In order to evaluate the proposed architecture, a complete system emulation with programmability support is performed using FPGA prototyping. Experimental results demonstrate that the proposed architecture has high-performance, and is light-weight and scalable in characteristics.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"152-162"},"PeriodicalIF":0.0,"publicationDate":"2017-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2754378","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68021416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Modelling Program's Performance with Gaussian Mixtures for Parametric Statistics 用于参数统计的高斯混合程序性能建模
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-09-19 DOI: 10.1109/TMSCS.2017.2754251
Julien Worms;Sid Touati
{"title":"Modelling Program's Performance with Gaussian Mixtures for Parametric Statistics","authors":"Julien Worms;Sid Touati","doi":"10.1109/TMSCS.2017.2754251","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2754251","url":null,"abstract":"This article is a continuation of our previous research effort on program performance statistical analysis and comparison [1], in the presence of program performance variability. In the previous study, we proposed a formal statistical methodology to analyze program speedups based on mean and median performance metrics: execution time, energy consumption, etc. However, mean and median observed performances do not always reflect the user's feeling of such performances, especially when they are particularly unstable. In the current study, we propose additional precise performance metrics, based on performance modelling using Gaussian mixtures. Our additional statistical metrics for analyzing and comparing program performances give the user more precise decision tools to select best code versions, not necessarily based on mean or median numbers. Also, we provide a new metric to estimate performance variability based on the Gaussian mixture model. Our statistical methods are implemented with R and distributed as open source code.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"383-395"},"PeriodicalIF":0.0,"publicationDate":"2017-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2754251","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PinMe: Tracking a Smartphone User around the World PinMe:追踪世界各地的智能手机用户
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-09-15 DOI: 10.1109/TMSCS.2017.2751462
Arsalan Mosenia;Xiaoliang Dai;Prateek Mittal;Niraj K. Jha
{"title":"PinMe: Tracking a Smartphone User around the World","authors":"Arsalan Mosenia;Xiaoliang Dai;Prateek Mittal;Niraj K. Jha","doi":"10.1109/TMSCS.2017.2751462","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2751462","url":null,"abstract":"With the pervasive use of smartphones that sense, collect, and process valuable information about the environment, ensuring location privacy has become one of the most important concerns in the modern age. A few recent research studies discuss the feasibility of processing sensory data gathered by a smartphone to locate the phone's owner, even when the user does not intend to share his location information, e.g., when the user has turned off the Global Positioning System (GPS) on the device. Previous research efforts rely on at least one of the two following fundamental requirements, which impose significant limitations on the adversary: (i) the attacker must accurately know either the user's initial location or the set of routes through which the user travels and/or (ii) the attacker must measure a set of features, e.g., device acceleration, for different potential routes in advance and construct a training dataset. In this paper, we demonstrate that neither of the above-mentioned requirements is essential for compromising the user's location privacy. We describe PinMe, a novel user-location mechanism that exploits non-sensory/sensory data stored on the smartphone, e.g., the environment's air pressure and device's timezone, along with publicly-available auxiliary information, e.g., elevation maps, to estimate the user's location when all location services, e.g., GPS, are turned off. Unlike previously-proposed attacks, PinMe neither requires any prior knowledge about the user nor a training dataset on specific routes. We demonstrate that PinMe can accurately estimate the user's location during four activities (walking, traveling on a train, driving, and traveling on a plane). We also suggest several defenses against the proposed attack.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"420-435"},"PeriodicalIF":0.0,"publicationDate":"2017-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2751462","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Hadoop Workloads Characterization for Performance and Energy Efficiency Optimizations on Microservers Hadoop工作负载特性在微型服务器上的性能和能效优化
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-09-05 DOI: 10.1109/TMSCS.2017.2749228
Maria Malik;Katayoun Neshatpour;Setareh Rafatirad;Houman Homayoun
{"title":"Hadoop Workloads Characterization for Performance and Energy Efficiency Optimizations on Microservers","authors":"Maria Malik;Katayoun Neshatpour;Setareh Rafatirad;Houman Homayoun","doi":"10.1109/TMSCS.2017.2749228","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2749228","url":null,"abstract":"The traditional low-power embedded processors such as Atom and ARM are entering into the high-performance server market. At the same time, big data analytics applications are emerging and dramatically changing the landscape of data center workloads. Emerging big data applications require a significant amount of server computational power. However, the rapid growth in the data yields challenges to process them efficiently using current high-performance server architectures. Furthermore, physical design constraints, such as power and density have become the dominant limiting factor for scaling out servers. Numerous big data applications rely on using Hadoop MapReduce framework to perform their analysis on large-scale datasets. Since Hadoop configuration parameters as well as system parameters directly affect the MapReduce job performance and energy-efficiency, joint application, system, and architecture level parameters tuning is vital to maximize the energy efficiency for Hadoop-based applications. In this work, through methodical investigation of performance and power measurements, we demonstrate how the interplay among various Hadoop configuration parameters, as well as system and architecture level parameters affect not only the performance but also the energy-efficiency across various big data applications. Our results identify trends to guide scheduling decision and key insights to help improving Hadoop MapReduce applications performance, power, and energy-efficiency on microservers.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"355-368"},"PeriodicalIF":0.0,"publicationDate":"2017-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2749228","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Performance Analysis and Optimization of Automatic Speech Recognition 语音自动识别的性能分析与优化
IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-08-14 DOI: 10.1109/TMSCS.2017.2739158
Hamid Tabani;Jose-Maria Arnau;Jordi Tubella;Antonio González
{"title":"Performance Analysis and Optimization of Automatic Speech Recognition","authors":"Hamid Tabani;Jose-Maria Arnau;Jordi Tubella;Antonio González","doi":"10.1109/TMSCS.2017.2739158","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2739158","url":null,"abstract":"Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"847-860"},"PeriodicalIF":0.0,"publicationDate":"2017-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2739158","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信