{"title":"DPFEE: A High Performance Scalable Pre-Processor for Network Security Systems","authors":"Vinayaka Jyothi;Sateesh K. Addepalli;Ramesh Karri","doi":"10.1109/TMSCS.2017.2765324","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2765324","url":null,"abstract":"Network Intrusion Detection Systems (NIDS) and Anti-Denial-of-Service (DoS) employ Deep Packet Inspection (DPI) which provides visibility to the content of payload to detect network attacks. All DPI engines assume a pre-processing step that extracts the various protocol-specific fields. However, application layer (L7) field extraction is computationally expensive. We propose a novel Deep Packet Field Extraction Engine (DPFEE) for application layer field extraction to hardware. DPFEE is a content-aware, grammar-based, Layer 7 programmable field extraction engine for text-based protocols. Our prototype DPFEE implementation for the Session Initiation Protocol (SIP) and HTTP protocol on a single FPGA, achieves a bandwidth of 408.5 Gbps and this can be scaled beyond 500 Gbps. Single DPFEE exhibits a speedup of 24X-89X against widely used preprocessors. Even against 12 multi-instances of a preprocessor, single DPFEE demonstrated a speedup of 4.7-7.4X. Single DPFEE achieved 3.14X higher bandwidth, 1020X lower latency, and 106X lower power consumption, when compared with 200 parallel streams of GPU accelerated preprocessor.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 1","pages":"55-68"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2765324","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68003399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
U. I. Minhas;M. Russell;S. Kaloutsakis;P. Barber;R. Woods;G. Georgakoudis;C. Gillan;D. S. Nikolopoulos;A. Bilas
{"title":"NanoStreams: A Microserver Architecture for Real-Time Analytics on Fast Data Streams","authors":"U. I. Minhas;M. Russell;S. Kaloutsakis;P. Barber;R. Woods;G. Georgakoudis;C. Gillan;D. S. Nikolopoulos;A. Bilas","doi":"10.1109/TMSCS.2017.2764087","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2764087","url":null,"abstract":"Ever increasing power consumption has created great interest in energy-efficient microserver architectures but they lack the computational, networking, and storage power necessary to cope with real-time data analytics. We propose NanoStreams, an integrated architecture comprising an ARM-based microserver, coupled via a novel, low latency network interface, Nanowire, to an Analytics-on-Chip architecture implemented on Field Programmable Gate Array (FPGA) technology; the architecture comprises ARM cores for performing low latency transactional processing, integrated with programmable, energy efficient Nanocore processors for high-throughput streaming analytics. The paper outlines the complete system architecture, hardware level detail, compiler, network protocol, and programming environment. We present experiments from the financial services sector, comparing a state-of-the-art server based on Intel Sandy Bridge processors, an ARM based Calxeda ECS-1000 microserver and ODROID XU3 node, with the NanoStreams microserver architecture using an industrial workload. For end-to-end workload, the NanoStreams microserver achieves energy savings up to 10.7×, 5.87× and 5× compared to the Intel server, Calxeda microserver and ODROID node, respectively.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"396-409"},"PeriodicalIF":0.0,"publicationDate":"2017-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2764087","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parameter Exploration to Improve Performance of Memristor-Based Neuromorphic Architectures","authors":"Mahyar Shahsavari;Pierre Boulet","doi":"10.1109/TMSCS.2017.2761231","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2761231","url":null,"abstract":"The brain-inspired spiking neural network neuromorphic architecture offers a promising solution for a wide set of cognitive computation tasks at a very low power consumption. Due to the practical feasibility of hardware implementation, we present a memristor-based model of hardware spiking neural networks which we simulate with Neural Network Scalable Spiking Simulator (N2S3), our open source neuromorphic architecture simulator. Although Spiking neural networks are widely used in the community of computational neuroscience and neuromorphic computation, there is still a need for research on the methods to choose the optimum parameters for better recognition efficiency. With the help of our simulator, we analyze and evaluate the impact of different parameters such as number of neurons, STDP window, neuron threshold, distribution of input spikes, and memristor model parameters on the MNIST hand-written digit recognition problem. We show that a careful choice of a few parameters (number of neurons, kind of synapse, STDP window, and neuron threshold) can significantly improve the recognition rate on this benchmark (around 15 points of improvement for the number of neurons, a few points for the others) with a variability of four to five points of recognition rate due to the random initialization of the synaptic weights.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"833-846"},"PeriodicalIF":0.0,"publicationDate":"2017-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2761231","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68024068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anderson L. Sartor;Pedro H. E. Becker;Joost Hoozemans;Stephan Wong;Antonio C. S. Beck
{"title":"Dynamic Trade-off among Fault Tolerance, Energy Consumption, and Performance on a Multiple-Issue VLIW Processor","authors":"Anderson L. Sartor;Pedro H. E. Becker;Joost Hoozemans;Stephan Wong;Antonio C. S. Beck","doi":"10.1109/TMSCS.2017.2760299","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2760299","url":null,"abstract":"In the design of modern-day processors, energy consumption and fault tolerance have gained significant importance next to performance. This is caused by battery constraints, thermal design limits, and higher susceptibility to errors as transistor feature sizes are decreasing. However, achieving the ideal balance among them is challenging due to their conflicting nature (e.g., fault-tolerance techniques usually influence execution time or increase energy consumption), and that is why current processor designs target at most two of these axes. Based on that, we propose a new VLIW-based processor design capable of adapting the execution of the application at run-time in a totally transparent fashion, considering performance, fault tolerance, and energy consumption altogether, in which the weight (priority) of each one can be defined a priori. This is achieved by a novel decision module that dynamically controls the application's ILP to increase the possibility of replicating instructions or applying power gating. For an energy-oriented configuration, it is possible, on average, to reduce energy consumption by 37.2 percent with an overhead of only 8.2 percent in performance, while maintaining low levels of failure rate, when compared to a fault-tolerant design.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"327-339"},"PeriodicalIF":0.0,"publicationDate":"2017-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2760299","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Basireddy Karunakar Reddy;Amit Kumar Singh;Dwaipayan Biswas;Geoff V. Merrett;Bashir M. Al-Hashimi
{"title":"Inter-Cluster Thread-to-Core Mapping and DVFS on Heterogeneous Multi-Cores","authors":"Basireddy Karunakar Reddy;Amit Kumar Singh;Dwaipayan Biswas;Geoff V. Merrett;Bashir M. Al-Hashimi","doi":"10.1109/TMSCS.2017.2755619","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2755619","url":null,"abstract":"Heterogeneous multi-core platforms that contain different types of cores, organized as clusters, are emerging, e.g., ARM's big.LITTLE architecture. These platforms often need to deal with multiple applications, having different performance requirements, executing concurrently. This leads to the generation of varying and mixed workloads (e.g., compute and memory intensive) due to resource sharing. Run-time management is required for adapting to such performance requirements and workload variabilities and to achieve energy efficiency. Moreover, the management becomes challenging when the applications are multi-threaded and the heterogeneity needs to be exploited. The existing run-time management approaches do not efficiently exploit cores situated in different clusters simultaneously (referred to as inter-cluster exploitation) and DVFS potential of cores, which is the aim of this paper. Such exploitation might help to satisfy the performance requirement while achieving energy savings at the same time. Therefore, in this paper, we propose a run-time management approach that first selects thread-to-core mapping based on the performance requirements and resource availability. Then, it applies online adaptation by adjusting the voltage-frequency (V-f) levels to achieve energy optimization, without trading-off application performance. For thread-to-core mapping, offline profiled results are used, which contain performance and energy characteristics of applications when executed on the heterogeneous platform by using different types of cores in various possible combinations. For an application, thread-to-core mapping process defines the number of used cores and their type, which are situated in different clusters. The online adaptation process classifies the inherent workload characteristics of concurrently executing applications, incurring a lower overhead than existing learning-based approaches as demonstrated in this paper. The classification of workload is performed using the metric Memory Reads Per Instruction (MRPI). The adaptation process pro-actively selects an appropriate V-f pair for a predicted workload. Subsequently, it monitors the workload prediction error and performance loss, quantified by instructions per second (IPS), and adjusts the chosen V-f to compensate. We validate the proposed run-time management approach on a hardware platform, the Odroid-XU3, with various combinations of multi-threaded applications from PARSEC and SPLASH benchmarks. Results show an average improvement in energy efficiency up to 33 percent compared to existing approaches while meeting the performance requirements.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"369-382"},"PeriodicalIF":0.0,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2755619","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors","authors":"Zhe Lin;Sharad Sinha;Hao Liang;Liang Feng;Wei Zhang","doi":"10.1109/TMSCS.2017.2754378","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2754378","url":null,"abstract":"Modern multicore systems are migrating from homogeneous systems to heterogeneous systems with accelerator-based computing in order to overcome the barriers of performance and power walls. In this trend, FPGA-based accelerators are becoming increasingly attractive, due to their excellent flexibility and low design cost. In this paper, we propose the architectural support for efficient interfacing between FPGA-based multi-accelerators and chip-multiprocessors (CMPs) connected through the network-on-chip (NoC). Distributed packet receivers and hierarchical packet senders are designed to maintain scalability and reduce the critical path delay under a heavy task load. A dedicated accelerator chaining mechanism is also proposed to facilitate intra-FPGA data reuse among accelerators to circumvent prohibitive communication overhead between the FPGA and processors. In order to evaluate the proposed architecture, a complete system emulation with programmability support is performed using FPGA prototyping. Experimental results demonstrate that the proposed architecture has high-performance, and is light-weight and scalable in characteristics.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"152-162"},"PeriodicalIF":0.0,"publicationDate":"2017-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2754378","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68021416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling Program's Performance with Gaussian Mixtures for Parametric Statistics","authors":"Julien Worms;Sid Touati","doi":"10.1109/TMSCS.2017.2754251","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2754251","url":null,"abstract":"This article is a continuation of our previous research effort on program performance statistical analysis and comparison [1], in the presence of program performance variability. In the previous study, we proposed a formal statistical methodology to analyze program speedups based on mean and median performance metrics: execution time, energy consumption, etc. However, mean and median observed performances do not always reflect the user's feeling of such performances, especially when they are particularly unstable. In the current study, we propose additional precise performance metrics, based on performance modelling using Gaussian mixtures. Our additional statistical metrics for analyzing and comparing program performances give the user more precise decision tools to select best code versions, not necessarily based on mean or median numbers. Also, we provide a new metric to estimate performance variability based on the Gaussian mixture model. Our statistical methods are implemented with R and distributed as open source code.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"383-395"},"PeriodicalIF":0.0,"publicationDate":"2017-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2754251","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arsalan Mosenia;Xiaoliang Dai;Prateek Mittal;Niraj K. Jha
{"title":"PinMe: Tracking a Smartphone User around the World","authors":"Arsalan Mosenia;Xiaoliang Dai;Prateek Mittal;Niraj K. Jha","doi":"10.1109/TMSCS.2017.2751462","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2751462","url":null,"abstract":"With the pervasive use of smartphones that sense, collect, and process valuable information about the environment, ensuring location privacy has become one of the most important concerns in the modern age. A few recent research studies discuss the feasibility of processing sensory data gathered by a smartphone to locate the phone's owner, even when the user does not intend to share his location information, e.g., when the user has turned off the Global Positioning System (GPS) on the device. Previous research efforts rely on at least one of the two following fundamental requirements, which impose significant limitations on the adversary: (i) the attacker must accurately know either the user's initial location or the set of routes through which the user travels and/or (ii) the attacker must measure a set of features, e.g., device acceleration, for different potential routes in advance and construct a training dataset. In this paper, we demonstrate that neither of the above-mentioned requirements is essential for compromising the user's location privacy. We describe PinMe, a novel user-location mechanism that exploits non-sensory/sensory data stored on the smartphone, e.g., the environment's air pressure and device's timezone, along with publicly-available auxiliary information, e.g., elevation maps, to estimate the user's location when all location services, e.g., GPS, are turned off. Unlike previously-proposed attacks, PinMe neither requires any prior knowledge about the user nor a training dataset on specific routes. We demonstrate that PinMe can accurately estimate the user's location during four activities (walking, traveling on a train, driving, and traveling on a plane). We also suggest several defenses against the proposed attack.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"420-435"},"PeriodicalIF":0.0,"publicationDate":"2017-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2751462","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68026457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria Malik;Katayoun Neshatpour;Setareh Rafatirad;Houman Homayoun
{"title":"Hadoop Workloads Characterization for Performance and Energy Efficiency Optimizations on Microservers","authors":"Maria Malik;Katayoun Neshatpour;Setareh Rafatirad;Houman Homayoun","doi":"10.1109/TMSCS.2017.2749228","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2749228","url":null,"abstract":"The traditional low-power embedded processors such as Atom and ARM are entering into the high-performance server market. At the same time, big data analytics applications are emerging and dramatically changing the landscape of data center workloads. Emerging big data applications require a significant amount of server computational power. However, the rapid growth in the data yields challenges to process them efficiently using current high-performance server architectures. Furthermore, physical design constraints, such as power and density have become the dominant limiting factor for scaling out servers. Numerous big data applications rely on using Hadoop MapReduce framework to perform their analysis on large-scale datasets. Since Hadoop configuration parameters as well as system parameters directly affect the MapReduce job performance and energy-efficiency, joint application, system, and architecture level parameters tuning is vital to maximize the energy efficiency for Hadoop-based applications. In this work, through methodical investigation of performance and power measurements, we demonstrate how the interplay among various Hadoop configuration parameters, as well as system and architecture level parameters affect not only the performance but also the energy-efficiency across various big data applications. Our results identify trends to guide scheduling decision and key insights to help improving Hadoop MapReduce applications performance, power, and energy-efficiency on microservers.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 3","pages":"355-368"},"PeriodicalIF":0.0,"publicationDate":"2017-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2749228","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Analysis and Optimization of Automatic Speech Recognition","authors":"Hamid Tabani;Jose-Maria Arnau;Jordi Tubella;Antonio González","doi":"10.1109/TMSCS.2017.2739158","DOIUrl":"https://doi.org/10.1109/TMSCS.2017.2739158","url":null,"abstract":"Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"847-860"},"PeriodicalIF":0.0,"publicationDate":"2017-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2739158","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68023990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}