{"title":"A TCAM generator for packet classification","authors":"Infall Syafalni, Tsutomu Sasao","doi":"10.1109/ICCD.2013.6657060","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657060","url":null,"abstract":"In the internet, packets are classified by source and destination addresses and ports, as well as protocol type. Ternary content addressable memories (TCAMs) are often used to perform this operation. This paper shows a method to reduce the number of words in TCAM for multi-field classification functions. We use head-tail expressions to represent a multi-field classification rule. Furthermore, we present an O(r2)-algorithm, called MFHT, to generate simplified TCAMs for two-field classification functions, where r is the number of rules. Experimental results show that MFHT achieves a 58% reduction of words for random rules and a 52% reduction of words for ACL and FW rules. Moreover, MFHT is fast and useful for simplifying TCAM for packet classification.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114461591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting dynamic phase distance mapping for phase-based tuning of embedded systems","authors":"Tosiron Adegbija, A. Gordon-Ross","doi":"10.1109/ICCD.2013.6657066","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657066","url":null,"abstract":"Phase-based tuning increases optimization potential by configuring system parameters for application execution phases. Previous work proposed phase distance mapping (PDM), which relied on extensive a priori analysis of executing applications to dynamically estimate the best configuration using the correlation between phases. We propose DynaPDM, a new dynamic phase distance mapping methodology that eliminates a priori designer effort, dynamically analyzes phases, and determines the best configurations, yielding average energy delay product savings of 28%-an 8% improvement on PDM-and configurations within 1% of the optimal.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134635929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards efficient dynamic data placement in NoC-based multicores","authors":"Qingchuan Shi, Farrukh Hijaz, O. Khan","doi":"10.1109/ICCD.2013.6657067","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657067","url":null,"abstract":"Next generation multicores will process massive data with significant sharing. Since future processors will also be inherently limited by the off-chip bandwidth, the on-chip data management is emerging as a first-order design constraint. On-chip memory latency increases as more cores are added since the diameter of most on-chip networks increases with the number of cores. We observe that a large fraction of on-chip traffic originates from communication between the cores to maintain cache coherence. Motivated by these observations, we propose a novel on-chip data placement mechanism that optimizes shared data placement by minimizing the distance of data from the requesting cores (improve locality) while paying attention to load balancing network contention and the utilization of percore cache capacity. Using simulations of a 64-core multicore, we show that our proposal outperforms state-of-the-art static and dynamic data placement mechanisms by an average of 5.5% and 8.5% respectively.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130578939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Resource allocation algorithms for guaranteed service in application-specific NoCs","authors":"Gongmin Yang, Hao He, Jiang Hu","doi":"10.1109/ICCD.2013.6657088","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657088","url":null,"abstract":"Networks-on-chip (NoC) has been recognized as a scalable approach to cope with the increasingly large demand for on-chip communication. This work focuses on how to achieve guaranteed service for application-specific NoCs through resource reservation. A graph model is adopted to describe physical and temporal sources of an NoC in a unified manner. Based on the graph model, two resource allocation heuristics are proposed and investigated. One heuristic leverages the idea of chip layout routing and the other utilizes Boolean satisfiability. Results from simulation from various testcases indicate that the proposed methods significantly outperform a state-of-the-art previous work.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128889995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lazy Precharge: An overhead-free method to reduce precharge overhead for memory parallelism improvement of DRAM system","authors":"Zhang Tao, Cong Xu, Yuan Xie, Guangyu Sun","doi":"10.1109/ICCD.2013.6657036","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657036","url":null,"abstract":"As we enter the multi-core era, the main memory becomes the bottleneck due to the exploded memory requests. In this work, we propose a novel memory architecture-Lazy Precharge (LaPRE) that enables aggressive activation schemes so that multiple rows in a bank can be activated successively without the interrupt from precharges. Therefore, LaPRE effectively reduces the precharge overhead and thus improves memory parallelism. In addition, three memory scheduling schemes are proposed correspondingly to fully make use of the improved memory parallelism. The experimental results show that LaPRE can achieve 14% performance improvement on average without hardware overhead.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117043371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Equivalence checking for compiler transformations in behavioral synthesis","authors":"Zhenkun Yang, K. Hao, Kai Cong, S. Ray, Fei Xie","doi":"10.1109/ICCD.2013.6657090","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657090","url":null,"abstract":"Behavioral synthesis entails application of a sequence of transformations to compile a high-level description of a hardware design (e.g., in C/C++/SystemC) into a Register-Transfer Level (RTL) implementation. We present a scalable equivalence checking framework to validate the correctness of compiler transformations employed by behavioral synthesis. Our approach is based on dual-rail symbolic simulation of the input and output design representations of a transformation. We have evaluated our framework on transformations applied to several designs by an open source behavioral synthesis tool, and we present initial results demonstrating the approach.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124355178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing post-silicon conformance checking","authors":"Li Lei, Kai Cong, Fei Xie","doi":"10.1109/ICCD.2013.6657092","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657092","url":null,"abstract":"Virtual prototypes of hardware devices, a.k.a, virtual devices, are increasingly used to enable early software development before silicon prototypes/devices are available. In previous work, we presented a post-silicon conformance checking approach to detecting interface state inconsistencies between a silicon device and its virtual device. In this paper, we present an optimization, adaptive concretization, to reduce the overhead incurred by symbolic execution, a key technique used in our conformance checking approach. We have evaluated our optimized approach on three Ethernet adapters and their virtual devices. The results demonstrate that it is effective and efficient: 21 inconsistencies are discovered and time usages are reduced by an order of magnitude, comparing to the previous approach.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132704777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CG-Resync: Conversion-guided resynchronization for a SSD-based RAID array","authors":"Letian Yi, J. Shu, Jiaxin Ou, Weimin Zheng","doi":"10.1109/ICCD.2013.6657081","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657081","url":null,"abstract":"SSD-based RAID arrays have been widely adopted in large-scale systems. One requirement on a RAID is to provide data consistency, which can be an issue during serving write requests. While using NVRAM or on-storage logging can ensure the consistency, the approaches can either be very expensive or substantially compromise performance. For SSD-based RAID, scanning the entire storage space during rebooting after a crash can recover the consistency. However, it takes a long resychronization time. To address the issue efficiently and cost-effectively, we propose CG-Resync, a scheme providing consistency assurance for SSD-based RAIDs by leveraging logging mechanism readily available in almost all SSDs for accommodating flash's out-of-place-write requirement. To identify uncompleted writes resulting in inconsistent stripes, we use guided conversion in managing SSD's internal logs. In particular, only when a stripe becomes consistent does CG-Resync allow the updated data on the stripe to be removed from the log. We evaluate CG-Resync and experiments show that it provides improved RAID reliability and availability upon a crash with little performance loss during regular I/O operations.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129280478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FastLanes: An FPGA accelerated GPU microarchitecture simulator","authors":"Kuan Fang, Yufei Ni, Jiayuan He, Zonghui Li, Shuai Mu, Yangdong Deng","doi":"10.1109/ICCD.2013.6657049","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657049","url":null,"abstract":"Graphic Processing Units (GPUs) have emerged as a new general purpose computing platform that attracts significant research efforts. Currently, GPU architecture research resorts to time-consuming software simulations to evaluate microarchitecture innovations. In this paper, we propose FastLanes, an FPGA based simulator for a generic GPU microarchitecture, to enable hardware-accelerated simulation. FastLanes consists of a function model and a timing model, both implemented on FPGA. The functional model implements the full functionality of a multiprocessor of GPU and emulates multiple multiprocessors via time-division multiplexing. We develop a hybrid implementation strategy in which certain GPU logic is directly mapped to FPGA while the other logic is simulated by reusing the same FPGA logic. A corresponding context shifting mechanism is proposed to store execution states of threads from FPGA to external on-board memory, and vice versa. Such a mechanism makes it possible to simulate hundreds of GPU cores on a single FPGA evaluation board. Driven by the functional simulation results, the timing model considers the detailed configuration of GPU microarchitecture to derive the performance evaluation. A compiler tool-chain is also developed to allow the execution of NVIDIA GPU binary on FastLanes. Experimental results prove that FastLanes outperforms its software equivalent by up to 2 orders of magnitude.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114665791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}