Övünç Kocabas, T. Soyata, J. Couderc, M. Aktas, J. Xia, Michael C. Huang
{"title":"Assessment of cloud-based health monitoring using Homomorphic Encryption","authors":"Övünç Kocabas, T. Soyata, J. Couderc, M. Aktas, J. Xia, Michael C. Huang","doi":"10.1109/ICCD.2013.6657078","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657078","url":null,"abstract":"Current financial and regulatory pressure has provided strong incentives to institute better disease prevention, improved patient monitoring, and push U.S. healthcare into the digital era. This transition requires that data privacy be ensured for digital health data in three distinct phases: I. acquisition, II. storage, and III. computation. Each phase comes with unique challenges in terms of proper implementation and privacy. While the privacy of the data can be ensured with existing AES encryption techniques in phases I (acquisition) and II (storage), to enable healthcare organizations to take advantage of cloud computing using resources such as Amazon Web Services, phase III (computation) must also enable the privacy of the data. Currently, there exists no system to enable direct computation in the cloud while assuring data privacy. Fully Homomorphic Encryption (FHE) is an emerging cryptographic technique to permit computation on encrypted data directly in the cloud without the need to bring the data back to the computational node. However, this promising technique comes with significant performance- and storage-related challenges. While it will take more years before true FHE is mainstream, we provide a feasibility study for its application to a simple longterm patient ECG-data monitoring system.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132197379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guohong Li, Zhenyu Liu, Sanchuan Guo, Chongmin Li, Dongsheng Wang
{"title":"Bayesian theory oriented Optimal Data-Provider Selection for CMP","authors":"Guohong Li, Zhenyu Liu, Sanchuan Guo, Chongmin Li, Dongsheng Wang","doi":"10.1109/ICCD.2013.6657050","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657050","url":null,"abstract":"With the number of cores and working sets of parallel workloads soaring, shared L2 caches exhibit fewer misses than private L2 caches via making better use of the all available cache capacity. However, shared L2 caches induce higher overall L1 miss latencies because of longer average distance between requestor and home node, and potentially congestions at some nodes. We observe that there is a high probability that the requested data of an L1 miss resides in a neighbor node's L1 cache. In such cases, these long-distance accesses to the home nodes can be potentially avoided. In order to successfully leverage the aforementioned property, we propose Bayesian theory oriented Optimal Data-Provider Selection (ODPS). ODPS partitions the multi-core into clusters of 2×2 nodes, and introduces the Proximity Data Prober (PDP) to detect whether an L1 miss can be served by one L1 cache within the same cluster. Furthermore, we devise the Bayesian Decision Classifier (BDC) to intelligently and adaptively select a remote L2 cache or a neighboring L1 node as the data provider according to the minimal miss cost based on the Bayesian decision theory.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115977382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu-Ting Chen, J. Cong, M. Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou
{"title":"Accelerator-rich CMPs: From concept to real hardware","authors":"Yu-Ting Chen, J. Cong, M. Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou","doi":"10.1109/ICCD.2013.6657039","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657039","url":null,"abstract":"Application-specific accelerators provide 10-100× improvement in power efficiency over general-purpose processors. The accelerator-rich architectures are especially promising. This work discusses a prototype of accelerator-rich CMPs (PARC). During our development of PARC in real hardware, we encountered a set of technical challenges and proposed corresponding solutions. First, we provided system IPs that serve a sea of accelerators to transfer data between userspace and accelerator memories without cache overhead. Second, we designed a dedicated interconnect between accelerators and memories to enable memory sharing. Third, we implemented an accelerator manager to virtualize accelerator resources for users. Finally, we developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC. We implemented PARC in a Virtex-6 FPGA chip with integration of platform-specific peripherals and booting of unmodified Linux. Experimental results show that PARC can fully exploit the energy benefits of accelerators at little system overhead.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114433022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance-controllable shared cache architecture for multi-core soft real-time systems","authors":"Myoungjun Lee, Soontae Kim","doi":"10.1109/ICCD.2013.6657097","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657097","url":null,"abstract":"Multi-core processors with shared L2 caches can improve performance and integrate several functions of real-time systems on a single chip. However, tasks running on different cores increase interferences in the shared L2 cache, resulting in more deadline misses and, consequently, worse quality of real-time tasks. This is mainly because of the blind sharing of the L2 cache by multiple tasks running on different cores.We propose a novel performance-controllable shared L2 cache architecture that can alleviate these problems. First, our proposed L2 cache architecture is made to be aware of instructions/data belonging to real-time tasks by adding a real-time indication bit to each L2 cache block. Second, it can control the performance of real-time tasks and non-real-time tasks. Our experimental results show that our proposed L2 cache architecture reduces more deadline misses of real-time tasks than the conventional L2 cache architecture and partitioning schemes.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128318348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting critical implications with set-covering formulation for SAT-based Bounded Model Checking","authors":"Mahmoud Elbayoumi, M. Hsiao, Mustafa ElNainay","doi":"10.1109/ICCD.2013.6657070","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657070","url":null,"abstract":"The effectiveness of SAT-based Bounded Model Checking (BMC) critically relies on the deductive power of the BMC instance. Although implication relationships have been used to help SAT solver to make more deductions, frequently an excessive number of implications has been used. Too many such implications can result in a large number of clauses that could potentially degrade the underlying SAT solver performance. In this paper, we first propose a framework for a parallel deduction engine to reduce implication learning time. Secondly, we propose a novel set-cover technique for optimal selection of constraint clauses. This technique depends on maximizing the number of literals that can be deduced by the SAT solver during the BCP (Boolean Constraint Propagation) operation. Our parallel deduction engine can achieve a 5.7× speedup on a 36-core machine. In addition, by selecting only those critical implications, our strategy improves BMC by another 1.74× against the case where all extended implications were added to the BMC instance. Compared with the original BMC without any implication clauses, up to 55.32× speedup can be achieved.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121628197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Komoda, S. Hayashi, Takashi Nakada, Shinobu Miwa, Hiroshi Nakamura
{"title":"Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping","authors":"T. Komoda, S. Hayashi, Takashi Nakada, Shinobu Miwa, Hiroshi Nakamura","doi":"10.1109/ICCD.2013.6657064","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657064","url":null,"abstract":"Future computer systems are built under much stringent power budget due to the limitation of power delivery and cooling systems. To this end, sophisticated power management techniques are required. Power capping is a technique to limit the power consumption of a system to the predetermined level, and has been extensively studied in homogeneous systems. However, few studies about the power capping of CPU-GPU heterogeneous systems have been done yet. In this paper, we propose an efficient power capping technique through coordinating DVFS and task mapping in a single computing node equipped with GPUs. In CPU-GPU heterogeneous systems, settings of the device frequencies have to be considered with task mapping between the CPUs and the GPUs because the frequency scaling can incurs load imbalance between them. To guide the settings of DVFS and task mapping for avoiding power violation and the load imbalance, we develop new empirical models of the performance and the maximum power consumption of a CPU-GPU heterogeneous system. The models enable us to set near-optimal settings of the device frequencies and the task mapping in advance of the application execution. We evaluate the proposed technique with five data-parallel applications on a machine equipped with a single CPU and a single GPU. The experimental result shows that the performance achieved by the proposed power capping technique is comparable to the ideal one.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126337678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A global router on GPU architecture","authors":"Yiding Han, Koushik Chakraborty, Sanghamitra Roy","doi":"10.1109/ICCD.2013.6657028","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657028","url":null,"abstract":"In the modern VLSI design flow, global router is often utilized to provide fast and accurate congestion analysis for upstream processes to improve the design routability. Global routing parallelization is a good candidate to speedup its runtime performance while delivering very competitive solution quality. In this paper, we first study the cause of insufficient exploitable concurrency of the existing net level concurrency model, which has become a major bottleneck for parallelizing the emerging design problems. Then, we mitigate this limitation with a novel fine grain parallel model, with which a GPU based multi-agent global router is designed. Our experimental results indicate that the parallel model can effectively support the GPU based global router, and deliver stable solutions. The runtime comparison with NCTUgr2 has shown that upto 3.9× speedup is achieved by the GPU based router.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132229646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Noise-based algorithms for functional equivalence and tautology checking","authors":"P. Lin, S. Khatri","doi":"10.1109/ICCD.2013.6657048","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657048","url":null,"abstract":"In this paper, we present noise-based algorithms for functional equivalence and tautology checking using noise-based logic (NBL). A key property of NBL is that literals are represented by independent noise sources, from which we can construct noise-based cubes, and superpositions of such noise-based cubes, to create a noise-based Boolean function on a single wire. In our algorithms, the Boolean sum-of-products (SOP) formula is expressed in NBL as a superposition of its minterms. This noise-based representation of the SOP can then be compared with that of another SOP formula for equivalence checking (or with the noise-based formula representing tautology, for tautology checking) using a single operation. We validate our approach using software simulation.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133593852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quipu: High-performance simulation of quantum circuits using stabilizer frames","authors":"Héctor J. García, I. Markov","doi":"10.1109/ICCD.2013.6657072","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657072","url":null,"abstract":"As quantum information processing gains traction, its simulation becomes increasingly significant for engineering purposes - evaluation, testing and optimization - as well as for theoretical research. Generic quantum-circuit simulation appears intractable for conventional computers. However, Gottesman and Knill identified an important subclass, called stabilizer circuits, which can be simulated efficiently using group-theory techniques. Practical circuits enriched with quantum error-correcting codes and fault-tolerant procedures are dominated by stabilizer subcircuits and contain a relatively small number of non-stabilizer components. Therefore, we develop new group-theory data structures and algorithms to simulate such circuits. Stabilizer frames offer more compact storage than previous approaches but requires more sophisticated bookkeeping. Our implementation, called Quipu, simulates certain quantum arithmetic circuits (e.g., ripple-carry adders) in polynomial time and space for equal superpositions of n-qubits. On such instances, known linear-algebraic simulation techniques, such as the (state-of-the-art) BDD-based simulator QuIDDPro, take exponential time. We simulate various quantum Fourier transform and quantum fault-tolerant circuits with Quipu, and the results demonstrate that our stabilizer-based technique outperforms QuIDDPro in all cases.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115801177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing GPU throughput using kernel interleaved thread block scheduling","authors":"Mihir Awatramani, Joseph Zambreno, D. Rover","doi":"10.1109/ICCD.2013.6657093","DOIUrl":"https://doi.org/10.1109/ICCD.2013.6657093","url":null,"abstract":"The number of active threads required to achieve peak application throughput on graphics processing units (GPUs) depends largely on the ratio of time spent on computation to the time spent accessing data from memory. While compute-intensive applications can achieve peak throughput with a low number of threads, memory-intensive applications might not achieve good throughput even at the maximum supported thread count. In this paper, we study the effects of scheduling work from multiple applications on the same GPU core. We claim that interleaving workload from different applications on a GPU core can improve the utilization of computational units and reduce the load on memory subsystem. Experiments on 17 application pairs from the Rodinia benchmark suite show that overall throughput increases by 7% on average.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114664338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}