{"title":"Load-aware Elastic Data Reduction and Re-computation for Adaptive Mesh Refinement","authors":"Mengxiao Wang, Huizhang Luo, Qing Liu, Hong Jiang","doi":"10.1109/NAS.2019.8834727","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834727","url":null,"abstract":"The increasing performance gap between computation and I/O creates huge data management challenges for simulation-based scientific discovery. Data reduction, among others, is deemed to be a promising technique to bridge the gap through reducing the amount of data migrated to persistent storage. However, the reduction performance is still far from what is being demanded from production applications. To this end, we propose a new methodology that aggressively reduces data despite the substantial loss of information, and re-computes the original accuracy on-demand. As a result, our scheme creates an illusion of a fast and large storage medium with the availability of high-accuracy data. We further design a load-aware data reduction strategy that monitors the I/O overhead at runtime, and dynamically adjusts the reduction ratio. We verify the efficacy of our methodology through adaptive mesh refinement, a popular numerical technique for solving partial differential equations. We evaluate data reduction and selective data re-computation on Titan, using a real application in FLASH and mini-applications in Chombo. To clearly demonstrate the benefits of re-computation, we compare it with other state-of-the-art data reduction methods including SZ, ZFP, FPC and deduplication, and it is shown to be superior in both write and read speeds, particularly when a small amount of data (e.g., 1%) need to be retrieved, as well as reduction ratio. Our results confirm that data reduction and selective data re-computation can 1) reduce the performance gap between I/O and compute via aggressively reducing AMR levels, and more importantly 2) can recover the target accuracy efficiently for AMR through re-computation.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124694911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yina Lv, Liang Shi, Qiao Li, Congming Gao, C. Xue, E. Sha
{"title":"Optimizing Tail Latency of LDPC based Flash Memory Storage Systems Via Smart Refresh","authors":"Yina Lv, Liang Shi, Qiao Li, Congming Gao, C. Xue, E. Sha","doi":"10.1109/NAS.2019.8834728","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834728","url":null,"abstract":"Flash memory has been developed with bit density improvement, technology scaling, and 3D stacking. With this trend, its reliability has been degraded significantly. Error correction code, low density parity code (LDPC), which has strong error correction capability, has been employed to solve this issue. However, one of the critical issues of LDPC is that it would introduce a long decoding latency on devices with low reliability. In this case, tail latency would happen, which will significantly impact the quality of service (QoS). In this work, a set of smart refresh schemes is proposed to optimize the tail latency. The basic idea of the work is to refresh data when the accessed data has a long decoding latency. Two smart refresh schemes are proposed for this work: The first refresh scheme is designed to refresh long access latency data when it is accessed several times for access performance optimization; The second refresh scheme is designed to periodical detecting data with extremely long access latency and refreshing them for tail latency optimization. Experiment results show that the proposed schemes are able to significantly improve the tail latency and access performance with little overhead.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"339 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113982818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NAS 2019 Messages","authors":"","doi":"10.1109/nas.2019.8834712","DOIUrl":"https://doi.org/10.1109/nas.2019.8834712","url":null,"abstract":"","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133614356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thermo-GC: Reducing Write Amplification by Tagging Migrated Pages during Garbage Collection","authors":"Jing Yang, Shuyi Pei","doi":"10.1109/NAS.2019.8834722","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834722","url":null,"abstract":"Flash memory based solid-state drive (SSD) has been deployed in various systems because of its significant advantages over hard disk drive in terms of throughput and IOPS. One inherent operation that is necessary in SSD is garbage collection (GC), a procedure that selects an erasure candidate block and moves valid data on the selected candidate to another block. The performance of SSD is greatly influenced by GC. While existing studies have made advances in minimizing GC cost, few took advantages of the procedure of GC itself. As GC goes on, valid pages in an erasure candidate block tend to have similar lifetimes that can be exploited to minimize page’s movements. In this paper, we introduce Thermo-GC. The idea is to identify data’s hotness during GC operations and group data that have similar lifetimes to the same block. By clustering valid pages based on their hotness, Thermo-GC can minimize valid page movements and reduce GC cost. Experiment results show that Thermo-GC reduces data movements during GC by 78% and write amplification factor by 29.7% on average, implying extended lifetimes of SSDs.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130021923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HCMA: Supporting High Concurrency of Memory Accesses with Scratchpad Memory in FPGAs","authors":"Yangyang Zhao, Yuhang Liu, Wei Li, Mingyu Chen","doi":"10.1109/NAS.2019.8834726","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834726","url":null,"abstract":"Currently many researches focus on new methods of accelerating memory accesses between memory controller and memory modules. However, the absence of an accelerator for memory accesses between CPU and memory controller wastes the performance benefits of new methods. Therefore, we propose a coordinated batch method to support high concurrency of memory accesses (HCMA). Compared to the conventional method of holding outstanding memory access requests in miss status handling registers (MSHRs), HCMA method takes advantage of scratchpad memory in FPGAs or SoCs to circumvent the limitation of MSHR entries. The concurrency of requests is only limited by the capacity of scratchpad memory. Moreover, to avoid the higher latency when searching more entries, we design an efficient coordinating mechanism based on circular queues.We evaluate the performance of HCMA method on an MP-SoC FPGA platform. Compared to conventional methods based on MSHRs, HCMA method supports ten times of concurrent memory accesses (from 10 to 128 entries on our evaluation platform). HCMA method achieves up to 2.72× memory bandwidth utilization for applications that access memory with massive fine-grained random requests, and to 3.46× memory bandwidth utilization for stream-based memory accesses. For real applications like CG, our method improves speedup performance by 29.87%.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116700796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoban Wu, Yan Luo, Jeronimo Bezerra, Liang-Min Wang
{"title":"Ares: A Scalable High-Performance Passive Measurement Tool Using a Multicore System","authors":"Xiaoban Wu, Yan Luo, Jeronimo Bezerra, Liang-Min Wang","doi":"10.1109/NAS.2019.8834734","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834734","url":null,"abstract":"Network measurement tools must support the collection of fine-grain flow statistics and scale well to the increasing line rates. However, conventional network measurement software tools are inadequate in high-speed network at the current scale. In this paper, we present Ares, a scalable high-performance passive network measurement tool to collect accurate per-flow metrics. Ares is built on a multicore platform, consisting of an effective hierarchical core assignment strategy, an efficient hash table for keeping flow statistics, a novel lockless flow statistics management scheme, as well as cache friendly prefetching. Our extensive performance evaluation shows that Ares brings about 19x speedup for 64-byte packets over existing approaches and can sustain up to a line rate of 100Gbps, while delivering the same level of fine-grained flow metrics.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"453 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133808256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Workflow Scheduling on Multi-Resource Clusters","authors":"Yang Hu, C. D. Laat, Zhiming Zhao","doi":"10.1109/NAS.2019.8834720","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834720","url":null,"abstract":"Workflow scheduling is one of the key issues in the management of workflow execution. Typically, a workflow application can be modeled as a Directed-Acyclic Graph (DAG). In this paper, we present GoDAG, an approach that can learn to well schedule workflows on multi-resource clusters. GoDAG directly learns the scheduling policy from experience through deep reinforcement learning. In order to adapt deep reinforcement learning methods, we propose a novel state representation, a practical action space and a corresponding reward definition for workflow scheduling problem. We implement a GoDAG prototype and a simulator to simulate task running on multi-resource clusters. In the evaluation, we compare the GoDAG with three state-of-the-art heuristics. The results show that GoDAG outperforms the baseline heuristics, leading to less average makespan to different workflow structures.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114829842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Contention Aware Workload and Resource Co-Scheduling on Power-Bounded Systems","authors":"Pengfei Zou, Xizhou Feng, Rong Ge","doi":"10.1109/NAS.2019.8834721","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834721","url":null,"abstract":"As power becomes a top challenge in HPC systems and data centers, how to sustain the system performance growth under limited available or permissible power becomes an important research topic. Traditionally, researchers have explored collocating non-interfering jobs on the same nodes to improve system performance. Nevertheless, power limits reduce the capacity of components, nodes, and systems, and induce or aggravate contention between jobs. Using prior power-oblivious job collocation strategies on power limited systems can adversely degrade system throughput. In this paper, we quantitatively estimate contention induced by power limits, and propose a Contention-Aware Power-bounded Scheduling (CAPS) for systems with finite power budgets. CAPS chooses to collocate jobs that are complementary when power is limited, and distributes the available power to nodes and components to minimize their interference. Experimental results show that CAPS improves system throughput and power efficiency by 10% or greater than power-oblivious job collocation strategies, depending on the available power, for hybrid MPI/OpenMP benchmarks on a 192-core 8-node cluster.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124082411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingjing You, Yongkun Li, Fan Guo, Yinlong Xu, Jinzhong Chen, Liu Yuan
{"title":"Leveraging Array Mapped Tries in KSM for Lightweight Memory Deduplication","authors":"Lingjing You, Yongkun Li, Fan Guo, Yinlong Xu, Jinzhong Chen, Liu Yuan","doi":"10.1109/NAS.2019.8834730","DOIUrl":"https://doi.org/10.1109/NAS.2019.8834730","url":null,"abstract":"In cloud computing, how to use limited hardware resources to meet the increasing demands has become a major issue. KSM (Kernel Same-page Merging) is a content-based page sharing mechanism used in Linux that merges equal memory pages, thereby significantly reducing memory usage and increasing the density of virtual machines or containers. However, KSM introduces a large overhead in CPU and memory bandwidth usage due to the use of red-black trees and content-based page comparison. To reduce the deduplication overhead, in this paper, we propose a new design called AMT-KSM, which leverages array mapped tries to realize lightweight memory deduplication. The basic idea is to divide each memory page into multiple segments and use the concatenated strings of the hash values of segments as indexed keys in the tries. By doing this, we can significantly reduce the time required for searching duplicate pages as well as the number of page comparisons. We conduct experiments to evaluate the performance of our design, and results show that compared with the conventional KSM, AMT-KSM can reduce up to 44.9% CPU usage and 31.6% memory bandwidth usage.","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128101178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NAS 2019 Keynotes","authors":"","doi":"10.1109/nas.2019.8834717","DOIUrl":"https://doi.org/10.1109/nas.2019.8834717","url":null,"abstract":"","PeriodicalId":230796,"journal":{"name":"2019 IEEE International Conference on Networking, Architecture and Storage (NAS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133031125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}