Ian T Foster, D. Kohr, R. Krishnaiyer, Jace Mogill
{"title":"Remote I/O: fast access to distant storage","authors":"Ian T Foster, D. Kohr, R. Krishnaiyer, Jace Mogill","doi":"10.1145/266220.266222","DOIUrl":"https://doi.org/10.1145/266220.266222","url":null,"abstract":"As high-speed networks make it easier to use distributed resources, it becomes increasingly common that applications and their data are not colocated. Users have traditionally addressed this problem by manually staging data to and from remote computers. We argue instead for a new remote I/O paradigm in which programs use familiar parallel I/O interfaces to access remote file systems. In addition to simplifying remote execution, remote I/O can improve performance relative to staging by overlapping computation and data transfer or by reducing communication requirements. However, remote I/O also introduces new technical challenges in the areas of portability, performance, and integration with distributed computing systems. We propose techniques designed to address these challenges and describe a remote I/O library called RIO that we have developed to evaluate the effectiveness of these techniques. RIO addresses issues of portability by adopting the quasi-standard MPI-IO interface and by defining a RIO device and RIO server within the ADIO abstract I/O device architecture. It addresses performance issues by providing traditional I/O optimizations such as asynchronous operations and through implementation techniques such as buffering and message forwarding to off load communication overheads. RIO uses the Nexus communication library to obtain access to configuration and security mechanisms provided by the Globus wide area computing tool kit. Microbenchmarks and application experiments demonstrate that our techniques achieve acceptable performance in most situations and can improve turnaround time relative to staging.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114523202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Input/output access pattern classification using hidden Markov models","authors":"T. Madhyastha, D. Reed","doi":"10.1145/266220.266226","DOIUrl":"https://doi.org/10.1145/266220.266226","url":null,"abstract":"Input/output performance on current parallel file systems is sensitive to a good match of application access pattern to file system capabilities. Automaticiuput/output access classification can determine application access patterns at execution time, guiding adaptive file system policies. In this paper we examine a new method for access pattern classification that uses hidden Markov models, trained on access patterns from previous executions, to create a probabilistic model of input/output accesses. We compare this approach to a neural network classification &n-rework, presenting performance results from parallel and sequential benchmarks and applications.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127299655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Kandemir, A. Choudhary, J. Ramanujam, M. Kandaswamy
{"title":"A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations","authors":"M. Kandemir, A. Choudhary, J. Ramanujam, M. Kandaswamy","doi":"10.1145/266220.266228","DOIUrl":"https://doi.org/10.1145/266220.266228","url":null,"abstract":"This paper presents compiler algorithms to optimize outof-core programs. These algorithms consider loop and data layout transformations in a tied framework. The performance of an out-of-core loop nest containing many references can be improved by a combination of restructuring the loops and file layouts. This approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost tiling loop. Preliminary re.suIts from handcompiles on IBM SP-2 and Intel Paragon show that this approach reduces the execution time, improves the bandwidth speedup and overall speedup. In addition, we extend the base algorithm to work with file layout constraints and show how it can be used for optimizing programs consisting of multiple loop nests.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127554503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The scalability of spatial reuse based serial storage interfaces","authors":"Tai-Sheng Chang, Sangyup Shim, D. Du","doi":"10.1145/266220.266229","DOIUrl":"https://doi.org/10.1145/266220.266229","url":null,"abstract":"Due to the growing popularity of emerging applications such as digital libraries, Video-On Demand, distance learning, and Internet World-Wide Web, multimedia servers with a large capacity and high performance storage subsystem are in high demand. Serial storage interfaces are emerging technologies designed to improve the performance of such storage subsystems. They provide high bandwidth, fault tolerance, fair bandwidth sharing and long distance connection capability. All of these issues are critical in designing a scalable and high performance storage subsystem. Some of the serial storage interfaces provide the spatial reuse feature which allows multiple concurrent transmissions. That is, multiple hosts can access disks concurrently with full link bandwidth if their access paths are disjoint. Spatial reuse provides a way to build a storage subsystem whose aggregate bandwidth may be scaled up with the number of hosts. However, it is not clear how much the performance of a storage subsystem could be improved by the spatial reuse with different configurations and traffic scenarios. Both limitation and capability of this scalability need to be investigated. To understand their fundamental performance characteristics, we derive an analytic model for the serial storage interfaces with the spatial reuse feature. Based on this model, we investigate the maximum aggregate throughput from different system configurations and load distributions. We show how the number of disks needed to saturate a loop varies with different number of hosts and different load scenarios. We also show how the load balancing by uniformly distributing the load to all the disks on a loop may incur high overhead. This is because the accesses to far away disks need to go through many links and consume the bandwidth of each link it goes through. The results show the achievable throughput may be reduced by more than half in some cases.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125721187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunil Prabhakar, D. Agrawal, A. E. Abbadi, Ambuj K. Singh, Terence R. Smith
{"title":"Browsing and placement of multiresolution images on parallel disks","authors":"Sunil Prabhakar, D. Agrawal, A. E. Abbadi, Ambuj K. Singh, Terence R. Smith","doi":"10.1145/266220.266230","DOIUrl":"https://doi.org/10.1145/266220.266230","url":null,"abstract":"With rapid advances in computer and communication technologies, there is an increasing demand to build and maintain large image repositories. In order to reduce the demands on I/O and network resources, multiresolution representations are being proposed for the storage organization of images. Image decomposition techniques such as wavelets can be used to provide these multiresolution images. The original image is represented by several coefficients, one of them with visual similarity to the original image, but at a lower resolution. These visually similar coefficients can be thought of as thumbnails or icons of the original image. This paper addresses the problem of storing these multiresolution coefficients on disks so that thumbnail browsing as well as image reconstruction can be performed efficiently. Several strategies are evaluated to store the image coefficients on parallel disks. These strategies can be classified into two broad classes depending on whether the access pattern of the images is used in the placement. Disk simulation is used to evaluate the performance of these strategies. Simulation results are validated wivith results from experiments wivith real disks and are found to be in good agreement. The results indicate that significant performance improvements can be achieved with as few as four disks by placing image coefficients based upon browsing access patterns.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122086178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Cho, M. Winslett, M. Subramaniam, Ying Chen, S. Kuo, K. Seamons
{"title":"Exploiting local data in parallel array I/O on a practical network of workstations","authors":"Yong Cho, M. Winslett, M. Subramaniam, Ying Chen, S. Kuo, K. Seamons","doi":"10.1145/266220.266221","DOIUrl":"https://doi.org/10.1145/266220.266221","url":null,"abstract":"A cost-effective way to run a parallel application is to use existing workstations connected by a local area network such as Ethernet or FDDI. In this paper, we present an approach for parallel I/O of multidimensional arrays on small networks of workstations with a shared-media interconnect, using the Panda I/O library. In such an environment, the message passing throughput per node is lower than the throughput obtainable from a fast disk and it is not easy for users to determine the configuration which will yield the best I/O performance. We introduce an I/O strategy that exploits local data to reduce the amount of data that must be shipped across the network, present experimental results, and analyze the results using an analytical performance model and predict the best choice of I/O parameters. Our experiments show that the new strategy results in a factor of 1.2-2.1 speedup in response time compared to the Panda version originally developed for the IBM SP2, depending on the array sizes, distributions and compute and I/O node meshes. Further, the performance model predicts the results within a 13% margin of error.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115204726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient distributed backup with delta compression","authors":"R. Burns, D. Long","doi":"10.1145/266220.266223","DOIUrl":"https://doi.org/10.1145/266220.266223","url":null,"abstract":"Inexpensive storage and more powerful processors have resulted in a proliferation of data that needs to be reliably backed up. Network resource limitations make it increasingly difficult to backup a distributed file system on a nightly or even weekly basis. By using delta compression algorithms, which minimally encode a version of a file using only the bytes that have changed, a backup system can compress the data sent to a server. With the delta backup technique, we can achieve significant savings in network transmission time over previous techniques. Our measurements indicate that file system data may, on average, be compressed to within 10% of its original size with this method and that approximately 45% of all changed files have also been backed up in the previous week. Based on our measurements, we conclude that a small file store on the client that contains copies of previously backed up files can be used to retain versions in order to generate delta files. To reduce the load on the backup server, we implement a modified version storage architecture, version jumping, that allows us to restore delta encoded file versions with at most two accesses to tertiary storage. This minimizes server workload and network transmission time on file restore.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133287836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prefetching in segmented disk cache for multi-disk systems","authors":"V. Soloviev","doi":"10.1145/236017.236037","DOIUrl":"https://doi.org/10.1145/236017.236037","url":null,"abstract":"This paper investigates the performance of a multi-disk storage system equipped with a segmented disk cache processing a workload of multiple relational scans. Prefetching is a popular method of improving the performance of scans. Many modern disks have a multisegment cache which can be used for prefetching. We observe that, exploiting declustering as a data placement method, prefetching in a segmented cache causes a load imbalance among several disks. A single disk becomes a bottleneck, degrading performance of the entire system. A variation in disk queue length is a primary factor of the imbalance. Using a precise simulation model, we investigate several approaches to achieving better balancing. Our metrics are a scan response time for the closed-end system and an ability to sustain a workload without saturating for the open-end system. We arrive at two main conclusions: (1) Prefetching in main memory is inexpensive and effective for balancing and can supplement or substitute prefetching in disk cache. (2) Disk-level prefetching provides about the same performance as main memory prefetching if request queues are managed in the disk controllers rather than in the host. Checking the disk cache before queuing requests provides not only better request response time but also drastically improves balancing. A single cache performs better than a segmented cache for this method.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129706338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structured permuting in place on parallel disk systems","authors":"L. Wisniewski","doi":"10.1145/236017.236047","DOIUrl":"https://doi.org/10.1145/236017.236047","url":null,"abstract":"The ability to perform permutations of large data sets in place reduces the amount of necessary available disk storage. The simplest way to perform a permutation often is to read the records of a data set from a source portion of data storage, permute them in memory, and write them to a separate target portion of the same size. It can be quite expensive, however, to provide disk storage that is twice the size of very large data sets. Permuting in place reduces the expense by using only a small amount of extra disk storage beyond the size of the data set.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132399863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HFS: a performance-oriented flexible file system based on building-block compositions","authors":"O. Krieger, M. Stumm","doi":"10.1145/236017.236041","DOIUrl":"https://doi.org/10.1145/236017.236041","url":null,"abstract":"The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies, and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. As an extreme example, HFS allows a file’s structure to be optimized for concurrent random-access write-only operations by 10 threads, something no other file system can do. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application’s access pattern. In contrast, most parallel file systems support a single file structure and a small set of policies. We have implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns, HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system.","PeriodicalId":442608,"journal":{"name":"Workshop on I/O in Parallel and Distributed Systems","volume":"19 31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123577454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}