Andrej Andrejev, S. Toor, A. Hellander, S. Holmgren, T. Risch
{"title":"Scientific Analysis by Queries in Extended SPARQL over a Scalable e-Science Data Store","authors":"Andrej Andrejev, S. Toor, A. Hellander, S. Holmgren, T. Risch","doi":"10.1109/ESCIENCE.2013.19","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.19","url":null,"abstract":"Data-intensive applications in e-Science require scalable solutions for storage as well as interactive tools for analysis of scientific data. It is important to be able to query the data in a storage-independent way, and to be able to obtain the results of the data-analysis incrementally (in contrast to traditional batch solutions). We use the RDF data model extended with multidimensional numeric arrays to represent the results, parameters, and other metadata describing scientific experiments, and SciSPARQL, an extension of the SPARQL language, to combine massive numeric array data and metadata in queries. To address the scalability problem we present an architecture that enables the same SciSPARQL queries to be executed on the RDF dataset whether it is stored in a relational DBMS or mapped over a specialized geographically distributed e-Science data store. In order to minimize access and communication costs, we represent the arrays with proxy objects, and retrieve their content lazily. We formulate typical analysis tasks from a computational biology application in terms of SciSPARQL queries, and compare the query processing performance with manually written scripts in MATLAB.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115643321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identity Management for Virtual Organizations: An Experience-Based Model","authors":"Robert Cowles, Craig Jackson, Von Welch","doi":"10.1109/eScience.2013.47","DOIUrl":"https://doi.org/10.1109/eScience.2013.47","url":null,"abstract":"In this paper we present our Virtual Organization (VO) Identity Management (IdM) Model, an overview of 14 interviews that informed it, and preliminary analysis of the factors that guide VOs and Resource Providers (RPs) to choose a particular IdM implementation. This model will serve both existing and future VOs and RPs to more effectively understand and implement their IdM relationships. The Virtual Organization has emerged as a fundamental way of structuring modern scientific collaborations and has shaped the computing infrastructure that supports those collaborations. One key aspect of this infrastructure is identity management, and the emergence of VOs introduces challenges regarding how much of the IdM process should be delegated from the RP to the VO. Many different implementation choices have been made, we conducted semi-structured interviews with 14 different VOs or RPs regarding their IdM choices and the bases behind those decisions. We analyzed the interview results to extract common parameters and values, which we used to inform our VO IdM Model.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123332920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wilfred W. Li, R. Moore, Matthew Kullberg, B. Battistuz, S. Meier, Ronald Joyce, R. Wagner, T. Reynales, Qian Liu
{"title":"Developing Sustainable Data Services in Cyberinfrastructure for Higher Education: Requirements and Lessons Learned","authors":"Wilfred W. Li, R. Moore, Matthew Kullberg, B. Battistuz, S. Meier, Ronald Joyce, R. Wagner, T. Reynales, Qian Liu","doi":"10.1109/eScience.2013.46","DOIUrl":"https://doi.org/10.1109/eScience.2013.46","url":null,"abstract":"The University of California, San Diego (UC San Diego) Research Cyber infrastructure (RCI) program provides long-term quality services in centralized storage, colocation, computing, data curation, networking and technical expertise. To help define the data storage needs and set priorities, the RCI data services (RCIDS) team conducted a series of interviews with faculty and senior staff members between September 2012 and February 2013. A total of 50 groups from 29 separate departments and organized research units (ORUs) participated in the interviews, representing more than 600 UC San Diego researchers. From human genomic sequences, marine natural products, to cosmological simulations, their diverse datasets are shared with hundreds of thousands of users worldwide. The top 10 requirements on data services and the top 5 existing challenges and risks as reported by UC San Diego researchers have been identified. Based upon these requirements, the RCIDS team recommends a Network Attached Storage (NAS) data service to be first deployed with a sustainable business model. Additional services will be developed through further discussion with the research community and in view of emerging cloud computing technologies. An extensive discussion is provided on the implementation plan, cloud-based data services, and the lessons learned in building sustainable e-science infrastructure for higher education research.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122369190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaonan Zhang, Yingpin Long, Guohui Zhao, Yufang Min, Jianfang Kang, L. Luo, Zhenfang He, Yang Wang
{"title":"An e-Science Environment for Ecological and Hydrological Simulation Research","authors":"Yaonan Zhang, Yingpin Long, Guohui Zhao, Yufang Min, Jianfang Kang, L. Luo, Zhenfang He, Yang Wang","doi":"10.1109/eScience.2013.37","DOIUrl":"https://doi.org/10.1109/eScience.2013.37","url":null,"abstract":"Comprehensive integrated research on ecological and hydrological processes and the simulation of river basin environments are critical foundations for decision making by governments and river-basin managers. The demand for a holistic understanding of environmental systems such as river basins is increasing. Eco-hydrological research needs two types of monitoring platforms to access and collect data from basins: a modeling platform to support access, select, and run models online, and build new models with the collected data, and a manipulation platform to generate forcing data, run models, and visualize the results. Consequently, we developed an e-science environment framework comprising three platforms - a monitoring platform, a model platform, and a manipulation platform. The framework allows automatic data transmission, storage, management, analysis, model management, simulation, computing, and result visualization. The e-science environment integrates land surface models such as Simplified Simple Biosphere model, the Revised Simple Biosphere model and WRF, hydrological models such as SWAT and TOPMODEL, data assimilation filters including such as Kalman filter algorithm, and several tools and methods for dealing with data, principally artificial neural networks and Markov chains. We demonstrate the application of the framework that uses an SSIB land surface model ensemble Kalman filter to improve evapotranspiration, soil moisture, and ground temperature simulation in the Heihe inland river basin. The approach proves suitable for environmental simulation for inland river research.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124113391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Humphrey, Jacob Steele, I. Kim, M. Kahn, J. Bondy, Michael Ames
{"title":"CloudDRN: A Lightweight, End-to-End System for Sharing Distributed Research Data in the Cloud","authors":"M. Humphrey, Jacob Steele, I. Kim, M. Kahn, J. Bondy, Michael Ames","doi":"10.1109/eScience.2013.53","DOIUrl":"https://doi.org/10.1109/eScience.2013.53","url":null,"abstract":"The cloud has proven itself as a scalable platform for Web-based applications. However, scientists and medical researchers are still searching for a simple cloud-based architecture that enables secure collaboration and sharing of distributed datasets. To date, attempts at using the cloud for this purpose generally view the cloud as simply a pool of servers upon which to run their legacy software. This approach fails to leverage the unique platform capabilities of the cloud. In this paper, we describe our Cloud Distributed Research Network (CloudDRN). We leverage the cloud for availability, reliability, scalability, and improved security as compared to legacy distributed systems while still supporting site autonomy. Our philosophy is to adapt commercial software tooling that was originally designed for business use-cases, thereby benefiting from the large built-in user community. We describe our general architecture and show an example of our system created to share distributed clinical research data. We evaluate our system in Amazon Web Services (AWS) and in Microsoft Windows Azure and find that while each cloud achieves similar financial cost, representative queries are 3.5x slower on average in Windows Azure.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129374580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Kugler, K. Chard, Simon Caton, O. Rana, D. Katz
{"title":"Constructing a Social Content Delivery Network for eScience","authors":"Kai Kugler, K. Chard, Simon Caton, O. Rana, D. Katz","doi":"10.1109/ESCIENCE.2013.52","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.52","url":null,"abstract":"Increases in the size of research data and the move towards citizen science, in which everyday users contribute data and analyses, have resulted in a research data deluge. Researchers must now carefully determine how to store, transfer and analyze \"Big Data\" in collaborative environments. This task is even more complicated when considering budget and locality constraints on data storage and access. In this paper we investigate the potential to construct a Social Content Delivery Network (S-CDN) based upon the social networks that exist between researchers. The S-CDN model builds upon the incentives of collaborative researchers within a given scientific community to address their data challenges collaboratively and in proven trusted settings. In this paper we present a prototype implementation of a S-CDN and investigate the performance of the data transfer mechanisms (using Glob us Online) and the potential cost advantages of this approach.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"826 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120879191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Operation Properties: A Representation and their Role in the Propagation of Meta-Data","authors":"Juan Amiguet-Vercher, P. Apers, A. Wombacher","doi":"10.1109/eScience.2013.13","DOIUrl":"https://doi.org/10.1109/eScience.2013.13","url":null,"abstract":"To facilitate the sharing and re-use of data in scientific studies we propose an automated technique for annotating operation results. The annotated output has to preserve, as much as possible, the properties of the input annotations. The preservation of properties is achieved by taking into account operation properties. Property preservation is evaluated with information theory metrics.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"1995 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"e-Enabling International Cancer Research: Lessons Being Learnt in the ENS@T-CANCER Project","authors":"A. Stell, R. Sinnott","doi":"10.1109/eScience.2013.33","DOIUrl":"https://doi.org/10.1109/eScience.2013.33","url":null,"abstract":"Breakthroughs in biomedicine are driven by research. More often than not, research takes place outside of a healthcare setting. However access to and use of clinical data for research purposes has many challenges that must be overcome, not least of which are the lack of standardized nomenclature and the heterogeneity of healthcare IT systems. For rare conditions, this challenge is particularly acute since the scarcity of data makes scientific breakthroughs increasingly difficult. Adrenal tumours represent one rare disease area where consolidation of clinical and biological information is urgently required. This paper describes the lessons being learnt in the development and rollout of an advanced security-oriented, virtual research environment (VRE) as part of the EU funded ENS@T-CANCER project. This system is currently used by 39 major cancer research centres across Europe and provides a unique resource for adrenal cancer research, underpinning an expanding portfolio of major international clinical trials and studies.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126331942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Hunter, C. Brooking, Wilfred Brimblecombe, R. G. Dwyer, H. Campbell, Matthew E. Watts, C. Franklin
{"title":"OzTrack -- E-Infrastructure to Support the Management, Analysis and Sharing of Animal Tracking Data","authors":"J. Hunter, C. Brooking, Wilfred Brimblecombe, R. G. Dwyer, H. Campbell, Matthew E. Watts, C. Franklin","doi":"10.1109/ESCIENCE.2013.38","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2013.38","url":null,"abstract":"The aim of the OzTrack project is to provide common e-Science infrastructure to support the management, pre-processing, analysis and visualization of animal tracking data generated by researchers who are using telemetry devices to study animal behavior and ecology in Australia. This paper describes the technical challenges and design decisions associated with the development of the OzTrack system. It also describes the pre-processing, analysis and visualization services that we have developed to help researchers understand how their study species move across space and time. Finally this paper outlines the systems' current limitations and preliminary results and feedback from its evaluation to date.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125194230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Pipeline in MapReduce","authors":"Jiaan Zeng, Beth Plale","doi":"10.1109/eScience.2013.21","DOIUrl":"https://doi.org/10.1109/eScience.2013.21","url":null,"abstract":"MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains.","PeriodicalId":325272,"journal":{"name":"2013 IEEE 9th International Conference on e-Science","volume":"89 25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126291144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}