{"title":"High-performance data management for genome sequencing centers using Globus Online: A case study","authors":"Dinanath Sulakhe, R. Kettimuthu, Utpal J. Davé","doi":"10.1109/eScience.2012.6404443","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404443","url":null,"abstract":"In the past few years in the biomedical field, availability of low-cost sequencing methods in the form of next-generation sequencing has revolutionized the approaches life science researchers are undertaking in order to gain a better understanding of the causative factors of diseases. With biomedical researchers getting many of their patients' DNA and RNA sequenced, sequencing centers are working with hundreds of researchers with terabytes to petabytes of data for each researcher. The unprecedented scale at which genomic sequence data is generated today by high-throughput technologies requires sophisticated and high-performance methods of data handling and management. For the most part, however, the state of the art is to use hard disks to ship the data. As data volumes reach tens or even hundreds of terabytes, such approaches become increasingly impractical. Data stored on portable media can be easily lost, and typically is not readily accessible to all members of the collaboration. In this paper, we discuss the application of Globus Online within a sequencing facility to address the data movement and management challenges that arise as a result of exponentially increasing amount of data being generated by a rapidly growing number of research groups. We also present the unique challenges in applying a Globus Online solution in sequencing center environments and how we overcome those challenges.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"127 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90263422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. R. Huq, P. Apers, A. Wombacher, Y. Wada, L. V. Beek
{"title":"From scripts towards provenance inference","authors":"M. R. Huq, P. Apers, A. Wombacher, Y. Wada, L. V. Beek","doi":"10.1109/ESCIENCE.2012.6404467","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2012.6404467","url":null,"abstract":"Scientists require provenance information either to validate their model or to investigate the origin of an unexpected value. However, they do not maintain any provenance information and even designing the processing workflow is rare in practice. Therefore, in this paper, we propose a solution that can build the workflow provenance graph by interpreting the scripts used for actual processing. Further, scientists can request fine-grained provenance information facilitating the inferred workflow provenance. We also provide a guideline to customize the workflow provenance graph based on user preferences. Our evaluation shows that the proposed approach is relevant and suitable for scientists to manage provenance.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"13 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83650847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Almeida, J. A. D. Santos, Bruna Alberton, R. Torres, L. Morellato
{"title":"Remote phenology: Applying machine learning to detect phenological patterns in a cerrado savanna","authors":"J. Almeida, J. A. D. Santos, Bruna Alberton, R. Torres, L. Morellato","doi":"10.1109/ESCIENCE.2012.6404438","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2012.6404438","url":null,"abstract":"Plant phenology has gained importance in the context of global change research, stimulating the development of new technologies for phenological observation. Digital cameras have been successfully used as multi-channel imaging sensors, providing measures of leaf color change information (RGB channels), or leafing phenological changes in plants. We monitored leaf-changing patterns of a cerrado-savanna vegetation by taken daily digital images. We extract RGB channels from digital images and correlated with phenological changes. Our first goals were: (1) to test if the color change information is able to characterize the phenological pattern of a group of species; and (2) to test if individuals from the same functional group may be automatically identified using digital images. In this paper, we present a machine learning approach to detect phenological patterns in the digital images. Our preliminary results indicate that: (1) extreme hours (morning and afternoon) are the best for identifying plant species; and (2) different plant species present a different behavior with respect to the color change information. Based on those results, we suggest that individuals from the same functional group might be identified using digital images, and introduce a new tool to help phenology experts in the species identification and location on-the-ground.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"226 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83612923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Perl, Yassene Mohammed, Michael Brenner, Matthew Smith
{"title":"Fast confidential search for bio-medical data using Bloom filters and Homomorphic Cryptography","authors":"H. Perl, Yassene Mohammed, Michael Brenner, Matthew Smith","doi":"10.1109/eScience.2012.6404484","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404484","url":null,"abstract":"Data protection is a challenge when outsourcing medical analysis, especially if one is dealing with patient related data. While securing transfer channels is possible using encryption mechanisms, protecting the data during analyses is difficult as it usually involves processing steps on the plain data. A common use case in bioinformatics is when a scientist searches for a biological sequence of amino acids or DNA nucleotides in a library or database of sequences to identify similarities. Most such search algorithms are optimized for speed with less or no consideration for data protection. Fast algorithms are especially necessary because of the immense search space represented for instance by the genome or proteome of complex organisms. We propose a new secure exact term search algorithm based on Bloom filters. Our algorithm retains data privacy by using Obfuscated Bloom filters while maintaining the performance needed for real-life applications. The results can then be further aggregated using Homomorphic Cryptography to allow exact-match searching. The proposed system facilitates outsourcing exact term search of sensitive data to on-demand resources in a way which conforms to best practice of data protection.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"20 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73072583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jack Paparian, Shawn T. Brown, D. Burke, J. Grefenstette
{"title":"FRED Navigator: An interactive system for visualizing results from large-scale epidemic simulations","authors":"Jack Paparian, Shawn T. Brown, D. Burke, J. Grefenstette","doi":"10.1109/ESCIENCE.2012.6404444","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2012.6404444","url":null,"abstract":"Large-scale simulations are increasingly used to evaluate potential public health interventions in epidemics such as the H1N1 pandemic of 2009. Due to variations in both disease scenarios and in interventions, it is typical to run thousands of simulations as part of a given study. This paper addresses the challenge of visualizing the results from a large number of simulation runs. We describe a new tool called FRED Navigator that allows a user to interactively visualize results from the FRED agent-based modeling system.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"352 1","pages":"1-5"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75494213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Narayanan, T. Madden, A. Sandy, R. Kettimuthu, M. Link
{"title":"GridFTP based real-time data movement architecture for x-ray photon correlation spectroscopy at the Advanced Photon Source","authors":"S. Narayanan, T. Madden, A. Sandy, R. Kettimuthu, M. Link","doi":"10.1109/eScience.2012.6404466","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404466","url":null,"abstract":"X-ray photon correlation spectroscopy (XPCS) is a unique tool to study the dynamical properties in a wide range of materials over a wide spatial and temporal range. XPCS measures the correlated changes in the speckle pattern, produced when a coherent x-ray beam is scattered from a disordered sample, over a time series of area detector images. The technique rides on “Big Data” and relies heavily on high performance computing (HPC) techniques. In this paper, we propose a highspeed data movement architecture for moving data within the Advanced Photon Source (APS) as well as between APS and the users' institutions. We describe the challenges involved in the internal data movement and a GridFTP-based solution that enables more efficient usage of the APS beam time. The implementation of GridFTP plugin as part of the data acquisition system at the Advanced Photon Source for real time data transfer to the HPC system for data analysis is discussed.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"19 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84807518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Overview of the TriBITS lifecycle model: A Lean/Agile software lifecycle model for research-based computational science and engineering software","authors":"R. Bartlett, M. Heroux, J. Willenbring","doi":"10.1109/eScience.2012.6404448","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404448","url":null,"abstract":"Software lifecycles are becoming an increasingly important issue for computational science & engineering (CSE) software. The process by which a piece of CSE software begins life as a set of research requirements and then matures into a trusted high-quality capability is both commonplace and extremely challenging. Although an implicit lifecycle is obviously being used in any effort, the challenges of this process-respecting the competing needs of research vs. production-cannot be overstated. Here we describe a proposal for a well-defined software life-cycle process based on modern Lean/Agile software engineering principles. What we propose is appropriate for many CSE software projects that are initially heavily focused on research but also are expected to eventually produce usable high-quality capabilities. The model is related to TriBITS, a build, integration and testing system, which serves as a strong foundation for this lifecycle model, and aspects of this lifecycle model are ingrained in the TriBITS system. Indeed this lifecycle process, if followed, will enable large-scale sustainable integration of many complex CSE software efforts across several institutions.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"43 5 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88850335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Thompson, A. Khassapov, Y. Nesterets, T. Gureyev, John A. Taylor
{"title":"X-ray imaging software tools for HPC clusters and the Cloud","authors":"D. Thompson, A. Khassapov, Y. Nesterets, T. Gureyev, John A. Taylor","doi":"10.1109/ESCIENCE.2012.6404464","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2012.6404464","url":null,"abstract":"Computed Tomography (CT) is a non-destructive imaging technique widely used across many scientific, industrial and medical fields. It is both computationally and data intensive, and therefore can benefit from infrastructure in the “supercomputing” domain for research purposes, such as Synchrotron science. Our group within CSIRO has been actively developing X-ray tomography and image processing software and systems for HPC clusters. We have also leveraged the use of GPU's (Graphical Processing Units) for several codes enabling speedups by an order of magnitude or more over CPU-only implementations. A key goal of our systems is to enable our targeted “end users”, researchers, easy access to the tools, computational resources and data via familiar interfaces and client applications such that specialized HPC expertise and support is generally not required in order to initiate and control data processing, analysis and visualzation workflows. We have strived to enable the use of HPC facilities in an interactive fashion, similar to the familiar Windows desktop environment, in contrast to the traditional batch-job oriented environment that is still the norm at most HPC installations. Several collaborations have been formed, and we currently have our systems deployed on two clusters within CSIRO, Australia. A major installation at the Australian Synchrotron (MASSIVE GPU cluster) where the system has been integrated with the Imaging and Medical Beamline (IMBL) detector to provide rapid on-demand CT-reconstruction and visualization capabilities to researchers whilst on-site and remotely. A smaller-scale installation has also been deployed on a mini-cluster at the Shanghai Synchrotron Radiation Facility (SSRF) in China. All clusters run the Windows HPC Server 2008 R2 operating system. The two large clusters running our software, MASSIVE and CSIRO Bragg are currently configured as “hybrid clusters” in which individual nodes can be dual-booted between Linux and Windows as demand requires. We have also recently explored the adaptation of our CT-reconstruction code to Cloud infrastructure, and have constructed a working “proof-of-concept” system for the Microsoft Azure Cloud. However, at this stage several challenges remain to be met in order to make it a truly viable alternative to our HPC cluster solution. Recently, CSIRO was successful in its proposal to develop eResearch tools for the Australian Government funded NeCTAR Research Cloud. As part of this project our group will be contributing CT and imaging processing components.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"7 1","pages":"1-7"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80449474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On realizing the concept study ScienceSoft of the European Middleware Initiative: Open Software for Open Science","authors":"A. D. Meglio, F. Estrella, M. Riedel","doi":"10.1109/eScience.2012.6404450","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404450","url":null,"abstract":"In September 2011 the European Middleware Initiative (EMI) started discussing the feasibility of creating an open source community for science with other projects like EGI, StratusLab, OpenAIRE, iMarine, and IGE, SMEs like DCore, Maat, SixSq, SharedObjects, communities like WLCG and LSGC. The general idea of establishing an open source community dedicated to software for scientific applications was understood and appreciated by most people. However, the lack of a precise definition of goals and scope is a limiting factor that has also made many people sceptical of the initiative. In order to understand more precisely what such an open source initiative should do and how, EMI has started a more formal feasibility study around a concept called ScienceSoft - Open Software for Open Science. A group of people from interested parties was created in December 2011 to be the ScienceSoft Steering Committee with the short-term mandate to formalize the discussions about the initiative and produce a document with an initial high-level description of the motivations, issues and possible solutions and a general plan to make it happen. The conclusions of the initial investigation were presented at CERN in February 2012 at a ScienceSoft Workshop organized by EMI. Since then, presentations of ScienceSoft have been made in various occasions, in Amsterdam in January 2012 at the EGI Workshop on Sustainability, in Taipei in February at the ISGC 2012 conference, in Munich in March at the EGI/EMI Conference and at OGF 34 in March. This paper provides information this concept study ScienceSoft as an overview distributed to the broader scientific community to critique it.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"30 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74068591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Provenance analysis: Towards quality provenance","authors":"Y. Cheah, Beth Plale","doi":"10.1109/eScience.2012.6404480","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404480","url":null,"abstract":"Data provenance, a key piece of metadata that describes the lifecycle of a data product, is crucial in aiding scientists to better understand and facilitate reproducibility and reuse of scientific results. Provenance collection systems often capture provenance on the fly and the protocol between application and provenance tool may not be reliable. As a result, data provenance can become ambiguous or simply inaccurate. In this paper, we identify likely quality issues in data provenance. We also establish crucial quality dimensions that are especially critical for the evaluation of provenance quality. We analyze synthetic and real-world provenance based on these quality dimensions and summarize our contributions to provenance quality.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"81 5","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72607827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}