Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry
{"title":"Digitization and search: A non-traditional use of HPC","authors":"Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry","doi":"10.1109/eScience.2012.6404445","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404445","url":null,"abstract":"Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required preprocessing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"113 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80601556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"eResearch environment for remote instrumentation: VBL, RLI, VisLabl & 2","authors":"C. Myers, Michael D'Silva","doi":"10.1109/eScience.2012.6404465","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404465","url":null,"abstract":"This talk demonstrates the current remote experimentation capabilities deployed at the Australian Synchrotron and La Trobe university, as well as remote data transfer services deployed at the above locations and at Bragg, ansto, metadata extraction tool, MyTardis node's, remote analysis and visualisation environments for medical imaging and IR spectroscopy and the use of high resolution multi screen displays.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"71 1","pages":"1-2"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90424526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partial replica selection for spatial datasets","authors":"Yun Tian, P. J. Rhodes","doi":"10.1109/eScience.2012.6404473","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404473","url":null,"abstract":"The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. Accessing only a subset of a spatial replica usually results in a large number of relatively small read requests made to the underlying storage device. For this reason, an accurate model of disk access is important when working with spatial subsets. We make two primary contributions in this paper. First, we describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Second, making a few simplifying assumptions, we propose a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"59 1 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89349493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Sempolinski, D. Thain, Daniel Wei, A. Kareem
{"title":"A system for management of Computational Fluid Dynamics simulations for civil engineering","authors":"Peter Sempolinski, D. Thain, Daniel Wei, A. Kareem","doi":"10.1109/eScience.2012.6404433","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404433","url":null,"abstract":"We introduce a web-based system for management of Computational Fluid Dynamics(CFD) simulations. This system provides an interface for users, on a web-browser, to have an intuitive, user-friendly means of dispatching and controlling long-running simulations. CFD presents a challenge to its users due to the complexity of its internal mathematics, the high computational demands of its simulations and the complexity of inputs to its simulations and related tasks. We designed this system to be as extensible as possible in order to be suitable for many different civil engineering applications. The front-end of this system is a webserver, which provides the user interface. The back-end is responsible for starting and stopping jobs as requested. There are also numerous components specifically for facilitating CFD computation. We discuss our experience with presenting this system to real users and the future ambitions for this project.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"29 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89687975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal representation for scientific data provenance","authors":"Peng Chen, Beth Plale, M. Aktaş","doi":"10.1109/eScience.2012.6404477","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404477","url":null,"abstract":"Provenance of digital scientific data is an important piece of the metadata of a data object. It can however grow voluminous quickly because the granularity level of capture can be high. It can also be quite feature rich. We propose a representation of the provenance data based on logical time that reduces the feature space. Creating time and frequency domain representations of the provenance, we apply clustering, classification and association rule mining to the abstract representations to determine the usefulness of the temporal representation. We evaluate the temporal representation using an existing 10 GB database of provenance captured from a range of scientific workflows.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"13 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87441105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sinnott, Christopher Bayliss, G. Galang, Phillip Greenwood, George Koetsier, D. Mannix, L. Morandini, Marcos Nino-Ruiz, C. Pettit, Martin Tomko, M. Sarwar, R. Stimson, W. Voorsluys, I. Widjaja
{"title":"A data-driven urban research environment for Australia","authors":"R. Sinnott, Christopher Bayliss, G. Galang, Phillip Greenwood, George Koetsier, D. Mannix, L. Morandini, Marcos Nino-Ruiz, C. Pettit, Martin Tomko, M. Sarwar, R. Stimson, W. Voorsluys, I. Widjaja","doi":"10.1109/eScience.2012.6404481","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404481","url":null,"abstract":"The Australian Urban Research Infrastructure Network (AURIN) project (www.aurin.org.au) is tasked with developing an e-Infrastructure to support urban and built environment research across Australia. As identified in [1], this e-Infrastructure must provide seamless access to highly distributed and heterogeneous data sets from multiple organisations with accompanying analytical and visualization capabilities. The project is tasked with delivering a secure, web-based unifying environment offering a one-stop-shop for Australia-wide urban and built environment research. This paper describes the architectural design and implementation of the AURIN data-driven e-Infrastructure, where data is not just a passive entity that is accessed and used as a consequence of research demand, but is instead, directly shaping the computational access, processing and intelligent utilization possibilities. This is demonstrated in a situational context.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"13 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82101246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-performance computing without commitment: SC2IT: A cloud computing interface that makes computational science available to non-specialists","authors":"K. Jorissen, W. Johnson, F. Vila, J. Rehr","doi":"10.1109/eScience.2012.6404441","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404441","url":null,"abstract":"Computational work is a vital part of many scientific studies. In materials science research in particular, theoretical models are often needed to understand measurements. There is currently a double barrier that keeps a broad class of researchers from using state-of-the-art materials science codes: the software typically lacks user-friendliness, and the hardware requirements can demand a significant investment, e.g. the purchase of a Beowulf cluster. Scientific Cloud Computing has the potential to remove this barrier and make computational science accessible to a wider class of scientists who are not computational specialists. We present a set of interface tools, SC2IT, that enables seamless control of virtual compute clusters in the Amazon EC2 cloud and is designed to be embedded in user-friendly Java GUIs. We present applications of our Scientific Cloud Computing method to the materials science codes FEFF9, WIEN2k, and MEEP-mpi. SC2IT and the paradigm described here are applicable to other fields of research outside materials science within current Cloud Computing capability.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"22 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80195839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kary A. C. S. Ocaña, Daniel de Oliveira, Jonas Dias, Eduardo S. Ogasawara, M. Mattoso
{"title":"Discovering drug targets for neglected diseases using a pharmacophylogenomic cloud workflow","authors":"Kary A. C. S. Ocaña, Daniel de Oliveira, Jonas Dias, Eduardo S. Ogasawara, M. Mattoso","doi":"10.1109/eScience.2012.6404431","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404431","url":null,"abstract":"Illnesses caused by parasitic protozoan are a research priority. A representative group of these illnesses is the commonly known as Neglected Tropical Diseases (NTD). NTD specially attack low socioeconomic population around the world and new anti-protozoan inhibitors are needed and several drug discovery projects focus on researching new drug targets. Pharmacophylogenomics is a novel bioinformatics field that aims at reducing the time and the financial cost of the drug discovery process. Pharmacophylogenomic analyses are applied mainly in the early stages of the research phase in drug discovery. Pharmacophylogenomic analysis executes several bioinformatics programs in a coherent flow to identify homologues sequences, construct phylogenetic trees and execute evolutionary and structural experiments. This way, it can be modeled as scientific workflows. Pharmacophylogenomic analysis workflows are complex, computing and data intensive and may execute during weeks. This way, it benefits from parallel execution. We propose SciPPGx, a scientific workflow that aims at providing thorough inferring support for pharmacophylogenomic hypotheses. SciPPGx is executed in parallel in a cloud using SciCumulus workflow engine. Experiments show that SciPPGx considerably reduces the total execution time up to 97.1% when compared to a sequential execution. We also present representative biological results taking advantage of the inference covering several related bioinformatics overviews.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"3 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74944990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ramos-Pollán, F. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez-Rubiano, J. Bermeo, Juan Sebastian Otálora Montenegro, Paola K. Rozo, John Arevalo
{"title":"BIGS: A framework for large-scale image processing and analysis over distributed and heterogeneous computing resources","authors":"R. Ramos-Pollán, F. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez-Rubiano, J. Bermeo, Juan Sebastian Otálora Montenegro, Paola K. Rozo, John Arevalo","doi":"10.1109/eScience.2012.6404424","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404424","url":null,"abstract":"This paper presents BIGS the Big Image Data Analysis Toolkit, a software framework for large scale image processing and analysis over heterogeneous computing resources, such as those available in clouds, grids, computer clusters or throughout scattered computer resources (desktops, labs) in an opportunistic manner. Through BIGS, eScience for image processing and analysis is conceived to exploit coarse grained parallelism based on data partitioning and parameter sweeps, avoiding the need of inter-process communication and, therefore, enabling loosely coupled computing nodes (BIGS workers). It adopts an uncommitted resource allocation model where (1) experimenters define their image processing pipelines in a simple configuration file, (2) a schedule of jobs is generated and (3) workers, as they become available, take over pending jobs as long as their dependency on other jobs is fulfilled. BIGS workers act autonomously, querying the job schedule to determine which one to take over. This removes the need for a central scheduling node, requiring only access by all workers to a shared information source. Furthermore, BIGS workers are encapsulated within different technologies to enable their agile deployment over the available computing resources. Currently they can be launched through the Amazon EC2 service over their cloud resources, through Java Web Start from any desktop computer and through regular scripting or SSH commands. This suits well different kinds of research environments, both when accessing dedicated computing clusters or clouds with committed computing capacity or when using opportunistic computing resources whose access is seldom or cannot be provisioned in advance. We also adopt a NoSQL storage model to ensure the scalability of the shared information sources required by all workers, including within BIGS support for HBase and Amazon's DynamoDB service. Overall, BIGS now enables researchers to run large scale image processing pipelines in an easy, affordable and unplanned manner with the capability to take over computing resources as they become available at run time. This is shown in this paper by using BIGS in different experimental setups in the Amazon cloud and in an opportunistic manner, demonstrating its configurability, adaptability and scalability capabilities.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"69 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75651632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IRMIS: The care and feeding of a generalized relatively relational database for accelerator components with a connection to the real time EPICS Input output controllers","authors":"R. Farnsworth, S. Benes","doi":"10.1109/eScience.2012.6404469","DOIUrl":"https://doi.org/10.1109/eScience.2012.6404469","url":null,"abstract":"IRMIS: The care and feeding of a generalized relatively relational database for accelerator components with a connection to the real time EPICS Input output controllers. This paper describes a relational database approach to documenting and maintaining; the feeding. It describes the automated process used to generate accelerator or synchrotron component data for the relational tables and the role of devices and components. The data this obtained turn may be used or presented in a variety of ways to the end use in order to either optimize the maintenance or to provide machine metadata for experimental performance purposes.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"32 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83187892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}