{"title":"Provenance-Based Scientific Workflow Search","authors":"A. A. Jabal, E. Bertino, Geeth de Mel","doi":"10.1109/eScience.2017.24","DOIUrl":"https://doi.org/10.1109/eScience.2017.24","url":null,"abstract":"Due to data intensive and sophisticated tasks in scientific experiments, workflows have been widely used to enable repetitive task automation and data reproducibility. This yields to the need for effective and efficient search mechanisms for scientific workflows discovery as workflow retrieval systems require a model which fulfills several requirements: unification, accuracy, and rich representations. Motivated by the recent uptake in provenance based models for scientific workflow discovery, in this paper, we propose a provenance-based architecture for retrieving workflows. Specifically, the paper presents an architecture which transforms data provenance into workflows and then organizes data into a set of indexes to support efficient querying mechanisms. The architecture enables composite queries supporting three types of search criteria: keywords of workflow tasks, workflow structure patterns, and metadata about workflows–e.g., how often a workflow was used.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"688 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133171360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Efficient Dynamic Scheduling of Deadline-Constrained MapReduce Workflows","authors":"Tong Shu, C. Wu","doi":"10.1109/eScience.2017.18","DOIUrl":"https://doi.org/10.1109/eScience.2017.18","url":null,"abstract":"Big data workflows comprised of moldable parallel MapReduce programs running on a large number of processors have become a main consumer of energy at data centers. The degree of parallelism of each moldable job in such workflows has a significant impact on the energy efficiency of parallel computing systems, which remains largely unexplored. In this paper, we validate with experimental results the moldable parallel computing model where the dynamic energy consumption of a moldable job increases with the number of parallel tasks. Based on our validation, we construct rigorous cost models and formulate a dynamic scheduling problem of deadline-constrained MapReduce workflows to minimize energy consumption in Hadoop systems. We propose a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamic energy consumption while meeting performance requirements from a global perspective, and also design the corresponding system modules for algorithm implementation in Hadoop architecture. The performance superiority of the proposed algorithm in terms of dynamic energy saving and deadline violation is illustrated by extensive simulation results in Hadoop/YARN in comparison with existing algorithms, and the core module of adaptive task partitioning is further validated through real-life workflow implementation and experimental results using the Oozie workflow engine in Hadoop/YARN systems.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"975 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134057232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Pastorello, D. Gunter, H. Chu, D. Christianson, C. Trotta, E. Canfora, B. Faybishenko, Y. Cheah, N. Beekwilder, S. Chan, S. Dengel, T. Keenan, F. O'Brien, Abdelrahman Elbashandy, C. Poindexter, M. Humphrey, D. Papale, D. Agarwal
{"title":"Hunting Data Rogues at Scale: Data Quality Control for Observational Data in Research Infrastructures","authors":"G. Pastorello, D. Gunter, H. Chu, D. Christianson, C. Trotta, E. Canfora, B. Faybishenko, Y. Cheah, N. Beekwilder, S. Chan, S. Dengel, T. Keenan, F. O'Brien, Abdelrahman Elbashandy, C. Poindexter, M. Humphrey, D. Papale, D. Agarwal","doi":"10.1109/ESCIENCE.2017.64","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2017.64","url":null,"abstract":"Data quality control is one of the most time consuming activities within Research Infrastructures (RIs), especially when involving observational data and multiple data providers. In this work we report on our ongoing development of data rogues, a scalable approach to manage data quality issues for observational data within RIs. The motivation for this work started with the creation of the FLUXNET2015 dataset, which includes carbon, water, and energy fluxes plus micrometeorological and ancillary data measured in over 200 sites around the world. To create an uniform dataset, including derived data products, extensive work on data quality control was needed. The unpredictable nature of observational data quality issues makes the automation of data quality control inherently difficult. Developed based on this experience, the data rogues methodology allows for increased automation of quality control activities by systematically identifying, cataloging, and documenting implementations of solutions to data issues. We believe this methodology can be extended and applied to others domains and types of data, making the automation of data quality control a more tractable problem.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"218 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113998795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andres Garcia-Silva, Raúl Palma, José Manuél Gómez-Pérez
{"title":"Semantic Technologies and Text Analysis in Support of Scientific Knowledge Reuse","authors":"Andres Garcia-Silva, Raúl Palma, José Manuél Gómez-Pérez","doi":"10.1109/eScience.2017.68","DOIUrl":"https://doi.org/10.1109/eScience.2017.68","url":null,"abstract":"Research objects act as a semantically rich container of all the information that lead to a scientific result, and have the potential to change how data intensive science shares and reuses methods, datasets and results. Despite a comprehensive set of vocabularies to describe semantically the aggregated resources, users often limit themselves to provide metadata at the container level, ignoring the valuable information that the aggregated resources contain. In this poster we explore how the combination of semantic technologies and natural language processing can be used to enrich research objects with structured metadata aiming at enhancing their findability as a crucial aspect towards their reuse.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122937276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eun-Kyu Byun, Junehawk Lee, S. Yu, J. Kwak, Soonwook Hwang
{"title":"Accelerating Genome Sequence Alignment on Hadoop on Lustre Environment","authors":"Eun-Kyu Byun, Junehawk Lee, S. Yu, J. Kwak, Soonwook Hwang","doi":"10.1109/eScience.2017.59","DOIUrl":"https://doi.org/10.1109/eScience.2017.59","url":null,"abstract":"Genome sequence alignment is one of the basic procedure of genome sequencing analysis pipeline and also one of the most time-consuming parts. Including BigBWA, a number of tools were proposed to accelerate genome sequence alignment by parallelizing computation with Hadoop technologies. However, HDFS incurs considerable I/O overhead. In this research, we propose a new sequence alignment tool adopting Hadoop on Lustre. Based on BigBWA, we removed data transfer overhead caused by HDFS and parallelized whole I/O steps. Experimental result shows that our solution is five times faster than original BigBWA in a ten-node Lustre based Hadoop cluster.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126415730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tshering Dema, L. Zhang, M. Towsey, A. Truskinger, S. Sherub, Kinley, Jinglan Zhang, M. Brereton, P. Roe
{"title":"An Investigation into Acoustic Analysis Methods for Endangered Species Monitoring: A Case of Monitoring the Critically Endangered White-Bellied Heron in Bhutan","authors":"Tshering Dema, L. Zhang, M. Towsey, A. Truskinger, S. Sherub, Kinley, Jinglan Zhang, M. Brereton, P. Roe","doi":"10.1109/eScience.2017.30","DOIUrl":"https://doi.org/10.1109/eScience.2017.30","url":null,"abstract":"Passive acoustic recording has great potential for monitoring soniferous endangered and cryptic species. However, this approach requires analysis of long duration environmental acoustic recordings that span months or years. There is a variety of approaches to analysing acoustic data. However, it is unclear which approaches are best suited for monitoring of endangered species in the wild. Specifically, this study is undertaking acoustic monitoring of the critically endangered White-bellied Heron (Ardea insignis) in Bhutan. Four different acoustic analysis methods are investigated in terms of their detection accuracy, involvement of human experts, and overall utility to ecologists for target species monitoring work. Our experimental results show that human pattern detection using a visualization technique has detection performance on par with a cluster-based recogniser, while a machine learning classifier implemented using the same acoustic features suffers from very low precision. Further, specific cases of false positives and false negatives by the different methods are investigated and discussed in terms of their overall utility for ecological monitoring. Based on our experimental results, we demonstrate how an integrated semi-automated approach of human visual pattern analysis with a recogniser is a robust system for acoustic monitoring of target species.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130460670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ugur Çayoglu, P. Braesicke, T. Kerzenmacher, Jörg Meyer, A. Streit
{"title":"Adaptive Lossy Compression of Complex Environmental Indices Using Seasonal Auto-Regressive Integrated Moving Average Models","authors":"Ugur Çayoglu, P. Braesicke, T. Kerzenmacher, Jörg Meyer, A. Streit","doi":"10.1109/eScience.2017.45","DOIUrl":"https://doi.org/10.1109/eScience.2017.45","url":null,"abstract":"Significant increases in computational resources have enabled the development of more complex and spatially better resolved weather and climate models. As a result the amount of output generated by data assimilation systems and by weather and climate simulations is rapidly increasing e.g. due to higher spatial resolution, more realisations and higher frequency data. However, while compute performance has increased significantly because of better scaling program code and increasing number of cores the storage capacity is only increasing slowly. One way to tackle the data storage problem is data compression. Here, we build the groundwork for an environmental data compressor by improving compression for established weather and climate indices like El Ni~no Southern Oscillation (ENSO), North Atlantic Oscillation (NAO) and Quasi-Biennial Oscillation (QBO). We investigate options for compressing these indices by using a statistical method based on the Auto Regressive Integrated Moving Average (ARIMA) model. The introduced adaptive approach shows that it is possible to improve accuracy of lossily compressed data by applying an adaptive compression method which preserves selected data with higher precision. Our analysis reveals no potential for lossless compression of these indices. However, as the ARIMA model is able to capture all relevant temporal variability, lossless compression is not necessary and lossy compression is acceptable. The reconstruction based on the lossily compressed data can reproduce the chosen indices to such a high degree that statistically relevant information needed for describing climate dynamics is preserved. The performance of the (seasonal) ARIMA model was tested with daily and monthly indices.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116565079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ScienceDB: A Public Multidisciplinary Research Data Repository for eScience","authors":"Chengzan Li, Yanfei Hou, Jianhui Li, Z. Lili","doi":"10.1109/ESCIENCE.2017.38","DOIUrl":"https://doi.org/10.1109/ESCIENCE.2017.38","url":null,"abstract":"Research data repositories are necessary infrastructures that ensure the data generated for research are accessible, stable, reliable, and reusable. Based on years of accumulated data work experience, the Computer Network Information Center of the Chinese Academy of Sciences has built a multi-disciplinary data repository ScienceDB for research users and teams using its big data storage, analysis and computing environments. This paper firstly introduces the motivation to develop ScienceDB and gives a profile to it. Then the overall technical framework of ScienceDB is introduced, and the key technologies such as the support for multidiscipline extensibility, data collaboration and data recommendation are analyzed deeply. And then this paper presents the functions and features of ScienceDB's current version and discusses some issues such as its data policy, data quality assurance measures, and current application status. Finally, it summarizes and puts forward that it needs to carry out more in-depth research and practice of ScienceDB in order to meet the higher requirements of eScience in terms of thorough data association and fusion, data analysis and mining, data evaluation, and so on.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116987225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jung-Ho Um, Sunggeun Han, Hyunwoo Kim, Kyongseok Park
{"title":"Massive OceanColor Data Processing and Analysis System: TuPiX-OC","authors":"Jung-Ho Um, Sunggeun Han, Hyunwoo Kim, Kyongseok Park","doi":"10.1109/eScience.2017.66","DOIUrl":"https://doi.org/10.1109/eScience.2017.66","url":null,"abstract":"Satellite image data generated from remote sensors around the world have different resolutions and are processed at varying levels from Level 0 to Level 3, with each level containing vast amounts of information. Due to the problem of data size, many ocean science researchers use L3 images, which have a spatial resolution of 4 km or 9 km. However, in order to overcome problems such as red tides or to analyze the marine ecosystem based on ocean color satellite research, researchers must generate data by changing the parameters of images at various levels. There is also a need for immediate access to satellite image information using analytical and visualization tools. Considering those requirements, TuPiX-OC (Turning PiXels into knowledge and science-OceanColor) provides an environment to design and execute algorithms for data processing and analysis of satellite image data by data type. TuPix-OC, which has a distributed architecture, is an analytical platform that supports data import, level conversion, DB integration, analysis and processing, and visualization. TuPiX-OC stores satellite data in a massive storage device, and provides an online platform for satellite data conversion/analysis/ visualization. For satellite data processing, TuPiX-OC converts NASA-provided binary files into files that can be analyzed by users. Moreover, TuPiX-OC includes various algorithms for satellite data selection and utilization of satellite images. Preliminary Experiments of TuPiX-OC's satellite image data processing capability showed that it was able to process 35 times as many images as the open source software SeaDAS.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133150808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gesing, M. Zentner, Juliana Casavan, Betsy Hillery, Mihaela Vorvoreanu, R. Heiland, S. Marru, M. Pierce, Nayiri Mullinix, Nancy Maron
{"title":"Science Gateways Incubator: Software Sustainability Meets Community Needs","authors":"S. Gesing, M. Zentner, Juliana Casavan, Betsy Hillery, Mihaela Vorvoreanu, R. Heiland, S. Marru, M. Pierce, Nayiri Mullinix, Nancy Maron","doi":"10.1109/eScience.2017.77","DOIUrl":"https://doi.org/10.1109/eScience.2017.77","url":null,"abstract":"The main goal of the US Science Gateways Community Institute (SGCI) is to serve science gateways to achieve sustainability and growth. Science gateways allow science and engineering communities to access shared data, software, computing services, instruments, educational materials, and other resources specific to their disciplines. Thus, science gateways are a subgroup of scientific software and the means for addressing software sustainability are also suitable for science gateways and vice versa, e.g., best practices for software engineering. Since science gateways are tailored to specific communities, understanding users' requirements is critical for sustainability. SGCI consists of five service areas that closely interact with each other. The Incubator acknowledges the value of business strategy to inform well-designed science gateways and offers two main types of services: individualized consultancy, tailored to specific challenges a gateway faces, and the Science Gateways Bootcamp. The cornerstone of the Bootcamp is a one-week onsite intensive workshop where participants create their own roadmap for a sustainable science gateway via sessions with experts, hands-on exercises, and group work. This paper offers an overview of the work of the Incubator and shares lessons learned from the inaugural session of the Bootcamp in April 2017.","PeriodicalId":137652,"journal":{"name":"2017 IEEE 13th International Conference on e-Science (e-Science)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132116197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}