2018 IEEE International Congress on Big Data (BigData Congress)最新文献_第2页

Lambda-Blocks: Data Processing with Topologies of Blocks Lambda-Blocks:基于块拓扑的数据处理

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00009

Matthieu Caneill, N. D. Palma

{"title":"Lambda-Blocks: Data Processing with Topologies of Blocks","authors":"Matthieu Caneill, N. D. Palma","doi":"10.1109/BigDataCongress.2018.00009","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00009","url":null,"abstract":"We present and evaluate λ-blocks, a novel framework to write data processing programs in a descriptive manner. The main idea behind this framework is to separate the semantics of a program from its implementation. For that purpose, we define a data schema, able to describe, parameterize, compose, and link together blocks of code, storing a directed graph which represents the data transformations. Along this data schema lies an execution engine, able to read such a program, give feedback on potential errors, and finally execute it. In our reference implementation, a computation graph is described in YAML, linking together vertices of Python code blocks defined in separate libraries. The advantages of this approach are manyfold: faster, less error-prone programming; reuse of code blocks; computation graph manipulations; mixing of different specialized libraries; and finally middleware for potential front-ends (such as graphical interfaces) and back-ends (other execution engines). The main goal of λ-blocks is to bring complex data processing computations to non-specialists, by providing a simple abstraction over large-scale data processing systems. Our contributions lie within a description of the schema, and an analysis of the reference execution engine. For that purpose we describe λ-blocks' internals and its main abstractions (blocks and topologies), and evaluate the framework performances. We measured the framework overhead to have a maximum value of 50 ms, a negligible amount compared to the average duration of data processing jobs.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Budget-Transfer: A Low Cost Inter-Service Data Storage and Transfer Scheme 预算-传输:一种低成本的跨业务数据存储和传输方案

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00022

Galen Deal, Yang Peng, Hua Qin

{"title":"Budget-Transfer: A Low Cost Inter-Service Data Storage and Transfer Scheme","authors":"Galen Deal, Yang Peng, Hua Qin","doi":"10.1109/BigDataCongress.2018.00022","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00022","url":null,"abstract":"With the offerings of compelling cloud storage services from various cloud service providers, numerous web and mobile applications are leveraging cloud to store data for long-term usage. In this paper, we propose Budget-Transfer, a unique scheme to reduce the long-term cost of storing large data sets using cloud storage services. In contrast to most existing works, we study the storage cost-minimization problem by leveraging various available storage services that can provide different levels of performance at different pricing cost, under the constraint of data-access performance requirement. The key idea of Budget-Transfer is to continually transfer large data sets between different cloud storage services so as to satisfy performance requirements while avoiding overpaying for unnecessarily high performance guarantees. Budget-Transfer selects which service to use for each request with a goal towards minimizing the overall storage cost, rather than selecting whichever would be the locally cheapest service. Thus, the accumulative data-transfer and data-storage cost over a long period of time to satisfy a sequence of data-access requests can be reduced for the system. Simulation results show that Budget-Transfer performs well under various system parameters and request patterns, and can significantly reduce costs compared to other schemes.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115222139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Survey of Current End-User Data Analytics Tool Support 当前终端用户数据分析工具支持的调查

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00013

Hourieh Khalajzadeh, Mohamed Abdelrazek, J. Grundy, J. Hosking, Qiang He

{"title":"A Survey of Current End-User Data Analytics Tool Support","authors":"Hourieh Khalajzadeh, Mohamed Abdelrazek, J. Grundy, J. Hosking, Qiang He","doi":"10.1109/BigDataCongress.2018.00013","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00013","url":null,"abstract":"There is a large growth in interest in big data analytics to discover unknown patterns and insights. A major challenge in this domain is the need to combine domain knowledge – what the data means (semantics) and what it is used for – with data analytics and visualization techniques to mine and communicate important information from huge volumes of raw data. Many data analytics tools have been developed for both research and practice to assist in specifying, integrating and deploying data analytics and visualization applications. However, delivering such big data analytics application requires a capable team with different skillsets including data scientists, software engineers and domain experts. Such teams and skillset usually take a long time to build and have high running costs. An alternative is to provide domain experts and data scientists with tools they can use to do the exploration and analysis directly with less technical skills required. We present an overview and analysis of several current approaches to supporting the data analytics for endusers, identifying key strengths, weaknesses and opportunities for future research.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121231050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

[Publisher's information] (发布者的信息)

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/bigdatacongress.2018.00051

引用次数: 0

Title Page i 第1页

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/bigdatacongress.2018.00001

引用次数: 0

A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets 基于傅立叶的基因组大数据快速安全传输的数据最小化算法

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00024

Mohammed Aledhari, Marianne Di Pierro, F. Saeed

{"title":"A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets","authors":"Mohammed Aledhari, Marianne Di Pierro, F. Saeed","doi":"10.1109/BigDataCongress.2018.00024","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00024","url":null,"abstract":"DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"35 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132536540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Dynamic Model Evaluation to Accelerate Distributed Machine Learning 动态模型评估加速分布式机器学习

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00027

Simon Caton, S. Venugopal, TN ShashiBhushan, Vidya Sankar Velamuri, K. Katrinis

{"title":"Dynamic Model Evaluation to Accelerate Distributed Machine Learning","authors":"Simon Caton, S. Venugopal, TN ShashiBhushan, Vidya Sankar Velamuri, K. Katrinis","doi":"10.1109/BigDataCongress.2018.00027","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00027","url":null,"abstract":"The increase in the volume and variety of data has increased the reliance of data scientists on shared computational resources, either in-house or obtained via cloud providers, to execute machine learning and artificial intelligence programs. This, in turn, has created challenges of exploiting available resources to execute such \"cognitive workloads\" quickly and effectively to gather the needed knowledge and data insight. A common challenge in machine learning is knowing when to stop model building. This is often exacerbated in the presence of big data as a trade off between the cost of producing the model (time, volume of training data, resources utilised) and its general performance. Whilst there are many tools and application stacks available to train models over distributed resources, the challenge of knowing when a model is \"good enough\" or no longer worth pursuing persists. In this paper, we propose a framework for the evaluating the models produced by distributed machine learning algorithms during the training process. This framework integrates with the cluster job scheduler so as to finalise model training under constraints of resource availability or time, or simply because model performance is asymptotic with further training. We present a prototype implementation of this framework using Apache Spark and YARN, and demonstrate the benefits of this approach using sample applications with both supervised and unsupervised learning algorithms.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134330526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

BigDataStack: A Holistic Data-Driven Stack for Big Data Applications and Operations BigDataStack:面向大数据应用和运营的整体数据驱动堆栈

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00041

D. Kyriazis, C. Doulkeridis, P. Gouvas, R. Jiménez-Peris, A. J. Ferrer, L. Kallipolitis, Pavlos Kranas, George Kousiouris, C. Macdonald, R. McCreadie, Y. Moatti, Apostolos Papageorgiou, M. Patiño-Martínez, Stathis Plitsos, Dimitrios Poulopoulos, Antonio Paradell, A. Raouzaiou, Paula Ta-Shma, V. Vianello

{"title":"BigDataStack: A Holistic Data-Driven Stack for Big Data Applications and Operations","authors":"D. Kyriazis, C. Doulkeridis, P. Gouvas, R. Jiménez-Peris, A. J. Ferrer, L. Kallipolitis, Pavlos Kranas, George Kousiouris, C. Macdonald, R. McCreadie, Y. Moatti, Apostolos Papageorgiou, M. Patiño-Martínez, Stathis Plitsos, Dimitrios Poulopoulos, Antonio Paradell, A. Raouzaiou, Paula Ta-Shma, V. Vianello","doi":"10.1109/BigDataCongress.2018.00041","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00041","url":null,"abstract":"The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. In this context, emerging innovative solutions exploit several underlying infrastructure and cluster management systems. However, these systems have not been designed and implemented in a \"big data context\", and they rather emphasize and address the computational needs and aspects of applications and services to be deployed. In this paper we present the architecture of a complete stack (namely BigDataStack), based on a frontrunner infrastructure management system that drives decisions according to data aspects, thus being fully scalable, runtime adaptable and high-performant to address the needs of big data operations and data-intensive applications. Furthermore, the stack goes beyond purely infrastructure elements by introducing techniques for dimensioning big data applications, modelling and analyzing of processes as well as provisioning data-as-a-service by exploiting a seamless analytics framework.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115981144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

On the Usage of the Probability Integral Transform to Reduce the Complexity of Multi-Way Fuzzy Decision Trees in Big Data Classification Problems 利用概率积分变换降低多路模糊决策树在大数据分类问题中的复杂性

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00011

M. Elkano, Mikel Uriz, H. Bustince, M. Galar

{"title":"On the Usage of the Probability Integral Transform to Reduce the Complexity of Multi-Way Fuzzy Decision Trees in Big Data Classification Problems","authors":"M. Elkano, Mikel Uriz, H. Bustince, M. Galar","doi":"10.1109/BigDataCongress.2018.00011","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00011","url":null,"abstract":"We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116651464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Treepedia 2.0: Applying Deep Learning for Large-Scale Quantification of Urban Tree Cover Treepedia 2.0:应用深度学习进行城市树木覆盖的大规模量化

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/bigdatacongress.2018.00014

B. Cai, Xiaojiang Li, Ian Seiferling, C. Ratti

{"title":"Treepedia 2.0: Applying Deep Learning for Large-Scale Quantification of Urban Tree Cover","authors":"B. Cai, Xiaojiang Li, Ian Seiferling, C. Ratti","doi":"10.1109/bigdatacongress.2018.00014","DOIUrl":"https://doi.org/10.1109/bigdatacongress.2018.00014","url":null,"abstract":"Recent advances in deep learning have made it possible to quantify urban metrics at fine resolution, and over large extents using street-level images. Here, we focus on measuring urban tree cover using Google Street View (GSV) images. First, we provide a small-scale labelled validation dataset and propose standard metrics to compare the performance of automated estimations of street tree cover using GSV. We apply state-of-the-art deep learning models, and compare their performance to a previously established benchmark of an unsupervised method. Our training procedure for deep learning models is novel; we utilize the abundance of openly available and similarly labelled street-level image datasets to pre-train our model. We then perform additional training on a small training dataset consisting of GSV images. We find that deep learning models significantly outperform the unsupervised benchmark method. Our semantic segmentation model increased mean intersection-over-union (IoU) from 44.10% to 60.42% relative to the unsupervised method and our end-to-end model decreased Mean Absolute Error from 10.04% to 4.67%. We also employ a recently developed method called gradient-weighted class activation map (Grad-CAM) to interpret the features learned by the end-to-end model. This technique confirms that the end-to-end model has accurately learned to identify tree cover area as key features for predicting percentage tree cover. Our paper provides an example of applying advanced deep learning techniques on a large-scale, geo-tagged and image-based dataset to efficiently estimate important urban metrics. The results demonstrate that deep learning models are highly accurate, can be interpretable, and can also be efficient in terms of data-labelling effort and computational resources.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126951613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34