2018 IEEE International Congress on Big Data (BigData Congress)最新文献_第3页

GDedup: Distributed File System Level Deduplication for Genomic Big Data GDedup:用于基因组大数据的分布式文件系统级重复数据删除

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00023

Paul Bartus, Emmanuel Arzuaga

{"title":"GDedup: Distributed File System Level Deduplication for Genomic Big Data","authors":"Paul Bartus, Emmanuel Arzuaga","doi":"10.1109/BigDataCongress.2018.00023","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00023","url":null,"abstract":"During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121670608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Large Scale Predictive Analytics for Hard Disk Remaining Useful Life Estimation 硬盘剩余使用寿命估计的大规模预测分析

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00044

P. Anantharaman, Mu Qiao, D. Jadav

引用次数: 34

DynMDL: A Parallel Trajectory Segmentation Algorithm 动态mdl:一种并行轨迹分割算法

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00036

Eleazar Leal, L. Gruenwald

引用次数: 6

Big Data Quality: A Survey 大数据质量:调查

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00029

Ikbal Taleb, M. Serhani, R. Dssouli

{"title":"Big Data Quality: A Survey","authors":"Ikbal Taleb, M. Serhani, R. Dssouli","doi":"10.1109/BigDataCongress.2018.00029","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00029","url":null,"abstract":"With the advances in communication technologies and the high amount of data generated, collected, and stored, it becomes crucial to manage the quality of this data deluge in an efficient and cost-effective way. The storage, processing, privacy and analytics are the main keys challenging aspects of Big Data that require quality evaluation and monitoring. Quality has been recognized by the Big Data community as an essential facet of its maturity. Yet, it is a crucial practice that should be implemented at the earlier stages of its lifecycle and progressively applied across the other key processes. The earlier we incorporate quality the full benefit we can get from insights. In this paper, we first identify the key challenges that necessitates quality evaluation. We then survey, classify and discuss the most recent work on Big Data management. Consequently, we propose an across-the-board quality management framework describing the key quality evaluation practices to be conducted through the different Big Data stages. The framework can be used to leverage the quality management and to provide a roadmap for Data scientists to better understand quality practices and highlight the importance of managing the quality. We finally, conclude the paper and point to some future research directions on quality of Big Data.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132170066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Adaptive Trip Recommendation System: Balancing Travelers among POIs with MapReduce 基于MapReduce的自适应旅行推荐系统

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00045

S. Migliorini, D. Carra, A. Belussi

{"title":"Adaptive Trip Recommendation System: Balancing Travelers among POIs with MapReduce","authors":"S. Migliorini, D. Carra, A. Belussi","doi":"10.1109/BigDataCongress.2018.00045","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00045","url":null,"abstract":"Travel recommendation systems provide suggestions to the users based on different information, such as user preferences, needs, or constraints. The recommendation may also take into account some characteristics of the points of interest (POIs) to be visited, such as the opening hours, or the peak hours. Although a number of studies have been proposed on the topic, most of them tailor the recommendation considering the user viewpoint, without evaluating the impact of the suggestions on the system as a whole. This may lead to oscillatory dynamics, where the choices made by the system generate new peak hours. This paper considers the trip planning problem that takes into account the balancing of users among the different POIs. To this aim, we consider the estimate of the level of crowding at POIs, including both the historical data and the effects of the recommendation. We formulate the problem as a multi-objective optimization problem, and we design a recommendation engine that explores the solution space in near real-time, through a distributed version of the Simulated Annealing approach. Through an experimental evaluation on a real dataset, we show that our solution is able to provide high quality recommendations, yet maintaining that the attractions are not overcrowded.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132203517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

XRT: Programming-Language Independent MapReduce on Shared-Memory Systems XRT:独立于编程语言的共享内存系统MapReduce

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00031

Erik G. Selin, H. Viktor

引用次数: 0

Towards a Better Replica Management for Hadoop Distributed File System 面向Hadoop分布式文件系统的副本管理

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00021

Hilmi Egemen Ciritoglu, Takfarinas Saber, Teodora Sandra Buda, John Murphy, Christina Thorpe

{"title":"Towards a Better Replica Management for Hadoop Distributed File System","authors":"Hilmi Egemen Ciritoglu, Takfarinas Saber, Teodora Sandra Buda, John Murphy, Christina Thorpe","doi":"10.1109/BigDataCongress.2018.00021","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00021","url":null,"abstract":"The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126978163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Biparti Majority Learning with Tensors 张量的双多数学习

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00038

Chia-Lun Lee, Shun-Wen Hsiao, Fang Yu

引用次数: 0

A Fast and Incremental Development Life Cycle for Data Analytics as a Service 数据分析即服务的快速增量开发生命周期

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00030

C. Ardagna, V. Bellandi, P. Ceravolo, E. Damiani, B. D. Martino, Salvatore D'Angelo, A. Esposito

{"title":"A Fast and Incremental Development Life Cycle for Data Analytics as a Service","authors":"C. Ardagna, V. Bellandi, P. Ceravolo, E. Damiani, B. D. Martino, Salvatore D'Angelo, A. Esposito","doi":"10.1109/BigDataCongress.2018.00030","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00030","url":null,"abstract":"Big Data does not only refer to a huge amount of diverse and heterogeneous data. It also points to the management of procedures, technologies, and competencies associated with the analysis of such data, with the aim of supporting high-quality decision making. There are, however, several obstacles to the effective management of a Big Data computation, such as data velocity, variety, and veracity, and technological complexity, which represent the main barriers towards the full adoption of the Big Data paradigm. The goal of this work is to define a new software Development Life Cycle for the design and implementation of a Big Data computation. Our proposal integrates two model-driven methods: a first method based on pre-configured services that reduces the cost of deployment and a second method based on custom component development that provides an incremental process of refinement and customization. The proposal is experimentally evaluated by clustering a data set of the distribution of the population in the United States based on contextual criteria.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"77 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121111123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Time Series Sanitization with Metric-Based Privacy 使用基于度量的隐私的时间序列清理

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI: 10.1109/BigDataCongress.2018.00047

Liyue Fan, Luca Bonomi

引用次数: 7