{"title":"GDedup: Distributed File System Level Deduplication for Genomic Big Data","authors":"Paul Bartus, Emmanuel Arzuaga","doi":"10.1109/BigDataCongress.2018.00023","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00023","url":null,"abstract":"During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121670608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large Scale Predictive Analytics for Hard Disk Remaining Useful Life Estimation","authors":"P. Anantharaman, Mu Qiao, D. Jadav","doi":"10.1109/BigDataCongress.2018.00044","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00044","url":null,"abstract":"Hard disk failure prediction plays an important role in reducing data center downtime and improving service reliability. In contrast to existing work of modeling the prediction problem as classification tasks, we aim to directly predict the remaining useful life (RUL) of hard disk drives. We experiment with two different types of machine learning methods: random forest and long short-term memory (LSTM) recurrent neural networks. The developed machine learning models are applied to predict RUL for a large number of hard disk drives. Preliminary experimental results indicate that random forest method using only the current snapshot of SMART attributes is comparable to or outperforms LSTM, which models historical temporal patterns of SMART sequences using a more sophisticated architecture.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DynMDL: A Parallel Trajectory Segmentation Algorithm","authors":"Eleazar Leal, L. Gruenwald","doi":"10.1109/BigDataCongress.2018.00036","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00036","url":null,"abstract":"The purpose of trajectory segmentation algorithms is to replace an input trajectory by a sub-trajectory with fewer points than the input, but that is also a good approximation to the original trajectory. As such, trajectory segmentation is an essential pre-processing step for trajectory mining algorithms, such as clustering. Among the segmentation strategies that are commonly used for trajectory clustering is Minimum Description Length (MDL)-based segmentation, which consists in finding a sub-trajectory such that the sum of its distance to the input trajectory and its overall length is minimum. However, there are no efficient algorithms for optimal MDL-based segmentation; there are only approximate algorithms. In this work we fill this gap by proposing a parallel multicore algorithm for MDL-based trajectory segmentation. We use three real-life datasets to show that our algorithm achieves optimal MDL, and compare its performance against Traclus, the state-of-the-art approximate Description Length (DL) segmentation algorithm.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133155299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big Data Quality: A Survey","authors":"Ikbal Taleb, M. Serhani, R. Dssouli","doi":"10.1109/BigDataCongress.2018.00029","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00029","url":null,"abstract":"With the advances in communication technologies and the high amount of data generated, collected, and stored, it becomes crucial to manage the quality of this data deluge in an efficient and cost-effective way. The storage, processing, privacy and analytics are the main keys challenging aspects of Big Data that require quality evaluation and monitoring. Quality has been recognized by the Big Data community as an essential facet of its maturity. Yet, it is a crucial practice that should be implemented at the earlier stages of its lifecycle and progressively applied across the other key processes. The earlier we incorporate quality the full benefit we can get from insights. In this paper, we first identify the key challenges that necessitates quality evaluation. We then survey, classify and discuss the most recent work on Big Data management. Consequently, we propose an across-the-board quality management framework describing the key quality evaluation practices to be conducted through the different Big Data stages. The framework can be used to leverage the quality management and to provide a roadmap for Data scientists to better understand quality practices and highlight the importance of managing the quality. We finally, conclude the paper and point to some future research directions on quality of Big Data.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132170066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Trip Recommendation System: Balancing Travelers among POIs with MapReduce","authors":"S. Migliorini, D. Carra, A. Belussi","doi":"10.1109/BigDataCongress.2018.00045","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00045","url":null,"abstract":"Travel recommendation systems provide suggestions to the users based on different information, such as user preferences, needs, or constraints. The recommendation may also take into account some characteristics of the points of interest (POIs) to be visited, such as the opening hours, or the peak hours. Although a number of studies have been proposed on the topic, most of them tailor the recommendation considering the user viewpoint, without evaluating the impact of the suggestions on the system as a whole. This may lead to oscillatory dynamics, where the choices made by the system generate new peak hours. This paper considers the trip planning problem that takes into account the balancing of users among the different POIs. To this aim, we consider the estimate of the level of crowding at POIs, including both the historical data and the effects of the recommendation. We formulate the problem as a multi-objective optimization problem, and we design a recommendation engine that explores the solution space in near real-time, through a distributed version of the Simulated Annealing approach. Through an experimental evaluation on a real dataset, we show that our solution is able to provide high quality recommendations, yet maintaining that the attractions are not overcrowded.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132203517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XRT: Programming-Language Independent MapReduce on Shared-Memory Systems","authors":"Erik G. Selin, H. Viktor","doi":"10.1109/BigDataCongress.2018.00031","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00031","url":null,"abstract":"Increasing processor core-counts have created an opportunity for efficient parallel processing of large datasets on shared-memory systems. When compared to clusters of networked commodity hardware, shared-memory systems have the potential to provide better per-core performance, a more straightforward development environment and reduced operational overhead. This paper presents XRT, a high-performance and programming-language independent MapReduce runtime for shared-memory systems. XRT is built to be simple to use, pedantic about resource usage and capable of utilizing disk-based data structures for processing datasets too large to fit in memory. To our knowledge, XRT is the first MapReduce runtime explicitly designed for programming-language independent MapReduce. Moreover, XRT is the first MapReduce runtime for shared-memory systems taking advantage of disk-based data structures for processing datasets which cannot fit in memory. Benchmarks of three common data processing problems demonstrate the disk-based capabilities as well as the excellent speedup profile of XRT as system core-counts increase.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"44 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131858977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hilmi Egemen Ciritoglu, Takfarinas Saber, Teodora Sandra Buda, John Murphy, Christina Thorpe
{"title":"Towards a Better Replica Management for Hadoop Distributed File System","authors":"Hilmi Egemen Ciritoglu, Takfarinas Saber, Teodora Sandra Buda, John Murphy, Christina Thorpe","doi":"10.1109/BigDataCongress.2018.00021","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00021","url":null,"abstract":"The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126978163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Biparti Majority Learning with Tensors","authors":"Chia-Lun Lee, Shun-Wen Hsiao, Fang Yu","doi":"10.1109/BigDataCongress.2018.00038","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00038","url":null,"abstract":"In addition to the mislabeled training data that could interfere the effectiveness of learning, in a dynamic environment where the majority pattern changes, it is also difficult to perform training. We propose an efficient bipartite majority learning algorithm (BML) for categorical data classification with tensors on a single hidden layer feedforward neural network (SLFN). We adopt the resistant learning approach to avoid significant impact from data anomalies and iteratively conduct bipartite classification for majorities afterward. The bipartite algorithm can reduce the training time significantly while keeping competitive accuracy compared to previous resistant learning algorithms.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123937015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Ardagna, V. Bellandi, P. Ceravolo, E. Damiani, B. D. Martino, Salvatore D'Angelo, A. Esposito
{"title":"A Fast and Incremental Development Life Cycle for Data Analytics as a Service","authors":"C. Ardagna, V. Bellandi, P. Ceravolo, E. Damiani, B. D. Martino, Salvatore D'Angelo, A. Esposito","doi":"10.1109/BigDataCongress.2018.00030","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00030","url":null,"abstract":"Big Data does not only refer to a huge amount of diverse and heterogeneous data. It also points to the management of procedures, technologies, and competencies associated with the analysis of such data, with the aim of supporting high-quality decision making. There are, however, several obstacles to the effective management of a Big Data computation, such as data velocity, variety, and veracity, and technological complexity, which represent the main barriers towards the full adoption of the Big Data paradigm. The goal of this work is to define a new software Development Life Cycle for the design and implementation of a Big Data computation. Our proposal integrates two model-driven methods: a first method based on pre-configured services that reduces the cost of deployment and a second method based on custom component development that provides an incremental process of refinement and customization. The proposal is experimentally evaluated by clustering a data set of the distribution of the population in the United States based on contextual criteria.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"77 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121111123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Time Series Sanitization with Metric-Based Privacy","authors":"Liyue Fan, Luca Bonomi","doi":"10.1109/BigDataCongress.2018.00047","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00047","url":null,"abstract":"The increasing popularity of connected devices has given rise to the vast generation of time series data. Due to consumer privacy concerns, the data collected from individual devices must be sanitized before sharing with untrusted third-parties. However, existing time series privacy solutions do not provide provable guarantees for individual time series and may not extend to data from a wide range of application domains. In this paper, we adopt a generalized privacy notion based on differential privacy for individual time series sanitization and Discrete Cosine Transform to model the characteristics of time series data. We extend previously reported 2-dimensional results to arbitrary k-dimensional space. Empirical evaluation with various datasets demonstrates the applicability of our proposed method with the standard mean squared error (MSE) and in classification tasks.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117110152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}