João Bachiega, M. Reis, M. Holanda, Aleteia P. F. Araujo
{"title":"An Architecture for Cost Optimization in the Processing of Big Geospatial Data in Public Cloud Providers","authors":"João Bachiega, M. Reis, M. Holanda, Aleteia P. F. Araujo","doi":"10.1109/BigDataCongress.2018.00032","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00032","url":null,"abstract":"Cloud computing is a suitable platform for running applications to process big data. Currently, with the increase in the volume of geographic and spatial data volume, conceptualized as Big Geospatial Data, a variety of tools have been developed to efficiently process this data. The index applied to the dataset is an important aspect. This paper presents an architecture, supported by a Knownlegde Base and an Inference Engine, to process big geospatial data in public cloud providers with the ultimate goal of optimizing costs. The tests executed demonstrated that the rules created are capable of optimizing the total costs for processing large geospatial data up to 71%.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114528773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning a Joint Low-Rank and Gaussian Model in Matrix Completion with Spectral Regularization and Expectation Maximization Algorithm","authors":"Gang Wu, Ratnesh Kumar","doi":"10.1109/BigDataCongress.2018.00035","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00035","url":null,"abstract":"Completing a partially-known matrix, is an important problem in the field of data science and useful for many related applications, e.g., collaborative filtering for recommendation systems, global positioning in large-scale sensor networks. Low-rank and Gaussian models are two popular classes of models used in matrix completion, both of which have proven success. In this paper, we introduce a single model that leverage the features of both low-rank and Gaussian models. We develop a novel method based on Expectation Maximization (EM) that involves spectral regularization (for low-rank part) as well as maximum likelihood maximization (for learning Gaussian parameters). We also test our framework on real-world movie rating data, and provide comparison results with some of the common methods used for matrix completion.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117191870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Proto, Evelina Di Corso, F. Ventura, T. Cerquitelli
{"title":"Useful ToPIC: Self-Tuning Strategies to Enhance Latent Dirichlet Allocation","authors":"Stefano Proto, Evelina Di Corso, F. Ventura, T. Cerquitelli","doi":"10.1109/BigDataCongress.2018.00012","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00012","url":null,"abstract":"TToPIC (Tuning of Parameters for Inference of Concepts) is a distributed self-tuning engine whose aim is to cluster collections of textual data into correlated groups of documents through a topic modeling methodology (i.e., LDA). ToPIC includes automatic strategies to relieve the end-user of the burden of selecting proper values for the overall analytics process. ToPIC's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, ToPIC has been validated on three real collections of textual documents characterized by different distributions. The experimental results show the effectiveness and efficiency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents with a similar topic.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"60 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127582345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Venkatesan, Taras Lehinevych, G. Cherubini, A. Glybovets, M. Lantz
{"title":"Graph-Based Data Relevance Estimation for Large Storage Systems","authors":"V. Venkatesan, Taras Lehinevych, G. Cherubini, A. Glybovets, M. Lantz","doi":"10.1109/BigDataCongress.2018.00040","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00040","url":null,"abstract":"In storage systems, the relevance of files to users can be taken into account to determine storage control policies to reduce cost, while retaining high reliability and performance. The relevance of a file can be estimated by applying supervised learning and using the metadata as features. However, supervised learning requires many training samples to achieve an acceptable estimation accuracy. In this paper, we propose a novel graph-based learning system for the relevance estimation of files using a small training set. First, files are grouped into different file-sets based on the available metadata. Then a parameterized similarity metric among files is introduced for each file-set using the knowledge of the metadata. Finally, message passing over a bipartite graph is applied for relevance estimation. The proposed system is tested on various datasets and compared with logistic regression.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127500497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE BigData Congress 2018 Organizing Committee","authors":"","doi":"10.1109/bigdatacongress.2018.00006","DOIUrl":"https://doi.org/10.1109/bigdatacongress.2018.00006","url":null,"abstract":"","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134123299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Yuan, Yi Lu, Kewen Liu, Guangyi Liu, Renchang Dai, Zhiwei Wang
{"title":"Exploration of Bi-Level PageRank Algorithm for Power Flow Analysis Using Graph Database","authors":"Chen Yuan, Yi Lu, Kewen Liu, Guangyi Liu, Renchang Dai, Zhiwei Wang","doi":"10.1109/BigDataCongress.2018.00026","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00026","url":null,"abstract":"Compared with traditional relational database, graph database (GDB) is a natural expression of most real-world systems. Each node in the GDB is not only a storage unit, but also a logic operation unit to implement local computation in parallel. This paper firstly explores the feasibility of power system modeling using GDB. Then a brief introduction of the PageRank algorithm and the feasibility analysis of its application in GDB are presented. Then the proposed GDB based bi-level PageRank algorithm is developed from PageRank algorithm and Gauss-Seidel methodology realize high performance parallel computation. MP 10790 case, and its extensions, MP 10790*10 and MP 10790*100, are tested to verify the proposed method and investigate its parallelism in GDB. Besides, a provincial system, FJ case which include 1425 buses and 1922 branches, is also included in the case study to further prove the proposed algorithm's effectiveness in real world.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116814199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Personalized Travel Recommendation System Using Social Media Analysis","authors":"Joseph Coelho, Paromita Nitu, P. Madiraju","doi":"10.1109/BigDataCongress.2018.00046","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00046","url":null,"abstract":"Personalization of recommender systems enables customized services to users. Social media is one resource that aids personalization. This study explores the use of twitter data to personalize travel recommendations. A machine learning classification model is used to identify travel related tweets. The travel tweets are then used to personalize recommendations regarding places of interest for the user. Places of interest are categorized as: historical buildings, museums, parks, and restaurants. To better personalize the model, travel tweets of the user’s friends and followers are also mined. Volunteer twitter users were asked to provide their twitter handle as well as rank their travel category preferences in a survey. We evaluated our model by comparing the predictions made by our model with the users choices in the survey. The evaluations show 68% prediction accuracy. The accuracy can be improved with a better travel-tweet training dataset as well as a better travel category identification technique using machine learning. The travel categories can be increased to include items like sports venues, musical events, entertainment, etc. and thereby fine-tune the recommendations. The proposed model lists 'n' places of interest from each category in proportion to the travel category score generated by the model.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116574592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autoencoder Evaluation and Hyper-Parameter Tuning in an Unsupervised Setting","authors":"Ellie Ordway-West, P. Parveen, Austin Henslee","doi":"10.1109/BigDataCongress.2018.00034","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00034","url":null,"abstract":"This paper aims to introduce a new methodology for evaluating autoencoder performance and to shorten time spent on heuristic analysis during hyper-parameter tuning. Existing methodologies for evaluating hyper-parameter tuning focus on finding known anomalies in a labeled set or minimizing the average per row reconstruction error as a method of model selection. This paper focuses on anomaly detection in a completely unsupervised setting, where labels are not known during model training or evaluation. This approach uses the approximate Full Width Half Max (FWHM) of the histogram of the per row reconstruction error in conjunction with the average per row reconstruction error and the number of anomalies found to define a new method of model selection that aims to maximize the FWHM while minimizing the average per row reconstruction error. This methodology simplifies and speeds up model evaluation by presenting model results in an intuitive manner and simplifies the heuristic analysis needed to determine the \"best\" model.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117115418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the IEEE BigData Congress 2018 Chairs","authors":"","doi":"10.1109/bigdatacongress.2018.00005","DOIUrl":"https://doi.org/10.1109/bigdatacongress.2018.00005","url":null,"abstract":"","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116903128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latency Measurement of Fine-Grained Operations in Benchmarking Distributed Stream Processing Frameworks","authors":"Giselle van Dongen, Bram Steurtewagen, D. V. Poel","doi":"10.1109/BigDataCongress.2018.00043","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2018.00043","url":null,"abstract":"This paper describes a benchmark for stream processing frameworks allowing accurate latency benchmarking of fine-grained individual stages of a processing pipeline. By determining the latency of distinct common operations in the processing flow instead of the end-to-end latency, we can form guidelines for efficient processing pipeline design. Additionally, we address the issue of defining time in distributed systems by capturing time on one machine and defining the baseline latency. We validate our benchmark for Apache Flink using a processing pipeline comprising common stream processing operations. Our results show that joins are the most time consuming operation in our processing pipeline. The latency incurred by adding a join operation is 4.5 times higher than for a parsing operation, and the latency gradually becomes more dispersed after adding additional stages.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115125591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}