{"title":"Modeling the Location Selection of Mirror Servers in Content Delivery Networks","authors":"Peter Hillmann, Tobias Uhlig, G. Rodosek, O. Rose","doi":"10.1109/BigDataCongress.2016.68","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.68","url":null,"abstract":"For a provider of a Content Delivery Network (CDN), the location selection of mirror servers is a complex optimization problem. Generally, the objective is to place the nodes centralized such that all customers have convenient access to the service according to their demands. It is an instance of the k-center problem, which is proven to be NP-hard. Determining reasonable server locations directly influences run time effects and future service costs. We model, simulate, and optimize the properties of a content delivery network. Specifically, considering the server locations in a network infrastructure with prioritized customers and weighted connections. A simulation model for the servers is necessary to analyze the caching behavior in accordance to the targeted customer requests. We analyze the problem and compare different optimization strategies. For our simulation, we employ various realistic scenarios and evaluate several performance indicators. Our new optimization approach shows a significant improvement. The presented results are generally applicable to other domains with k-center problems, e.g., the placement of military bases, the planning and placement of facility locations, or data mining.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":" 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113947138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of a Multidimensional Data Retrieval Sorting Optimization Model","authors":"Danfeng Yan, Liying Zhang, Xuan Zhao","doi":"10.1109/BigDataCongress.2016.38","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.38","url":null,"abstract":"Currently, how to accurately and quickly locate required information from the massive network data, especially from the current popular social network data, is the focus of data retrieval services. Based on the traditional data retrieval sorting technology, this paper proposes a multi-dimensional data retrieval sorting optimization model, considering the characteristics of data, users and applications. Meanwhile, this paper implements this model in the system of financial microblog data retrieval. It enables the retrieval system to sort the results according to the characteristics of the microblog data, users' real query intentions and financial tendency of the system. Finally, this paper shows the basic test results, and future researches are discussed.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132496049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Miller, Casey N. Bowman, V. Harish, Shannon P. Quinn
{"title":"Open Source Big Data Analytics Frameworks Written in Scala","authors":"J. Miller, Casey N. Bowman, V. Harish, Shannon P. Quinn","doi":"10.1109/BigDataCongress.2016.61","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.61","url":null,"abstract":"Frameworks for big data arguably began with Google's use of MapReduce. Since then, a huge amount of progress has been made in the development of big data frameworks, many of which have been released as open source. Further to increase portability and ease of set-up, many are coded in a Java Virtual Machine (JVM) based language, e.g., Java or Scala. In addition, processing of big data involves the flow of data, and of course, the processing of data as it flows. This computational paradigm is a natural for functional programming. Furthermore, the map, reduce and combiner have analogs in functional programming. There has been a trend in the last few years toward developing open source big data frameworks written in Scala to support big data analytics. Scala is a modern JVM language that supports both object-oriented and functional programming paradigms.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124090328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complex Quality of Service Lifecycle Assessment Methodology","authors":"R. Maule","doi":"10.1109/BigDataCongress.2016.71","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.71","url":null,"abstract":"Large-scale systems engineering projects involving hundreds of independent systems with complex systems integration requirements and high levels of security necessitate specialized analytics methodology to ensure systems readiness across their operational lifecycle. This includes assessment of systems, components, processes and services over time, and in the range of technical, operational and environmental contexts in which the service will operate. This paper presents a quality of service audit method for assessment of complex integrated services.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"38 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Privacy-Aware Big Data Warehouse Architecture","authors":"Karthik Navuluri, R. Mukkamala, Aftab Ahmad","doi":"10.1109/BigDataCongress.2016.53","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.53","url":null,"abstract":"Along with the ever increasing growth in data collection and its mining, there is an increasing fear of compromising individual and population privacy. Several techniques have been proposed in literature to preserve privacy of collected data while storing and processing. In this paper, we propose a privacy-aware architecture for storing and processing data in a Big Data warehouse. In particular, we propose a flexible, extendable, and adaptable architecture that enforces user specified privacy requirements in the form of Embedded Privacy Agreements. The paper discusses the details of the architecture with some implementation details.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115245185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdullah Alfazi, Quan Z. Sheng, W. Zhang, Lina Yao, Talal H. Noor
{"title":"Identification as a Service: Large-Scale Cloud Service Discovery over the World Wide Web","authors":"Abdullah Alfazi, Quan Z. Sheng, W. Zhang, Lina Yao, Talal H. Noor","doi":"10.1109/BigDataCongress.2016.74","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.74","url":null,"abstract":"Cloud computing is provisioned with high flexibility with regard to on demand infrastructures, platforms and software as services through the Internet. The unique characteristics of cloud services such as dynamic and diverse services offering at different levels, as well as the lack of standardized description, are becoming important challenges in efficiently discovering cloud services for customers. In this paper, we propose a cloud service search engine that has the capability to automatically identify cloud services aiming at improving the accuracy when searching cloud services in real environments. Our search engine can detect cloud services effectively from the Web sources. Furthermore, we focus on learning the cloud service features, such as similarity function, semantic ontology and cloud service components to identify the cloud services. We use a real cloud service dataset to build an identifier. Our cloud service identifier can be used to automatically determine whether a given Web source is a cloud service with high accuracy.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133438414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eleazar Leal, L. Gruenwald, Jianting Zhang, Simin You
{"title":"Towards an Efficient Top-K Trajectory Similarity Query Processing Algorithm for Big Trajectory Data on GPGPUs","authors":"Eleazar Leal, L. Gruenwald, Jianting Zhang, Simin You","doi":"10.1109/BigDataCongress.2016.33","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.33","url":null,"abstract":"Through the use of location-sensing devices, it has been possible to collect very large datasets of trajectories. These datasets make it possible to issue spatio-temporal queries with which users can gather information about the characteristics of the movements of objects, derive patterns from that information, and understand the objects themselves. Among such spatio-temporal queries that can be issued is the top-K trajectory similarity query. This query finds many applications, such as bird migration analysis in ecology and trajectory sharing in social networks. However, the large size of the trajectory query sets and databases poses significant computational challenges. In this work, we propose a parallel GPGPU algorithm Top-KaBT that is specifically designed to reduce the size of the candidate set generated while processing these queries, and in doing so strives to address these computational challenges. The experiments show that the state of the art top-K trajectory similarity query processing algorithm on GPGPUs, TKSimGPU, achieves a 6.44X speedup in query processing time when combined with our algorithm and a 13X speedup over a GPGPU algorithm that uses exhaustive search.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125581834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Don't Fire Me, a Kernel Autoregressive Hybrid Model for Optimal Layoff Plan","authors":"Zhiling Luo, Ying Li, Ruisheng Fu, Jianwei Yin","doi":"10.1109/BigDataCongress.2016.72","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.72","url":null,"abstract":"Job cutting occurs when a modern service enterprise reduces the employing labour cost by firing some staffs. Making an appropriate layoff plan is always quite difficult since a bad job cutting has a serious impact on not only the organization but also the business process executing efficiency. Therefore, in this paper, we address the problem of making an optimal layoff plan with the least influence on the executing of the business process. The key challenge is estimating the process throughput under a layoff plan. We overcome this challenge by two steps: regressing the activity throughput by the stuff number and inferring process throughput by the maximum flow or minimum cut algorithm on the Directed Acyclic Graph of process. In the regressing step, a kernel autoregressive hybrid model is proposed, whose MSE is 30% lower than SVM. After that, an augmenting path based algorithm is introduced to make an optimal layoff plan. To evaluate the accuracy of our model, we conduct an external experiment on a real dataset from the workflow system employed in the government of Hangzhou City in China, which results in 9750969 logs from 2050 activities and 16295 employees in two years.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134328921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Geelytics: Enabling On-Demand Edge Analytics over Scoped Data Sources","authors":"Bin Cheng, Apostolos Papageorgiou, M. Bauer","doi":"10.1109/BigDataCongress.2016.21","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.21","url":null,"abstract":"Large-scale Internet of Things (IoT) systems typically consist of a large number of sensors and actuators distributed geographically in a physical environment. To react fast on real time situations, it is often required to bridge sensors and actuators via real-time stream processing close to IoT devices. Existing stream processing platforms like Apache Storm and S4 are designed for intensive stream processing in a cluster or in the Cloud, but they are unsuitable for large scale IoT systems in which processing tasks are expected to be triggered by actuators on-demand and then be allocated and performed in a Cloud-Edge environment. To fill this gap, we designed and implemented a new system called Geelytics, which can enable on-demand edge analytics over scoped data sources via IoT-friendly interfaces to sensors and actuators. This paper presents its design, implementation, interfaces, and core algorithms. Three example applications have been built to showcase the potential of Geelytics in enabling advanced IoT edge analytics. Our preliminary evaluation results demonstrate that we can reduce the bandwidth cost by 99% in a face detection example, achieve less than 10 milliseconds reacting latency and about 1.5 seconds startup latency in an outlier detection example, and also save 65% duplicated computation cost via sharing intermediate results in a data aggregation example.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121374805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Infra: SLO Aware Elastic Auto-scaling in the Cloud for Cost Reduction","authors":"Subhajit Sidhanta, S. Mukhopadhyay","doi":"10.1109/BigDataCongress.2016.25","DOIUrl":"https://doi.org/10.1109/BigDataCongress.2016.25","url":null,"abstract":"Enterprises often host applications and services on clusters of virtual machine instances provided by cloud service providers, like Amazon, Rackspace, Microsoft, etc. Users pay a cloud usage cost on the basis of the hourly usage [1] of virtual machine instances composing the cluster. A cluster composition refers to the number of virtual machine instances of each type (from a predefined list of types) comprising a cluster. We present Infra, a cloud provisioning framework that can predict an (ϵ, δ)-minimum cluster composition required to run a given application workload on a cloud under an SLO (i.e., Service Level Objective) deadline. This paper does not present a new approximation algorithm, instead we provide a tool that applies existing machine learning techniques to predict an (ϵ, δ)-minimum cluster composition. An (ϵ, δ)-minimum cluster composition specifies a cluster composition whose cost approximates that of the minimum cluster composition (i.e., the cluster composition that incurs the minimum cloud usage cost that must be incurred in executing a given application under an SLO deadline); the approximation bounds the error to a predefined threshold ϵ with a degree of confidence 100 * (1 - δ)%. The degree of confidence 100 * (1 - δ)% specifies that the probability of failure in achieving the error threshold ϵ for the above approximation is at most δ. For ϵ = 0.1 and δ = 0.02, we experimentally demonstrate that an (ϵ, δ)-minimum cluster composition predicted by Infra successfully approximates the minimum cluster composition, i.e., the accuracy of prediction of minimum cluster composition ranges from 93.1% to 97.99% (the error is bound by the error threshold of 0.1) with a 98% degree of confidence, since 100* (1 - δ) = 98%. Auto scaling refers to the process of automatically adding cloud instances to a cluster to adapt to an increase in application workload (increased request rate), and deleting instances from a cluster when there is a decrease in workload (reduced request rate). However, state-of-the-art auto scaling techniques have the following disadvantages: A) they require explicit policy definition for changing the cluster configuration and therefore lack the ability to automatically adapt a cluster with respect to changing workload, B) they do not compute the appropriate size of resources required, and therefore do not result in an “optimal” cluster composition. Infra provides an auto scaler that automatically adapts a cloud infrastructure to changing application workload, scaling the cluster up/down based on predictions from the Infra provisioning tool.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128435628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}