M. K. Behera, S. Kalyan, Prasanna Venkatesh, A. Wolski
{"title":"SINCA: Scalable in-memory event aggregation using clustered operators","authors":"M. K. Behera, S. Kalyan, Prasanna Venkatesh, A. Wolski","doi":"10.1109/ICDEW.2015.7129578","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129578","url":null,"abstract":"Analytical processing of various information created in the operation of social media requires queries involving grouping and aggregating of large volumes of detail data. Any advanced query processing method should take into account two dominating hardware trends: increasing main memory capacities and increasing parallel processing capacity exposed as growing number of cores per processor chip. We introduce a scalable in-memory method for data aggregation (SINCA), using clustered operators, which profits from the hardware trends. The method uses a concept of a microengine being a set of resources that can be utilized in parallel, with great efficiency. The resulting parallelized aggregation algorithm is characterized by a low overhead and high volume, and is suitable to both real-time and extract-transform-load scenarios. The core idea of the method is to use real-time histograms to partition the data for grouping. As the data is already grouped during the partitioning phase, the group aggregation can be done very efficiently. Additionally, some of the grouped data can be cached for re-use in subsequent queries.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122867961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"POP: A Passenger-Oriented Partners matching system","authors":"Xiaoyi Duan, Cheqing Jin, Xiaoling Wang","doi":"10.1109/ICDEW.2015.7129560","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129560","url":null,"abstract":"Sharing one taxi by more than one person is treated promising, since it enables us to take a taxi in rush-hour more conveniently. Hence, we develop POP, a prototype system to find appropriate partners to share a taxi with a given passenger. The framework of POP includes two phases, namely offline preprocessing and online matching. During the offline preprocessing phase, it constructs an R-tree index for road network to speedup data access and computes average travel time for each road segment based on history trajectory data, while during the online matching, it tries to find appropriate partners to a given passenger which aims to save time as much as possible. We also propose a simple pricing method to allocate fee between passengers.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114169134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Xia, Qunyan Zhang, Chengyu Wang, Weining Qian, Aoying Zhou
{"title":"On the rise and fall of Sina Weibo: Analysis based on a fixed user group","authors":"Fan Xia, Qunyan Zhang, Chengyu Wang, Weining Qian, Aoying Zhou","doi":"10.1109/ICDEW.2015.7129580","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129580","url":null,"abstract":"Micro-blogging service Sina Weibo in China has become the country's most free-flowing and important source of news and opinions just a few years ago. Following its launch in the summer of 2009, Sina Weibo grew quickly, attracting hundreds of millions of users and saw its biggest boom around 2011. However, several reports indicate a decrease in activity on Sina Weibo. In our study, we reveal the prosperity and decline of Sina Weibo by analyzing how a fixed user group's collective behaviors change throughout the whole development process. A huge dataset based on Sina Weibo along with search engine data is used in this study. In this paper we model the popularity of single tweet and multiple tweets. Then we define the statistic representing the capability of information propagation of Sina Weibo. The well-known time series prediction model, ARMA, is used to model and predict its trend. In addition, we extract both internal features, i.e. features of Sina Weibo, and external features, i.e. public's attention. Their trends are presented and analyzed. Then detailed experiments are conducted to measure the correlation and causality between them and our proposed statistic. The approaches we present in this paper clearly show the prosperity and decline of this microblogging community.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130349198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Zhang, Keqiang Wang, Xiaoling Wang, Cheqing Jin, Aoying Zhou
{"title":"Hotel recommendation based on user preference analysis","authors":"Kai Zhang, Keqiang Wang, Xiaoling Wang, Cheqing Jin, Aoying Zhou","doi":"10.1109/ICDEW.2015.7129564","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129564","url":null,"abstract":"Recommender system offers personalized suggestions by analyzing user preference. However, the performance falls sharply when it encounters sparse data, especially meets a cold start user. Hotel is such kind of goods that suffers a lot from sparsity issue due to extremely low rating frequency. In order to handle these issues, this paper proposes a novel hotel recommendation framework. The main contribution includes: 1) We combine collaboration filtering (CF) with content-based (CBF) method to overcome sparsity issue, while ensuring high accuracy. 2) Travel intents are introduced to provide additional information for user preference analysis. 3) To provide as broad as possible recommendations, diversity techniques are employed. 4) Several experiments are conducted on the real Ctrip1 dataset, the results show that the proposed hybrid framework is competitive against classical approaches.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128018935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On crowdsensed data acquisition using multi-dimensional point processes","authors":"Saket K. Sathe, T. Sellis, K. Aberer","doi":"10.1109/ICDEW.2015.7129562","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129562","url":null,"abstract":"Crowdsensing applications are increasing at a tremendous rate. In crowdsensing, mobile sensors (humans, vehicle-mounted sensors, etc.) generate streams of information that is used for inferring high-level phenomena of interest (e.g, traffic jams, air pollution). Unlike traditional sensor network data, crowdsensed data has a highly skewed spatio-temporal distribution caused largely due to the mobility of sensors [1]. Thus, designing systems that can mitigate this effect by acquiring crowdsensed at a fixed spatio-temporal rate are needed. In this paper we propose using multi-dimensional point processes (MDPPs), a mathematical modeling tool that can be effectively used for performing this data acquisition task.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134451658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale spatial join query processing in Cloud","authors":"Simin You, Jianting Zhang, L. Gruenwald","doi":"10.1109/ICDEW.2015.7129541","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129541","url":null,"abstract":"The rapidly increasing amount of location data available in many applications has made it desirable to process their large-scale spatial queries in Cloud for performance and scalability. We report our designs and implementations of two prototype systems that are ready for Cloud deployments: SpatialSpark based on Apache Spark and ISP-MC based on Cloudera Impala. Both systems support indexed spatial joins based on point-in-polygon test and point-to-polyline distance computation. Experiments on the pickup locations of ~170 million taxi trips in New York City and ~10 million global species occurrences records have demonstrated both efficiency and scalability using Amazon EC2 clusters.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124233587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun Ma, Qing Li, Zhenguo Yang, Zheng Lu, Haiwei Pan, Antoni B. Chan
{"title":"An SVD-based Multimodal Clustering method for Social Event Detection","authors":"Yun Ma, Qing Li, Zhenguo Yang, Zheng Lu, Haiwei Pan, Antoni B. Chan","doi":"10.1109/ICDEW.2015.7129577","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129577","url":null,"abstract":"With the rapid development of social media sites such as Flickr, user-generated multimedia content on the Web has shown an explosive growth in recent years. Social event detection from these large multimedia collections has become one of the hottest topics in analysis of Web content. In this paper, an SVD-based Multimodal Clustering (SVDMC) algorithm is proposed to detect social events from multimodal data. SVDMC is a completely unsupervised approach aiming to take full advantage of the data at hand. Through using the binary adjacency matrix and Singular Value Decomposition (SVD), SVDMC is robust to data incompleteness for datasets in real world. Experiments conducted on the MediaEval Social Event Detection (SED) 2012 dataset demonstrate the effectiveness of the proposed method as well as discriminative power of different features.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126116097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing online news dissemination via structure learning: An experimental view","authors":"Ruiqi Li, Yanli Hu, Jiuyang Tang, W. Xiao","doi":"10.1109/ICDEW.2015.7129572","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129572","url":null,"abstract":"Online information dissemination has attracted unprecedented attention with the proliferation of Internet. This paper investigates how news is disseminated through key online media. Key media are defined to include two categories: a) leader media, whose reports will be reproduced by numerous other media; b) source media, serving as the information counselor for leader ones. Through analyzing the appearance of the same report on various online media, we are able to locate key media in news dissemination and predict the path of dissemination. We provide the initial experimental results on real-life datasets, and the results presented in the form of Bayesian network indicate that the unique influence of online media in three different categories during the process of report.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129529314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AIR: Adaptive Index Replacement in Hadoop","authors":"Stefan Schuh, J. Dittrich","doi":"10.1109/ICDEW.2015.7129539","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129539","url":null,"abstract":"The Hadoop Distributed Filesystem has become the de-facto standard for storing large datasets in data management systems such as Hadoop MapReduce, Hive, and Stratosphere. Though HDFS was originally designed to support scan-oriented operations, recently several techniques for HDFS have been developed to allow for efficient indexing. One of these indexing techniques is aggressive indexing, i.e. HDFS replicas are immediately indexed at upload time before touching any disk - creating multiple clustered indexes almost for free on the way. A second technique is adaptive indexing, i.e. HDFS blocks are only indexed on demand as a side effect of query processing. Though these techniques provide impressive speed-ups in terms of query processing, they totally ignored the costs involved with storing a large number of replicas of a particular dataset. The HDFS-variants of adaptive indexing were already designed to leverage the natural redundancy that comes with HDFS, typically storing a dataset three times anyway. However, it is questionable whether storing an unlimited number of replicas for a dataset is a practical solution. Therefore, this paper is the first to analyze adaptive indexing under a space constraint, i.e. we assume that indexes are adaptively created and deleted. We coin this problem the Adaptive Index Replacement problem. We present a new algorithm to solve the online AIR problem called LeastExpectedBenefit-K and compare it with several existing state-of-the-art online Index Selection algorithms. We present a comprehensive study evaluating ten different algorithms. Our results show that our algorithm LEB-2 is efficient and robust and a good choice in practice.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132082021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating query processing with parallel languages","authors":"Brandon Myers","doi":"10.1109/ICDEW.2015.7129583","DOIUrl":"https://doi.org/10.1109/ICDEW.2015.7129583","url":null,"abstract":"In this thesis we propose new techniques for using parallel languages to improve query processing. Optimizing a query plan and its particular implementation is important for efficient processing on modern systems. First, we present our work on a parallel representation of queries using partitioned global address space languages that enables new optimizations. Next, we propose future work on cooperative optimization of query plans and imperative programs in the context of parallel applications that include queries.","PeriodicalId":333151,"journal":{"name":"2015 31st IEEE International Conference on Data Engineering Workshops","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126673292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}