2014 IEEE 30th International Conference on Data Engineering最新文献_第6页

Incremental cluster evolution tracking from highly dynamic network data 基于高动态网络数据的增量集群演化跟踪

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816635

Pei Lee, L. Lakshmanan, E. Milios

{"title":"Incremental cluster evolution tracking from highly dynamic network data","authors":"Pei Lee, L. Lakshmanan, E. Milios","doi":"10.1109/ICDE.2014.6816635","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816635","url":null,"abstract":"Dynamic networks are commonly found in the current web age. In scenarios like social networks and social media, dynamic networks are noisy, are of large-scale and evolve quickly. In this paper, we focus on the cluster evolution tracking problem on highly dynamic networks, with clear application to event evolution tracking. There are several previous works on data stream clustering using a node-by-node approach for maintaining clusters. However, handling of bulk updates, i.e., a subgraph at a time, is critical for achieving acceptable performance over very large highly dynamic networks. We propose a subgraph-by-subgraph incremental tracking framework for cluster evolution in this paper. To effectively illustrate the techniques in our framework, we consider the event evolution tracking task in social streams as an application, where a social stream and an event are modeled as a dynamic post network and a dynamic cluster respectively. By monitoring through a fading time window, we introduce a skeletal graph to summarize the information in the dynamic network, and formalize cluster evolution patterns using a group of primitive evolution operations and their algebra. Two incremental computation algorithms are developed to maintain clusters and track evolution patterns as time rolls on and the network evolves. Our detailed experimental evaluation on large Twitter datasets demonstrates that our framework can effectively track the complete set of cluster evolution patterns from highly dynamic networks on the fly.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122006405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

An efficient sampling method for characterizing points of interests on maps 地图上兴趣点特征的有效采样方法

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816719

P. Wang, Wenbo He, Xue Liu

引用次数: 13

GLog: A high level graph analysis system using MapReduce GLog:使用MapReduce的高级图形分析系统

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816680

Jun Gao, Jiashuai Zhou, Chang Zhou, J. Yu

{"title":"GLog: A high level graph analysis system using MapReduce","authors":"Jun Gao, Jiashuai Zhou, Chang Zhou, J. Yu","doi":"10.1109/ICDE.2014.6816680","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816680","url":null,"abstract":"With the rapid growth of graphs in different applications, it is inevitable to leverage existing distributed data processing frameworks in managing large graphs. Although these frameworks ease the developing cost, it is still cumbersome and error-prone for developers to implement complex graph analysis tasks in distributed environments. Additionally, developers have to learn the details of these frameworks quite well, which is a key to improve the performance of distributed jobs. This paper introduces a high level query language called GLog and proposes its evaluation method to overcome these limitations. Specifically, we first design a RG (Relational-Graph) data model to mix relational data and graph data, and extend Datalog to GLog on RG tables to support various graph analysis tasks. Second, we define operations on RG tables, and show translation templates to convert a GLog query into a sequence of MapReduce jobs. Third, we propose two strategies, namely rule merging and iteration rewriting, to optimize the translated jobs. The final experiments show that GLog can not only express various graph analysis tasks in a more succinct way, but also achieve a better performance for most of the graph analysis tasks than Pig, another high level dataflow system.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123370546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

We can learn your #hashtags: Connecting tweets to explicit topics 我们可以学习你的#标签:将推文与明确的主题联系起来

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816706

W. Feng, Jianyong Wang

{"title":"We can learn your #hashtags: Connecting tweets to explicit topics","authors":"W. Feng, Jianyong Wang","doi":"10.1109/ICDE.2014.6816706","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816706","url":null,"abstract":"In Twitter, users can annotate tweets with hashtags to indicate the ongoing topics. Hashtags provide users a convenient way to categorize tweets. From the system's perspective, hashtags play an important role in tweet retrieval, event detection, topic tracking, and advertising, etc. Annotating tweets with the right hashtags can lead to a better user experience. However, two problems remain unsolved during an annotation: (1) Before the user decides to create a new hashtag, is there any way to help her/him find out whether some related hashtags have already been created and widely used? (2) Different users may have different preferences for categorizing tweets. However, few work has been done to study the personalization issue in hashtag recommendation. To address the above problems, we propose a statistical model for personalized hashtag recommendation in this paper. With millions of <;tweet, hashtag> pairs being published everyday, we are able to learn the complex mappings from tweets to hashtags with the wisdom of the crowd. Two questions are answered in the model: (1) Different from traditional item recommendation data, users and tweets in Twitter have rich auxiliary information like URLs, mentions, locations, social relations, etc. How can we incorporate these features for hashtag recommendation? (2) Different hashtags have different temporal characteristics. Hashtags related to breaking events in the physical world have strong rise-and-fall temporal pattern while some other hashtags remain stable in the system. How can we incorporate hashtag related features to serve for hashtag recommendation? With all the above factors considered, we show that our model successfully outperforms existing methods on real datasets crawled from Twitter.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114648252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

On masking topical intent in keyword search 关键词搜索中主题意图的掩蔽

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816656

Peng Wang, C. Ravishankar

引用次数: 12

Data quality: The other face of Big Data 数据质量:大数据的另一面

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816764

B. Saha, D. Srivastava

{"title":"Data quality: The other face of Big Data","authors":"B. Saha, D. Srivastava","doi":"10.1109/ICDE.2014.6816764","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816764","url":null,"abstract":"In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 217

The Vertica Query Optimizer: The case for specialized query optimizers Vertica查询优化器:专用查询优化器的案例

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816727

Nga Tran, Andrew Lamb, Lakshmikant Shrinivas, Sreenath Bodagala, J. Dave

引用次数: 7

Distributed and interactive cube exploration 分布式和交互式多维数据集探索

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816674

N. Kamat, Prasanth Jayachandran, Karthik Tunga, Arnab Nandi

引用次数: 139

Locality-sensitive operators for parallel main-memory database clusters 用于并行主存数据库集群的位置敏感操作符

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816684

Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann

{"title":"Locality-sensitive operators for parallel main-memory database clusters","authors":"Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2014.6816684","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816684","url":null,"abstract":"The growth in compute speed has outpaced the growth in network bandwidth over the last decades. This has led to an increasing performance gap between local and distributed processing. A parallel database cluster thus has to maximize the locality of query processing. A common technique to this end is to co-partition relations to avoid expensive data shuffling across the network. However, this is limited to one attribute per relation and is expensive to maintain in the face of updates. Other attributes often exhibit a fuzzy co-location due to correlations with the distribution key but current approaches do not leverage this. In this paper, we introduce locality-sensitive data shuffling, which can dramatically reduce the amount of network communication for distributed operators such as join and aggregation. We present four novel techniques: (i) optimal partition assignment exploits locality to reduce the network phase duration; (ii) communication scheduling avoids bandwidth underutilization due to cross traffic; (iii) adaptive radix partitioning retains locality during data repartitioning and handles value skew gracefully; and (iv) selective broadcast reduces network communication in the presence of extreme value skew or large numbers of duplicates. We present comprehensive experimental results, which show that our techniques can improve performance by up to factor of 5 for fuzzy co-location and a factor of 3 for inputs with value skew.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114389554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

RuleMiner: Data quality rules discovery RuleMiner:数据质量规则发现

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI: 10.1109/ICDE.2014.6816746

Xu Chu, I. Ilyas, Paolo Papotti, Yin Ye

引用次数: 30