{"title":"MassJoin: A mapreduce-based method for scalable string similarity joins","authors":"Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng","doi":"10.1109/ICDE.2014.6816663","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816663","url":null,"abstract":"String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133010211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung, Jae-Gil Lee
{"title":"LinkSCAN*: Overlapping community detection using the link-space transformation","authors":"Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung, Jae-Gil Lee","doi":"10.1109/ICDE.2014.6816659","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816659","url":null,"abstract":"In this paper, for overlapping community detection, we propose a novel framework of the link-space transformation that transforms a given original graph into a link-space graph. Its unique idea is to consider topological structure and link similarity separately using two distinct types of graphs: the line graph and the original graph. For topological structure, each link of the original graph is mapped to a node of the link-space graph, which enables us to discover overlapping communities using non-overlapping community detection algorithms as in the line graph. For link similarity, it is calculated on the original graph and carried over into the link-space graph, which enables us to keep the original structure on the transformed graph. Thus, our transformation, by combining these two advantages, facilitates overlapping community detection as well as improves the resulting quality. Based on this framework, we develop the algorithm LinkSCAN that performs structural clustering on the link-space graph. Moreover, we propose the algorithm LinkSCAN* that enhances the efficiency of LinkSCAN by sampling. Extensive experiments were conducted using the LFR benchmark networks as well as some real-world networks. The results show that our algorithms achieve higher accuracy, quality, and coverage than the state-of-the-art algorithms.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132632266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"R-Store: A scalable distributed system for supporting real-time analytics","authors":"Feng Li, M. Tamer Özsu, Gang Chen, B. Ooi","doi":"10.1109/ICDE.2014.6816638","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816638","url":null,"abstract":"It is widely recognized that OLTP and OLAP queries have different data access patterns, processing needs and requirements. Hence, the OLTP queries and OLAP queries are typically handled by two different systems, and the data are periodically extracted from the OLTP system, transformed and loaded into the OLAP system for data analysis. With the awareness of the ability of big data in providing enterprises useful insights from vast amounts of data, effective and timely decisions derived from real-time analytics are important. It is therefore desirable to provide real-time OLAP querying support, where OLAP queries read the latest data while OLTP queries create the new versions. In this paper, we propose R-Store, a scalable distributed system for supporting real-time OLAP by extending the MapReduce framework. We extend an open source distributed key/value system, HBase, as the underlying storage system that stores data cube and real-time data. When real-time data are updated, they are streamed to a streaming MapReduce, namely Hstreaming, for updating the cube on incremental basis. Based on the metadata stored in the storage system, either the data cube or OLTP database or both are used by the MapReduce jobs for OLAP queries. We propose techniques to efficiently scan the real-time data in the storage system, and design an adaptive algorithm to process the real-time query based on our proposed cost model. The main objectives are to ensure the freshness of answers and low processing latency. The experiments conducted on the TPC-H data set demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Tang, Ling Liu, Ting Wang, Xin Hu, R. Sailer, P. Pietzuch
{"title":"Outsourcing multi-version key-value stores with verifiable data freshness","authors":"Y. Tang, Ling Liu, Ting Wang, Xin Hu, R. Sailer, P. Pietzuch","doi":"10.1109/ICDE.2014.6816744","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816744","url":null,"abstract":"In the age of big data, key-value data updated by intensive write streams is increasingly common, e.g., in social event streams. To serve such data in a cost-effective manner, a popular new paradigm is to outsource it to the cloud and store it in a scalable key-value store while serving a large user base. Due to the limited trust in third-party cloud infrastructures, data owners have to sign the data stream so that the data users can verify the authenticity of query results from the cloud. In this paper, we address the problem of verifiable freshness for multi-version key-value data. We propose a memory-resident digest structure that utilizes limited memory effectively and can have efficient verification performance. The proposed structure is named IncBM-Tree because it can INCrementally build a Bloom filter-embedded Merkle Tree. We have demonstrated the superior performance of verification under small memory footprints for signing, which is typical in an outsourcing scenario where data owners and users have limited resources.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124867777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inci Cetindil, Jamshid Esmaelnezhad, Taewoo Kim, Chen Li
{"title":"Efficient instant-fuzzy search with proximity ranking","authors":"Inci Cetindil, Jamshid Esmaelnezhad, Taewoo Kim, Chen Li","doi":"10.1109/ICDE.2014.6816662","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816662","url":null,"abstract":"Instant search is an emerging information-retrieval paradigm in which a system finds answers to a query instantly while a user types in keywords character-by-character. Fuzzy search further improves user search experiences by finding relevant answers with keywords similar to query keywords. A main computational challenge in this paradigm is the high-speed requirement, i.e., each query needs to be answered within milliseconds to achieve an instant response and a high query throughput. At the same time, we also need good ranking functions that consider the proximity of keywords to compute relevance scores. In this paper, we study how to integrate proximity information into ranking in instant-fuzzy search while achieving efficient time and space complexities. We adapt existing solutions on proximity ranking to instant-fuzzy search. A naïve solution is computing all answers then ranking them, but it cannot meet this high-speed requirement on large data sets when there are too many answers, so there are studies of early-termination techniques to efficiently compute relevant answers. To overcome the space and time limitations of these solutions, we propose an approach that focuses on common phrases in the data and queries, assuming records with these phrases are ranked higher. We study how to index these phrases and develop an incremental-computation algorithm for efficiently segmenting a query into phrases and computing relevant answers. We conducted a thorough experimental study on real data sets to show the tradeoffs between time, space, and quality of these solutions.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129508014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kondenzer: Exploration and visualization of archived social media","authors":"Omar Alonso, Kartikay Khandelwal","doi":"10.1109/ICDE.2014.6816741","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816741","url":null,"abstract":"Modern social networks such as Twitter provide a platform for people to express their opinions on a variety of topics ranging from personal to global. While the factual part of this information and the opinions of various experts are archived by sources such as Wikipedia and reputable news articles, the opinion of the general public is drowned out in a sea of noise and “un-interesting” information. In this demo we present Kondenzer - an offline system for condensing, archiving and visualizing social data. Specifically, we create digests of social data using a combination of filtering, duplicate removal and efficient clustering. This gives a condensed set of high quality data which is used to generate facets and create a collection that can be visualized using the PivotViewer control.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128213786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of the effect of Category Match Score in search advertising","authors":"Youngchul Cha, Junghoo Cho, Jian Yuan, Tak W. Yan","doi":"10.1109/ICDE.2014.6816731","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816731","url":null,"abstract":"Categorical (topic) similarity between a web page and an advertisement (ad) text has long been used for contextual advertising. In this paper, we explore the use of the categorical similarity score, referred to as Category Match Score (CMS), in the context of search advertising. In particular, we explore the effect of CMS on various ad-effectiveness prediction tasks, including user-judgment prediction, ad click-through-rate prediction (CTR), and revenue-per-impression prediction. Our extensive experiments on two editorial datasets and one live traffic dataset demonstrate that CMS is one of the strongest features in the judgment prediction task and that CMS-based filtering is very effective in improving revenue per impression as well as CTR. We believe that our analyses can be extremely effective in helping web service providers serve more relevant and profitable ads to users.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129611493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"C-DMr: Crowd-powered Decision Maker for real world Knapsack Problems","authors":"Leihao Xia, Caleb Chen Cao, Lei Chen, Zhao Chen","doi":"10.1109/ICDE.2014.6816734","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816734","url":null,"abstract":"Knapsack problems range over a large sphere of real world challenges [?]. For example, every year a professor has to decide her new “squad” of students/staff from possibly hundreds of candidates, while having a restricted budget of funding in consideration. Moreover, in many cases, she has to resort to her colleagues and senior students to make comparisons among the candidates. The difficulties of such tasks are mainly three-fold: 1) the knowledge about the candidates are distributed among a crowd; 2) the underlying factors are human-intrinsic and hard to be formatted; 3) the size of candidates exceeds the capacity of human for a one-shot decision. Other examples in this category include gear set preparation for a venture trip, syllabus design for a popular course and inventory design for goods shelf, where the two difficulties are commonly observed. Consequently, a person may be heavily entangled to work out a final decision, which may even be inaccurate. Driven by this demand, in this demo, we present C-DMr - a Crowd-powered Decision Maker that incorporates the wisdom of the informed crowds to solve such real world Knapsack Problems. The core module of this web-based system is a set of algorithms along with a novel interactive interface. The interface incrementally presents comparison jobs and motivates the crowd to participate with a rewarding mechanism, and the set of algorithms solves the Knapsack Problem given only pairwise preferences among candidates. We demonstrate the novelty and usefulness of C-DMr by forming a aforementioned “squad” for a recruiting professor. Specifically four functionalities are shown: 1) a Candidates Entrance that collects the information about all candidates; 2) a Jury Trial that facilitates informed crowds to contribute preferences; 3) an Knapsack Analyzer that measures the on-going “squad”; and 4) a Consultant that recommends a final set of candidates to the professor.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132145523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Yi, Russell Paulet, E. Bertino, V. Varadharajan
{"title":"Practical k nearest neighbor queries with location privacy","authors":"X. Yi, Russell Paulet, E. Bertino, V. Varadharajan","doi":"10.1109/ICDE.2014.6816688","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816688","url":null,"abstract":"In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study k nearest neighbor (kNN) queries where the mobile user queries the location-based service (LBS) provider about k nearest points of interest (POIs) on the basis of his current location. We propose a solution for the mobile user to preserve his location privacy in kNN queries. The proposed solution is built on the Paillier public-key cryptosystem and can provide both location privacy and data privacy. In particular, our solution allows the mobile user to retrieve one type of POIs, for example, k nearest car parks, without revealing to the LBS provider what type of points is retrieved. For a cloaking region with n×n cells and m types of points, the total communication complexity for the mobile user to retrieve a type of k nearest POIs is O(n+m) while the computation complexities of the mobile user and the LBS provider are O(n + m) and O(n2m), respectively. Compared with existing solutions for kNN queries with location privacy, our solutions are more efficient. Experiments have shown that our solutions are practical for kNN queries.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"13 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128846995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting group recommendation functions for flexible preferences","authors":"Senjuti Basu Roy, Saravanan Thirumuruganathan, S. Amer-Yahia, Gautam Das, Cong Yu","doi":"10.1109/ICDE.2014.6816669","DOIUrl":"https://doi.org/10.1109/ICDE.2014.6816669","url":null,"abstract":"We examine the problem of enabling the flexibility of updating one's preferences in group recommendation. In our setting, any group member can provide a vector of preferences that, in addition to past preferences and other group members' preferences, will be accounted for in computing group recommendation. This functionality is essential in many group recommendation applications, such as travel planning, online games, book clubs, or strategic voting, as it has been previously shown that user preferences may vary depending on mood, context, and company (i.e., other people in the group). Preferences are enforced in an feedback box that replaces preferences provided by the users by a potentially different feedback vector that is better suited for maximizing the individual satisfaction when computing the group recommendation. The feedback box interacts with a traditional recommendation box that implements a group consensus semantics in the form of Aggregated Voting or Least Misery, two popular aggregation functions for group recommendation. We develop efficient algorithms to compute robust group recommendations that are appropriate in situations where users have changing preferences. Our extensive empirical study on real world data-sets validates our findings.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116622068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}