{"title":"A Unified Framework for Multi-view Spectral Clustering","authors":"Guo Zhong, Chi-Man Pun","doi":"10.1109/ICDE48307.2020.00187","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00187","url":null,"abstract":"In the era of big data, multi-view clustering has drawn considerable attention in machine learning and data mining communities due to the existence of a large number of unlabeled multi-view data in reality. Traditional spectral graph theoretic methods have recently been extended to multi-view clustering and shown outstanding performance. However, most of them still consist of two separate stages: learning a fixed common real matrix (i.e., continuous labels) of all the views from original data, and then applying K-means to the resulting common label matrix to obtain the final clustering results. To address these, we design a unified multi-view spectral clustering scheme to learn the discrete cluster indicator matrix in one stage. Specifically, the proposed framework directly obtain clustering results without performing K-means clustering. Experimental results on several famous benchmark datasets verify the effectiveness and superiority of the proposed method compared to the state-of-the-arts.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"22 1","pages":"1854-1857"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82083095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"StructSim: Querying Structural Node Similarity at Billion Scale","authors":"Xiaoshuang Chen, Longbin Lai, Lu Qin, Xuemin Lin","doi":"10.1109/ICDE48307.2020.00211","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00211","url":null,"abstract":"Structural node similarity is widely used in analyzing complex networks. As one of the structural node similarity metrics, role similarity has the good merit of indicating automorphism (isomorphism). Existing algorithms to compute role similarity (e.g., RoleSim and NED) suffer from severe performance bottlenecks, and thus cannot handle large real-world graphs. In this paper, we propose a new framework StructSim to compute nodes’ role similarity. Under this framework, we prove that StructSim is guaranteed to be an admissible role similarity metric based on the maximum matching. While maximum matching is too costly to scale, we then devise the BinCount matching to speed up the computation. BinCount-based StructSim admits a precomputed index to query one single pair in O(k log D) time, where k is a small user-defined parameter and D is the maximum node degree. Extensive empirical studies show that StructSim is significantly faster than existing works for computing structural node similarities on the real-world graphs, with comparable effectiveness.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"1950-1953"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81154043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples","authors":"Min Xie, R. C. Wong, Peng Peng, V. Tsotras","doi":"10.1109/ICDE48307.2020.00092","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00092","url":null,"abstract":"When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user’s needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user’s favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database.In this paper, we study the min-size version of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user’s happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in achieving higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to two orders of magnitude times improvement for large α).","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"95 1","pages":"1009-1020"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82782539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlashSchema: Achieving High Quality XML Schemas with Powerful Inference Algorithms and Large-scale Schema Data","authors":"Yeting Li, Jialun Cao, H. Chen, Tingjian Ge, Zhiwu Xu, Qiancheng Peng","doi":"10.1109/ICDE48307.2020.00214","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00214","url":null,"abstract":"Getting high quality XML schemas to avoid or reduce application risks is an important problem in practice, for which some important aspects have yet to be addressed satisfactorily in existing work. In this paper, we propose a tool FlashSchema for high quality XML schema design, which supports both one-pass and interactive schema design and schema recommendation. To the best of our knowledge, no other existing tools support interactive schema design and schema recommendation. One salient feature of our work is the design of algorithms to infer k-occurrence interleaving regular expressions, which are not only more powerful in model capacity, but also more efficient. Additionally, such algorithms form the basis of our interactive schema design. The other feature is that, starting from large-scale schema data that we have harvested from the Web, we devise a new solution for type inference, as well as propose schema recommendation for schema design. Finally, we conduct a series of experiments on two XML datasets, comparing with 9 state-of-the-art algorithms and open-source tools in terms of running time, preciseness, and conciseness. Experimental results show that our work achieves the highest level of preciseness and conciseness within only a few seconds. Experimental results and examples also demonstrate the effectiveness of our type inference and schema recommendation methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"820 1","pages":"1962-1965"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80995124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Indoor Mobility Semantics Annotation Using Coupled Conditional Markov Networks","authors":"Huan Li, Hua Lu, M. A. Cheema, L. Shou, Gang Chen","doi":"10.1109/ICDE48307.2020.00128","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00128","url":null,"abstract":"Indoor mobility semantics analytics can greatly benefit many pertinent applications. Existing semantic annotation methods mainly focus on outdoor space and require extra knowledge such as POI category or human activity regularity. However, these conditions are difficult to meet in indoor venues with relatively small extents but complex topology. This work studies the annotation of indoor mobility semantics that describe an object’s mobility event (what ) at a semantic indoor region (where ) during a time period (when ). A coupled conditional Markov network (C2MN) is proposed with a set of feature functions carefully designed by incorporating indoor topology and mobility behaviors. C2MN is able to capture probabilistic dependencies among positioning records, semantic regions, and mobility events jointly. Nevertheless, the correlation of regions and events hinders the parameters learning. Therefore, we devise an alternate learning algorithm to enable the parameter learning over correlated variables. The extensive experiments demonstrate that our C2MN-based semantic annotation is efficient and effective on both real and synthetic indoor mobility data.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1441-1452"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81988838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Contribution Maximization in Probabilistic Datalog","authors":"T. Milo, Y. Moskovitch, Brit Youngmann","doi":"10.1109/ICDE48307.2020.00076","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00076","url":null,"abstract":"The use of probabilistic datalog programs has been recently advocated for applications that involve recursive computation and uncertainty. While using such programs allows for a flexible knowledge derivation, it makes the analysis of query results a challenging task. Particularly, given a set O of output tuples and a number k, one would like to understand which k-size subset of the input tuples have contributed the most to the derivation of O. This is useful for multiple tasks, such as identifying the critical sources of errors and understanding surprising results. Previous works have mainly focused on the quantification of tuples contribution to a query result in non-recursive SQL queries, very often disregarding probabilistic inference. To quantify the contribution in probabilistic datalog programs, one must account for the recursive relations between input and output data, and the uncertainty. To this end, we formalize the Contribution Maximization (CM) problem. We then reduce CM to the well-studied Influence Maximization (IM) problem, showing that we can harness techniques developed for IM to our setting. However, we show that such naïve adoption results in poor performance. To overcome this, we propose an optimized algorithm which injects a refined variant of the classic Magic Sets technique, integrated with a sampling method, into IM algorithms, achieving a significant saving of space and execution time. Our experiments demonstrate the effectiveness of our algorithm, even where the naïve approach is infeasible.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"72 1","pages":"817-828"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78747207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of Database Systems with DRAM-only Heterogeneous Memory Architecture","authors":"Yifan Qiao","doi":"10.1109/ICDE48307.2020.00243","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00243","url":null,"abstract":"This thesis advocates a novel DRAM-only strategy to reduce the computing system memory cost for the first time, and investigates its applications to database systems. This thesis envisions a low-cost DRAM module called block-protected DRAM, which reduces bit cost by significantly relaxing the DRAM raw reliability and meanwhile employs long error correction code (ECC) to ensure data integrity at small coding redundancy. Built upon the exactly same DRAM technology, today’s byte-accessible DRAM and envisioned block-protected DRAM strike at different trade-offs between memory bit cost and native data access granularity, and naturally form a heterogeneous DRAM-only memory system. The practical feasibility of such heterogeneous memory systems is further strengthened by the new media-agnostic and latency-oblivious CPU-memory interfaces such as IBM’s OpenCAPI/OMI and Intel’s CXL. This DRAM-only design approach perfectly leverages the existing DRAM manufacturing infrastructure and is not subject to any fundamental technology risk and uncertainty. Hence, before NVM technologies could eventually fulfill their long-awaited promises (i.e., DRAM-grade speed at flash-grade cost), this DRAM-only design framework can fill the gap to empower continuous progress and advances of computing systems. This thesis aims to develop techniques that enable relational and NoSQL databases to take full advantage of the envisioned low-cost heterogeneous DRAM system. As the first step, we studied how one could employ heterogeneous DRAM to implement a low-cost tiered caching solution for relational database, and obtained encouraging results using MySQL as a test vehicle.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"60 1","pages":"2054-2058"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75053755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bolong Zheng, Chenze Huang, Christian S. Jensen, Lu Chen, Nguyen Quoc Viet Hung, Guanfeng Liu, Guohui Li, Kai Zheng
{"title":"Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing","authors":"Bolong Zheng, Chenze Huang, Christian S. Jensen, Lu Chen, Nguyen Quoc Viet Hung, Guanfeng Liu, Guohui Li, Kai Zheng","doi":"10.1109/ICDE48307.2020.00089","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00089","url":null,"abstract":"In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"973-984"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74223517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BFT-Store: Storage Partition for Permissioned Blockchain via Erasure Coding","authors":"Xiaodong Qi, Zhao Zhang, Cheqing Jin, Aoying Zhou","doi":"10.1109/ICDE48307.2020.00205","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00205","url":null,"abstract":"The full-replication data storage mechanism, as commonly utilized in existing blockchain systems, is lack of sufficient storage scalability, since it reserves a copy of the whole block data in each node so that the overall storage consumption per block is O(n) with n nodes. Moreover, due to the existence of Byzantine nodes, existing partitioning methods, though widely adopted in distributed systems for decades, cannot suit for blockchain systems directly, thereby it is critical to devise a new storage mechanism. This paper proposes a novel storage engine, called BFT-Store, to enhance storage scalability by integrating erasure coding with Byzantine Fault Tolerance (BFT) consensus protocol. First, the storage consumption per block can be reduced to O(1), which enlarges overall storage capability when more nodes join blockchain. Second, an efficient online re-encoding protocol is designed for storage scale-out and a hybrid replication scheme is employed to improve reading performance. Last, extensive experimental results illustrate the scalability, availability and efficiency of BFT-Store, which is implemented on an open-source permissioned blockchain Tendermint.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"89 1","pages":"1926-1929"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73216128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangwei Guo, Yang Ji, Ce Zhang, Cheng Xu, Jianliang Xu
{"title":"vCBIR: A Verifiable Search Engine for Content-Based Image Retrieval","authors":"Shangwei Guo, Yang Ji, Ce Zhang, Cheng Xu, Jianliang Xu","doi":"10.1109/ICDE48307.2020.00156","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00156","url":null,"abstract":"We demonstrate vCBIR, a verifiable search engine for Content-Based Image Retrieval. vCBIR allows a small or medium-sized enterprise to outsource its image database to a cloud-based service provider and ensures the integrity of query processing. Like other common data-as-a-service (DaaS) systems, vCBIR consists of three parties: (i) the image owner who outsources its database, (ii) the service provider who executes the authenticated query processing, and (iii) the client who issues search queries. By employing a novel query authentication scheme proposed in our prior work [4], the system not only supports cloud-based image retrieval, but also generates a cryptographic proof for each query, by which the client could verify the integrity of query results. During the demonstration, we will showcase the usage of vCBIR and also provide attendees interactive experience of verifying query results against an untrustworthy service provider through graphical user interface (GUI).","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"1730-1733"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75899262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}