Abdulrahman Salama, Mahmoud Elkamhawy, Mohamed Ali, Ehab Al-Masri, Adel Sabour, Abdeltawab M. Hendawi, Ming Tan, Vashutosh Agrawal, Ravi Prakash
{"title":"A Computer Vision Approach for Detecting Discrepancies in Map Textual Labels","authors":"Abdulrahman Salama, Mahmoud Elkamhawy, Mohamed Ali, Ehab Al-Masri, Adel Sabour, Abdeltawab M. Hendawi, Ming Tan, Vashutosh Agrawal, Ravi Prakash","doi":"10.1145/3603719.3603722","DOIUrl":"https://doi.org/10.1145/3603719.3603722","url":null,"abstract":"Maps provide various sources of information. An important example of such information is textual labels such as cities, neighborhoods, and street names. Although we treat this information as facts, and despite the massive effort done by providers to continuously improve their accuracy, this data is far from perfect. Discrepancies in textual labels rendered on the map are one of the major sources of inconsistencies across map providers. These discrepancies can have significant impacts on the reliability of the derived information and decision-making processes. Thus, it is important to validate the accuracy and consistency in such data. Most providers treat this data as their propriety data and it is not available to the public, thus we cannot compare the data directly. To address these challenges, we introduce a novel computer vision-based approach for automatically extracting and classifying labels based on the visual characteristics of the label, which indicates its category based on the format convention used by the specific map provider. Based on the extracted data, we detect the degree of discrepancies across map providers. We consider three map providers: Bing Maps, Google Maps, and OpenStreetMaps. The neural network we develop classifies the text labels with an accuracy up to 93% in all providers. We leverage our system to analyze randomly selected regions in different markets. The studied markets are USA, Germany, France, and Brazil. Experimental results and statistical analysis reveal the amount of discrepancies across map providers per region. We calculate the Jaccard distance between the extracted text sets for each pair of map providers, which represents the discrepancy percentage. Discrepancies percentages as high as 90% were found in some markets.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121443821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ESM2-Tree: An maintenance efficient authentication data structure in blockchain","authors":"Yuzhou Fang, Liang Cai, Weiwei Qiu, Fanglei Huang, Huaihai Hui","doi":"10.1145/3603719.3603721","DOIUrl":"https://doi.org/10.1145/3603719.3603721","url":null,"abstract":"Blockchain technology is gaining broader attention. Owing to its immutability property and byzantine fault-tolerance consensus protocol, blockchain offers a brand new trusted data-sharing solution. Some researchers use blockchain to drive autonomous collaboration among smart devices, which face massive spatial data updates and usage. The key challenge lies in designing an authenticated data structure (ADS) that can efficiently process spatial data and queries. However, the previous schemes could not handle spatial data efficiently or did not consider the efficiency of frequent data updates. In this paper, we take a step toward implementing a maintenance-efficient ADS on the blockchain, called ESM2-Tree, which is not only good at processing spatial data but also effective in supporting authenticated spatial queries by partitioning and merging data at different granularities. Theoretical analysis and empirical evaluation validate the performance of our ADS, which reduces the overall data structure maintenance overhead by about 50% in a uniform data distribution scenario.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126622780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Machine Learning Queries with Linear Algebra Query Processing","authors":"Wenbo Sun, Asterios Katsifodimos, Rihan Hai","doi":"10.1145/3603719.3603726","DOIUrl":"https://doi.org/10.1145/3603719.3603726","url":null,"abstract":"The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model predictions often operate in separate execution environments, leading to redundant engineering and computations. Additionally, the diverging mathematical foundations of data processing and machine learning hinder cross-optimizations by combining these two components, thereby overlooking potential opportunities to expedite predictive pipelines. In this paper, we propose an operator fusing method based on GPU-accelerated linear algebraic evaluation of relational queries. Our method leverages linear algebra computation properties to merge operators in machine learning predictions and data processing, significantly accelerating predictive pipelines by up to 317x. We perform a complexity analysis to deliver quantitative insights into the advantages of operator fusion, considering various data and model dimensions. Furthermore, we extensively evaluate matrix multiplication query processing utilizing the widely-used Star Schema Benchmark. Through comprehensive evaluations, we demonstrate the effectiveness and potential of our approach in improving the efficiency of data processing and machine learning workloads on modern hardware.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134464156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Will, L. Thamsen, Dominik Scheinert, O. Kao
{"title":"Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?","authors":"Jonathan Will, L. Thamsen, Dominik Scheinert, O. Kao","doi":"10.1145/3603719.3603733","DOIUrl":"https://doi.org/10.1145/3603719.3603733","url":null,"abstract":"Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matteo Ceccarello, Anton Dignös, J. Gamper, Christina Khnaisser
{"title":"Indexing Temporal Relations for Range-Duration Queries","authors":"Matteo Ceccarello, Anton Dignös, J. Gamper, Christina Khnaisser","doi":"10.1145/3603719.3603732","DOIUrl":"https://doi.org/10.1145/3603719.3603732","url":null,"abstract":"Temporal information plays a crucial role in many database applications, however support for queries on such data is limited. We present an index structure, termed RD-index, to support range-duration queries over interval timestamped relations, which constrain both the range of the tuples’ positions on the timeline and their duration. RD-index is a grid structure in the two-dimensional space, representing the position on the timeline and the duration of timestamps, respectively. Instead of using a regular grid, we consider the data distribution for the construction of the grid in order to ensure that each grid cell contains approximately the same number of intervals. RD-index features provable bounds on the running time of all the operations, allow for a simple implementation, and supports very predictable query performance. We benchmark our solution on a variety of datasets and query workloads, investigating both the query rate and the behavior of the individual queries. The results show that RD-index performs better than the baselines on range-duration queries, for which it is explicitly designed. Furthermore, it outperforms state of the art indexes also on mixed workloads containing queries that constrain either only the duration or the range along with range-duration queries. Finally, the size of the RD-index is in all settings smaller than the competitors.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115646788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","authors":"","doi":"10.1145/3603719","DOIUrl":"https://doi.org/10.1145/3603719","url":null,"abstract":"","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125538792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}