Ranga Raju Vatsavai, A. Ganguly, V. Chandola, A. Stefanidis, S. Klasky, S. Shekhar
{"title":"Spatiotemporal data mining in the era of big spatial data: algorithms and applications","authors":"Ranga Raju Vatsavai, A. Ganguly, V. Chandola, A. Stefanidis, S. Klasky, S. Shekhar","doi":"10.1145/2447481.2447482","DOIUrl":"https://doi.org/10.1145/2447481.2447482","url":null,"abstract":"Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from the spatial and spatiotemporal data. However, explosive growth in the spatial and spatiotemporal data, and the emergence of social media and location sensing technologies emphasize the need for developing new and computationally efficient methods tailored for analyzing big data. In this paper, we review major spatial data mining algorithms by closely looking at the computational and I/O requirements and allude to few applications dealing with big spatial data.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113999909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thiago L. Gomes, S. V. G. Magalhães, M. Andrade, W. Randolph Franklin, Guilherme C. Pena
{"title":"Computing the drainage network on huge grid terrains","authors":"Thiago L. Gomes, S. V. G. Magalhães, M. Andrade, W. Randolph Franklin, Guilherme C. Pena","doi":"10.1145/2447481.2447488","DOIUrl":"https://doi.org/10.1145/2447481.2447488","url":null,"abstract":"We present a very efficient algorithm, named EMFlow, and its implementation to compute the drainage network, that is, the flow direction and flow accumulation on huge terrains stored in external memory. It is about 20 times faster than the two most recent and most efficient published methods: TerraFlow and r.watershed.seg. Since processing large datasets can take hours, this improvement is very significant.\u0000 The EMFlow is based on our previous method RWFlood which uses a flooding process to compute the drainage network. And, to reduce the total number of I/O operations, EMFlow is based on grouping the terrain cells into blocks which are stored in a special data structure managed as a cache memory. Also, a new strategy is adopted to subdivide the terrains in islands which are processed separately.\u0000 Because of the recent increase in the volume of high resolution terrestrial data, the internal memory algorithms do not run well on most computers and, thus, optimizing the massive data processing algorithm simultaneously for data movement and computation has been a challenge for GIS.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131033192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Patlolla, E. Bright, Jeanette E. Weaver, A. Cheriyadat
{"title":"Accelerating satellite image based large-scale settlement detection with GPU","authors":"D. Patlolla, E. Bright, Jeanette E. Weaver, A. Cheriyadat","doi":"10.1145/2447481.2447487","DOIUrl":"https://doi.org/10.1145/2447481.2447487","url":null,"abstract":"Computer vision algorithms for image analysis are often computationally demanding. Application of such algorithms on large image databases--- such as the high-resolution satellite imagery covering the entire land surface, can easily saturate the computational capabilities of conventional CPUs. There is a great demand for vision algorithms running on high performance computing (HPC) architecture capable of processing petascale image data. We exploit the parallel processing capability of GPUs to present a GPU-friendly algorithm for robust and efficient detection of settlements from large-scale high-resolution satellite imagery. Feature descriptor generation is an expensive, but a key step in automated scene analysis. To address this challenge, we present GPU implementations for three different feature descriptors-multiscale Historgram of Oriented Gradients (HOG), Gray Level Co-Occurrence Matrix (GLCM) Contrast and local pixel intensity statistics. We perform extensive experimental evaluations of our implementation using diverse and large image datasets. Our GPU implementation of the feature descriptor algorithms results in speedups of 220 times compared to the CPU version. We present an highly efficient settlement detection system running on a multiGPU architecture capable of extracting human settlement regions from a city-scale sub-meter spatial resolution aerial imagery spanning roughly 1200 sq. kilometers in just 56 seconds with detection accuracy close to 90%. This remarkable speedup gained by our vision algorithm maintaining high detection accuracy clearly demonstrates that such computational advancements clearly hold the solution for petascale image analysis challenges.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128222875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TMC-pattern: holistic trajectory extraction, modeling and mining","authors":"Roland Assam, T. Seidl","doi":"10.1145/2447481.2447490","DOIUrl":"https://doi.org/10.1145/2447481.2447490","url":null,"abstract":"Mobility data is Big Data. Modeling such raw big location data is quite challenging in terms of quality and runtime efficiency. Mobility data emanating from smart phones and other pervasive devices consists of a combination of spatio-temporal dimensions, as well as some additional contextual dimensions that may range from social network activities, diseases to telephone calls. However, most existing trajectory models focus only on the spatio-temporal dimensions of mobility data and their regions of interest depict only the popularity of a place. In this paper, we propose a novel trajectory model called Time Mobility Context Correlation Pattern (TMC-Pattern), which considers a wide variety of dimensions and utilizes subspace clustering to find contextual regions of interest. In addition, our proposed TMC-Pattern rigorously captures and embeds infrastructural, human, social and behavioral patterns into the trajectory model. We show theoretically and experimentally, how TMC-Pattern can be used for Frequent Location Sequence Mining and Location Prediction with real datasets.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121183433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EarthDB: scalable analysis of MODIS data using SciDB","authors":"Gary Planthaber, M. Stonebraker, J. Frew","doi":"10.1145/2447481.2447483","DOIUrl":"https://doi.org/10.1145/2447481.2447483","url":null,"abstract":"Earth scientists are increasingly experiencing difficulties with analyzing rapidly growing volumes of complex data. Those who must perform analysis directly on low-level National Aeronautics and Space Administration (NASA) Moderate Resolution Imaging Spectroradiometer (MODIS) Level 1B calibrated and geolocated data, for example, encounter an arcane, high-volume data set that is burdensome to make use of. Instead, Earth scientists typically opt to use higher-level \"canned\" products provided by NASA. However, when these higher-level products fail to meet the requirements of a particular project, a cruel dilemma arises: cope with data products that don't exactly meet the project's needs or spend an enormous amount of resources extracting what is needed from the unadulterated low-level data. In this paper, we present EarthDB, a system that eliminates this dilemma by offering the following contributions:\u0000 1. Enabling painless importing of MODIS Level 1B data into SciDB, a highly scalable science-oriented database platform that abstracts away the complexity of distributed storage and analysis of complex multi-dimensional data,\u0000 2. Defining a schema that unifies storage and representation of MODIS Level 1B data, regardless of its source file,\u0000 3. Supporting fast filtering and analysis of MODIS data through the use of an intuitive, high-level query language rather than complex procedural programming and,\u0000 4. Providing the ability to easily define and reconfigure entire analysis pipelines within the SciDB database, allowing for rapid ad-hoc analysis. To demonstrate this ability, we provide sample benchmarks for the construction of true-color (RGB) and Normalized Difference Vegetative Index (NDVI) images from raw MODIS Level 1B data using relatively simple queries with scalable performance.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114204754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big 3D spatial data processing using cloud computing environment","authors":"R. Sugumaran, Jeff Burnett, Andrew Blinkmann","doi":"10.1145/2447481.2447484","DOIUrl":"https://doi.org/10.1145/2447481.2447484","url":null,"abstract":"Lately, acquiring a large quantity of three-dimensional (3-D) spatial data particularly topographic information has become commonplace with the advent of new technology such as laser scanner or light detection and ranging (LiDAR) and techniques. Though both in the USA and around the globe, the pace of massive 3-D spatial data collection is accelerating, the provision of affordable technology for dealing with issues such as processing, management, archival, dissemination, and analysis of the huge data volumes has lagged behind. Single computers and generic high-end computing are not sufficient to process this massive data and researches started to explore other computing environments. Recently cloud computing environment showed very promising solutions due to availability and affordability. The main goal of this paper is to develop a web-based LiDAR data processing framework called \"Cloud Computing-based LiDAR Processing System (CLiPS)\" to process massive LiDAR data using cloud computing environment. The CLiPS framework implementation was done using ESRI's ArcGIS server, Amazon Elastic Compute Cloud (Amazon EC2), and several open source spatial tools. Some of the applications developed in this project include: 1) preprocessing tools for LiDAR data, 2) generation of large area Digital Elevation Model (DEMs) on the cloud environment, and 3) user-driven DEM derived products. We have used three different terrain types, LiDAR tile sizes, and EC2 instant types (large, Xlarge, and double Xlarge) to test for time and cost comparisons. Undulating terrain data took more time than other two terrain types used in this study and overall cost for the entire project was less than $100.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131563509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniar Achakeev, M. Seidemann, Markus Schmidt, B. Seeger
{"title":"Sort-based parallel loading of R-trees","authors":"Daniar Achakeev, M. Seidemann, Markus Schmidt, B. Seeger","doi":"10.1145/2447481.2447489","DOIUrl":"https://doi.org/10.1145/2447481.2447489","url":null,"abstract":"Due to the increasing amount of spatial data, parallel algorithms for processing big spatial data become more and more important. In particular, the shared nothing architecture is attractive as it offers low cost data processing. Moreover, popular MapReduce frameworks such as Hadoop allow developing conceptually simple and scalable algorithms for processing big data using this architecture. In this work we address the problem of parallel loading of R-trees on a shared-nothing platform. The R-tree is a key element for efficient query processing in large spatial database, but its creation is expensive. We proposed a novel scalable parallel loading algorithm for MapReduce. The core of our parallel loading is the state of the art sequential sort-based query-adaptive R-tree loading algorithm that builds R-trees optimized according to a commonly used cost model. In contrast to previous methods for loading R-trees with MapReduce we construct the R-tree level-wise. Our experimental results show an almost linear speedup in the number of machines. Moreover, the resulting R-trees provide a better query performance than R-trees build by other competitive bulk-loading algorithms.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126991443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kulsawasd Jitkajornwanich, R. Elmasri, J. McEnery, Chengkai Li
{"title":"Extracting storm-centric characteristics from raw rainfall data for storm analysis and mining","authors":"Kulsawasd Jitkajornwanich, R. Elmasri, J. McEnery, Chengkai Li","doi":"10.1145/2447481.2447492","DOIUrl":"https://doi.org/10.1145/2447481.2447492","url":null,"abstract":"Most rainfall data is stored in formats that are not easy to analyze and mine. In these formats, the amount of data is enormous. In this paper, we propose techniques to summarize the raw rainfall data into a model that facilitates storm analysis and mining, and reduces the data size. The result is to convert raw rainfall data into meaningful storm-centric data, which is then stored in a relational database for easy analysis and mining. The size of the storm data is less than 1% of the size of the raw data. We can determine the spatio-temporal characteristics of a storm, such as how big a storm is, how many sites are covered, and what is its overall depth (precipitation) and duration. We present formal definitions for the storm-related concepts that are needed in our data conversion. Then we describe storm identification algorithms based on these concepts. Our storm identification algorithms analyze precipitation values of adjacent sites within the period of time that covers the whole storm and combines them together to identify the overall storm characteristics.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124515111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speeding up large-scale point-in-polygon test based spatial join on GPUs","authors":"Jianting Zhang, Simin You","doi":"10.1145/2447481.2447485","DOIUrl":"https://doi.org/10.1145/2447481.2447485","url":null,"abstract":"Point-in-Polygon (PIP) test is fundamental to spatial databases and GIS. Motivated by the slow response times in joining large-scale point locations with polygons using traditional spatial databases and GIS, we have designed and developed an end-to-end system completely on Graphics Processing Units (GPUs) to associate points with the polygons that they fall within by utilizing massively data parallel computing power of GPUs. The system includes an efficient module to generate point quadrants that have at most K points from large-scale unordered points, a simple grid-file based spatial filtering approach to associate point quadrants and polygons, and, a PIP test module to assign polygons to points in a GPU computing block using both the block and thread level parallelisms. Experiments on joining 170 million points with more than 40 thousand polygons have resulted in a runtime of 11.165 seconds on an Nvidia Quadro 6000 GPU device. In contrast, a baseline serial CPU implementation using state-of-the-art open source GIS packages required 15+ hours to complete. We further discuss several factors and parameters that may affect the system performance.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115410178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards scalable ad-hoc climate anomalies search","authors":"P. Baumann, D. Misev","doi":"10.1145/2447481.2447493","DOIUrl":"https://doi.org/10.1145/2447481.2447493","url":null,"abstract":"Meteorological data contribute significantly to \"Big Data\"; however, not only is their volume ranging into Petabyte sizes for single objects a challenge, but also the number of dimensions -- such general 4-D spatio-temporal data cannot be handled through traditional GIS methods and tools. Actually, climate data tend to transcend these dimensions and add an extra time dimension for the simulation run time, ending up with 5-D data cubes.\u0000 Traditional databases, known for their flexibility and scalability, have proven inadequate due to their lack of support for multi-dimensional rasters. Consequently, file-based implementations are being used for serving such data to the community, rather than databases. This is recently overcome by Array Databases which provide storage and query support for this information category of multi-dimensional rasters, thereby unleashing the scalability and flexibility advantages for climate data management.\u0000 In this contribution, we present a case study where non-trivial analytics functionality on n-D climate data cubes has been established. Storage optimization techniques novel to standard databases allow to tune the system for interactive response in many cases. We briefly introduce the rasdaman database system used, present the database schema and practically important queries use case, and report preliminary performance observations. To the best of our knowledge, this is the first non-academic, real-life deployment of an array database for up to 5-D data sets.","PeriodicalId":416086,"journal":{"name":"International Workshop on Analytics for Big Geospatial Data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132172974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}