K. Ferreira, L. Ferla, G. R. Queiroz, N. Vijaykumar, Carlos A. Noronha, R. Mariano, Denis Taveira, Gabriel Sansigolo, Orlando Guarnieri, Thomas Rogers, J. Lesser, M. Page, Fernando Atique, D. Musa, Janaina Y. Santos, Diego S. Morais, Cristiane R. Miyasaka, C. Almeida, L. Nascimento, Jaine A. Diniz, M. Santos
{"title":"A Platform for Collaborative Historical Research based on Volunteered Geographical Information","authors":"K. Ferreira, L. Ferla, G. R. Queiroz, N. Vijaykumar, Carlos A. Noronha, R. Mariano, Denis Taveira, Gabriel Sansigolo, Orlando Guarnieri, Thomas Rogers, J. Lesser, M. Page, Fernando Atique, D. Musa, Janaina Y. Santos, Diego S. Morais, Cristiane R. Miyasaka, C. Almeida, L. Nascimento, Jaine A. Diniz, M. Santos","doi":"10.5753/jidm.2018.2046","DOIUrl":"https://doi.org/10.5753/jidm.2018.2046","url":null,"abstract":"Digital humanities research promotes the intersection between digital technologies and humanities, emphasizing free knowledge sharing and collaborative work. Based on digital humanities features, this paper describes the architecture of a computational platform for collaborative historical research designed and developed in an ongoing project called Pauliceia 2.0. This project aims to produce historical data of São Paulo city from 1870 to 1940 and to develop a computational platform that allows researchers to explore, integrate and share urban historical data sets. The Pauliceia 2.0 platform main goal is to use volunteered geographical information (VGI) and crowdsourcing concepts to produce past geographical data and to allow historians to share historical data sets resulting from their researches. In this work, we present the Pauliceia 2.0 platform architecture and its underlying VGI protocol.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115332100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. P. Nogueira, C. Celes, H. Martin, A. Loureiro, Rossana M. C. Andrade
{"title":"A Statistical Method for Detecting Move, Stop, and Noise: A Case Study with Bus Trajectories","authors":"T. P. Nogueira, C. Celes, H. Martin, A. Loureiro, Rossana M. C. Andrade","doi":"10.5753/jidm.2018.2041","DOIUrl":"https://doi.org/10.5753/jidm.2018.2041","url":null,"abstract":"The proliferation of devices with positioning capability has allowed new possibilities for studies and applications in the context of urban mobility. However, the process of analyzing raw trajectories poses several challenges. In this work, we investigate one of the main tasks in this process of trajectory analysis: detecting stops from GPS trajectories. Stops can reveal interesting behavior aspects of a moving object such as its daily routine, bottlenecks in traffic jams, or visiting times of touristic places. Although there are some efforts in this direction, most current methods ignore the presence of noise segments, which typically occur many times in trajectories. In this sense, we present a method that exploits gaps in time and space to identify episodes of movement, stop, and periods where some classification is inconclusive, which we define as noise. In addition, our method does not rely on contextual information as opposed to some current solutions, which make our proposal also suitable for trajectories recorded in free space. We compare our method to the state of the art highlighting its advantages in terms of manipulating noise, supporting spatial filtering and being independent of external resources. Moreover, we conduct an experimental evaluation using a large-scale bus dataset to show the effectiveness of our method in a real application scenario.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123922570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beyond Hit-or-Miss: A Comparative Study of Synopses for Similarity Searching","authors":"M. Bedo, Daniel de Oliveira, A. Traina, C. Traina","doi":"10.5753/jidm.2018.1635","DOIUrl":"https://doi.org/10.5753/jidm.2018.1635","url":null,"abstract":"A DBMS optimizer module takes its decisions by modeling the query costs upon the distribution of the data space. Cost modeling of similarity queries, however, requires the representation of distances’ rather than data distributions. Therefore, the finding of a suitable representation (or synopsis) for the distance distribution has a major impact in the optimization of similarity searches. In this study, we evaluate the quality of estimates drawn from five synopses of distinct paradigms regarding two common query criteria. Moreover, we embed the synopses into a new parametric cost model, called Stockpile, for the cost estimation of similarity queries on metric trees. The model uses the synopses estimation for calculating the probability of traversing a metric tree node, which defines the expected number of both disk accesses (I/O costs) and distance calculations (CPU costs). We performed an extensive set of experiments on real-world data sources regarding the estimates of each synopsis (and its parametric variations) by using paired ranking tests. In global terms, three synopses have outperformed their competitors regarding selectivity estimation, whereas two of them have also surpassed the others in the prediction of both I/O and CPU costs with respect to Stockpile model predictions. Additionally, results also indicate the choice of the most suitable synopsis may depend on characteristics of the distance distribution.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"42 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113986197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eduardo H. M. Pena, Erik Falk, J. Meira, E. Almeida
{"title":"Mind Your Dependencies for Semantic Query Optimization","authors":"Eduardo H. M. Pena, Erik Falk, J. Meira, E. Almeida","doi":"10.5753/jidm.2018.1633","DOIUrl":"https://doi.org/10.5753/jidm.2018.1633","url":null,"abstract":"Semantic query optimization uses dependencies between attributes to formulate query transformations and revise the number of processed rows, with direct impact on performance. Commercial databases present facilities to define dependencies as not enforced constraints. The goal is to help the query optimizer in cases where the database is denormalized or simply lost dependencies in the design. However, feeding these facilities is a manual task which is tedious and error-prone. An attractive alternative is the automatic discovery of dependencies, but the cost of finding dependencies increases with the number of rows and attributes in the dataset. In this paper, we stick to the automatic discovery approach, but to reduce the cost we focus on dependencies matching the current queries in the pipe (ie., workload). Initially, we rely on a large set of functional dependencies computed in batch with state of the art algorithms in the literature. Over time our focused dependency selector (FDSel) chooses exemplars to feed the query optimizer. Therewith we eliminate further manual interactions. The automatically selected exemplars exhibit statistical properties that resemble those of the initial dependency set. This demonstrates the effectiveness of our proposed approach. In the best case scenario, by applying the FDSel for join elimination on a real-world database, we reduce query response time by more than one order of magnitude.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132918173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. C. Scabora, Paulo H. Oliveira, Gabriel Spadon, D. S. Kaster, José F. Rodrigues, A. Traina, C. Traina
{"title":"Cutting-edge Relational Graph Data Management with Edge-k: From One to Multiple Edges in the Same Row","authors":"L. C. Scabora, Paulo H. Oliveira, Gabriel Spadon, D. S. Kaster, José F. Rodrigues, A. Traina, C. Traina","doi":"10.5753/jidm.2018.1634","DOIUrl":"https://doi.org/10.5753/jidm.2018.1634","url":null,"abstract":"Relational Database Management Systems (RDBMSs) are widely employed in several applications, including those that deal with data modeled as graphs. Existing solutions store every edge in a distinct row in the edge table, however, for most cases, such modeling does not provide adequate performance. In this work, we propose Edge-k, a technique to group the vertex neighborhood into a reduced number of rows in a table through additional columns that stores up to k edges per row. The technique provides a better table organization and reduces both table size and query processing time. We evaluate Edge-k table management for insert, update, delete and bulkload operations, and compare the query processing performance both with the conventional edge table — adopted by the existing frameworks — and with the Neo4j graph database. Experiments using Single-Source Shortest Path (SSSP) queries reveal that our new proposal approach always outperforms the conventional edge table as well as it was faster than Neo4j for the first iterations, being slightly slower than Neo4j only for iterations after having loaded the whole graph from disk to memory. It was able to reach a speedup of 66% over a representative real dataset, with an average reduction of up to 58% in our tests. The average speedup over synthetic datasets was up to 54%. Edge-k was also the best one when performing graph degree distribution queries. Moreover, the Edge-k table obtained a processing time reduction of 70% for bulkload operations, despite having an overhead of 50% for individual insert, update and delete operations. Finally, Edge-k advances the state of the art for graph data management within relational database systems.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115954972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos S. S. Marinho, L. O. Moreira, E. Coutinho, José S. Costa Filho, F. R. C. Sousa, Javam C. Machado
{"title":"LABAREDA: A Predictive and Elastic Load Balancing Service for Cloud-Replicated Databases","authors":"Carlos S. S. Marinho, L. O. Moreira, E. Coutinho, José S. Costa Filho, F. R. C. Sousa, Javam C. Machado","doi":"10.5753/jidm.2018.1639","DOIUrl":"https://doi.org/10.5753/jidm.2018.1639","url":null,"abstract":"Cloud computing emerges as an alternative to promote quality of service for data-driven applications. Database management systems must be available to support the deployment of cloud applications resorting to databases.Many solutions use database replication as a strategy to increase availability and decentralize the workload of database transactions among replicas. Due to the distribution of database transactions among replicas, load balancing techniques improve the computational resources utilization. However, several solutions use the current state of the database service to make decisions for the distribution of transactions. This article proposes a predictive and elastic load balancing service for replicated cloud databases. Experiments carried out showed that the use of prediction models can help to predict possible SLA violations in time series that represent workloads of cloud-replicated databases.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115945306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michele A. Brandão, Pedro O. S. Vaz de Melo, Mirella M. Moro
{"title":"STACY: Strength of Ties Automatic-Classifier over the Years","authors":"Michele A. Brandão, Pedro O. S. Vaz de Melo, Mirella M. Moro","doi":"10.5753/jidm.2018.1636","DOIUrl":"https://doi.org/10.5753/jidm.2018.1636","url":null,"abstract":"With the evolution of Web technology and its worldwide use by regular people, there is now data about not only such people but also their relations. Database research has evolved as well to tackle the myriad of problems that arrive with such volumes of data. Here, we contribute to such a trend by proposing a new algorithm (STACY) to automatically classify tie strength (an intrinsic property of relationships) considering time. We show that each class has singular and different behavior, and analyze them over co-authorship networks. Also, STACY identifies strong relationships that persist more than the ones classified by a state of the art algorithm. Finally, we derive a computational model from STACY that is able to automatically identify relationships classes with low computational cost.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133216573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Natércia A. Batista, Guilherme A. de Sousa, Michele A. Brandão, Ana Paula Couto da Silva, Mirella M. Moro
{"title":"Tie Strength Metrics to Rank Pairs of Developers from GitHub","authors":"Natércia A. Batista, Guilherme A. de Sousa, Michele A. Brandão, Ana Paula Couto da Silva, Mirella M. Moro","doi":"10.5753/jidm.2018.1637","DOIUrl":"https://doi.org/10.5753/jidm.2018.1637","url":null,"abstract":"The Web provides huge volumes of data, which makes efficient data collecting and processing not easy tasks. An example of such volumes is in software repositories, a type of Web storage platform for software and projects,their developers and companies. In this work, we first present a systematic literature review over topics related to such repositories. Then, we extract their data and enrich it by building a development network. Based on such a network, we investigate tie strength metrics on their capability of defining new information through a correlation analysis. We also use the metrics to rank pairs of developers by considering three different aggregate methods. Our experimental analysis shows different results for each ranking method when considering all pairs of developers, which reveals the difficulty of choosing the best way to rank pairs of developers. However, when considering the top 10 best ranked pairs, two methods present similar results. Also, the combination of tie strength metrics with ranking aggregated methods allows to identify important developers in the network and their collaboration strength.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"73 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128035828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira
{"title":"Towards an Empirical Evaluation of Scientific Data Indexing and Querying","authors":"Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira","doi":"10.5753/jidm.2018.1638","DOIUrl":"https://doi.org/10.5753/jidm.2018.1638","url":null,"abstract":"Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116788432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProQua: a system for evaluating logic-based scoring functions on uncertain relational data","authors":"S. Lehrack, S. Saretz","doi":"10.1145/2452376.2452474","DOIUrl":"https://doi.org/10.1145/2452376.2452474","url":null,"abstract":"ProQua is an innovative probabilistic database system which enables the application of logic-based and weighted similarity conditions on uncertain relation data. In this demonstration paper we describe the interrelations among the main concepts, present an archaeological example scenario and sketch the software architecture of ProQua.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129844310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}