J. Inf. Data Manag.最新文献

筛选
英文 中文
Assessing Data Quality Inconsistencies in Brazilian Governmental Data 评估巴西政府数据质量的不一致性
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3220
Gabriel P. Oliveira, Bárbara M. A. Mendes, Clara A. Bacha, Lucas L. Costa, Larissa D. Gomide, Mariana O. Silva, Michele A. Brandão, A. Lacerda, Gisele L. Pappa
{"title":"Assessing Data Quality Inconsistencies in Brazilian Governmental Data","authors":"Gabriel P. Oliveira, Bárbara M. A. Mendes, Clara A. Bacha, Lucas L. Costa, Larissa D. Gomide, Mariana O. Silva, Michele A. Brandão, A. Lacerda, Gisele L. Pappa","doi":"10.5753/jidm.2023.3220","DOIUrl":"https://doi.org/10.5753/jidm.2023.3220","url":null,"abstract":"In recent years, vast volumes of data are constantly being made available on the Web, and they have been increasingly used as decision support in different contexts. However, for these decisions to be more assertive and reliable, it is necessary to ensure data quality. Although there are several definitions for this area, it is a consensus that data quality is always associated with a specific context. This work aims to analyze data quality in a data warehouse with governmental information of the Brazilian state of Minas Gerais. We first present a brief comparison of eight open-source data quality tools and then choose the Great Expectations tool for analyzing such data in two real applications: public bids and public expenditure. Our analyses show that the chosen tool has relevant characteristics to generate good data quality indicators to reveal data quality issues that may directly impact the construction of final applications using such data.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139307309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Class Schema Discovery from Semi-Structured Data 从半结构化数据中发现类模式
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3117
Everaldo Costa Neto, Johny Moreira, Luciano Barbosa, Ana Carolina Salgado
{"title":"Class Schema Discovery from Semi-Structured Data","authors":"Everaldo Costa Neto, Johny Moreira, Luciano Barbosa, Ana Carolina Salgado","doi":"10.5753/jidm.2023.3117","DOIUrl":"https://doi.org/10.5753/jidm.2023.3117","url":null,"abstract":"A wide range of applications has used semi-structured data. A characteristic of this type of data is its flexible structure, i.e., it does not rely on schema-based constraints to define its entities. Usually entities of a same kind (i.e, class) do not present the same attribute set. However, some data processing and management applications rely on a data schema to perform their tasks. In this context, the lack of structure is a challenge for these applications to use this data. In this paper, we propose CoFFee, an approach to class schema discovery. Given a set of heterogeneous entity schemata, found within a class, CoFFee provides a summarized set with core attributes. To this end, CoFFee applies a strategy combining attributes co-occurrence and frequency. It models a set of entity schemata as a graph and uses centrality metrics to capture the co-occurrence between attributes. We evaluated CoFFee using data from 12 classes extracted from DBpedia and e-Commerce datasets. We benchmarked it against two other state-of-the-art approaches. The results show that: i) CoFFee effectively provides a summarized schema, minimizing non-relevant attributes without compromising the data retrieval rate; and ii) CoFFee produces a summarized schema of good quality, outperforming the baselines by an average of 19% of F1 score.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139308965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved generalization of cyclist detection on security cameras with the OpenImages Cyclists dataset 利用 OpenImages 自行车数据集改进安全摄像头上自行车检测的通用性
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3179
Ednilza Evangelista da Silva Nardi, Bruno Padilha, L. T. Kamaura, João Eduardo Ferreira
{"title":"Improved generalization of cyclist detection on security cameras with the OpenImages Cyclists dataset","authors":"Ednilza Evangelista da Silva Nardi, Bruno Padilha, L. T. Kamaura, João Eduardo Ferreira","doi":"10.5753/jidm.2023.3179","DOIUrl":"https://doi.org/10.5753/jidm.2023.3179","url":null,"abstract":"Most large public datasets containing cyclists for training detectors based on Deep Learning have annotations for bicycles and people, but not for cyclists. Even when it is not the case, the quality and quantity of the images are limited. To overcome these limitations, we propose the new OpenImages Cyclists dataset, built through the pre-selection of images from the OpenImages set and a new algorithm for semiautomatic generation of cyclist annotation aided by people and bicycle detectors. A cyclist detector trained with this dataset achieved identification rates up to 78% and 89% in two different sets of images obtained from security cameras at USP, Campus São Paulo - Capital.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139307800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Fast XGBoost for Multiclass Classification 用于多类分类的自适应快速 XGBoost
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3150
Fabiano Baldo, J. Grando, Yuji Yamada Correa, Deividy Amorim Policarpo
{"title":"Adaptive Fast XGBoost for Multiclass Classification","authors":"Fabiano Baldo, J. Grando, Yuji Yamada Correa, Deividy Amorim Policarpo","doi":"10.5753/jidm.2023.3150","DOIUrl":"https://doi.org/10.5753/jidm.2023.3150","url":null,"abstract":"The popularization of sensoring and connectivity technologies like 5G and IoT are boosting the generation of data streams. Such kinds of data are one of the last frontiers of data mining applications. However, data streams are massive and unbounded sequences of non-stationary data objects that are continuously generated at rapid rates. To deal with these challenges, the learning algorithms should analyze the data just once and update their classifiers to handle the concept drifts. The literature presents some algorithms to deal with the classification of multiclass data streams. However, most of them have high processing time. Therefore, this work proposes a XGBoost-based classifier called AFXGB-MC to fast classify non-stationary data streams with multiple classes. We compared it with the six state-of-the-art algorithms for multiclass classification found in the literature. The results pointed out that AFXGB-MC presents similar accuracy performance, but with faster processing time, being twice faster than the second fastest algorithm from the literature, and having fast drift recovery time.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139307206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries 利用主动学习对官方日记中提取的法律行为进行分割和语义分类
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3181
Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves
{"title":"Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries","authors":"Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves","doi":"10.5753/jidm.2023.3181","DOIUrl":"https://doi.org/10.5753/jidm.2023.3181","url":null,"abstract":"Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into \"sections of interest\" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139309201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents 用于官方文件中实体识别和关系提取的上下文强化、实体划界和生成性数据扩展
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3180
F. Belém, Cláudio M. V. de Andrade, Celso França, Marcos Carvalho, M. Ganem, Gabriel Teixeira, Gabriel Jallais, Alberto H. F. Laender, Marcos André Gonçalves
{"title":"Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents","authors":"F. Belém, Cláudio M. V. de Andrade, Celso França, Marcos Carvalho, M. Ganem, Gabriel Teixeira, Gabriel Jallais, Alberto H. F. Laender, Marcos André Gonçalves","doi":"10.5753/jidm.2023.3180","DOIUrl":"https://doi.org/10.5753/jidm.2023.3180","url":null,"abstract":"Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when there is little semantic context surrounding the named entities. This is the case of entities composed only by digits and punctuation, such as IDs and phone numbers, as well as long composed names. In this article, we propose new techniques for contextual reinforcement and entity delimitation based on pre- and post-processing techniques to provide a richer semantic context, improving SpERT, a state-of-the-art Span-based Entity and Relation Transformer. To provide further context to the training process of NER+RE, we propose a data augmentation technique based on Generative Pretrained Transformers (GPT). We evaluate our strategies using real data from public administration documents (official gazettes and biddings) and court lawsuits. Our results show that our pre- and post-processing strategies, when used co-jointly, allows significant improvements on NER+ER effectiveness, while we also show the benefits of using GPT for training data augmentation.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139306502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hurricane: a Dataflow-oriented Data Service for Smart Cities Applications 飓风:面向智能城市应用的面向数据流的数据服务
J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3189
Maicon Banni, Maria Luiza Falci, Isabel Rosseti, Daniel de Oliveira
{"title":"Hurricane: a Dataflow-oriented Data Service for Smart Cities Applications","authors":"Maicon Banni, Maria Luiza Falci, Isabel Rosseti, Daniel de Oliveira","doi":"10.5753/jidm.2023.3189","DOIUrl":"https://doi.org/10.5753/jidm.2023.3189","url":null,"abstract":"The concept of Smart Cities has gained relevance, especially in the last decade, due to the availability of data associated with cities, e.g., car traffic, public transportation, crime data, etc. The purpose of using these data is to improve the services offered to the citizens. Most of these applications manipulate spatiotemporal data. These data are processed in a dataflow that starts with the collection, integration, and aggregation and ends with visualization. This way, specialized data services for smart city applications are most welcome. However, many of the existing data services in this context, are either specific to a particular application/domain or do not consider the entire data life cycle. In this article, we present Hurricane, a dataflow-oriented data service for smart city applications. Hurricane executes multiple dataflows to gather, pre-process, integrate, and public data. Hurricane was evaluated with an application in the area of public security and results reinforced the importance of this type of data service.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139307293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topic Coherence Metrics: How Sensitive Are They? 主题一致性度量:它们有多敏感?
J. Inf. Data Manag. Pub Date : 2022-10-03 DOI: 10.5753/jidm.2022.2181
João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco
{"title":"Topic Coherence Metrics: How Sensitive Are They?","authors":"João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco","doi":"10.5753/jidm.2022.2181","DOIUrl":"https://doi.org/10.5753/jidm.2022.2181","url":null,"abstract":"Topic modeling approaches extract the most relevant sets of words (grouped into so-called topics) from a document collection. The extracted topics can be used for analyzing the latent semantic structure hiding in the collection. This task is intrinsically unsupervised (without information about the labels), so evaluating the quality of the discovered topics is challenging. To address that, different unsupervised metrics have been proposed, and some of them are close to human perception, e.g., coherence metrics. Moreover, metrics behave differently when facing noise (i.e., unrelated words) in the topics. This article presents an exploratory analysis to evaluate how state-of-the-art metrics are affected by perturbations in the topics. By perturbation, we mean that intruder words are synthetically inserted into the topics to measure the metrics’ ability to deal with noises. Our findings highlight the importance of overlooked choices in the metrics sensitiveness context. We show that some topic modeling metrics are highly sensitive to disturbing; others can handle noisy topics with minimal perturbation. As a result, we rank the chosen metrics by sensitiveness, and as the contribution, we believe that the results might be helpful for developers to evaluate the discovered topics better.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
P+RProv: Prospective+Retrospective Provenance Graphs of Python Scripts P+RProv: Python脚本的前瞻性和回顾性来源图
J. Inf. Data Manag. Pub Date : 2022-10-03 DOI: 10.5753/jidm.2022.2059
Vitor Gama Lemos, J. F. Pimentel, Bruno Erbisti, V. Braganholo
{"title":"P+RProv: Prospective+Retrospective Provenance Graphs of Python Scripts","authors":"Vitor Gama Lemos, J. F. Pimentel, Bruno Erbisti, V. Braganholo","doi":"10.5753/jidm.2022.2059","DOIUrl":"https://doi.org/10.5753/jidm.2022.2059","url":null,"abstract":"The evolution of technology has enabled scientists to advance the automation of scientific experiments. Many programming languages have become popular in the scientific environment, especially scripting languages, due to their high abstraction level and simplicity, allowing the specification of complex tasks in fewer steps than traditional programming languages. Due to these features, lots of scientists model their scientific experiments in scripting languages to ensure data management and results control. However, this type of experiment usually generates large volumes of data, making data analysis and threat mitigation difficult. To fill in this gap, we propose P+RProv, an approach to aid scientists in understanding the structure of Python scripts and their results.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116923939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Intersection between Databases and Digital Forensics 探索数据库和数字取证之间的交集
J. Inf. Data Manag. Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2524
Danilo B. Seufitelli, Michele A. Brandão, Mirella M. Moro
{"title":"Exploring the Intersection between Databases and Digital Forensics","authors":"Danilo B. Seufitelli, Michele A. Brandão, Mirella M. Moro","doi":"10.5753/jidm.2022.2524","DOIUrl":"https://doi.org/10.5753/jidm.2022.2524","url":null,"abstract":"Digital forensics has attracted attention from assorted researchers, who primarily work on predicting and solving digital hacks and crimes. In turn, the number and types of digital crimes have increased considerably, mainly due to the growing use of digital media to perform daily personal and professional tasks. Like most computer-related activities, data is at the center of such hacks and crimes. Hence, this work presents a systematic literature review of publications at the intersection between Digital Forensics and Databases. We discuss problems and trends of two main categories: Data Building and Database Management Systems. Overall, this research opens the doors for the communication between databases and an area with several exciting and concrete challenges, with great potential for social, economic, and technical-scientific contributions.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116602511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信