J. Inf. Data Manag.最新文献

Assessing Data Quality Inconsistencies in Brazilian Governmental Data 评估巴西政府数据质量的不一致性

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3220

Gabriel P. Oliveira, Bárbara M. A. Mendes, Clara A. Bacha, Lucas L. Costa, Larissa D. Gomide, Mariana O. Silva, Michele A. Brandão, A. Lacerda, Gisele L. Pappa

引用次数: 0

Class Schema Discovery from Semi-Structured Data 从半结构化数据中发现类模式

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3117

Everaldo Costa Neto, Johny Moreira, Luciano Barbosa, Ana Carolina Salgado

{"title":"Class Schema Discovery from Semi-Structured Data","authors":"Everaldo Costa Neto, Johny Moreira, Luciano Barbosa, Ana Carolina Salgado","doi":"10.5753/jidm.2023.3117","DOIUrl":"https://doi.org/10.5753/jidm.2023.3117","url":null,"abstract":"A wide range of applications has used semi-structured data. A characteristic of this type of data is its flexible structure, i.e., it does not rely on schema-based constraints to define its entities. Usually entities of a same kind (i.e, class) do not present the same attribute set. However, some data processing and management applications rely on a data schema to perform their tasks. In this context, the lack of structure is a challenge for these applications to use this data. In this paper, we propose CoFFee, an approach to class schema discovery. Given a set of heterogeneous entity schemata, found within a class, CoFFee provides a summarized set with core attributes. To this end, CoFFee applies a strategy combining attributes co-occurrence and frequency. It models a set of entity schemata as a graph and uses centrality metrics to capture the co-occurrence between attributes. We evaluated CoFFee using data from 12 classes extracted from DBpedia and e-Commerce datasets. We benchmarked it against two other state-of-the-art approaches. The results show that: i) CoFFee effectively provides a summarized schema, minimizing non-relevant attributes without compromising the data retrieval rate; and ii) CoFFee produces a summarized schema of good quality, outperforming the baselines by an average of 19% of F1 score.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139308965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improved generalization of cyclist detection on security cameras with the OpenImages Cyclists dataset 利用 OpenImages 自行车数据集改进安全摄像头上自行车检测的通用性

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3179

Ednilza Evangelista da Silva Nardi, Bruno Padilha, L. T. Kamaura, João Eduardo Ferreira

引用次数: 0

Adaptive Fast XGBoost for Multiclass Classification 用于多类分类的自适应快速 XGBoost

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3150

Fabiano Baldo, J. Grando, Yuji Yamada Correa, Deividy Amorim Policarpo

引用次数: 0

Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries 利用主动学习对官方日记中提取的法律行为进行分割和语义分类

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3181

Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves

{"title":"Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries","authors":"Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves","doi":"10.5753/jidm.2023.3181","DOIUrl":"https://doi.org/10.5753/jidm.2023.3181","url":null,"abstract":"Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into \"sections of interest\" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139309201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents 用于官方文件中实体识别和关系提取的上下文强化、实体划界和生成性数据扩展

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3180

F. Belém, Cláudio M. V. de Andrade, Celso França, Marcos Carvalho, M. Ganem, Gabriel Teixeira, Gabriel Jallais, Alberto H. F. Laender, Marcos André Gonçalves

{"title":"Contextual Reinforcement, Entity Delimitation and Generative Data Augmentation for Entity Recognition and Relation Extraction in Official Documents","authors":"F. Belém, Cláudio M. V. de Andrade, Celso França, Marcos Carvalho, M. Ganem, Gabriel Teixeira, Gabriel Jallais, Alberto H. F. Laender, Marcos André Gonçalves","doi":"10.5753/jidm.2023.3180","DOIUrl":"https://doi.org/10.5753/jidm.2023.3180","url":null,"abstract":"Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when there is little semantic context surrounding the named entities. This is the case of entities composed only by digits and punctuation, such as IDs and phone numbers, as well as long composed names. In this article, we propose new techniques for contextual reinforcement and entity delimitation based on pre- and post-processing techniques to provide a richer semantic context, improving SpERT, a state-of-the-art Span-based Entity and Relation Transformer. To provide further context to the training process of NER+RE, we propose a data augmentation technique based on Generative Pretrained Transformers (GPT). We evaluate our strategies using real data from public administration documents (official gazettes and biddings) and court lawsuits. Our results show that our pre- and post-processing strategies, when used co-jointly, allows significant improvements on NER+ER effectiveness, while we also show the benefits of using GPT for training data augmentation.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139306502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hurricane: a Dataflow-oriented Data Service for Smart Cities Applications 飓风：面向智能城市应用的面向数据流的数据服务

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI: 10.5753/jidm.2023.3189

Maicon Banni, Maria Luiza Falci, Isabel Rosseti, Daniel de Oliveira

引用次数: 0

Topic Coherence Metrics: How Sensitive Are They? 主题一致性度量:它们有多敏感?

J. Inf. Data Manag. Pub Date : 2022-10-03 DOI: 10.5753/jidm.2022.2181

João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco

{"title":"Topic Coherence Metrics: How Sensitive Are They?","authors":"João Marcos Campagnolo, Denio Duarte, Guillherme Dal Bianco","doi":"10.5753/jidm.2022.2181","DOIUrl":"https://doi.org/10.5753/jidm.2022.2181","url":null,"abstract":"Topic modeling approaches extract the most relevant sets of words (grouped into so-called topics) from a document collection. The extracted topics can be used for analyzing the latent semantic structure hiding in the collection. This task is intrinsically unsupervised (without information about the labels), so evaluating the quality of the discovered topics is challenging. To address that, different unsupervised metrics have been proposed, and some of them are close to human perception, e.g., coherence metrics. Moreover, metrics behave differently when facing noise (i.e., unrelated words) in the topics. This article presents an exploratory analysis to evaluate how state-of-the-art metrics are affected by perturbations in the topics. By perturbation, we mean that intruder words are synthetically inserted into the topics to measure the metrics’ ability to deal with noises. Our findings highlight the importance of overlooked choices in the metrics sensitiveness context. We show that some topic modeling metrics are highly sensitive to disturbing; others can handle noisy topics with minimal perturbation. As a result, we rank the chosen metrics by sensitiveness, and as the contribution, we believe that the results might be helpful for developers to evaluate the discovered topics better.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

P+RProv: Prospective+Retrospective Provenance Graphs of Python Scripts P+RProv: Python脚本的前瞻性和回顾性来源图

J. Inf. Data Manag. Pub Date : 2022-10-03 DOI: 10.5753/jidm.2022.2059

Vitor Gama Lemos, J. F. Pimentel, Bruno Erbisti, V. Braganholo

引用次数: 0

Exploring the Intersection between Databases and Digital Forensics 探索数据库和数字取证之间的交集

J. Inf. Data Manag. Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2524

Danilo B. Seufitelli, Michele A. Brandão, Mirella M. Moro

引用次数: 1