利用主动学习对官方日记中提取的法律行为进行分割和语义分类

J. Inf. Data Manag. Pub Date : 2023-10-31 DOI:10.5753/jidm.2023.3181

Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves

{"title":"利用主动学习对官方日记中提取的法律行为进行分割和语义分类","authors":"Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves","doi":"10.5753/jidm.2023.3181","DOIUrl":null,"url":null,"abstract":"Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into \"sections of interest\" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"96 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries\",\"authors\":\"Kattiana Constantino, Thiago H. P. Silva, João Vítor B. Silva, Victor Augusto L. Cruz, Otávio M. M. Zucheratto, Marcos Carvalho, Welton Santos, Celso França, Cláudio M. V. de Andrade, Alberto H. F. Laender, Marcos André Gonçalves\",\"doi\":\"10.5753/jidm.2023.3181\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into \\\"sections of interest\\\" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.\",\"PeriodicalId\":301338,\"journal\":{\"name\":\"J. Inf. Data Manag.\",\"volume\":\"96 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Inf. Data Manag.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/jidm.2023.3181\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2023.3181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于善治的公开性和透明度，不受阻碍、可核查地获取法律和监管信息至关重要。有了这种渠道，我们就可以监督政府行为，确保公共财政资源不会被不当或不一致地使用。例如，这有助于发现公共行动中的不法行为，如招标过程和拍卖。然而，不同的公共机构有自己的标准来规范用于提供信息的模式和格式，市政、州和联盟（联邦）文件中的不同风格就是例证。在这种情况下，我们的目标是尽量减少处理公共文件，特别是官方公报的工作量。为此，我们提出了一种面向结构的启发式方法，用于从文本中提取相关节选。然后，我们通过形态句法分析和实体识别来描述这些摘录的特征。随后，我们使用主动学习策略将提取的片段语义分类为 "感兴趣的部分"（如投标、法律、人事、预算），以减少人工标注的工作量。我们还通过结合转换器、堆叠和不同类型的表示（如频数主义、静态和上下文语义嵌入）来改进分类过程。此外，我们还利用基于半监督学习的超采样来处理（标记）数据稀缺和偏斜问题。最后，我们将所有这些贡献结合到一个具有主动学习支持的实时标注工具中，该工具在提取方面达到了 100% 的准确率，在分类方面达到了 85% 的总体准确率，而且只需很少的标注工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries

Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into "sections of interest" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Inf. Data Manag.

自引率

0.00%

发文量