F. A. D. G. Pinto, J. D. B. Santos, Sérgio Lifschitz, E. Haeusler
{"title":"A benchmarking for public information by Machine Learning and Regular Language","authors":"F. A. D. G. Pinto, J. D. B. Santos, Sérgio Lifschitz, E. Haeusler","doi":"10.5753/wcge.2023.229975","DOIUrl":null,"url":null,"abstract":"Technologies such as Big Data and Transfer Learning have been attracting the interest of industry and academia over the last 15 years. The consequence of this is an almost unanimous preference for technological solutions that use statistical models. This technology is causing a revolution in the information extraction process. In this research, we question whether this technique is the best solution for extracting information from documents. We compare machine learning (ML) and rule-based approaches in the task of recognizing legal entities in the official gazette. We built an annotated dataset with 100 examples of legal documents and submitted this model to an evaluation in IBM Watson Knowledge Studio (WKS). We show that, in a scenario where documents follow a formal structure, rules-based information extraction systems still present themselves as low-cost, more uncomplicated, and more efficient solutions.","PeriodicalId":108828,"journal":{"name":"Anais do XI Workshop de Computação Aplicada em Governo Eletrônico (WCGE 2023)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do XI Workshop de Computação Aplicada em Governo Eletrônico (WCGE 2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/wcge.2023.229975","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Technologies such as Big Data and Transfer Learning have been attracting the interest of industry and academia over the last 15 years. The consequence of this is an almost unanimous preference for technological solutions that use statistical models. This technology is causing a revolution in the information extraction process. In this research, we question whether this technique is the best solution for extracting information from documents. We compare machine learning (ML) and rule-based approaches in the task of recognizing legal entities in the official gazette. We built an annotated dataset with 100 examples of legal documents and submitted this model to an evaluation in IBM Watson Knowledge Studio (WKS). We show that, in a scenario where documents follow a formal structure, rules-based information extraction systems still present themselves as low-cost, more uncomplicated, and more efficient solutions.