Using $\text{CRF}+\text{LG}$ for automated classification of named entities in newspaper texts

2020 XLVI Latin American Computing Conference (CLEI) Pub Date : 2020-10-01 DOI:10.1109/CLEI52000.2020.00010

J. Lima, Cristiano da Silveira Colombo, Flávio Izo, Juliana Pinheiro Campos Pirovani, E. Oliveira

{"title":"Using $\\text{CRF}+\\text{LG}$ for automated classification of named entities in newspaper texts","authors":"J. Lima, Cristiano da Silveira Colombo, Flávio Izo, Juliana Pinheiro Campos Pirovani, E. Oliveira","doi":"10.1109/CLEI52000.2020.00010","DOIUrl":null,"url":null,"abstract":"Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $\\text{CRF}+\\text{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.","PeriodicalId":413655,"journal":{"name":"2020 XLVI Latin American Computing Conference (CLEI)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 XLVI Latin American Computing Conference (CLEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLEI52000.2020.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $\text{CRF}+\text{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.

查看原文本刊更多论文

使用$\text{CRF}+\text{LG}$对报纸文本中的命名实体进行自动分类

信息生产一直在加速增长。有大量的信息需要处理，这使得与文本的收集和利用相关的任务具有挑战性，需要来自最不同领域的努力，特别是计算语言学。计算语言学的目标之一是通过应用计算资源提取的经验证据，实现语言数据集的收集和利用。这项工作展示了从报纸中提取数据集的创建。技术用于提取句子，标记化，命名实体识别(NER)，以及用于描述数据集的统计方法。其他研究人员可以直接从可用的语料库中受益。这项工作提出了一个语料库与1029注释新闻文章在葡萄牙语实体根据HAREM类别。在108页的样本中，我们的实验显示，与同一份报纸的黄金标准文本相比，相似度为97.0%。对于NER任务和提取数据集的自动标注，使用Second Harem和aTribuna100数据集的比例来训练混合模型$\text{CRF}+\text{LG}$。利用训练好的模型，对提取的1029篇文章进行了自动标注。总的来说，指标的值表明，最优指标实现了70/30比例的分类模型，特别是对于Person (PER)类别，准确率和召回率分别达到了91.11%和95.82%。总体而言，该模型的准确率为95.86%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 XLVI Latin American Computing Conference (CLEI)

自引率

0.00%

发文量