J. Lima, Cristiano da Silveira Colombo, Flávio Izo, Juliana Pinheiro Campos Pirovani, E. Oliveira
{"title":"使用$\\text{CRF}+\\text{LG}$对报纸文本中的命名实体进行自动分类","authors":"J. Lima, Cristiano da Silveira Colombo, Flávio Izo, Juliana Pinheiro Campos Pirovani, E. Oliveira","doi":"10.1109/CLEI52000.2020.00010","DOIUrl":null,"url":null,"abstract":"Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $\\text{CRF}+\\text{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.","PeriodicalId":413655,"journal":{"name":"2020 XLVI Latin American Computing Conference (CLEI)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using $\\\\text{CRF}+\\\\text{LG}$ for automated classification of named entities in newspaper texts\",\"authors\":\"J. Lima, Cristiano da Silveira Colombo, Flávio Izo, Juliana Pinheiro Campos Pirovani, E. Oliveira\",\"doi\":\"10.1109/CLEI52000.2020.00010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $\\\\text{CRF}+\\\\text{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.\",\"PeriodicalId\":413655,\"journal\":{\"name\":\"2020 XLVI Latin American Computing Conference (CLEI)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 XLVI Latin American Computing Conference (CLEI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLEI52000.2020.00010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 XLVI Latin American Computing Conference (CLEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLEI52000.2020.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Using $\text{CRF}+\text{LG}$ for automated classification of named entities in newspaper texts
Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $\text{CRF}+\text{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.