基于Transformer的葡萄牙语新闻文本分类模型

Journal of Systemics Cybernetics and Informatics Pub Date : 2022-10-01 DOI:10.54808/jsci.20.05.33

Isabel N. Santana, R. S. Oliveira, E. G. S. Nascimento

{"title":"基于Transformer的葡萄牙语新闻文本分类模型","authors":"Isabel N. Santana, R. S. Oliveira, E. G. S. Nascimento","doi":"10.54808/jsci.20.05.33","DOIUrl":null,"url":null,"abstract":"This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.","PeriodicalId":30249,"journal":{"name":"Journal of Systemics Cybernetics and Informatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Text Classification of News Using Transformer-based Models for Portuguese\",\"authors\":\"Isabel N. Santana, R. S. Oliveira, E. G. S. Nascimento\",\"doi\":\"10.54808/jsci.20.05.33\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.\",\"PeriodicalId\":30249,\"journal\":{\"name\":\"Journal of Systemics Cybernetics and Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systemics Cybernetics and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.54808/jsci.20.05.33\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systemics Cybernetics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54808/jsci.20.05.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

这项工作建议使用一种名为BERTimbau的基于变压器的微调自然语言处理（NLP）模型，从巴西报纸上发布的文本中生成单词嵌入，创建一个强大的NLP模型来对葡萄牙语新闻进行分类，这项任务对人类来说是一项成本高昂的任务。为了评估这种方法，除了通过微调的BERTimbau生成嵌入之外，还使用Word2Vec技术进行了比较分析。这项工作的第一步是使用K-means和TF-IDF技术，将新闻从19个类别重新排列到10个类别，以减少语料库中类别失衡的存在。在Word2Vec步骤中，应用了CBOW和Skip gram架构。在BERTimbau和Word2Vec步骤中，使用Doc2Vec方法将每条新闻表示为唯一的嵌入，为每条新闻生成文档嵌入。采用指标准确度、加权准确度、准确度、召回率、F1评分、AUC ROC和AUC PRC来评估结果。值得注意的是，经过微调的BERTimbau捕捉到了不同类别文本中的差异，表明基于该模型的分类模型比其他探索技术具有更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text Classification of News Using Transformer-based Models for Portuguese

This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systemics Cybernetics and Informatics

自引率

0.00%

发文量

审稿时长

12 weeks