{"title":"Developing a news classifier for Greek using BERT","authors":"George Gkolfopoulos, Iraklis Varlamis","doi":"10.1109/SEEDA-CECNSM57760.2022.9932996","DOIUrl":null,"url":null,"abstract":"Text categorization is a significant task in the re-search field of text mining, which has recently benefited from deep neural network algorithms and advanced learning techniques that extract language models from large textual corpora. These Pre-Trained Language Models are the main components of state-of-the-art solutions in many natural language processing and text-mining tasks can be very generic, trained in generic text corpora, or domain-specific when they employ large corpora from specific application domains (e.g. social media, news, sciences, etc.). When only generic language models are available the overall performance in the task can be improved by adapting or fine-tuning the model used for the task, e.g. the classifier. Although multilingual language models are reported in the literature, such models are usually language-specific. This work presents a news article classifier, which has been trained on a small corpus and employs a Greek version of BERT language model. Comparison with existing machine learning-based classifiers shows that the proposed method outperforms well-known methods in text classification. In addition, the proposed approach allows the continuous training of the classifier through user-provided feedback on falsely classified articles.","PeriodicalId":68279,"journal":{"name":"计算机工程与设计","volume":"64 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"计算机工程与设计","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.1109/SEEDA-CECNSM57760.2022.9932996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Text categorization is a significant task in the re-search field of text mining, which has recently benefited from deep neural network algorithms and advanced learning techniques that extract language models from large textual corpora. These Pre-Trained Language Models are the main components of state-of-the-art solutions in many natural language processing and text-mining tasks can be very generic, trained in generic text corpora, or domain-specific when they employ large corpora from specific application domains (e.g. social media, news, sciences, etc.). When only generic language models are available the overall performance in the task can be improved by adapting or fine-tuning the model used for the task, e.g. the classifier. Although multilingual language models are reported in the literature, such models are usually language-specific. This work presents a news article classifier, which has been trained on a small corpus and employs a Greek version of BERT language model. Comparison with existing machine learning-based classifiers shows that the proposed method outperforms well-known methods in text classification. In addition, the proposed approach allows the continuous training of the classifier through user-provided feedback on falsely classified articles.
期刊介绍:
Computer Engineering and Design is supervised by China Aerospace Science and Industry Corporation and sponsored by the 706th Institute of the Second Academy of China Aerospace Science and Industry Corporation. It was founded in 1980. The purpose of the journal is to disseminate new technologies and promote academic exchanges. Since its inception, it has adhered to the principle of combining depth and breadth, theory and application, and focused on reporting cutting-edge and hot computer technologies. The journal accepts academic papers with innovative and independent academic insights, including papers on fund projects, award-winning research papers, outstanding papers at academic conferences, doctoral and master's theses, etc.