Named Entity Recognition and News Article Classification: A Lightweight Approach

IF 3.6 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Ioannis Katranis;Christos Troussas;Akrivi Krouska;Phivos Mylonas;Cleo Sgouropoulou
{"title":"Named Entity Recognition and News Article Classification: A Lightweight Approach","authors":"Ioannis Katranis;Christos Troussas;Akrivi Krouska;Phivos Mylonas;Cleo Sgouropoulou","doi":"10.1109/ACCESS.2025.3605709","DOIUrl":null,"url":null,"abstract":"This paper introduces TinyGreekNewsBERT, a 14.1 M-parameter distilled Transformer that performs both Named Entity Recognition (NER) and multiclass news-topic classification in Greek. We first compile and annotate a 20 000 article corpus with 32 IOB2 entity labels and 19 thematic categories, accompanied by a transparent, reproducible preprocessing pipeline. On this benchmark, TinyGreekNewsBERT reaches 81% micro F1 for NER and 78% classification accuracy, coming within five percentage points of GreekBERT (86% / 83%) while delivering comparable performance to mBERT (82% / 77%) and approaching XLM-RoBERTa (85% / 82%). Crucially, compared with GreekBERT, our model is <inline-formula> <tex-math>$8\\times $ </tex-math></inline-formula> smaller, requires <inline-formula> <tex-math>$15\\times $ </tex-math></inline-formula> fewer FLOPs (1.3 BFLOPs at 128 tokens), and yields a median CPU latency of 14.7 ms per article, a <inline-formula> <tex-math>$10\\times $ </tex-math></inline-formula> speed-up that makes it the first genuinely edge-deployable solution for Greek NER and news classification. Because the distillation and training pipeline is language-agnostic, the approach can be ported to other mid-resource languages and domains, offering a cost-effective path to multilingual, real-time NLP systems.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"155031-155046"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11148234","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11148234/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper introduces TinyGreekNewsBERT, a 14.1 M-parameter distilled Transformer that performs both Named Entity Recognition (NER) and multiclass news-topic classification in Greek. We first compile and annotate a 20 000 article corpus with 32 IOB2 entity labels and 19 thematic categories, accompanied by a transparent, reproducible preprocessing pipeline. On this benchmark, TinyGreekNewsBERT reaches 81% micro F1 for NER and 78% classification accuracy, coming within five percentage points of GreekBERT (86% / 83%) while delivering comparable performance to mBERT (82% / 77%) and approaching XLM-RoBERTa (85% / 82%). Crucially, compared with GreekBERT, our model is $8\times $ smaller, requires $15\times $ fewer FLOPs (1.3 BFLOPs at 128 tokens), and yields a median CPU latency of 14.7 ms per article, a $10\times $ speed-up that makes it the first genuinely edge-deployable solution for Greek NER and news classification. Because the distillation and training pipeline is language-agnostic, the approach can be ported to other mid-resource languages and domains, offering a cost-effective path to multilingual, real-time NLP systems.
命名实体识别和新闻文章分类:一种轻量级方法
本文介绍了TinyGreekNewsBERT,这是一个14.1 m参数提炼的转换器,可以同时进行命名实体识别(NER)和希腊文的多类新闻主题分类。我们首先编译并注释了一个包含32个IOB2实体标签和19个主题类别的2万篇语料库,并附有透明、可重复的预处理管道。在这个基准测试中,TinyGreekNewsBERT的NER达到81%的微F1,分类准确率达到78%,与GreekBERT(86% / 83%)相差不到5个百分点,而与mBERT(82% / 77%)的性能相当,接近XLM-RoBERTa(85% / 82%)。至关重要的是,与GreekBERT相比,我们的模型小了8倍,需要的FLOPs少了15倍(128个令牌时需要1.3个BFLOPs),每篇文章的CPU延迟中值为14.7 ms,速度提升了10倍,这使它成为第一个真正的边缘部署解决方案。由于蒸馏和训练管道与语言无关,因此该方法可以移植到其他中等资源语言和领域,为多语言实时NLP系统提供了一条经济有效的途径。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Access
IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍: IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信