基于朴素贝叶斯算法和TF-IDF的新闻分类特征提取。

IF 2.6 3区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES
PLoS ONE Pub Date : 2025-07-30 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0327347
Li Zhang
{"title":"基于朴素贝叶斯算法和TF-IDF的新闻分类特征提取。","authors":"Li Zhang","doi":"10.1371/journal.pone.0327347","DOIUrl":null,"url":null,"abstract":"<p><p>The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 7","pages":"e0327347"},"PeriodicalIF":2.6000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310027/pdf/","citationCount":"0","resultStr":"{\"title\":\"Features extraction based on Naive Bayes algorithm and TF-IDF for news classification.\",\"authors\":\"Li Zhang\",\"doi\":\"10.1371/journal.pone.0327347\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.</p>\",\"PeriodicalId\":20189,\"journal\":{\"name\":\"PLoS ONE\",\"volume\":\"20 7\",\"pages\":\"e0327347\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310027/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS ONE\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pone.0327347\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0327347","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

在线新闻的快速增长需要强大的自动分类系统来增强信息组织和个性化推荐。尽管像TF-IDF和朴素贝叶斯这样的传统方法提供了基本的解决方案,但它们在捕获语义细微差别和处理实时需求方面的局限性阻碍了实际应用。本研究提出了一个混合新闻分类框架,该框架将经典机器学习与现代自然语言处理的进展相结合,以应对这些挑战。我们的方法引入了三个关键创新:(1)领域特定特征工程,结合定制的n-grams和实体感知TF-IDF加权来放大判别术语;(2) BERT引导的特征选择,利用经过提炼的BERT识别上下文重要词,解决罕见词的歧义;(3)计算效率部署,以1/52.4的推理成本实现了95.2%的BERT准确率。在11个类别的新浪新闻文章的平衡语料库上进行评估,该系统的测试精度为95.12%(相比之下,SVM+TF-IDF基线为84.43%),经5倍交叉验证证实,具有统计学意义的改进(p < 0.01)。关键的发现揭示了在区分语义上不同的类别方面的强大性能,同时暴露了细粒度区分的挑战。该框架的效率(2.1推理延迟)和可伸缩性(CPU资源的线性利用)验证了其在实际部署中的实用性。这项工作弥补了传统特征工程和基于变压器的模型之间的差距,为新闻平台提供了一种经济有效的解决方案。未来的研究将探索层次分类和动态主题的适应,以进一步细化语义边界。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Features extraction based on Naive Bayes algorithm and TF-IDF for news classification.

The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
PLoS ONE
PLoS ONE 生物-生物学
CiteScore
6.20
自引率
5.40%
发文量
14242
审稿时长
3.7 months
期刊介绍: PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信