基于朴素贝叶斯算法和TF-IDF的新闻分类特征提取。

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-07-30 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0327347

Li Zhang

{"title":"基于朴素贝叶斯算法和TF-IDF的新闻分类特征提取。","authors":"Li Zhang","doi":"10.1371/journal.pone.0327347","DOIUrl":null,"url":null,"abstract":"The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 7","pages":"e0327347"},"PeriodicalIF":2.6000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310027/pdf/","citationCount":"0","resultStr":"{\"title\":\"Features extraction based on Naive Bayes algorithm and TF-IDF for news classification.\",\"authors\":\"Li Zhang\",\"doi\":\"10.1371/journal.pone.0327347\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.\",\"PeriodicalId\":20189,\"journal\":{\"name\":\"PLoS ONE\",\"volume\":\"20 7\",\"pages\":\"e0327347\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310027/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS ONE\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pone.0327347\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0327347","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

在线新闻的快速增长需要强大的自动分类系统来增强信息组织和个性化推荐。尽管像TF-IDF和朴素贝叶斯这样的传统方法提供了基本的解决方案，但它们在捕获语义细微差别和处理实时需求方面的局限性阻碍了实际应用。本研究提出了一个混合新闻分类框架，该框架将经典机器学习与现代自然语言处理的进展相结合，以应对这些挑战。我们的方法引入了三个关键创新：(1)领域特定特征工程，结合定制的n-grams和实体感知TF-IDF加权来放大判别术语；(2) BERT引导的特征选择，利用经过提炼的BERT识别上下文重要词，解决罕见词的歧义；(3)计算效率部署，以1/52.4的推理成本实现了95.2%的BERT准确率。在11个类别的新浪新闻文章的平衡语料库上进行评估，该系统的测试精度为95.12%（相比之下，SVM+TF-IDF基线为84.43%），经5倍交叉验证证实，具有统计学意义的改进（p < 0.01）。关键的发现揭示了在区分语义上不同的类别方面的强大性能，同时暴露了细粒度区分的挑战。该框架的效率（2.1推理延迟）和可伸缩性（CPU资源的线性利用）验证了其在实际部署中的实用性。这项工作弥补了传统特征工程和基于变压器的模型之间的差距，为新闻平台提供了一种经济有效的解决方案。未来的研究将探索层次分类和动态主题的适应，以进一步细化语义边界。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Features extraction based on Naive Bayes algorithm and TF-IDF for news classification.

The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage