{"title":"一种基于相似性的半监督算法,用于标记未标记的文本数据","authors":"Kirankumar Singh Potshangbam, Kshetrimayum Nareshkumar Singh","doi":"10.1016/j.eswa.2025.128941","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF, and Doc2Vec, and is classifier-agnostic, enabling compatibility with both traditional and deep learning models such as KNN, Multinomial Naïve Bayes, SGDClassifier, Logistic Regression, Feedforward Neural Networks (FNN), and Convolutional Neural Networks (CNN). Extensive experiments conducted on benchmark datasets (BBC, Inshorts, 20-newsgroups) demonstrate: (1) achieving 96.88% accuracy on BBC, 93.59% on Inshorts, and 92.49% on 20-newsgroups with only 30% labeled data, thereby reducing manual labeling effort by over 99%; (2) TF-IDF consistently outperforms CountVectorizer and Doc2Vec by 3–12 percentages in accuracy across most experimental settings; and (3) Logistic Regression and FNN achieve the best performance among the classifiers. The method offers a practical, resource-efficient solution for real-world text classification by bridging labeled-unlabeled data gaps without iterative retraining.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"296 ","pages":"Article 128941"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A similarity-based semi-supervised algorithm for labeling unlabeled text data\",\"authors\":\"Kirankumar Singh Potshangbam, Kshetrimayum Nareshkumar Singh\",\"doi\":\"10.1016/j.eswa.2025.128941\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF, and Doc2Vec, and is classifier-agnostic, enabling compatibility with both traditional and deep learning models such as KNN, Multinomial Naïve Bayes, SGDClassifier, Logistic Regression, Feedforward Neural Networks (FNN), and Convolutional Neural Networks (CNN). Extensive experiments conducted on benchmark datasets (BBC, Inshorts, 20-newsgroups) demonstrate: (1) achieving 96.88% accuracy on BBC, 93.59% on Inshorts, and 92.49% on 20-newsgroups with only 30% labeled data, thereby reducing manual labeling effort by over 99%; (2) TF-IDF consistently outperforms CountVectorizer and Doc2Vec by 3–12 percentages in accuracy across most experimental settings; and (3) Logistic Regression and FNN achieve the best performance among the classifiers. The method offers a practical, resource-efficient solution for real-world text classification by bridging labeled-unlabeled data gaps without iterative retraining.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"296 \",\"pages\":\"Article 128941\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425025588\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425025588","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A similarity-based semi-supervised algorithm for labeling unlabeled text data
This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF, and Doc2Vec, and is classifier-agnostic, enabling compatibility with both traditional and deep learning models such as KNN, Multinomial Naïve Bayes, SGDClassifier, Logistic Regression, Feedforward Neural Networks (FNN), and Convolutional Neural Networks (CNN). Extensive experiments conducted on benchmark datasets (BBC, Inshorts, 20-newsgroups) demonstrate: (1) achieving 96.88% accuracy on BBC, 93.59% on Inshorts, and 92.49% on 20-newsgroups with only 30% labeled data, thereby reducing manual labeling effort by over 99%; (2) TF-IDF consistently outperforms CountVectorizer and Doc2Vec by 3–12 percentages in accuracy across most experimental settings; and (3) Logistic Regression and FNN achieve the best performance among the classifiers. The method offers a practical, resource-efficient solution for real-world text classification by bridging labeled-unlabeled data gaps without iterative retraining.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.