Semi-supervised learning models for document classification: A systematic review and meta-analysis

IF 3.7 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Inteligencia Artificial-Iberoamerical Journal of Artificial Intelligence Pub Date : 2023-06-09 DOI:10.4114/intartif.vol26iss72pp30-60

Alex Cevallos-Culqui, Claudia Pons, Gustavo Rodriguez

{"title":"Semi-supervised learning models for document classification: A systematic review and meta-analysis","authors":"Alex Cevallos-Culqui, Claudia Pons, Gustavo Rodriguez","doi":"10.4114/intartif.vol26iss72pp30-60","DOIUrl":null,"url":null,"abstract":"The continuous increase of digital documents on the web creates the need to search for information patterns that allow the categorization of organizational documents to generate knowledge in an institution. An Artificial Intelligence technique for this purpose is text classification, it for its application uses labels (previously categorized documents) with supervised (with labels) or unsupervised (without labels) training models. Both traditional models with their advantages and disadvantages have been joined into semi-supervised models that extract the best qualities of each one, however, the labeling process involves resources and time that try to be optimized to improve classification accuracy. An analysis of the different semi-supervised models would show us the advantages of their training and the way how the structure of each of them affects the accuracy of their classification. In the present study, a classification structure of the semi-supervised models in the classification of documents is proposed to analyze their qualities and categorization process, through an SLR (Revision of systematic literature) that extracts performance metrics from the identified studies to perform a meta-analysis through forest plots. To define the search strategy for studies, the PICOC (Population, Intervention, Comparison, Outputs, Context) method has been used, it is supported by the research question defines a search string, which has allowed the collection of 228 research, these are filtered with the PRISMA declaration method and the determination of exclusion criteria, in this way 35 researches are selected for the present study. The analysis of the selected studies identifies a structure for the different semi-supervised learning models, and a scheme of their work process is obtained, it has been used to extract advantages, disadvantages, and performance metrics. Through a meta-analysis with forest diagrams, the classification accuracy performance of the researches in each learning model is evaluated, determining as results that regardless of the characteristics of its process, active learning (0.89) and assembled learning (0.83) present the best performance levels.","PeriodicalId":43470,"journal":{"name":"Inteligencia Artificial-Iberoamerical Journal of Artificial Intelligence","volume":"213 1","pages":"0"},"PeriodicalIF":3.7000,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inteligencia Artificial-Iberoamerical Journal of Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4114/intartif.vol26iss72pp30-60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The continuous increase of digital documents on the web creates the need to search for information patterns that allow the categorization of organizational documents to generate knowledge in an institution. An Artificial Intelligence technique for this purpose is text classification, it for its application uses labels (previously categorized documents) with supervised (with labels) or unsupervised (without labels) training models. Both traditional models with their advantages and disadvantages have been joined into semi-supervised models that extract the best qualities of each one, however, the labeling process involves resources and time that try to be optimized to improve classification accuracy. An analysis of the different semi-supervised models would show us the advantages of their training and the way how the structure of each of them affects the accuracy of their classification. In the present study, a classification structure of the semi-supervised models in the classification of documents is proposed to analyze their qualities and categorization process, through an SLR (Revision of systematic literature) that extracts performance metrics from the identified studies to perform a meta-analysis through forest plots. To define the search strategy for studies, the PICOC (Population, Intervention, Comparison, Outputs, Context) method has been used, it is supported by the research question defines a search string, which has allowed the collection of 228 research, these are filtered with the PRISMA declaration method and the determination of exclusion criteria, in this way 35 researches are selected for the present study. The analysis of the selected studies identifies a structure for the different semi-supervised learning models, and a scheme of their work process is obtained, it has been used to extract advantages, disadvantages, and performance metrics. Through a meta-analysis with forest diagrams, the classification accuracy performance of the researches in each learning model is evaluated, determining as results that regardless of the characteristics of its process, active learning (0.89) and assembled learning (0.83) present the best performance levels.

查看原文本刊更多论文

文献分类的半监督学习模型:系统回顾与元分析

网络上数字文档的持续增长产生了搜索信息模式的需求，这些信息模式允许对组织文档进行分类，从而在机构中生成知识。用于此目的的人工智能技术是文本分类，它的应用程序使用标签(先前分类的文档)与监督(有标签)或无监督(没有标签)训练模型。两种传统模型各有优缺点，都被加入到半监督模型中，从中提取出每一种模型的最佳品质，然而，标注过程涉及资源和时间，试图对其进行优化以提高分类精度。对不同的半监督模型的分析将向我们展示它们的训练优势，以及它们的结构如何影响分类的准确性。在本研究中，提出了一种半监督模型的分类结构，通过SLR(系统文献修订)从已识别的研究中提取绩效指标，通过森林样地进行meta分析，分析其质量和分类过程。为了确定研究的搜索策略，使用了PICOC (Population, Intervention, Comparison, Outputs, Context)方法，并以研究问题为支持定义了一个搜索字符串，允许收集228项研究，这些研究使用PRISMA声明方法进行过滤并确定排除标准，这样就选择了35项研究用于本研究。通过对所选研究的分析，确定了不同半监督学习模型的结构，并获得了其工作过程的方案，并用于提取优点，缺点和性能指标。通过森林图的meta分析，对各学习模型的分类准确率表现进行了评价，结果表明，无论其过程的特征如何，主动学习(0.89)和组合学习(0.83)表现出最好的性能水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Inteligencia Artificial-Iberoamerical Journal of Artificial Intelligence COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

2.00

自引率

0.00%

发文量

审稿时长

8 weeks

期刊介绍： Inteligencia Artificial is a quarterly journal promoted and sponsored by the Spanish Association for Artificial Intelligence. The journal publishes high-quality original research papers reporting theoretical or applied advances in all branches of Artificial Intelligence. The journal publishes high-quality original research papers reporting theoretical or applied advances in all branches of Artificial Intelligence. Particularly, the Journal welcomes: New approaches, techniques or methods to solve AI problems, which should include demonstrations of effectiveness oor improvement over existing methods. These demonstrations must be reproducible. Integration of different technologies or approaches to solve wide problems or belonging different areas. AI applications, which should describe in detail the problem or the scenario and the proposed solution, emphasizing its novelty and present a evaluation of the AI techniques that are applied. In addition to rapid publication and dissemination of unsolicited contributions, the journal is also committed to producing monographs, surveys or special issues on topics, methods or techniques of special relevance to the AI community. Inteligencia Artificial welcomes submissions written in English, Spaninsh or Portuguese. But at least, a title, summary and keywords in english should be included in each contribution.