文本文档分类算法和过程综述

IF 0.5 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Data Mining Modelling and Management Pub Date : 2023-01-24 DOI:10.46610/jodmm.2023.v08i01.002

N. Ranjan, R. Prasad

{"title":"文本文档分类算法和过程综述","authors":"N. Ranjan, R. Prasad","doi":"10.46610/jodmm.2023.v08i01.002","DOIUrl":null,"url":null,"abstract":"The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.","PeriodicalId":43061,"journal":{"name":"International Journal of Data Mining Modelling and Management","volume":"61 4 1","pages":""},"PeriodicalIF":0.5000,"publicationDate":"2023-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Brief Survey of Text Document Classification Algorithms and Processes\",\"authors\":\"N. Ranjan, R. Prasad\",\"doi\":\"10.46610/jodmm.2023.v08i01.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.\",\"PeriodicalId\":43061,\"journal\":{\"name\":\"International Journal of Data Mining Modelling and Management\",\"volume\":\"61 4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2023-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Data Mining Modelling and Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46610/jodmm.2023.v08i01.002\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Mining Modelling and Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46610/jodmm.2023.v08i01.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

非结构化数据的指数级增长是数据挖掘、文本分析或数据分析中最关键的挑战之一。世界上大约80%的数据以非结构化格式提供，由于其分析的复杂性，大多数数据都无人关注。由于大规模的术语和数据模式，保证基于用户偏好对文档进行分类的文本文档分类器的质量是一个很大的挑战。万维网正在迅速发展，电子文档的可用性也在增加。因此，文档的自动分类是信息系统组织和知识发现的关键因素。大多数现有的广泛的文本挖掘和分类策略都采用了基于术语的方法。然而，这种方法中的多义、同义问题令人十分关注。要根据上下文对文档进行分类，需要遵循基于上下文的方法。文本的语义分析克服了基于词的方法的局限性，提高了分类器的准确率。本文旨在强调可用于文本文档分类的重要算法、技术和方法。此外，本文还对文本文档分类的不同阶段进行了综述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Brief Survey of Text Document Classification Algorithms and Processes

The exponential growth of unstructured data is one of the most critical challenges in data mining, text analytics, or data analytics. Around 80% of the world's data are available in unstructured format and most are left unattended due to the complexity of its analysis. It is a great challenge to guarantee the quality of the text document classifier that classifies documents based on user preferences because of large-scale terms and data patterns. The World Wide Web is growing rapidly and the availability of electronic documents is also increasing. Therefore, the automatic categorization of documents is the key factor for the systematic organization of information and knowledge discovery. Most existing widespread text mining and classification strategies have adopted term-based approaches. However, the problems of polysemy and synonymy in such approaches are of great concern. To classify documents based on their context, the context-based approach is needed to be followed. Semantic analysis of the text overcomes the limitations of the term-based approach and it also enhances the accuracy of the classifiers. This paper aims to highlight the important algorithms, techniques, and methodologies that can be used for text document classification. Furthermore, the paper also provides a review of the different stages of Text Document Classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Data Mining Modelling and Management COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

1.10

自引率

0.00%

发文量

期刊介绍： Facilitating transformation from data to information to knowledge is paramount for organisations. Companies are flooded with data and conflicting information, but with limited real usable knowledge. However, rarely should a process be looked at from limited angles or in parts. Isolated islands of data mining, modelling and management (DMMM) should be connected. IJDMMM highlightes integration of DMMM, statistics/machine learning/databases, each element of data chain management, types of information, algorithms in software; from data pre-processing to post-processing; between theory and applications. Topics covered include: -Artificial intelligence- Biomedical science- Business analytics/intelligence, process modelling- Computer science, database management systems- Data management, mining, modelling, warehousing- Engineering- Environmental science, environment (ecoinformatics)- Information systems/technology, telecommunications/networking- Management science, operations research, mathematics/statistics- Social sciences- Business/economics, (computational) finance- Healthcare, medicine, pharmaceuticals- (Computational) chemistry, biology (bioinformatics)- Sustainable mobility systems, intelligent transportation systems- National security