{"title":"从文本分类的基础到 GPT:关于当前方法和未来趋势的全面调查","authors":"Marco Siino, Ilenia Tinnirello, Marco La Cascia","doi":"10.1561/1500000107","DOIUrl":null,"url":null,"abstract":"<p>\nText classification stands as a cornerstone within the realm\nof Natural Language Processing (NLP), particularly when\nviewed through computer science and engineering. The past\ndecade has seen deep learning revolutionize text classification,\npropelling advancements in text retrieval, categorization,\ninformation extraction, and summarization. The\nscholarly literature includes datasets, models, and evaluation\ncriteria, with English being the predominant language of\nfocus, despite studies involving Arabic, Chinese, Hindi, and\nothers. The efficacy of text classification models relies heavily\non their ability to capture intricate textual relationships\nand non-linear correlations, necessitating a comprehensive\nexamination of the entire text classification pipeline.\n<p>\nIn the NLP domain, a plethora of text representation techniques\nand model architectures have emerged, with Large\nLanguage Models (LLMs) and Generative Pre-trained Transformers\n(GPTs) at the forefront. These models are adept at\ntransforming extensive textual data into meaningful vector\nrepresentations encapsulating semantic information. The\nmultidisciplinary nature of text classification, encompassing\ndata mining, linguistics, and information retrieval, highlights\nthe importance of collaborative research to advance the field.\nThis work integrates traditional and contemporary text mining\nmethodologies, fostering a holistic understanding of text\nclassification.\n</p><p>\nThis monograph provides an in-depth exploration of the\ntext classification pipeline, with a particular emphasis on\nevaluating the impact of each component on the overall performance\nof text classification models. The pipeline includes\nstate-of-the-art datasets, text preprocessing techniques, text\nrepresentation methods, classification models, evaluation\nmetrics, and future trends. Each section examines these\nstages, presenting technical innovations and recent findings.\nThe work assesses various classification strategies, offering\ncomparative analyses, examples and case studies. These\ncontributions extend beyond a typical survey, providing a\ndetailed and insightful exploration of the field.\n</p></p>","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"8 1","pages":""},"PeriodicalIF":8.3000,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends\",\"authors\":\"Marco Siino, Ilenia Tinnirello, Marco La Cascia\",\"doi\":\"10.1561/1500000107\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>\\nText classification stands as a cornerstone within the realm\\nof Natural Language Processing (NLP), particularly when\\nviewed through computer science and engineering. The past\\ndecade has seen deep learning revolutionize text classification,\\npropelling advancements in text retrieval, categorization,\\ninformation extraction, and summarization. The\\nscholarly literature includes datasets, models, and evaluation\\ncriteria, with English being the predominant language of\\nfocus, despite studies involving Arabic, Chinese, Hindi, and\\nothers. The efficacy of text classification models relies heavily\\non their ability to capture intricate textual relationships\\nand non-linear correlations, necessitating a comprehensive\\nexamination of the entire text classification pipeline.\\n<p>\\nIn the NLP domain, a plethora of text representation techniques\\nand model architectures have emerged, with Large\\nLanguage Models (LLMs) and Generative Pre-trained Transformers\\n(GPTs) at the forefront. These models are adept at\\ntransforming extensive textual data into meaningful vector\\nrepresentations encapsulating semantic information. The\\nmultidisciplinary nature of text classification, encompassing\\ndata mining, linguistics, and information retrieval, highlights\\nthe importance of collaborative research to advance the field.\\nThis work integrates traditional and contemporary text mining\\nmethodologies, fostering a holistic understanding of text\\nclassification.\\n</p><p>\\nThis monograph provides an in-depth exploration of the\\ntext classification pipeline, with a particular emphasis on\\nevaluating the impact of each component on the overall performance\\nof text classification models. The pipeline includes\\nstate-of-the-art datasets, text preprocessing techniques, text\\nrepresentation methods, classification models, evaluation\\nmetrics, and future trends. Each section examines these\\nstages, presenting technical innovations and recent findings.\\nThe work assesses various classification strategies, offering\\ncomparative analyses, examples and case studies. These\\ncontributions extend beyond a typical survey, providing a\\ndetailed and insightful exploration of the field.\\n</p></p>\",\"PeriodicalId\":48829,\"journal\":{\"name\":\"Foundations and Trends in Information Retrieval\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2025-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Foundations and Trends in Information Retrieval\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1561/1500000107\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations and Trends in Information Retrieval","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1561/1500000107","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends
Text classification stands as a cornerstone within the realm
of Natural Language Processing (NLP), particularly when
viewed through computer science and engineering. The past
decade has seen deep learning revolutionize text classification,
propelling advancements in text retrieval, categorization,
information extraction, and summarization. The
scholarly literature includes datasets, models, and evaluation
criteria, with English being the predominant language of
focus, despite studies involving Arabic, Chinese, Hindi, and
others. The efficacy of text classification models relies heavily
on their ability to capture intricate textual relationships
and non-linear correlations, necessitating a comprehensive
examination of the entire text classification pipeline.
In the NLP domain, a plethora of text representation techniques
and model architectures have emerged, with Large
Language Models (LLMs) and Generative Pre-trained Transformers
(GPTs) at the forefront. These models are adept at
transforming extensive textual data into meaningful vector
representations encapsulating semantic information. The
multidisciplinary nature of text classification, encompassing
data mining, linguistics, and information retrieval, highlights
the importance of collaborative research to advance the field.
This work integrates traditional and contemporary text mining
methodologies, fostering a holistic understanding of text
classification.
This monograph provides an in-depth exploration of the
text classification pipeline, with a particular emphasis on
evaluating the impact of each component on the overall performance
of text classification models. The pipeline includes
state-of-the-art datasets, text preprocessing techniques, text
representation methods, classification models, evaluation
metrics, and future trends. Each section examines these
stages, presenting technical innovations and recent findings.
The work assesses various classification strategies, offering
comparative analyses, examples and case studies. These
contributions extend beyond a typical survey, providing a
detailed and insightful exploration of the field.
期刊介绍:
The surge in research across all domains in the past decade has resulted in a plethora of new publications, causing an exponential growth in published research. Navigating through this extensive literature and staying current has become a time-consuming challenge. While electronic publishing provides instant access to more articles than ever, discerning the essential ones for a comprehensive understanding of any topic remains an issue. To tackle this, Foundations and Trends® in Information Retrieval - FnTIR - addresses the problem by publishing high-quality survey and tutorial monographs in the field.
Each issue of Foundations and Trends® in Information Retrieval - FnT IR features a 50-100 page monograph authored by research leaders, covering tutorial subjects, research retrospectives, and survey papers that provide state-of-the-art reviews within the scope of the journal.