Bag of Words and Embedding Text Representation Methods for Medical Article Classification

IF 1.2 4区计算机科学 Q3 AUTOMATION & CONTROL SYSTEMS

International Journal of Applied Mathematics and Computer Science Pub Date : 2023-12-01 DOI:10.34768/amcs-2023-0043

Paweł Cichosz

{"title":"Bag of Words and Embedding Text Representation Methods for Medical Article Classification","authors":"Paweł Cichosz","doi":"10.34768/amcs-2023-0043","DOIUrl":null,"url":null,"abstract":"Abstract Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.","PeriodicalId":50339,"journal":{"name":"International Journal of Applied Mathematics and Computer Science","volume":"112 ","pages":"603 - 621"},"PeriodicalIF":1.2000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Applied Mathematics and Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.34768/amcs-2023-0043","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.

查看原文本刊更多论文

用于医学文章分类的词袋和嵌入式文本表示方法

摘要文本分类已成为自动系统文献综述（SLR）解决方案的标准组成部分，其中文章被分类为与特定文献研究主题相关或不相关。用于表格数据的传统机器学习算法可以从并不一定很大且通常不平衡的数据中快速学习，对计算要求较低，非常适合这一应用，但这些算法需要将文本数据转换为向量表示。这项工作研究了不同类型的文本表示法在这方面的实用性。实验中使用了词袋表示法和基于单词或文本嵌入的选定表示法：word2vec、doc2vec、GloVe、fastText、Flair 和 BioBERT。这些表示法使用了四种分类算法：天真贝叶斯分类器、逻辑回归、支持向量机和随机森林。这些算法被应用于由医学领域系统性文献综述研究中的科学文章摘要组成的数据集，并与预训练的 BioBERT 模型进行了分类微调比较。结果证实，文本表示法的选择对于成功的文本分类至关重要。事实证明，虽然标准的词袋表示法难以超越，但 fastText 词嵌入法可以实现大致相同的分类质量，而且维度更低，还能处理词汇表外的词。基于深度神经网络的更精细嵌入方法虽然计算要求更高，但在分类任务中似乎并没有实质性的优势。经过微调的 BioBERT 分类模型在与传统算法和最佳文本表示方法相结合时，表现与传统算法相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Applied Mathematics and Computer Science 工程技术-计算机：人工智能

CiteScore

4.10

自引率

21.10%

发文量

审稿时长

4.2 months

期刊介绍： The International Journal of Applied Mathematics and Computer Science is a quarterly published in Poland since 1991 by the University of Zielona Góra in partnership with De Gruyter Poland (Sciendo) and Lubuskie Scientific Society, under the auspices of the Committee on Automatic Control and Robotics of the Polish Academy of Sciences. The journal strives to meet the demand for the presentation of interdisciplinary research in various fields related to control theory, applied mathematics, scientific computing and computer science. In particular, it publishes high quality original research results in the following areas: -modern control theory and practice- artificial intelligence methods and their applications- applied mathematics and mathematical optimisation techniques- mathematical methods in engineering, computer science, and biology.