Performance Analysis of Embedding Methods for Deep Learning-Based Turkish Sentiment Analysis Models

IF 2.6 4区综合性期刊 Q2 MULTIDISCIPLINARY SCIENCES

Arabian Journal for Science and Engineering Pub Date : 2024-08-01 DOI:10.1007/s13369-024-09360-4

Abdulfattah Ba Alawi, Ferhat Bozkurt

{"title":"Performance Analysis of Embedding Methods for Deep Learning-Based Turkish Sentiment Analysis Models","authors":"Abdulfattah Ba Alawi, Ferhat Bozkurt","doi":"10.1007/s13369-024-09360-4","DOIUrl":null,"url":null,"abstract":"<div><p>The complex syntactic structure of Turkish text makes sentiment analysis in natural language processing (NLP) a challenging task. Conventional sentiment analysis methods often fail to effectively identify attitudes in Turkish texts, creating an urgent need for more efficient approaches. To fill this need, our study investigates the effectiveness of embedding techniques including pre-trained Turkish models such as Word2Vec, GloVe, and FastText in addition to two character-level embedding methods, namely, character-integer embedding (CIE) and character one-hot encoding embedding (COE), in conjunction with deep learning models specifically long short-term memory (LSTM), convolution neural networks (CNNs), bidirectional LSTM (Bi-LSTM), and hybrid models, for Turkish short-texts sentiment analysis. DL-based models were investigated on two datasets (e.g., an original Twitter (X) dataset and an accessible hotel reviews dataset). In addition to providing an intensive performance analysis of different embedding strategies and assessing their efficacy in dealing with the linguistic intricacies of Turkish, this study proposed a previously unexplored method in Turkish text representation that relies on a character-level one-hot encoding technique. The obtained findings indicate positive progress using a novel approach utilizing a dual-pathway architecture for both character level and word level that constitutes a substantial contribution to the area of natural language processing (NLP), specifically in the context of complex morphological languages. By employing a hybrid strategy that combines character and word levels on Twitter (X) data, the LSTM model obtained an <i>F</i>1 score of <span>\\(0.835 \\pm 0.005\\)</span> concerning cross-validation while CNN-BiLSTM attained the highest <i>F</i>1 Score (0.8392) using holdout validation. This strategy consistently produced modest improvements across the second public dataset (hotel reviews dataset) by emerging as the runner-up embedding technique in effectiveness, surpassed only by FastText. Findings provide practical recommendations for practitioners on how to effectively use sentiment analysis to make informed decisions by introducing an extensive performance analysis of the use of embedding techniques and deep learning models for sentiment analysis in Turkish texts, which is crucial in the current age of data analysis.\n</p></div>","PeriodicalId":54354,"journal":{"name":"Arabian Journal for Science and Engineering","volume":"50 10","pages":"7299 - 7321"},"PeriodicalIF":2.6000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s13369-024-09360-4.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arabian Journal for Science and Engineering","FirstCategoryId":"103","ListUrlMain":"https://link.springer.com/article/10.1007/s13369-024-09360-4","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The complex syntactic structure of Turkish text makes sentiment analysis in natural language processing (NLP) a challenging task. Conventional sentiment analysis methods often fail to effectively identify attitudes in Turkish texts, creating an urgent need for more efficient approaches. To fill this need, our study investigates the effectiveness of embedding techniques including pre-trained Turkish models such as Word2Vec, GloVe, and FastText in addition to two character-level embedding methods, namely, character-integer embedding (CIE) and character one-hot encoding embedding (COE), in conjunction with deep learning models specifically long short-term memory (LSTM), convolution neural networks (CNNs), bidirectional LSTM (Bi-LSTM), and hybrid models, for Turkish short-texts sentiment analysis. DL-based models were investigated on two datasets (e.g., an original Twitter (X) dataset and an accessible hotel reviews dataset). In addition to providing an intensive performance analysis of different embedding strategies and assessing their efficacy in dealing with the linguistic intricacies of Turkish, this study proposed a previously unexplored method in Turkish text representation that relies on a character-level one-hot encoding technique. The obtained findings indicate positive progress using a novel approach utilizing a dual-pathway architecture for both character level and word level that constitutes a substantial contribution to the area of natural language processing (NLP), specifically in the context of complex morphological languages. By employing a hybrid strategy that combines character and word levels on Twitter (X) data, the LSTM model obtained an F1 score of \(0.835 \pm 0.005\) concerning cross-validation while CNN-BiLSTM attained the highest F1 Score (0.8392) using holdout validation. This strategy consistently produced modest improvements across the second public dataset (hotel reviews dataset) by emerging as the runner-up embedding technique in effectiveness, surpassed only by FastText. Findings provide practical recommendations for practitioners on how to effectively use sentiment analysis to make informed decisions by introducing an extensive performance analysis of the use of embedding techniques and deep learning models for sentiment analysis in Turkish texts, which is crucial in the current age of data analysis.

Abstract Image

查看原文本刊更多论文

基于深度学习的土耳其情感分析模型的嵌入方法性能分析

土耳其语文本的句法结构复杂，使得自然语言处理（NLP）中的情感分析成为一项具有挑战性的任务。传统的情感分析方法往往无法有效识别土耳其语文本中的态度，因此迫切需要更有效的方法。为了满足这一需求，我们的研究调查了嵌入技术的有效性，包括预训练的土耳其语模型，如 Word2Vec、GloVe 和 FastText，以及两种字符级嵌入方法，即字符整数嵌入（CIE）和字符单次编码嵌入（COE），并结合深度学习模型，特别是长短期记忆（LSTM）、卷积神经网络（CNN）、双向 LSTM（Bi-LSTM）和混合模型，用于土耳其语短文的情感分析。基于 DL 的模型在两个数据集（如原始 Twitter (X) 数据集和可访问的酒店评论数据集）上进行了研究。除了对不同的嵌入策略进行深入的性能分析，并评估它们在处理土耳其语错综复杂的语言方面的功效外，本研究还提出了一种以前未曾探索过的土耳其语文本表示方法，该方法依赖于字符级单击编码技术。研究结果表明，利用字符级和单词级双通道架构的新方法取得了积极进展，为自然语言处理（NLP）领域做出了重大贡献，特别是在复杂形态语言方面。通过在Twitter（X）数据上采用字符级和词级相结合的混合策略，LSTM模型在交叉验证中获得了F1分数（0.835/pm 0.005），而CNN-BiLSTM在保持验证中获得了最高的F1分数（0.8392）。在第二个公共数据集（酒店点评数据集）中，该策略始终保持着适度的改进，在有效性方面成为嵌入技术的亚军，仅次于FastText。研究结果为从业人员提供了如何有效利用情感分析做出明智决策的实用建议，介绍了在土耳其文本中使用嵌入技术和深度学习模型进行情感分析的广泛性能分析，这在当前的数据分析时代至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Arabian Journal for Science and Engineering MULTIDISCIPLINARY SCIENCES-

CiteScore

5.70

自引率

3.40%

发文量

993

期刊介绍： King Fahd University of Petroleum & Minerals (KFUPM) partnered with Springer to publish the Arabian Journal for Science and Engineering (AJSE). AJSE, which has been published by KFUPM since 1975, is a recognized national, regional and international journal that provides a great opportunity for the dissemination of research advances from the Kingdom of Saudi Arabia, MENA and the world.