The advantages of lexicon-based sentiment analysis in an age of machine learning.

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-01-10 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0313092

A Maurits van der Veen, Erik Bleich

{"title":"The advantages of lexicon-based sentiment analysis in an age of machine learning.","authors":"A Maurits van der Veen, Erik Bleich","doi":"10.1371/journal.pone.0313092","DOIUrl":null,"url":null,"abstract":"<p><p>Assessing whether texts are positive or negative-sentiment analysis-has wide-ranging applications across many disciplines. Automated approaches make it possible to code near unlimited quantities of texts rapidly, replicably, and with high accuracy. Compared to machine learning and large language model (LLM) approaches, lexicon-based methods may sacrifice some in performance, but in exchange they provide generalizability and domain independence, while crucially offering the possibility of identifying gradations in sentiment. We demonstrate the strong performance of lexica using MultiLexScaled, an approach which averages valences across a number of widely-used general-purpose lexica. We validate it against benchmark datasets from a range of different domains, comparing performance against machine learning and LLM alternatives. In addition, we illustrate the value of identifying fine-grained sentiment levels by showing, in an analysis of pre- and post-9/11 British press coverage of Muslims, that binarized valence metrics give rise to different (and erroneous) conclusions about the nature of the post-9/11 shock as well as about differences between broadsheet and tabloid coverage. The code to apply MultiLexScaled is available online.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 1","pages":"e0313092"},"PeriodicalIF":2.6000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723603/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0313092","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Assessing whether texts are positive or negative-sentiment analysis-has wide-ranging applications across many disciplines. Automated approaches make it possible to code near unlimited quantities of texts rapidly, replicably, and with high accuracy. Compared to machine learning and large language model (LLM) approaches, lexicon-based methods may sacrifice some in performance, but in exchange they provide generalizability and domain independence, while crucially offering the possibility of identifying gradations in sentiment. We demonstrate the strong performance of lexica using MultiLexScaled, an approach which averages valences across a number of widely-used general-purpose lexica. We validate it against benchmark datasets from a range of different domains, comparing performance against machine learning and LLM alternatives. In addition, we illustrate the value of identifying fine-grained sentiment levels by showing, in an analysis of pre- and post-9/11 British press coverage of Muslims, that binarized valence metrics give rise to different (and erroneous) conclusions about the nature of the post-9/11 shock as well as about differences between broadsheet and tabloid coverage. The code to apply MultiLexScaled is available online.

Abstract Image

查看原文本刊更多论文

评估文本是积极的还是消极的——情绪分析——在许多学科中都有广泛的应用。自动化方法使得快速、可复制和高精度地编写几乎无限量的文本成为可能。与机器学习和大型语言模型（LLM）方法相比，基于词典的方法可能会牺牲一些性能，但作为交换，它们提供了泛化性和领域独立性，同时重要的是提供了识别情感层次的可能性。我们使用MultiLexScaled展示了lexica的强大性能，MultiLexScaled是一种对许多广泛使用的通用词典进行平均化的方法。我们对来自不同领域的基准数据集进行了验证，并将性能与机器学习和LLM替代方案进行了比较。此外，通过对9/11前后英国媒体对穆斯林的报道进行分析，我们说明了识别细粒度情绪水平的价值，二值化的效价指标对9/11后冲击的性质以及大报和小报报道之间的差异产生了不同（和错误）的结论。应用MultiLexScaled的代码可在网上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage