使用机器学习方法对不平衡代码混合数据进行情感分析。

IF 0.9 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Distributed and Parallel Databases Pub Date : 2023-01-01 DOI:10.1007/s10619-021-07331-4

R Srinivasan, C N Subalalitha

{"title":"使用机器学习方法对不平衡代码混合数据进行情感分析。","authors":"R Srinivasan, C N Subalalitha","doi":"10.1007/s10619-021-07331-4","DOIUrl":null,"url":null,"abstract":"Knowledge discovery from various perspectives has become a crucial asset in almost all fields. Sentimental analysis is a classification task used to classify the sentence based on the meaning of their context. This paper addresses class imbalance problem which is one of the important issues in sentimental analysis. Not much works focused on sentimental analysis with imbalanced class label distribution. The paper also focusses on another aspect of the problem which involves a concept called \"Code Mixing\". Code mixed data consists of text alternating between two or more languages. Class imbalance distribution is a commonly noted phenomenon in a code-mixed data. The existing works have focused more on analyzing the sentiments in a monolingual data but not in a code-mixed data. This paper addresses all these issues and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Furthermore, this paper compares the performances of various machine learning approaches namely, Random Forest Classifier, Logistic Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier using F1- Score.","PeriodicalId":50568,"journal":{"name":"Distributed and Parallel Databases","volume":"41 1-2","pages":"37-52"},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s10619-021-07331-4","citationCount":"17","resultStr":"{\"title\":\"Sentimental analysis from imbalanced code-mixed data using machine learning approaches.\",\"authors\":\"R Srinivasan, C N Subalalitha\",\"doi\":\"10.1007/s10619-021-07331-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowledge discovery from various perspectives has become a crucial asset in almost all fields. Sentimental analysis is a classification task used to classify the sentence based on the meaning of their context. This paper addresses class imbalance problem which is one of the important issues in sentimental analysis. Not much works focused on sentimental analysis with imbalanced class label distribution. The paper also focusses on another aspect of the problem which involves a concept called \\\"Code Mixing\\\". Code mixed data consists of text alternating between two or more languages. Class imbalance distribution is a commonly noted phenomenon in a code-mixed data. The existing works have focused more on analyzing the sentiments in a monolingual data but not in a code-mixed data. This paper addresses all these issues and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Furthermore, this paper compares the performances of various machine learning approaches namely, Random Forest Classifier, Logistic Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier using F1- Score.\",\"PeriodicalId\":50568,\"journal\":{\"name\":\"Distributed and Parallel Databases\",\"volume\":\"41 1-2\",\"pages\":\"37-52\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1007/s10619-021-07331-4\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Distributed and Parallel Databases\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10619-021-07331-4\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Distributed and Parallel Databases","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10619-021-07331-4","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 17

摘要

从不同角度发现知识已成为几乎所有领域的重要资产。情感分析是一种基于上下文意义对句子进行分类的分类任务。阶级失衡问题是情感分析中的一个重要问题。关注情感分析的作品不多，阶级标签分布不均。本文还关注了这个问题的另一个方面，涉及到一个叫做“代码混合”的概念。代码混合数据由在两种或多种语言之间交替的文本组成。类不平衡分布是代码混合数据中一个常见的现象。现有的工作更多地集中在分析单语数据中的情感，而不是代码混合数据中的情感。本文解决了所有这些问题，并提出了一种使用抽样技术结合levenshtein距离度量来分析类不平衡代码混合数据的情感的解决方案。此外，本文还比较了各种机器学习方法的性能，即随机森林分类器，逻辑回归，XGBoost分类器，支持向量机和Naïve贝叶斯分类器使用F1- Score。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Sentimental analysis from imbalanced code-mixed data using machine learning approaches.

查看原文本刊更多论文

Sentimental analysis from imbalanced code-mixed data using machine learning approaches.

Knowledge discovery from various perspectives has become a crucial asset in almost all fields. Sentimental analysis is a classification task used to classify the sentence based on the meaning of their context. This paper addresses class imbalance problem which is one of the important issues in sentimental analysis. Not much works focused on sentimental analysis with imbalanced class label distribution. The paper also focusses on another aspect of the problem which involves a concept called "Code Mixing". Code mixed data consists of text alternating between two or more languages. Class imbalance distribution is a commonly noted phenomenon in a code-mixed data. The existing works have focused more on analyzing the sentiments in a monolingual data but not in a code-mixed data. This paper addresses all these issues and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Furthermore, this paper compares the performances of various machine learning approaches namely, Random Forest Classifier, Logistic Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier using F1- Score.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Distributed and Parallel Databases 工程技术-计算机：理论方法

CiteScore

3.50

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Distributed and Parallel Databases publishes papers in all the traditional as well as most emerging areas of database research, including: Availability and reliability; Benchmarking and performance evaluation, and tuning; Big Data Storage and Processing; Cloud Computing and Database-as-a-Service; Crowdsourcing; Data curation, annotation and provenance; Data integration, metadata Management, and interoperability; Data models, semantics, query languages; Data mining and knowledge discovery; Data privacy, security, trust; Data provenance, workflows, Scientific Data Management; Data visualization and interactive data exploration; Data warehousing, OLAP, Analytics; Graph data management, RDF, social networks; Information Extraction and Data Cleaning; Middleware and Workflow Management; Modern Hardware and In-Memory Database Systems; Query Processing and Optimization; Semantic Web and open data; Social Networks; Storage, indexing, and physical database design; Streams, sensor networks, and complex event processing; Strings, Texts, and Keyword Search; Spatial, temporal, and spatio-temporal databases; Transaction processing; Uncertain, probabilistic, and approximate databases.