Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) Pub Date : 2019-05-26 DOI:10.1109/MSR.2019.00020

Eeshita Biswas, K. Vijay-Shanker, L. Pollock

{"title":"Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts","authors":"Eeshita Biswas, K. Vijay-Shanker, L. Pollock","doi":"10.1109/MSR.2019.00020","DOIUrl":null,"url":null,"abstract":"Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.","PeriodicalId":6706,"journal":{"name":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","volume":"79 1","pages":"68-78"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2019.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.

查看原文本刊更多论文

探索词嵌入技术改进软件工程文本情感分析

基于文本的软件工件的情感分析(SA)越来越多地用于为各种任务提取信息，包括提供代码建议、提高开发团队的生产力、给出软件包和库的建议，以及对源代码中的缺陷、代码质量、应用程序改进的可能性提出评论。对应用于软件相关文本的最先进的情感分析工具的研究显示，基于技术和训练方法，结果各不相同。在本文中，我们研究了在使用Lin等人开发的Stack Overflow数据定制的神经网络的背景下，改进SE工件情感分析训练的两个潜在机会的影响。我们将情感分析过程定制到软件领域，使用从Stack Overflow (SO)文章中学习到的特定于软件领域的词嵌入，并研究了特定于软件领域的词嵌入对情感分析工具性能的影响，并与从Google News中学习到的通用词嵌入进行了比较。我们发现从Google新闻数据中学习到的词嵌入在大多数情况下与从SO帖子中学习到的词嵌入相似，在某些情况下甚至更好。我们还研究了两种机器学习技术(数据的过采样和欠采样)对情感分类器训练的影响，用于处理具有倾斜分布的小SE数据集。我们发现单独的过采样，以及过采样和欠采样的结合，有助于提高情感分类器的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量