Analyzing and Detecting Information Types of Developer Live Chat Threads

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology Pub Date : 2024-01-29 DOI:10.1145/3643677

Xiuwei Shang, Shuai Zhang, Yitong Zhang, Shikai Guo, Yulong Li, Rong Chen, Hui Li, Xiaochen Li, He Jiang

{"title":"Analyzing and Detecting Information Types of Developer Live Chat Threads","authors":"Xiuwei Shang, Shuai Zhang, Yitong Zhang, Shikai Guo, Yulong Li, Rong Chen, Hui Li, Xiaochen Li, He Jiang","doi":"10.1145/3643677","DOIUrl":null,"url":null,"abstract":"<p>Online chatrooms serve as vital platforms for information exchange among software developers. With multiple developers engaged in rapid communication and diverse conversation topics, the resulting chat messages often manifest complexity and lack structure. To enhance the efficiency of extracting information from chat <i>threads</i>, automatic mining techniques are introduced for thread classification. However, previous approaches still grapple with unsatisfactory classification accuracy, due to two primary challenges that they struggle to adequately capture long-distance dependencies within chat threads and address the issue of category imbalance in labeled datasets. To surmount these challenges, we present a topic classification approach for chat information types named EAEChat. Specifically, EAEChat comprises three core components: the text feature encoding component captures contextual text features using a multi-head self-attention mechanism-based text feature encoder, and a siamese network is employed to mitigate overfitting caused by limited data; the data augmentation component expands a small number of categories in the training dataset using a technique tailored to developer chat messages, effectively tackling the challenge of imbalanced category distribution; the non-text feature encoding component employs a feature fusion model to integrate deep text features with manually extracted non-text features. Evaluation across three real-world projects demonstrates that EAEChat respectively achieves an average precision, recall, and F1-score of 0.653, 0.651, and 0.644, and it marks a significant 7.60% improvement over the state-of-the-art approachs. These findings confirm the effectiveness of our method in proficiently classifying developer chat messages in online chatrooms.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"27 1","pages":""},"PeriodicalIF":6.2000,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3643677","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Online chatrooms serve as vital platforms for information exchange among software developers. With multiple developers engaged in rapid communication and diverse conversation topics, the resulting chat messages often manifest complexity and lack structure. To enhance the efficiency of extracting information from chat threads, automatic mining techniques are introduced for thread classification. However, previous approaches still grapple with unsatisfactory classification accuracy, due to two primary challenges that they struggle to adequately capture long-distance dependencies within chat threads and address the issue of category imbalance in labeled datasets. To surmount these challenges, we present a topic classification approach for chat information types named EAEChat. Specifically, EAEChat comprises three core components: the text feature encoding component captures contextual text features using a multi-head self-attention mechanism-based text feature encoder, and a siamese network is employed to mitigate overfitting caused by limited data; the data augmentation component expands a small number of categories in the training dataset using a technique tailored to developer chat messages, effectively tackling the challenge of imbalanced category distribution; the non-text feature encoding component employs a feature fusion model to integrate deep text features with manually extracted non-text features. Evaluation across three real-world projects demonstrates that EAEChat respectively achieves an average precision, recall, and F1-score of 0.653, 0.651, and 0.644, and it marks a significant 7.60% improvement over the state-of-the-art approachs. These findings confirm the effectiveness of our method in proficiently classifying developer chat messages in online chatrooms.

查看原文本刊更多论文

分析和检测开发人员即时聊天主题的信息类型

在线聊天室是软件开发人员进行信息交流的重要平台。由于多个开发人员进行快速交流，聊天主题也多种多样，因此产生的聊天信息往往表现出复杂性和缺乏结构性。为了提高从聊天线程中提取信息的效率，人们引入了线程分类自动挖掘技术。然而，以往的方法仍然无法达到令人满意的分类精度，这主要是由于它们难以充分捕捉聊天线程中的长距离依赖关系，以及无法解决标签数据集中类别不平衡的问题。为了克服这些挑战，我们提出了一种名为 EAEChat 的聊天信息类型主题分类方法。具体来说，EAEChat 由三个核心组件组成：文本特征编码组件使用基于多头自注意机制的文本特征编码器捕获上下文文本特征，并使用连体网络来减轻有限数据造成的过拟合；数据增强组件使用一种为开发者聊天信息量身定制的技术扩展训练数据集中的少量类别，从而有效解决类别分布不平衡的难题；非文本特征编码组件使用一种特征融合模型来整合深度文本特征和手动提取的非文本特征。对三个真实世界项目的评估表明，EAEChat 的平均精确度、召回率和 F1 分数分别达到了 0.653、0.651 和 0.644，比最先进的方法显著提高了 7.60%。这些发现证实了我们的方法在熟练分类在线聊天室中的开发人员聊天信息方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Software Engineering and Methodology 工程技术-计算机：软件工程

CiteScore

6.30

自引率

4.50%

发文量

164

审稿时长

>12 weeks

期刊介绍： Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.