Phrase2Set: Phrase-to-Set Machine Translation and Its Software Engineering Applications

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2022-03-01 DOI:10.1109/saner53432.2022.00068

THANH VAN NGUYEN, Aashish Yadavally, T. Nguyen

{"title":"Phrase2Set: Phrase-to-Set Machine Translation and Its Software Engineering Applications","authors":"THANH VAN NGUYEN, Aashish Yadavally, T. Nguyen","doi":"10.1109/saner53432.2022.00068","DOIUrl":null,"url":null,"abstract":"Machine translation has been applied to software engineering (SE) problems, e.g., software tagging, language mi-gration, bug localization, auto program repair, etc. However, ma-chine translation primarily supports only sequence-to-sequence transformations and falls short during the translation/transfor-mation from a phrase or sequence in the input to a set in the output. An example of such a task is tagging the input text in a software library tutorial or a forum entry text with a set of API elements that are relevant to the input. In this work, we propose Phrase2Set, a context-sensitive statistical machine translation model that learns to transform a phrase of a mixture of code and texts into a set of code or text tokens. We first design a token-to-token algorithm that computes the probabilities of mapping individual tokens from phrases to sets. We propose a Bayesian network-based statistical machine translation model that uses these probabilities to decide a trans-lation process that maximizes the joint translation probability. To achieve that, we consider the context of the tokens in the source side and that in the target side via their relative co-occurrence frequencies. We evaluate Phrase2Set in three SE applications: 1) tagging the fragments of texts in a tutorial with the relevant API elements, 2) tagging the StackOverflow entries with relevant API elements, 3) text-to-API translation. Our empirical results show that Phrase2Set achieves high accuracy and outperforms the state-of-the-art models in all three applications. We also provide the lessons learned and other potential applications.","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/saner53432.2022.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine translation has been applied to software engineering (SE) problems, e.g., software tagging, language mi-gration, bug localization, auto program repair, etc. However, ma-chine translation primarily supports only sequence-to-sequence transformations and falls short during the translation/transfor-mation from a phrase or sequence in the input to a set in the output. An example of such a task is tagging the input text in a software library tutorial or a forum entry text with a set of API elements that are relevant to the input. In this work, we propose Phrase2Set, a context-sensitive statistical machine translation model that learns to transform a phrase of a mixture of code and texts into a set of code or text tokens. We first design a token-to-token algorithm that computes the probabilities of mapping individual tokens from phrases to sets. We propose a Bayesian network-based statistical machine translation model that uses these probabilities to decide a trans-lation process that maximizes the joint translation probability. To achieve that, we consider the context of the tokens in the source side and that in the target side via their relative co-occurrence frequencies. We evaluate Phrase2Set in three SE applications: 1) tagging the fragments of texts in a tutorial with the relevant API elements, 2) tagging the StackOverflow entries with relevant API elements, 3) text-to-API translation. Our empirical results show that Phrase2Set achieves high accuracy and outperforms the state-of-the-art models in all three applications. We also provide the lessons learned and other potential applications.

查看原文本刊更多论文

Phrase2Set:短语到集合机器翻译及其软件工程应用

机器翻译已被应用于软件工程问题，如软件标记、语言迁移、错误定位、自动程序修复等。然而，机器翻译主要只支持序列到序列的转换，并且在从输入中的短语或序列到输出中的集合的翻译/转换过程中存在不足。这种任务的一个例子是用一组与输入相关的API元素标记软件库教程或论坛条目文本中的输入文本。在这项工作中，我们提出了Phrase2Set，这是一个上下文敏感的统计机器翻译模型，它学习将代码和文本混合的短语转换为一组代码或文本标记。我们首先设计了一个标记到标记的算法，计算从短语到集合的单个标记映射的概率。我们提出了一个基于贝叶斯网络的统计机器翻译模型，该模型使用这些概率来决定一个翻译过程，使联合翻译概率最大化。为了实现这一点，我们通过它们的相对共现频率来考虑源端和目标端令牌的上下文。我们在三个SE应用中评估了Phrase2Set: 1)用相关的API元素标记教程中的文本片段，2)用相关的API元素标记StackOverflow条目，3)文本到API的翻译。我们的实证结果表明，Phrase2Set在所有三种应用中都达到了很高的准确性，并且优于最先进的模型。我们还提供了经验教训和其他潜在的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量