Shikai Guo, Bowen Ping, Zixuan Song, Hui Li, Rong Chen
{"title":"通过数据去噪方法在StackOverflow中生成高质量的标题","authors":"Shikai Guo, Bowen Ping, Zixuan Song, Hui Li, Rong Chen","doi":"10.1109/PAAP56126.2022.10010656","DOIUrl":null,"url":null,"abstract":"StackOverflow is one of the most popular question-and-answer platforms on the internet and whether posts on StackOverflow will be answered largely depends on their titles’ quality. Based on recurrent neural networks (RNN) or transformers, previous studies have attempted to use real posts from StackOverflow to generate better titles. However, the challenge of noise in existing data has been ignored, leading models can’t generate higher quality titles. To address this issue, we propose the K-clusters confidence learning for code titles (KCL-CT) model, which contains code clustering and confident learning (CL) denoising components. Specifically, the code clustering component is used to capture the word order and semantic information in code and classify code into different functional categories. The CL denoising component receives the output from the code clustering component and employs a heuristic method based on a confidence threshold to prune raw datasets. We conducted experiments based on Java, Python, JavaScript, SQL and C# datasets, the results of which indicated that in terms of the BLEU and ROUGE scores, the proposed KCL-CT model can outperform previous state-of-the-art models by 2.0%–11.1% and 2.5%–14.0%, respectively.","PeriodicalId":336339,"journal":{"name":"2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generating High Quality Titles in StackOverflow via Data Denoising Method\",\"authors\":\"Shikai Guo, Bowen Ping, Zixuan Song, Hui Li, Rong Chen\",\"doi\":\"10.1109/PAAP56126.2022.10010656\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"StackOverflow is one of the most popular question-and-answer platforms on the internet and whether posts on StackOverflow will be answered largely depends on their titles’ quality. Based on recurrent neural networks (RNN) or transformers, previous studies have attempted to use real posts from StackOverflow to generate better titles. However, the challenge of noise in existing data has been ignored, leading models can’t generate higher quality titles. To address this issue, we propose the K-clusters confidence learning for code titles (KCL-CT) model, which contains code clustering and confident learning (CL) denoising components. Specifically, the code clustering component is used to capture the word order and semantic information in code and classify code into different functional categories. The CL denoising component receives the output from the code clustering component and employs a heuristic method based on a confidence threshold to prune raw datasets. We conducted experiments based on Java, Python, JavaScript, SQL and C# datasets, the results of which indicated that in terms of the BLEU and ROUGE scores, the proposed KCL-CT model can outperform previous state-of-the-art models by 2.0%–11.1% and 2.5%–14.0%, respectively.\",\"PeriodicalId\":336339,\"journal\":{\"name\":\"2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PAAP56126.2022.10010656\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PAAP56126.2022.10010656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Generating High Quality Titles in StackOverflow via Data Denoising Method
StackOverflow is one of the most popular question-and-answer platforms on the internet and whether posts on StackOverflow will be answered largely depends on their titles’ quality. Based on recurrent neural networks (RNN) or transformers, previous studies have attempted to use real posts from StackOverflow to generate better titles. However, the challenge of noise in existing data has been ignored, leading models can’t generate higher quality titles. To address this issue, we propose the K-clusters confidence learning for code titles (KCL-CT) model, which contains code clustering and confident learning (CL) denoising components. Specifically, the code clustering component is used to capture the word order and semantic information in code and classify code into different functional categories. The CL denoising component receives the output from the code clustering component and employs a heuristic method based on a confidence threshold to prune raw datasets. We conducted experiments based on Java, Python, JavaScript, SQL and C# datasets, the results of which indicated that in terms of the BLEU and ROUGE scores, the proposed KCL-CT model can outperform previous state-of-the-art models by 2.0%–11.1% and 2.5%–14.0%, respectively.