在非结构化数据中发现技术构件的轻量级方法

Nicolas Bettenburg, Bram Adams, A. Hassan, Michel Smidt
{"title":"在非结构化数据中发现技术构件的轻量级方法","authors":"Nicolas Bettenburg, Bram Adams, A. Hassan, Michel Smidt","doi":"10.1109/ICPC.2011.36","DOIUrl":null,"url":null,"abstract":"Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.","PeriodicalId":345601,"journal":{"name":"2011 IEEE 19th International Conference on Program Comprehension","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data\",\"authors\":\"Nicolas Bettenburg, Bram Adams, A. Hassan, Michel Smidt\",\"doi\":\"10.1109/ICPC.2011.36\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.\",\"PeriodicalId\":345601,\"journal\":{\"name\":\"2011 IEEE 19th International Conference on Program Comprehension\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 19th International Conference on Program Comprehension\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPC.2011.36\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 19th International Conference on Program Comprehension","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPC.2011.36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

摘要

开发人员通过电子邮件、聊天或问题报告注释进行的交流主要由大量非结构化数据组成,即自然语言文本,混合着诸如特定于项目的术语、缩写、源代码补丁、堆栈跟踪和标识符等技术工件。这些技术工件代表了系统技术部分的有价值的知识来源,具有广泛的应用程序,从建立可追溯性链接到创建特定于项目的词汇表。然而,自然语言和技术内容之间缺乏明确定义的边界使得技术工件的自动挖掘具有挑战性。作为从非结构化数据中提取技术工件的通用技术的第一步,我们提出了一种轻量级的方法来解开技术工件和自然语言文本的纠缠。我们的方法基于现有的拼写检查工具,这些工具易于理解、快速、跨平台可用,并且对不同类型的文本数据一视同仁。通过手工制作的基准测试,我们证明了我们的方法能够成功地在非结构化数据中发现广泛的技术工件。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data
Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信