从源文件的名称中恢复软件架构

J. Softw. Maintenance Res. Pract. Pub Date : 1999-05-01 DOI:10.1002/(SICI)1096-908X(199905/06)11:3%3C201::AID-SMR192%3E3.0.CO;2-1

N. Anquetil, T. Lethbridge

{"title":"从源文件的名称中恢复软件架构","authors":"N. Anquetil, T. Lethbridge","doi":"10.1002/(SICI)1096-908X(199905/06)11:3%3C201::AID-SMR192%3E3.0.CO;2-1","DOIUrl":null,"url":null,"abstract":"We discuss how to extract a useful set of subsystems from a set of existing source-code file names. This problem is challenging because many legacy systems use thousands of files names, including some that are very short and cryptic. At the same time the problem is important because software maintainers often find it difficult to understand such systems. We propose a general algorithm to cluster files based on their names, and a set of alternative methods for implementing the algorithm. One of the key tasks is picking candidate words to try to identify in file names. We do this by (a) iteratively decomposing file names, (b) finding common substrings, and (c) choosing words in routine names, in an English dictionary or in source-code comments. In addition, we investigate generating abbreviations from the candidate words in order to find matches in file names, as well as how to split file names into components given no word markers. To compare and evaluate our five approaches, we present two experiments. The first compares the ‘concepts’ found in each file name by each method with the results of manually decomposing file names. The second experiment compares automatically generated subsystems with subsystem examples proposed by experts. We conclude that two methods are most effective: extracting concepts using common substrings and extracting those concepts that relate to the names of routines in the files. Copyright © 1999 John Wiley & Sons, Ltd.","PeriodicalId":383619,"journal":{"name":"J. Softw. Maintenance Res. Pract.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"123","resultStr":"{\"title\":\"Recovering software architecture from the names of source files\",\"authors\":\"N. Anquetil, T. Lethbridge\",\"doi\":\"10.1002/(SICI)1096-908X(199905/06)11:3%3C201::AID-SMR192%3E3.0.CO;2-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We discuss how to extract a useful set of subsystems from a set of existing source-code file names. This problem is challenging because many legacy systems use thousands of files names, including some that are very short and cryptic. At the same time the problem is important because software maintainers often find it difficult to understand such systems. We propose a general algorithm to cluster files based on their names, and a set of alternative methods for implementing the algorithm. One of the key tasks is picking candidate words to try to identify in file names. We do this by (a) iteratively decomposing file names, (b) finding common substrings, and (c) choosing words in routine names, in an English dictionary or in source-code comments. In addition, we investigate generating abbreviations from the candidate words in order to find matches in file names, as well as how to split file names into components given no word markers. To compare and evaluate our five approaches, we present two experiments. The first compares the ‘concepts’ found in each file name by each method with the results of manually decomposing file names. The second experiment compares automatically generated subsystems with subsystem examples proposed by experts. We conclude that two methods are most effective: extracting concepts using common substrings and extracting those concepts that relate to the names of routines in the files. Copyright © 1999 John Wiley & Sons, Ltd.\",\"PeriodicalId\":383619,\"journal\":{\"name\":\"J. Softw. Maintenance Res. Pract.\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"123\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Softw. Maintenance Res. Pract.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/(SICI)1096-908X(199905/06)11:3%3C201::AID-SMR192%3E3.0.CO;2-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Softw. Maintenance Res. Pract.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/(SICI)1096-908X(199905/06)11:3%3C201::AID-SMR192%3E3.0.CO;2-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 123

摘要

我们将讨论如何从一组现有的源代码文件名中提取一组有用的子系统。这个问题具有挑战性，因为许多遗留系统使用数千个文件名，包括一些非常短和神秘的文件名。同时，这个问题很重要，因为软件维护者经常发现很难理解这样的系统。我们提出了一种基于文件名对文件进行聚类的通用算法，以及一组实现该算法的替代方法。关键任务之一是在文件名中选择候选单词以尝试识别。我们通过(a)迭代分解文件名，(b)查找公共子字符串，以及(c)在例程名称、英语字典或源代码注释中选择单词来实现这一点。此外，我们还研究了从候选单词生成缩略语，以便在文件名中找到匹配项，以及如何在没有单词标记的情况下将文件名拆分为组件。为了比较和评价我们的五种方法，我们提出了两个实验。第一个比较了通过每种方法在每个文件名中找到的“概念”与手动分解文件名的结果。第二个实验将自动生成的子系统与专家给出的子系统实例进行比较。我们得出结论，有两种方法是最有效的:使用公共子字符串提取概念，以及提取与文件中例程名称相关的概念。版权所有©1999 John Wiley & Sons, Ltd

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Recovering software architecture from the names of source files

We discuss how to extract a useful set of subsystems from a set of existing source-code file names. This problem is challenging because many legacy systems use thousands of files names, including some that are very short and cryptic. At the same time the problem is important because software maintainers often find it difficult to understand such systems. We propose a general algorithm to cluster files based on their names, and a set of alternative methods for implementing the algorithm. One of the key tasks is picking candidate words to try to identify in file names. We do this by (a) iteratively decomposing file names, (b) finding common substrings, and (c) choosing words in routine names, in an English dictionary or in source-code comments. In addition, we investigate generating abbreviations from the candidate words in order to find matches in file names, as well as how to split file names into components given no word markers. To compare and evaluate our five approaches, we present two experiments. The first compares the ‘concepts’ found in each file name by each method with the results of manually decomposing file names. The second experiment compares automatically generated subsystems with subsystem examples proposed by experts. We conclude that two methods are most effective: extracting concepts using common substrings and extracting those concepts that relate to the names of routines in the files. Copyright © 1999 John Wiley & Sons, Ltd.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Softw. Maintenance Res. Pract.

自引率

0.00%

发文量