MIC检查:ESE数据的相关策略

Daryl Posnett, Premkumar T. Devanbu, V. Filkov
{"title":"MIC检查:ESE数据的相关策略","authors":"Daryl Posnett, Premkumar T. Devanbu, V. Filkov","doi":"10.1109/MSR.2012.6224295","DOIUrl":null,"url":null,"abstract":"Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.","PeriodicalId":383774,"journal":{"name":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"MIC check: A correlation tactic for ESE data\",\"authors\":\"Daryl Posnett, Premkumar T. Devanbu, V. Filkov\",\"doi\":\"10.1109/MSR.2012.6224295\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.\",\"PeriodicalId\":383774,\"journal\":{\"name\":\"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)\",\"volume\":\"82 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSR.2012.6224295\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 9th IEEE Working Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2012.6224295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

经验软件工程研究人员关心的是理解感兴趣的结果(例如缺陷)与过程和产品度量之间的关系。使用相关性来揭示强关系是多变量建模的自然先驱。不幸的是,相关系数很难解释,而且/或容易引起误解。例如,多项式关系中的变量之间存在很强的相关性;这可能导致人们错误地,并最终误导,在线性回归中建立多项式相关变量的模型。同样,非单调泛函关系,甚至非泛函关系也可能被相关系数完全忽略。异常值会影响标准相关度量,关联值甚至会过度影响稳健的非参数秩相关度量,而较小的样本量会导致相关度量不稳定。一种新的二元关联度量,最大信息系数(MIC)[1],承诺同时发现两个变量是否具有:A)任何关联,b)函数关系,c)非线性关系。MIC是对标准和等级相关度量的非常有用的补充。它分别规定了一种关系的存在及其确切的性质;因此,它可以在建模非功能和非线性关系时做出更明智的选择,并且可以更细致地指示标准和等级相关度量所报告的值的潜在问题。我们使用各种软件工程度量来说明MIC的使用。我们研究和解释了软件工程数据中MIC和相关度量的分布特性,并说明了这些度量对经验软件工程研究者的价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MIC check: A correlation tactic for ESE data
Empirical software engineering researchers are concerned with understanding the relationships between outcomes of interest, e.g. defects, and process and product measures. The use of correlations to uncover strong relationships is a natural precursor to multivariate modeling. Unfortunately, correlation coefficients can be difficult and/or misleading to interpret. For example, a strong correlation occurs between variables that stand in a polynomial relationship; this may lead one mistakenly, and eventually misleadingly, to model a polynomially related variable in a linear regression. Likewise, a non-monotonic functional, or even non-functional relationship might be entirely missed by a correlation coefficient. Outliers can influence standard correlation measures, tied values can unduly influence even robust non-parametric rank correlation, measures, and smaller sample sizes can cause instability in correlation measures. A new bivariate measure of association, Maximal Information Coefficient (MIC) [1], promises to simultaneously discover if two variables have: a) any association, b) a functional relationship, and c) a nonlinear relationship. The MIC is a very useful complement to standard and rank correlation measures. It separately characterizes the existence of a relationship and its precise nature; thus, it enables more informed choices in modeling non-functional and nonlinear relationships, and a more nuanced indicator of potential problems with the values reported by standard and rank correlation measures. We illustrate the use of MIC using a variety of software engineering metrics. We study and explain the distributional properties of MIC and related measures in software engineering data, and illustrate the value of these measures for the empirical software engineering researcher.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信