噪声水平和分布对简易基因芯片数据分类的影响

Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014) Pub Date : 2014-08-01 DOI:10.1109/IRI.2014.7051903

Randall Wald, T. Khoshgoftaar, A. A. Shanab

{"title":"噪声水平和分布对简易基因芯片数据分类的影响","authors":"Randall Wald, T. Khoshgoftaar, A. A. Shanab","doi":"10.1109/IRI.2014.7051903","DOIUrl":null,"url":null,"abstract":"Many bioinformatics datasets suffer from noise, making it difficult to build reliable models. These datasets can also exhibit class imbalance (many more examples of the negative class than the positive class), which will also affect classification performance. It is not known how these two problems intersect: no previous study has considered to what extent the noise level (total quantity of noise) and noise distribution (amount of noise in each class) affect performance when considered at the same time. To explore this question, we injected artificial class noise into twelve clean bioinformatics datasets of varying levels of class imbalance (all of which were relatively easy to learn from), varying both the level and distribution of the noise. We discovered that when the number of noisy instances is less than or equal to 40% the total number of minority-class instances, the resulting noisy datasets (regardless of which classes suffered from noise injection) are nearly as easy to build models from as the original, clean data. However, with greater levels of noise injection, the distribution does matter, and in particular it matters in proportion to the imbalance of the original (clean) dataset. If the original dataset was mostly balanced, injecting noise into the minority class will not have much more effect than injecting into the majority class, but for highly imbalanced datasets, injecting into the minority class will give results much worse than those from injecting into the majority class.","PeriodicalId":360013,"journal":{"name":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"The effect of noise level and distribution on classification of easy gene microarray data\",\"authors\":\"Randall Wald, T. Khoshgoftaar, A. A. Shanab\",\"doi\":\"10.1109/IRI.2014.7051903\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many bioinformatics datasets suffer from noise, making it difficult to build reliable models. These datasets can also exhibit class imbalance (many more examples of the negative class than the positive class), which will also affect classification performance. It is not known how these two problems intersect: no previous study has considered to what extent the noise level (total quantity of noise) and noise distribution (amount of noise in each class) affect performance when considered at the same time. To explore this question, we injected artificial class noise into twelve clean bioinformatics datasets of varying levels of class imbalance (all of which were relatively easy to learn from), varying both the level and distribution of the noise. We discovered that when the number of noisy instances is less than or equal to 40% the total number of minority-class instances, the resulting noisy datasets (regardless of which classes suffered from noise injection) are nearly as easy to build models from as the original, clean data. However, with greater levels of noise injection, the distribution does matter, and in particular it matters in proportion to the imbalance of the original (clean) dataset. If the original dataset was mostly balanced, injecting noise into the minority class will not have much more effect than injecting into the majority class, but for highly imbalanced datasets, injecting into the minority class will give results much worse than those from injecting into the majority class.\",\"PeriodicalId\":360013,\"journal\":{\"name\":\"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI.2014.7051903\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2014.7051903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

许多生物信息学数据集受到噪声的影响，使得难以建立可靠的模型。这些数据集也可能表现出类的不平衡(负类的例子比正类的多)，这也会影响分类性能。目前尚不清楚这两个问题是如何交叉的:以前的研究没有考虑到噪声水平(噪声总量)和噪声分布(每个类别的噪声量)在多大程度上同时影响性能。为了探索这个问题，我们将人工类噪声注入12个干净的生物信息学数据集，这些数据集具有不同的类不平衡程度(所有这些数据集都相对容易学习)，改变了噪声的水平和分布。我们发现，当噪声实例的数量小于或等于少数类实例总数的40%时，产生的噪声数据集(无论哪些类遭受噪声注入)几乎与原始的干净数据一样容易构建模型。然而，随着噪音注入水平的提高，分布确实很重要，特别是它与原始(干净)数据集的不平衡成正比。如果原始数据集基本上是平衡的，那么向少数类中注入噪声的效果不会比向多数类中注入噪声的效果好多少，但对于高度不平衡的数据集，向少数类中注入噪声的效果要比向多数类中注入噪声的效果差得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The effect of noise level and distribution on classification of easy gene microarray data

Many bioinformatics datasets suffer from noise, making it difficult to build reliable models. These datasets can also exhibit class imbalance (many more examples of the negative class than the positive class), which will also affect classification performance. It is not known how these two problems intersect: no previous study has considered to what extent the noise level (total quantity of noise) and noise distribution (amount of noise in each class) affect performance when considered at the same time. To explore this question, we injected artificial class noise into twelve clean bioinformatics datasets of varying levels of class imbalance (all of which were relatively easy to learn from), varying both the level and distribution of the noise. We discovered that when the number of noisy instances is less than or equal to 40% the total number of minority-class instances, the resulting noisy datasets (regardless of which classes suffered from noise injection) are nearly as easy to build models from as the original, clean data. However, with greater levels of noise injection, the distribution does matter, and in particular it matters in proportion to the imbalance of the original (clean) dataset. If the original dataset was mostly balanced, injecting noise into the minority class will not have much more effect than injecting into the majority class, but for highly imbalanced datasets, injecting into the minority class will give results much worse than those from injecting into the majority class.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)

自引率

0.00%

发文量