{"title":"同时特征选择和元组选择的高效分类","authors":"M. Dash, V. Gopalkrishnan","doi":"10.4018/978-1-60566-748-5.CH012","DOIUrl":null,"url":null,"abstract":"It is no longer news that data are increasing very rapidly day-by-day. Particularly with Internet becoming so prevalent everywhere, the sources of data have become numerous. Data are increasing in both ways: dimensions or features and instances or examples or tuples, not all the data are relevant though. While gathering the data on any particular aspect, usually one tends to gather as much information as will be required for various tasks. One may not explicitly have any particular task, for example classification, in mind. So, it behooves for a data mining expert to remove the noisy, irrelevant and redundant data before proceeding with classification because many traditional algorithms fail in the presence of such noisy and irrelevant data (Blum and Langley 1997). As an example, consider microarray gene expression data where there are thousands of features (or genes) and only 10s of tuples (or sample tests). For example, Leukemia cancer data (Alon, Barkai et al. 1999) has 7129 genes and 72 sample tests. It has been shown that even with very few genes one can achieve the same or even better prediction acABStrAct","PeriodicalId":255230,"journal":{"name":"Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Simultaneous Feature Selection and Tuple Selection for Efficient Classification\",\"authors\":\"M. Dash, V. Gopalkrishnan\",\"doi\":\"10.4018/978-1-60566-748-5.CH012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is no longer news that data are increasing very rapidly day-by-day. Particularly with Internet becoming so prevalent everywhere, the sources of data have become numerous. Data are increasing in both ways: dimensions or features and instances or examples or tuples, not all the data are relevant though. While gathering the data on any particular aspect, usually one tends to gather as much information as will be required for various tasks. One may not explicitly have any particular task, for example classification, in mind. So, it behooves for a data mining expert to remove the noisy, irrelevant and redundant data before proceeding with classification because many traditional algorithms fail in the presence of such noisy and irrelevant data (Blum and Langley 1997). As an example, consider microarray gene expression data where there are thousands of features (or genes) and only 10s of tuples (or sample tests). For example, Leukemia cancer data (Alon, Barkai et al. 1999) has 7129 genes and 72 sample tests. It has been shown that even with very few genes one can achieve the same or even better prediction acABStrAct\",\"PeriodicalId\":255230,\"journal\":{\"name\":\"Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4018/978-1-60566-748-5.CH012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/978-1-60566-748-5.CH012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
数据每天都在快速增长,这已经不是新闻了。特别是随着互联网变得无处不在,数据的来源变得越来越多。数据以两种方式增加:维度或特征,实例或示例或元组,但并非所有数据都是相关的。在收集任何特定方面的数据时,通常倾向于收集各种任务所需的尽可能多的信息。人们可能没有明确地想到任何特定的任务,例如分类。因此,在进行分类之前,数据挖掘专家应该先去除有噪声的、不相关的和冗余的数据,因为许多传统算法在存在这些有噪声的和不相关的数据时失败(Blum和Langley 1997)。例如,考虑微阵列基因表达数据,其中有数千个特征(或基因),而只有10个元组(或样本测试)。例如,白血病数据(Alon, Barkai et al. 1999)有7129个基因和72个样本测试。研究表明,即使只有很少的基因,人们也能达到相同甚至更好的预测
Simultaneous Feature Selection and Tuple Selection for Efficient Classification
It is no longer news that data are increasing very rapidly day-by-day. Particularly with Internet becoming so prevalent everywhere, the sources of data have become numerous. Data are increasing in both ways: dimensions or features and instances or examples or tuples, not all the data are relevant though. While gathering the data on any particular aspect, usually one tends to gather as much information as will be required for various tasks. One may not explicitly have any particular task, for example classification, in mind. So, it behooves for a data mining expert to remove the noisy, irrelevant and redundant data before proceeding with classification because many traditional algorithms fail in the presence of such noisy and irrelevant data (Blum and Langley 1997). As an example, consider microarray gene expression data where there are thousands of features (or genes) and only 10s of tuples (or sample tests). For example, Leukemia cancer data (Alon, Barkai et al. 1999) has 7129 genes and 72 sample tests. It has been shown that even with very few genes one can achieve the same or even better prediction acABStrAct