Rohan Mitra, D. Varam, Eyad Ali, Hana Sulieman, Firuz Kamalov
{"title":"评价特征选择算法的综合数据基准的开发","authors":"Rohan Mitra, D. Varam, Eyad Ali, Hana Sulieman, Firuz Kamalov","doi":"10.1109/ISMODE56940.2022.10180928","DOIUrl":null,"url":null,"abstract":"The primary objective of this paper is to present a set of synthetically generated datasets as a benchmark for evaluating feature selection algorithms (FSAs). The use of synthetic datasets is encouraged because of their utility in controlling data parameters, including the exact number of relevant, redundant, and irrelevant features. This paper proposes four numeric datasets with several sources of inspiration, namely based on geometric objects, trigonometric equations and multi-cut linear combinations. These synthetically generated datasets come with a fixed number of relevant, redundant and irrelevant features, which are then evaluated using feature selection algorithms currently popular within industry and academia. This highlights the function of these datasets as benchmarks for future researchers in the field of feature selection. Accordingly, the datasets will also be made available through GitHub for use as evaluation metrics, whilst the code is made available to be modified according to the application for the researcher. This may include research into the performance of FSAs, the development of new synthetic data, and beyond.","PeriodicalId":335247,"journal":{"name":"2022 2nd International Seminar on Machine Learning, Optimization, and Data Science (ISMODE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development of Synthetic Data Benchmarks for Evaluating Feature Selection Algorithms\",\"authors\":\"Rohan Mitra, D. Varam, Eyad Ali, Hana Sulieman, Firuz Kamalov\",\"doi\":\"10.1109/ISMODE56940.2022.10180928\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The primary objective of this paper is to present a set of synthetically generated datasets as a benchmark for evaluating feature selection algorithms (FSAs). The use of synthetic datasets is encouraged because of their utility in controlling data parameters, including the exact number of relevant, redundant, and irrelevant features. This paper proposes four numeric datasets with several sources of inspiration, namely based on geometric objects, trigonometric equations and multi-cut linear combinations. These synthetically generated datasets come with a fixed number of relevant, redundant and irrelevant features, which are then evaluated using feature selection algorithms currently popular within industry and academia. This highlights the function of these datasets as benchmarks for future researchers in the field of feature selection. Accordingly, the datasets will also be made available through GitHub for use as evaluation metrics, whilst the code is made available to be modified according to the application for the researcher. This may include research into the performance of FSAs, the development of new synthetic data, and beyond.\",\"PeriodicalId\":335247,\"journal\":{\"name\":\"2022 2nd International Seminar on Machine Learning, Optimization, and Data Science (ISMODE)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 2nd International Seminar on Machine Learning, Optimization, and Data Science (ISMODE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISMODE56940.2022.10180928\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd International Seminar on Machine Learning, Optimization, and Data Science (ISMODE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISMODE56940.2022.10180928","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Development of Synthetic Data Benchmarks for Evaluating Feature Selection Algorithms
The primary objective of this paper is to present a set of synthetically generated datasets as a benchmark for evaluating feature selection algorithms (FSAs). The use of synthetic datasets is encouraged because of their utility in controlling data parameters, including the exact number of relevant, redundant, and irrelevant features. This paper proposes four numeric datasets with several sources of inspiration, namely based on geometric objects, trigonometric equations and multi-cut linear combinations. These synthetically generated datasets come with a fixed number of relevant, redundant and irrelevant features, which are then evaluated using feature selection algorithms currently popular within industry and academia. This highlights the function of these datasets as benchmarks for future researchers in the field of feature selection. Accordingly, the datasets will also be made available through GitHub for use as evaluation metrics, whilst the code is made available to be modified according to the application for the researcher. This may include research into the performance of FSAs, the development of new synthetic data, and beyond.