研究综合数据分布对克服小数据集问题的回归模型性能的影响

2020 International Seminar on Application for Technology of Information and Communication (iSemantic) Pub Date : 2020-09-19 DOI:10.1109/iSemantic50169.2020.9234265

T. Sutojo, A. Syukur, Supriadi Rustad, Guruh Fajar Shidik, Heru Agus Santoso, Purwanto Purwanto, Muljono Muljono

{"title":"研究综合数据分布对克服小数据集问题的回归模型性能的影响","authors":"T. Sutojo, A. Syukur, Supriadi Rustad, Guruh Fajar Shidik, Heru Agus Santoso, Purwanto Purwanto, Muljono Muljono","doi":"10.1109/iSemantic50169.2020.9234265","DOIUrl":null,"url":null,"abstract":"Machine learning is widely used in various fields, its ability to study data without having to determine the functional relationships that govern a system. However, small datasets often make it difficult for learning algorithms to make accurate predictions. To overcome this, an oversampling technique is needed. However, for the regression learning model this is not easy to do, because in regression to place synthesis data in a certain feature space must be accompanied by an appropriate target value, usually represented by an estimate function. Therefore in this paper oversampling is done by distributing synthetic data according to the Bus, Star, and Mesh topology, using the SMOTE (Synthetic Minority Over-sampling Technique) method. In the experiment, one of the ISE (Istanbul Stock Exchange) public datasets and one of the CF (Color Filter) real datasets were tested to measure the performance of the proposed oversampling technique. Besides, the results of experiments conducted on the same dataset using the MPV, FCM, and MMPV methods were used as a comparison. The results show that oversampling using the Bus, Star, or Mesh distribution results in better performance than without using oversampling. The ISE dataset tested using the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods. For CF datasets, the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods when the amount of training data is smaller than the amount of testing data.","PeriodicalId":345558,"journal":{"name":"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Investigating the Impact of Synthetic Data Distribution on the Performance of Regression Models to Overcome Small Dataset Problems\",\"authors\":\"T. Sutojo, A. Syukur, Supriadi Rustad, Guruh Fajar Shidik, Heru Agus Santoso, Purwanto Purwanto, Muljono Muljono\",\"doi\":\"10.1109/iSemantic50169.2020.9234265\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning is widely used in various fields, its ability to study data without having to determine the functional relationships that govern a system. However, small datasets often make it difficult for learning algorithms to make accurate predictions. To overcome this, an oversampling technique is needed. However, for the regression learning model this is not easy to do, because in regression to place synthesis data in a certain feature space must be accompanied by an appropriate target value, usually represented by an estimate function. Therefore in this paper oversampling is done by distributing synthetic data according to the Bus, Star, and Mesh topology, using the SMOTE (Synthetic Minority Over-sampling Technique) method. In the experiment, one of the ISE (Istanbul Stock Exchange) public datasets and one of the CF (Color Filter) real datasets were tested to measure the performance of the proposed oversampling technique. Besides, the results of experiments conducted on the same dataset using the MPV, FCM, and MMPV methods were used as a comparison. The results show that oversampling using the Bus, Star, or Mesh distribution results in better performance than without using oversampling. The ISE dataset tested using the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods. For CF datasets, the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods when the amount of training data is smaller than the amount of testing data.\",\"PeriodicalId\":345558,\"journal\":{\"name\":\"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iSemantic50169.2020.9234265\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSemantic50169.2020.9234265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

机器学习被广泛应用于各个领域，其研究数据的能力无需确定控制系统的功能关系。然而，小数据集往往使学习算法难以做出准确的预测。为了克服这个问题，需要一种过采样技术。然而，对于回归学习模型来说，这并不容易做到，因为在回归中，将合成数据放置在某个特征空间中必须伴随着合适的目标值，通常由估计函数表示。因此，本文采用SMOTE (synthetic Minority oversampling Technique)方法，根据Bus、Star和Mesh拓扑结构对合成数据进行过采样。在实验中，对ISE (Istanbul Stock Exchange)的一个公共数据集和CF (Color Filter)的一个真实数据集进行了测试，以衡量所提出的过采样技术的性能。在同一数据集上使用MPV、FCM和MMPV方法进行的实验结果进行比较。结果表明，使用Bus、Star或Mesh分布的过采样比不使用过采样的性能更好。使用该方法测试的ISE数据集的平均RMSE值小于MPV, FCM和MMPV方法。对于CF数据集，当训练数据量小于测试数据量时，本文方法的平均RMSE值小于MPV、FCM和MMPV方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Investigating the Impact of Synthetic Data Distribution on the Performance of Regression Models to Overcome Small Dataset Problems

Machine learning is widely used in various fields, its ability to study data without having to determine the functional relationships that govern a system. However, small datasets often make it difficult for learning algorithms to make accurate predictions. To overcome this, an oversampling technique is needed. However, for the regression learning model this is not easy to do, because in regression to place synthesis data in a certain feature space must be accompanied by an appropriate target value, usually represented by an estimate function. Therefore in this paper oversampling is done by distributing synthetic data according to the Bus, Star, and Mesh topology, using the SMOTE (Synthetic Minority Over-sampling Technique) method. In the experiment, one of the ISE (Istanbul Stock Exchange) public datasets and one of the CF (Color Filter) real datasets were tested to measure the performance of the proposed oversampling technique. Besides, the results of experiments conducted on the same dataset using the MPV, FCM, and MMPV methods were used as a comparison. The results show that oversampling using the Bus, Star, or Mesh distribution results in better performance than without using oversampling. The ISE dataset tested using the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods. For CF datasets, the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods when the amount of training data is smaller than the amount of testing data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 International Seminar on Application for Technology of Information and Communication (iSemantic)

自引率

0.00%

发文量