{"title":"高维流数据汇总的特征选择和保留采样","authors":"Ling Lin, Qian Yu, Wen Ji, Yang Gao","doi":"10.1109/ICTAI.2019.00198","DOIUrl":null,"url":null,"abstract":"Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.","PeriodicalId":346657,"journal":{"name":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary\",\"authors\":\"Ling Lin, Qian Yu, Wen Ji, Yang Gao\",\"doi\":\"10.1109/ICTAI.2019.00198\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.\",\"PeriodicalId\":346657,\"journal\":{\"name\":\"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2019.00198\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2019.00198","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary
Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.