{"title":"流混合大数据的频繁集挖掘","authors":"R. Khade, Jessica Lin, Nital S. Patel","doi":"10.1109/ICMLA.2015.218","DOIUrl":null,"url":null,"abstract":"Frequent set mining is a well researched problem due to its application in many areas of data mining such as clustering, classification and association rule mining. Most of the existing work focuses on categorical and batch data and do not scale well for large datasets. In this work, we focus on frequent set mining for mixed data. We introduce a discretization methodology to find meaningful bin boundaries when itemsets contain at least one continuous attribute, an update strategy to keep the frequent items relevant in the event of concept drift, and a parallel algorithm to find these frequent items. Our approach identifies local bins per itemset, as a global discretization may not identify the most meaningful bins. Since the relationships between attributes my change over time, the rules are updated using a weighted average method. Our algorithm fits well in the Hadoop framework, so it can be scaled up for large datasets.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Frequent Set Mining for Streaming Mixed and Large Data\",\"authors\":\"R. Khade, Jessica Lin, Nital S. Patel\",\"doi\":\"10.1109/ICMLA.2015.218\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Frequent set mining is a well researched problem due to its application in many areas of data mining such as clustering, classification and association rule mining. Most of the existing work focuses on categorical and batch data and do not scale well for large datasets. In this work, we focus on frequent set mining for mixed data. We introduce a discretization methodology to find meaningful bin boundaries when itemsets contain at least one continuous attribute, an update strategy to keep the frequent items relevant in the event of concept drift, and a parallel algorithm to find these frequent items. Our approach identifies local bins per itemset, as a global discretization may not identify the most meaningful bins. Since the relationships between attributes my change over time, the rules are updated using a weighted average method. Our algorithm fits well in the Hadoop framework, so it can be scaled up for large datasets.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.218\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Frequent Set Mining for Streaming Mixed and Large Data
Frequent set mining is a well researched problem due to its application in many areas of data mining such as clustering, classification and association rule mining. Most of the existing work focuses on categorical and batch data and do not scale well for large datasets. In this work, we focus on frequent set mining for mixed data. We introduce a discretization methodology to find meaningful bin boundaries when itemsets contain at least one continuous attribute, an update strategy to keep the frequent items relevant in the event of concept drift, and a parallel algorithm to find these frequent items. Our approach identifies local bins per itemset, as a global discretization may not identify the most meaningful bins. Since the relationships between attributes my change over time, the rules are updated using a weighted average method. Our algorithm fits well in the Hadoop framework, so it can be scaled up for large datasets.