{"title":"Character String Analysis and Customer Path in Stream Data","authors":"K. Yada","doi":"10.1109/ICDMW.2008.41","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.41","url":null,"abstract":"This purpose of this study is to propose a knowledge-discovery system that can abstract helpful information from character strings representing shopper visits to product sections associated with positive and negative purchasing events by applying character string parsing technologies to stream data describing customer purchasing behavior inside a store. Taking data that traced customers' movements we focus on the number of times customers stop by particular product sections, and by representing those visits in the form of character strings, we propose a way to efficiently handle large stream data. During our experiment, we abstract store-section visiting patterns that characterize customers who purchase a relatively larger volume of items, and are able to show the usefulness of these visiting patterns. In addition, we examine index functions, calculation time, and prediction accuracy, and clarify technological issues warranting further research. In the present study, we demonstrate the feasibility of employing stream data in the marketing field and the usefulness of the employing character parsing techniques.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134122758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Sparse Bayesian Network Learning for Spatial Applications","authors":"T. Liebig, Christine Kopp, M. May","doi":"10.1109/ICDMW.2008.124","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.124","url":null,"abstract":"Traffic routes through a street network contain patterns and are no random walks. Such patterns exist for instance along streets or between neighbouring street segments. The extraction of these patterns is a challenging task due to the enormous size of city street networks, the large number of required training data and the unknown distribution of the latter. We apply Bayesian Networks to model the correlations between the locations in space-time trajectories and address the following tasks. We introduce and examine a Bayesian Network Learning algorithm enabling us to handle the complexity and performance requirements of the spatial context. Furthermore, we apply our method to German cities, evaluate the accuracy and analyse the runtime behaviour for different parameter settings.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133601559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-supervised Collaborative Clustering with Partial Background Knowledge","authors":"G. Forestier, Cédric Wemmert, P. Gançarski","doi":"10.1109/ICDMW.2008.116","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.116","url":null,"abstract":"In this paper we present a new algorithm for semisupervised clustering. We assume to have a small set of labeled samples and we use it in a clustering algorithm to discover relevant patterns. We study how our algorithm works against two other semisupervised algorithms when the data are multimodal. Then, we study the case where the user is able to produce few samples for some classes but not for each class of the dataset. Indeed, in complex problems, the user is not always able to produce samples for each class present in the dataset. The challenging task is consequently to use the set of labeled samples to discover other members of these classes, but also to keep a degree of freedom to discover unknown clusters, for which samples are not available. We address this problem through a series of experimentations on synthetic datasets, to show the relevance of the proposed method.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Wavelet-Based Data Perturbation for Simultaneous Privacy-Preserving and Statistics-Preserving","authors":"Lian Liu, Jie Wang, Jun Zhang","doi":"10.1109/ICDMW.2008.77","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.77","url":null,"abstract":"With the rapid development of data mining technologies, preserving privacy in certain data becomes a challenge to data mining applications in many fields, especially in medical, financial and homeland security fields. We present a privacy-preserving strategy based on wavelet perturbation to keep the data privacy and data statistical properties and data mining utilities at the same time. Our mathematical analyses and experimental results show that this method can keep the distance before and after perturbation and it can preserve the basic statistical properties of the original data while maximizing the data utilities. Through experiments on real-life datasets, we conclude that this method is a promising privacy-preserving and statistics-preserving technique.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133463207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bounding and Estimating Association Rule Support from Clusters on Binary Data","authors":"C. Ordonez, Kai Zhao, Zhibo Chen","doi":"10.1109/ICDMW.2008.47","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.47","url":null,"abstract":"The theoretical relationship between association rules and machine learning techniques needs to be studied in more depth. This article studies the use of clustering as a model for association rule mining. The clustering model is exploited to bound and estimate association rule support and confidence. We first study the efficient computation of the clustering model with K-means; we show the sufficient statistics for clustering on binary data sets is the linear sum of points. We then prove item set support can be bounded and estimated from the model. Finally, we show support bounds fulfill the set downward closure property. Experiments study model accuracy and algorithm speed, paying particular attention to error behavior in support estimation. Given a sufficiently large number of clusters, the model becomes fairly accurate to approximate support. However, as the minimum support threshold decreases accuracy also decreases. The model is fairly accurate to discover a large fraction of frequent itemsets at different support levels. The model is compared against a traditional association rule algorithm to mine frequent itemsets, exhibiting better performance at low support levels. Time complexity to compute the binary cluster model is linear on data set size, whereas the dimensionality of transaction data sets has marginal impact on time.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123256331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Risk Assessment of Atmospheric Hazard Releases Using K-Means Clustering","authors":"G. Cervone, P. Franzese, Y. Ezber, Z. Boybeyi","doi":"10.1109/ICDMW.2008.89","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.89","url":null,"abstract":"Unsupervised machine learning algorithms are used to perform statistical analysis of several transport and dispersion model runs which simulate emissions from a fixed source under different atmospheric conditions. A clustering algorithm is used to automatically group the results of the transport and dispersion simulations according to their respective cloud characteristics. Each cluster of clouds describes a distinct area at risk from potentially hazardous atmospheric contamination. Overimposing the resulting risk areas with ground maps, it is possible to assess the impact of the population exposure to the contaminants. The releases were simulated in the Bosphorus channel. Simulations were performed for one year at weekly interval, both day and night, to sample all different potential atmospheric conditions.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123045150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic Features for Multi-view Semi-supervised and Active Learning of Text Classification","authors":"Shiliang Sun","doi":"10.1109/ICDMW.2008.13","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.13","url":null,"abstract":"For multi-view learning, existing methods usually exploit originally provided features for classifier training, which ignore the latent correlation between different views. In this paper, semantic features integrating information from multiple views are extracted for pattern representation. Canonical correlation analysis is used to learn the representation of semantic spaces where semantic features are projections of original features on the basis vectors of the spaces. We investigate the feasibility of semantic features on two learning paradigms: semi-supervised learning and active learning. Experiments on text classification with two state-of-the-art multi-view learning algorithms co-training and co-testing indicate that this use of semantic features can lead to a significant improvement of performance.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"88 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120890599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Unstructured Text at Gigabyte per Second Speeds","authors":"A. Ratner","doi":"10.1109/ICDMW.2008.9","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.9","url":null,"abstract":"Humans communicate with text in thousands of languages, in dozens of scripts, in a variety of binary codes, on millions of topics. There is a need, for both government and commercial applications, to identify these text characteristics to enable follow-on processing such as transcoding, translation, transliteration, routing and prioritization. This paper deals with the implementation of real-time mining of unstructured text on high-speed hardware capable of processing network data streams at gigabyte per second speeds.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"84 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132477657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Adaptive Pre-filtering Technique for Error-Reduction Sampling in Active Learning","authors":"Michael Davy, S. Luz","doi":"10.1109/ICDMW.2008.52","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.52","url":null,"abstract":"Error-reduction sampling (ERS) is a high performing (but computationally expensive) query selection strategy for active learning. Subset optimisation has been proposed to reduce computational expense by applying ERS to only a subset of examples from the pool. This paper compares techniques used to construct the subset, namely random sub-sampling and pre-filtering. We focus on pre-filtering which populates the subset with more informative examples by filtering from the unlabelled pool using a query selection strategy. In this paper we establish whether pre-filtering outperforms sub-sampling optimisation, examine the effect of subset size, and propose a novel adaptive pre-filtering technique which dynamically switches between several alternative pre-filtering techniques using a multi-armed bandit algorithm. Empirical evaluations conducted on two benchmark text categorisation datasets demonstrate that pre-filtered ERS achieve higher levels of accuracy when compared to sub-sampled ERS. The proposed adaptive pre-filtering technique is also shown to be competitive with the optimal pre-filtering technique on the majority of problems and is never the worst technique.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131466529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Little, M. Schucking, B. Gartrell, Bing Chen, K. Ross, R. McKellip
{"title":"High Granularity Remote Sensing and Crop Production over Space and Time: NDVI over the Growing Season and Prediction of Cotton Yields at the Farm Field Level in Texas","authors":"B. Little, M. Schucking, B. Gartrell, Bing Chen, K. Ross, R. McKellip","doi":"10.1109/ICDMW.2008.91","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.91","url":null,"abstract":"Remote sensing has been applied to agriculture at very coarse levels of granularity (i.e., national levels) but few investigations have focused on yield prediction at the farm unit level. Specific aims of the present investigation are to analyze the ability of Moderate Resolution Imaging Spectroradiometer (MODIS) data to predict cotton yields in two highly homogeneous counties in west Texas. In one study county > 90% of cotton grown is irrigated, while the other study county 40 miles south has >85% non-irrigated cotton. Regression analysis by day from April to November at the county and farm levels reveals a highly significant ability for MODIS to predict cotton yields. R values ranged from 0.90 to 0.98 for irrigated cotton and 0.80 to . 90 for non-irrigated cotton practices. The objective in future studies is to algorithmically extend these analyses to the ~300 million acres of arable land under cultivation in the United States.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}