J. Besson, C. Rigotti, I. Mitasiunaite, Jean-François Boulicaut
{"title":"Parameter Tuning for Differential Mining of String Patterns","authors":"J. Besson, C. Rigotti, I. Mitasiunaite, Jean-François Boulicaut","doi":"10.1109/ICDMW.2008.118","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.118","url":null,"abstract":"Constraint-based mining has been proven to be extremely useful for supporting actionable pattern discovery. However, useful conjunctions of constraints that support domain driven mining tasks generally need to set several parameter values and how to tune these parameters remains fairly open. We study this problem for substring pattern discovery, when using a conjunction of maximal frequency, minimal frequency and size constraints. We propose a method, based on pattern space sampling, to estimate the number of patterns that satisfy such conjunctions. This permits the user to probe the parameter space in many points, and then to choose some initial promising parameter settings. Our empirical validation confirms that we efficiently obtain good approximations of the number of patterns that will be extracted.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131477702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Liu, Jun Yan, Shuicheng Yan, Weiguo Fan, Zheng Chen
{"title":"Web Query Prediction by Unifying Model","authors":"Ning Liu, Jun Yan, Shuicheng Yan, Weiguo Fan, Zheng Chen","doi":"10.1109/ICDMW.2008.53","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.53","url":null,"abstract":"Recently, many commercial products, such as Google Trends and Yahoo! Buzz, are released to monitor the past search engine query frequency trend. However, little research has been devoted for predicting the upcoming query trend, which is of great importance in providing guidelines for future business planning. In this paper, a unified solution is presented for such a purpose. Besides the classical time series model, we propose to integrate the cosine signal hidden periodicities model to capture periodic information of query time series. Motivated by the fact that these models cannot capture the external accidental event factors which could significantly influence the query frequency, the query correlation model is also modified and integrated for predicting the upcoming query trend. Finally linear regression is utilized for model unification. Experiments based on 15,511,531 queries from a commercial search engine query log ranging within 283 days well validate the effectiveness of our proposed unified algorithm.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122946548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Airel Pérez Suárez, José Francisco Martínez Trinidad, J. A. Carrasco-Ochoa, J. Medina-Pagola
{"title":"A New Graph-Based Algorithm for Clustering Documents","authors":"Airel Pérez Suárez, José Francisco Martínez Trinidad, J. A. Carrasco-Ochoa, J. Medina-Pagola","doi":"10.1109/ICDMW.2008.69","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.69","url":null,"abstract":"In this paper a new algorithm, called CStar, for document clustering is presented. This algorithm improves recently developed algorithms like generalized star (GStar) and ACONS algorithms, originally proposed for reducing some drawbacks presented in previous Star-like algorithms.The CStar algorithm uses the condensed star-shaped sub-graph concept defined by ACONS, but defines a new heuristic that allows to construct a new cover of the thresholded similarity graph and to reduce the drawbacks presented in GStar and ACONS algorithms. The experimentation over standard document collections shows that our proposal outperforms previously defined algorithms and other related algorithms used to document clustering.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124549868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple-Instance Regression with Structured Data","authors":"K. Wagstaff, T. Lane, A. Roper","doi":"10.1109/ICDMW.2008.31","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.31","url":null,"abstract":"We present a multiple-instance regression algorithm that models internal bag structure to identify the items most relevant to the bag labels. Multiple-instance regression (MIR) operates on a set of bags with real-valued labels, each containing a set of unlabeled items, in which the relevance of each item to its bag label is unknown. The goal is to predict the labels of new bags from their contents. Unlike previous MIR methods, MI-ClusterRegress can operate on bags that are structured in that they contain items drawn from a number of distinct (but unknown) distributions. MI-ClusterRegress simultaneously learns a model of the bagpsilas internal structure, the relevance of each item, and a regression model that accurately predicts labels for new bags. We evaluated this approach on the challenging MIR problem of crop yield prediction from remote sensing data. MI-ClusterRegress provided predictions that were more accurate than those obtained with non-multiple-instance approaches or MIR methods that do not model the bag structure.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125471924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speeding up Array Query Processing by Just-In-Time Compilation","authors":"C. Jucovschi, P. Baumann, Sorin Stancu-Mara","doi":"10.1109/ICDMW.2008.73","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.73","url":null,"abstract":"Interpreted languages frequently suffer from higher processing times as compared to compiled approaches. Typically this happens when complex computations are performed. Array DBMSs, which extend database functionality with multidimensional array modeling and query support, find themselves in exactly this situation: queries often involve a large number of operations, and each such operation is applied to a large number of array elements.In this paper, we propose just-in-time compilation as an optimization method for an interpreted array query language. This is achieved by grouping suitable query nodes into complex operation nodes, for which C code is generated, compiled, and loaded during runtime.We present our approach based on the array DBMS rasdaman, discuss its benefits and its embedding into the rasdaman query evaluation, and show initial, rather promising benchmark results.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115608655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Actionable Knowledge Discovery for Threats Intelligence Support Using a Multi-dimensional Data Mining Methodology","authors":"Olivier Thonnard, M. Dacier","doi":"10.1109/ICDMW.2008.78","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.78","url":null,"abstract":"This paper describes a multi-dimensional knowledge discovery and data mining (KDD) methodology that aims at discovering actionable knowledge related to Internet threats, taking into account domain expert guidance and the integration of domain-specific intelligence during the data mining process. The objectives are twofold: i) to develop global indicators for assessing the prevalence of certain malicious activities on the Internet, and ii) to get insights into the modus operandi of new emerging attack phenomena, so as to improve our understanding of threats. In this paper, we first present the generic aspects of a domain-driven graph-based KDD methodology, which is based on two main components: a clique-based clustering technique and a concepts synthesis process using cliques' intersections. Then, to evaluate the applicability of this approach to our application domain, we use a large dataset of real-world attack traces collected since 2003. Our experimental results show that significant insights can be obtained into the domain of threat intelligence by using this multi-dimensional knowledge discovery method.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115071369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meghana Deodhar, Joydeep Ghosh, Gunjan Gupta, Hyuk Cho, I. Dhillon
{"title":"Hunting for Coherent Co-clusters in High Dimensional and Noisy Datasets","authors":"Meghana Deodhar, Joydeep Ghosh, Gunjan Gupta, Hyuk Cho, I. Dhillon","doi":"10.1109/ICDMW.2008.20","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.20","url":null,"abstract":"Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional \"one-sided\" clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches that have been applied to this task. We also point out other interesting applications of the proposed framework in solving difficult clustering problems.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129651785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research on Methodology of Classification Mining for Tumor Markers","authors":"Wei Jiang, Min Yao, Jiekai Yu","doi":"10.1109/ICDMW.2008.74","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.74","url":null,"abstract":"Reliability is one of the key issues in data mining. In the case of massive protein mass spectrum data from SELDI-TOF-MS, this paper proposes an effective and reliable method to extract tumor markers. First of all, an adaptive threshold approach based on wavelet transformation is put forward to eliminate the noise in raw data so as to furnish reliable foundation for tumor markers extraction. Then a kind of genetic algorithm based on SVM is designed to construct discriminating model in order to find the optimal combination of distinct protein peaks and obtain tumor markers. Finally, the method proposed in this paper is applied to extract tumor markers from the protein mass spectrum data that come from normal mouse serums and induced pancreatic cancer mouse serums to verify the feasibility and reliability of our method.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126038010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Co-training by Committee: A New Semi-supervised Learning Framework","authors":"Mohamed Farouk Abdel Hady, F. Schwenker","doi":"10.1109/ICDMW.2008.27","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.27","url":null,"abstract":"For many data mining applications, it is necessary to develop algorithms that use unlabeled data to improve the accuracy of the supervised learning. Co-Training is a popular semi-supervised learning algorithm. It assumes that each example is represented by two or more redundantly sufficient sets of features (views) and these views are independent given the class. However, these assumptions are not satisfied in many real-world application domains. Therefore, we present a framework called co-training by committee (CoBC), in which a set of diverse classifiers are used to learn each other. The framework is a simple, general single-view semi-supervised learner that can use any ensemble learner to build diverse committees. Experimental studies on CoBC using bagging, AdaBoost and the random subspace method (RSM) as ensemble learners demonstrate that error diversity among classifiers leads to an effective co-training that requires neither redundant and independent views nor different learning algorithms.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115049251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Combining Structured Pattern Mining and Graph Kernels","authors":"Fabrizio Costa, Björn Bringmann","doi":"10.1109/ICDMW.2008.125","DOIUrl":"https://doi.org/10.1109/ICDMW.2008.125","url":null,"abstract":"This paper presents a novel approach to feature construction for structured data in order to enhance graph prediction classification performance. To this end we combine graph mining techniques with graph kernel based classifiers. The main idea is to employ efficient mining techniques to extract a set of patterns correlated with the target concept and use these, or a selected subset of these, to annotate the original graph structures. A decomposition kernel is then defined on the enriched structured data instances. Experimental results on carcinogenic and toxicological activity prediction tasks for small molecules show that the proposed technique significantly increases classification performance.","PeriodicalId":175955,"journal":{"name":"2008 IEEE International Conference on Data Mining Workshops","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124117574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}