{"title":"Clustering Categorical Data Based on Maximal Frequent Itemsets","authors":"Dadong Yu, Dongbo Liu, Rui Luo, Jianxin Wang","doi":"10.1109/ICMLA.2007.11","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.11","url":null,"abstract":"Clustering categorical data received more attention since recent years, but several aspects of the existing algorithms, such as the interpretabilities of found clusters, the impact of data selection orders, are not well solved. A novel categorical data clustering algorithm called CLUBMIS is proposed in this paper, which can effectively find the interesting clusters. In addition, the clusters can be easily interpreted by the maximal frequent itemsets used in the clustering process. Different from most of the hierarchical clustering algorithm, CLUBMIS clusters datasets based on the summarized information, i.e. maximal frequent itemsets, thus it eliminates the effect of different data selection order.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115989283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using evolutionary sampling to mine imbalanced data","authors":"D. J. Drown, T. Khoshgoftaar, R. Narayanan","doi":"10.1109/ICMLA.2007.73","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.73","url":null,"abstract":"Class imbalance tends to cause inferior performance in data mining learners. Evolutionary sampling is a technique which seeks to counter this problem by using genetic algorithms to evolve a reduced sample of a complete dataset to train a classification model. Evolutionary sampling works to remove noisy and duplicate instances so that the sampled training data will produce a superior classifier. We propose this novel technique as a method to handle severe class imbalance in data mining. This paper presents our research into the the use of evolutionary sampling with C4.5 decision trees and compares the technique's performance with random undersamp ling.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129680582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modifying kernels using label information improves SVM classification performance","authors":"Martin Renqiang Min, A. Bonner, Zhaolei Zhang","doi":"10.1109/ICMLA.2007.84","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.84","url":null,"abstract":"Kernel learning methods based on kernel alignment with semidefinite programming (SDP) are often memory intensive and computationally expensive, thus often impractical for problems with large-size dataset. We propose a method using label information to modify kernels based on SVD and a linear mapping. As a result, the new kernel matrix reflects the label-dependent separability of the data in a better way than the original kernel matrix. In addition, our experimental results on USPS handwritten digits and the SCOP dataset, show that the SVM classifier based on the improved kernels has better performance than the SVM classifier based on the original kernels; moreover, SVM based on the improved profile kernel with pull-in homologs (see experiment section for explanations) produced the best results for remote homology detection on the SCOP dataset compared to the published results.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128766721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text Mining and Ontology Applications in Bioinformatics and GIS","authors":"S. Navathe","doi":"10.1109/ICMLA.2007.122","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.122","url":null,"abstract":"Informatics and computers have not yet become as pervasive in chemistry as they have in physics and biology. Drawing analogies from bioinformatics, key ingredients for progress in chemoinformatics are the availability of large, annotated databases of compounds and reactions, data structures and algorithms to efficiently search these databases, and computational methods to predict the physical, chemical, and biological properties of new compounds and reactions. We will describe the development of: (1) a large public database of compounds and reactions (ChemDB); (2) machine learning kernel methods to predict molecular properties; and (3) the applications of these methods to drug screening/design problems and the identification of new drug leads against a major disease. More broadly, we will discuss some of the challenges and opportunities for computer science, AI, and machine learning in chemistry. Abstract: This talk will present some general problem areas and solutions in two fields of applications of machine learning: bioinformatics and Geographic Information Systems (GIS). The bioinformatics arena is very broad and encompasses many problems such as gene finding in sequences, molecular pathway construction, protein structure prediction etc. We will outline our research on finding important keywords from the biomedical literature by statistical analysis and some natural language analysis. We have also incorporated ontologies such as UMLS (Unified Medical Language System) to determine relationships among biological and medical concepts. The primary goal of this work has been to interpret the long lists of genes that are derived in microarray experiments used to understand and treat diseases. We are able to cluster genes based on their functional similarity. We have also used lists of keywords as feature vectors to drive SVM models for a classification of literature. In particular, we have dealt with the classification of relevant literature for Public health at the CDC (Centers of Disease Control). We will briefly explain the discovery of biomarkers for cancer using a technique that combines SVM and gene ontology.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127788365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparsity regularization path for semi-supervised SVM","authors":"G. Gasso, Karina Zapien Arreola, S. Canu","doi":"10.1109/ICMLA.2007.81","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.81","url":null,"abstract":"Using unlabeled data to unravel the structure of the data to leverage the learning process is the goal of semi supervised learning. A common way to represent this underlying structure is to use graphs. Flexibility of the maximum margin kernel framework allows to model graph smoothness and to build kernel machine for semi supervised learning such as Laplacian SVM [1]. But a common complaint of the practitioner is the long running time of these kernel algorithms for classification of new points. We provide an efficient way of alleviating this problem by using a LI penalization term and a regularization path algorithm to efficiently compute the solution. Empirical evidence shows the benefit of the algorithm.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126423083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Chen, J. Xuan, Chen Wang, Y. Wang, I. Shih, Tian-Li Wang, Zhen Zhang, R. Clarke, E. Hoffman
{"title":"Biomarker Identification by Knowledge-Driven Multi-Level ICA and Motif Analysis","authors":"Li Chen, J. Xuan, Chen Wang, Y. Wang, I. Shih, Tian-Li Wang, Zhen Zhang, R. Clarke, E. Hoffman","doi":"10.1109/ICMLA.2007.58","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.58","url":null,"abstract":"Many statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study from expression data alone. In this paper, we develop a novel strategy, namely knowledge-driven multi-level independent component analysis (ICA), to infer regulatory signals and identify biologically relevant biomarkers from microarray data. Specifically, based on multi-level clustering results and partial prior knowledge, we apply ICA to find stable disease specific linear regulatory modes and then extract associated biomarker genes. A statistical test is designed to evaluate the significance of transcription factor enrichment for extracted gene set based on motif information. The experimental results on an Rsf-1 induced microarray data set show that our knowledge-driven method can extract more biologically meaningful biomarkers with significant enrichment of transcription factors related to ovarian cancer compared to other gene selection methods with/without prior knowledge.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127142413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An optimization method for selecting parameters in support vector machines","authors":"Yulin Dong, Manghui Tu, Zhonghang Xia, Guangming Xing","doi":"10.1109/ICMLA.2007.38","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.38","url":null,"abstract":"It has been shown that the cost parameters and kernel parameters are critical in the performance of support vector machines (SVMs). A standard parameter selection method compares parameters among a discrete set of values, called the candidate set, and picks the one which has the best classification accuracy. As a result, the choice of parameters strongly depends on the pre-defined candidate set. In this paper, we formulate the selection of the cost parameter and kernel parameter as a two-level optimization problem, in which the values of parameters vary continuously and thus optimization techniques can be applied to select ideal parameters. Due to the non-smoothness of the objective function in our model, a genetic algorithm has been presented. Numerical results show that the two-level approach can significantly improve the performance of SVM classifier in terms of classification accuracy.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127273312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An incremental viterbi algorithm","authors":"J. Bobbin","doi":"10.1109/ICMLA.2007.49","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.49","url":null,"abstract":"This paper describes an incremental version of the Viterbi dynamic programming algorithm. The incremental algorithm is shown to dramatically reduce memory usage in long state sequence problems compared with the standard Viterbi algorithm while having no measurable impact on the algorithms runtime. In addition, the set of problems which the Viterbi algorithm can be applied is extended by the incremental algorithm to include problems of finding optimal paths in realtime domains. The Viterbi algorithm is widely used to find optimal paths in hidden Markov models (HMM), and HMMs are frequently applied to both streaming data problems where realtime solutions can be of interest, and to large state sequence problems in areas like bioinformatics. In this paper we apply the incremental algorithm to finding optimal paths in a variant of the burst detection HMM applied to the novel problem of detecting user activity levels in digital evidence data derived from hard drives.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134631436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-based context-sensitive spelling correction at web scale","authors":"Andrew Carlson, Ian Fette","doi":"10.1109/ICMLA.2007.50","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.50","url":null,"abstract":"We study the problem of correcting spelling mistakes in text using memory-based learning techniques and a very large database of token n-gram occurrences in web text as training data. Our approach uses the context in which an error appears to select the most likely candidate from words which might have been intended in its place. Using a novel correction algorithm and a massive database of training data, we demonstrate higher accuracy on correcting real- word errors than previous work, and very high accuracy at a new task of ranking corrections to non-word errors given by a standard spelling correction package.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115246701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Khoshgoftaar, Chris Seiffert, J. V. Hulse, Amri Napolitano, A. Folleco
{"title":"Learning with limited minority class data","authors":"T. Khoshgoftaar, Chris Seiffert, J. V. Hulse, Amri Napolitano, A. Folleco","doi":"10.1109/ICMLA.2007.76","DOIUrl":"https://doi.org/10.1109/ICMLA.2007.76","url":null,"abstract":"A practical problem in data mining and machine learning is the limited availability of data. For example, in a binary classification problem it is often the case that examples of one class are abundant, while examples of the other class are in short supply. Examples from one class, typically the positive class, can be limited due to the financial cost or time required to collect these examples. This work presents a comprehensive empirical study of learning when examples from one class are extremely rare, but examples of the other class(es) are plentiful. Specifically, we address the issue of how many examples from the abundant class should be used when training a classifier on data where one class is very rare. Nearly one million classifiers were built and evaluated to generate the results presented in this work. Our results demonstrate that the often used 'even distribution' is not optimal when dealing with such rare events.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123897272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}