Weiming Jiang, Zhao Zhang, Jie Qin, Mingbo Zhao, Fanzhang Li, Shuicheng Yan
{"title":"Robust Projective Dictionary Learning by Joint Label Embedding and Classification","authors":"Weiming Jiang, Zhao Zhang, Jie Qin, Mingbo Zhao, Fanzhang Li, Shuicheng Yan","doi":"10.1109/ICDMW.2017.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.72","url":null,"abstract":"In this paper, we propose a new discriminative dictionary learning framework, called robust Label Embedding Projective Dictionary Learning (LE-PDL), for data classification. LE-PDL can learn a discriminative dictionary and the blockdiagonal representations without using the l0-norm or l1-norm sparsity regularization, since the l0 or l1-norm constraint on the coding coefficients used in the existing DL methods makes the training phase time-consuming. To enhance the performance, we also consider label information of the dictionary atoms in the learning process of LE-PDL to encourage the intra-class atoms to deliver similar profiles and enforce the coefficient matrix to be block-diagonal. Besides, our LE-PDL also involves an underlying projection to bridge data with their coefficients by extracting special features from given data. Then, we can train a classifier based on the extracted features so that the classification and representation powers are jointly considered. So, the classification approach of our model is efficient, since it avoids the extra time-consuming sparse reconstruction process with trained dictionary for each new test data as most existing DL methods. Besides, a robust l2,1-norm is regularized on the classifier and the non-negative constraint is used for the coding coefficients to enhance the performance. Experimental results show the effectiveness of our formulation.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123615099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Method for Stock Price Prediction Based on MRFs and SSVM","authors":"Lin Lai, Chang Li, Wen Long","doi":"10.1109/ICDMW.2017.113","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.113","url":null,"abstract":"Trading strategies basing on both financial analysis and machine learning techniques are becoming increasingly popular due to their ability to capture micro market price movements and leverage big data. An important class of works are focusing on exploiting the structural relationships between companies for accurate stock price prediction. In this paper we develop an algorithm for learning the parameters of unary and binary potentials in binary markov random fields (MRFs) under the max-margin framework. We first show how to train unary potentials using market price data and Gaussian Mixture Models (GMMs). Then, we developed a graph-cut based algorithm to solve the inference problem exactly. We demonstrate the learning of potentials' parameters using a max-margin learning framework. Experiment is conducted by comparing performances between our formulation and conventional SVM method. Results show that our method outperforms SVM by 27.9% on train set and 40.5% on test set.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123767482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. T. Islam, S. Pramanik, Vahid Mirjalili, S. Sural
{"title":"RESTRAC: REference Sequence Based Space TRAnsformation for Clustering","authors":"A. T. Islam, S. Pramanik, Vahid Mirjalili, S. Sural","doi":"10.1109/ICDMW.2017.66","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.66","url":null,"abstract":"Effective mining of large amount of DNA and RNA fragments obtained from next generation sequencing technologies, depends on the availability of efficient analytical tools to process them. One of the important aspects of this analysis, dealing with huge number of fragments, is partitioning them based on their level of similarities. In this paper we propose a space transformation based clustering approach to achieve this partitioning. In this approach, we transform each sequence by a set of reference sequences into a point in a multidimensional vector space and do the clustering in this vector space. We show through extensive analysis that the proposed transformation very closely preserve the clustering properties of the sequences using edit distance. Time for this transformation is linear with the number of sequences. The amount of time saving for this clustering is significant because in this approach edit distance calculations between two sequences are replaced by vector distance calculations between two corresponding feature vectors. We used agglomerative hierarchical clustering using single and average linkage because they are frequently used by the bioinformatics community. Agglomerative hierarchical clustering runs in quadratic time with the number of sequences and clustering time for this approach in the edit space can be prohibitive for large number of sequences. There exists greedy heuristic methods that perform clustering much faster but at the cost of significantly reduced cluster quality. We have applied our method to 16S rRNA fragment datasets obtained from different environmental samples. In these experiments, RESTRAC achieves up to five hundred times speed-up for single linkage and up to five times speed-up for average linkage while preserving good cluster quality.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130508143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feature Selection in Learning Using Privileged Information","authors":"R. Izmailov, Blerta Lindqvist, Peter Lin","doi":"10.1109/ICDMW.2017.131","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.131","url":null,"abstract":"The paper considers the problem of feature selection in learning using privileged information (LUPI), where some of the features (referred to as privileged ones) are only available for training, while being absent for test data. In the latest implementation of LUPI, these privileged features are approximated using regressions constructed on standard data features, but this approach could lead to polluting the data with poorly constructed and/or noisy features. This paper proposes a privileged feature selection method that addresses some of these issues. Since not many LUPI datasets are currently available in open access, while calibration of parameters of the proposed method requires testing it on a wide variety of datasets, a modified version of the method for traditional machine learning paradigm (i.e., without privileged features) was also studied. This lead to a novel mechanism of error rate reduction by constructing and selecting additional regression-based features capturing mutual relationships among standard features. The results on calibration datasets demonstrate the efficacy of the proposed feature selection method both for standard classification problems (tested on multiple calibration datasets) and for LUPI (for several datasets described in the literature).","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116558740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Controversy Detection Using Reactions on Social Media","authors":"Allaparthi Sriteja, Prakhar Pandey, Vikram Pudi","doi":"10.1109/ICDMW.2017.121","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.121","url":null,"abstract":"In this work we demonstrate a method to detect controversy on news issues. This is done by performing an analysis of people's reaction on social media to news articles reporting these issues. Detecting controversial news topics on web is a relevant problem today. It helps to identify the issues upon which people have divided opinion and is specially useful on topics such as a presidential election, government reforms, climate change etc. We use sentiment analysis and word matching to accomplish this task. We show the application of our method for detecting controversial topics during the US Presidential elections 2016.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131231335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Donzé, Ken Ishibashi, Bo Wu, Yuta Kaneko, Kei Miyazaki, Keiji Takai
{"title":"Global Distribution of Watches: A Network Analysis of Trade Relations","authors":"P. Donzé, Ken Ishibashi, Bo Wu, Yuta Kaneko, Kei Miyazaki, Keiji Takai","doi":"10.1109/ICDMW.2017.86","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.86","url":null,"abstract":"The worldwide market for luxury and fashion goods is dominated today by a handful of multinational corporations (MNCs). The way MNCs access foreign markets and organize distribution, however, remains unclear. In this paper, based on an analysis of foreign trade statistics, we take the example of watches and provide a model to highlight the most important flows as well as regional hubs in this global distribution system. By using this matrix data about watch distribution as a network consisting of countries (nodes) and trades (links), a network analysis is applied to extract hub nodes playing an important role. The network is visualized to represent the distribution system while focusing on heavily weighted links. As a result, the analysis has demonstrated that the flow of watches does not run always directly from the country of production to end consumers. Intermediaries play a key role, especially in regional markets like Asia, Europe and North America.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133713358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Dimensional Density Estimation for Data Mining Tasks","authors":"Alexander P. Kuleshov, A. Bernstein, Y. Yanovich","doi":"10.1109/ICDMW.2017.74","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.74","url":null,"abstract":"Consider a problem of estimating an unknown high dimensional density whose support lies on unknown low-dimensional data manifold. This problem arises in many data mining tasks, and the paper proposes a new geometrically motivated solution for the problem in manifold learning framework, including an estimation of an unknown support of the density. Firstly, tangent bundle manifold learning problem is solved resulting in transforming high dimensional data into their low-dimensional features and estimating the Riemannian tensor on the Data manifold. After that, an unknown density of the constructed features is estimated with the use of appropriate kernel approach. Finally, with the use of estimated Riemannian tensor, the final estimator of the initial density is constructed.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133258226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahak Goindani, Qiaoling Liu, Josh Chao, V. Jijkoun
{"title":"Employer Industry Classification Using Job Postings","authors":"Mahak Goindani, Qiaoling Liu, Josh Chao, V. Jijkoun","doi":"10.1109/ICDMW.2017.30","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.30","url":null,"abstract":"In the recruitment domain, knowing the employer industry of jobs is important to get an insight about the demand in each industry. The existing system at CareerBuilder uses an employer name normalization system and an employer knowledge base to infer the employer industry of a job. However, errors may occur during the computation of the job employer and in the construction of the employer knowledge base with the industry attributes. Since the knowledge base is huge, it is not possible to manually detect the errors. Therefore, in this paper we use Machine Learning techniques to automatically detect the errors. With the observation that the main jobs posted by an employer often relate to the employer industry, e.g., truck driver jobs often correspond to employers belonging to the transportation industry, we develop a system that classifies the industry of an employer using job posting data. We aggregate job postings from an employer and use job titles and employer names as features for predicting the industry of the employer. We used two models for classification: (1) Support Vector Machine, and (2) Gradient Boosted Decision Trees, and observed that while both the models perform similarly in classifying job employers that were correctly computed, GBDT is more effective than SVM in identifying job employers that were wrongly computed. We also show the utility of our system in detecting normalization errors and knowledge base errors.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114634336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dixon Vimalajeewa, D. Berry, Eric Robson, C. Kulatunga
{"title":"Evaluation of Non-linearity in MIR Spectroscopic Data for Compressed Learning","authors":"Dixon Vimalajeewa, D. Berry, Eric Robson, C. Kulatunga","doi":"10.1109/ICDMW.2017.77","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.77","url":null,"abstract":"Mid-Infrared (MIR) spectroscopy has emerged as the most economically viable technology to determine milk values as well as to identify a set of animal phenotypes related to health, feeding, well-being and environment. However, Fourier transform-MIR spectra incurs a significant amount of redundant data. This creates critical issues such as increased learning complexity while performing Fog and Cloud based data analytics in smart farming. These issues can be resolved through data compression using unsupervisory techniques like PCA, and perform analytics in the compressed-domain i.e. without decompressing. Compression algorithms should preserve non-linearity of MIRS data (if exists), since emerging advanced learning algorithms can improve their prediction accuracy. This study has investigated the non-linearity between the feature variables in the measurement-domain as well as in two compressed domains using standard Linear PCA and Kernel PCA. Also, the non-linearity between the feature variables and the commonly used target milk quality parameters (Protein, Lactose, Fat) has been analyzed. The study evaluates the prediction accuracy using PLS and LS-SVM respectively as linear and nonlinear predictive models.","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130605623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring Network Structure Metrics as a Proxy for Socio-Political Activity in Social Media","authors":"Selvas Mwanza, H. Suleman","doi":"10.1109/ICDMW.2017.120","DOIUrl":"https://doi.org/10.1109/ICDMW.2017.120","url":null,"abstract":"Social surveys have been used by researchers and policy makers as an essential tool for understanding social and political activities in society. Social media has introduced a new way of capturing data from large numbers of people. Unlike surveys, social media deliver data more rapidly and cheaply. In this paper, we aim to rapidly identify socio-political activity in South Africa using proxy data from social media. We measure and analyse scalar properties of a network created by user interactions on Twitter. Our experimental results show that network diameter and reciprocity have statistical significance in determining socio-political activity","PeriodicalId":389183,"journal":{"name":"2017 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125746923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}