{"title":"Two approaches of using heavy tails in high dimensional EDA","authors":"Momodou L. Sanyang, Hanno Muehlbrandt, A. Kabán","doi":"10.1109/ICDMW.2014.184","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.184","url":null,"abstract":"We consider the problem of high dimensional black-box optimisation via Estimation of Distribution Algorithms (EDA). The Gaussian distribution is commonly used as a search operator in most of the EDA methods. However there are indications in the literature that heavy tailed distributions may perform better due to their higher exploration capabilities. Univariate heavy tailed distributions were already proposed for high dimensional problems. In 2D problems it has been reported that a multivariate heavy tailed (such as Cauchy) search distribution is able to blend together the strengths of multivariate modelling with a high exploration power. In this paper, we study whether a similar scheme would work well in high dimensional search problems. To get around of the difficulty of multivariate model building in high dimensions we employ a recently proposed random projections (RP) ensemble based approach which we modify to get samples from a multivariate Cauchy using the scale-mixture representation of the Cauchy distribution. Our experiments show that the resulting RP-based multivariate Cauchy EDA consistently improves on the performance of the univariate Cauchy search distribution. However, intriguingly, the RP-based multivariate Gaussian EDA has the best performance among these methods. It appears that the highly explorative nature of the multivariate Cauchy sampling is exacerbated in high dimensional search spaces and the population based search loses its focus and effectiveness as a result. Finally, we present an idea to increase exploration while maintaining exploitation and focus by using the RP-based multivariate Gaussian EDA in which the RP matrices are drawn with i.i.d. Heavy tailed entries. This achieves improved performance and is competitive with the state of the art.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115594695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling of Writing and Thinking Process in Handwriting by Digital Pen Analysis","authors":"Kenshin Ikegami, Y. Ohsawa","doi":"10.1109/ICDMW.2014.85","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.85","url":null,"abstract":"In order to acquire infrequent events as new ideas and evaluate the ideas quantitatively, it is necessary to know how people create and refine ideas and to model creating and refining process. In this paper, we focused on relations between thinking time and writing time in handwriting, and proposed to model the relation by externalization, classification, relation, transportation and systematization, which are elements to make sentences. The relation depended on questions and formats of sheets. When sheets give participants the question answered by sentences, writing time become longer as thinking time is longer. On the other hand, if sheets give the question which could be answered only by words, writing time become shorter as thinking time is longer. We hypothesized that participants spent more time classifying, relating and transporting words in answering only by words than in answering by sentences. We could also confirm that when the same questions were given twice, writing time became longer and thinking time became shorter second time than first time. It was because enough externalizations were performed first time and participants spent less time externalizing second time.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128784801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable Algorithm for Discovering Topologies in Social Networks","authors":"Jyoti Rani Yadav, D. Somayajulu, P. Krishna","doi":"10.1109/ICDMW.2014.75","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.75","url":null,"abstract":"Discovering topologies in a social network targets various business applications such as finding key influencers in a network, recommending music movies in virtual communities, finding active groups in network and promoting a new product. Since social networks are large in size, discovering topologies from such networks is challenging. In this paper, we present a scalable topology discovery approach using Giraph platform and perform (i) graph structural analysis and (ii) graph mining. For graph structural analysis, we consider various centrality measures. First, we find top-K centrality vertices for a specific topology (e.g. Star, ring and mesh). Next, we find other vertices which are in the neighborhood of top centrality vertices and then create the cluster based on structural density. We compare our clustering approach with DBSCAN algorithm on the basis of modularity parameter. The results show that clusters generated through structural density parameter are better in quality than generated through neighborhood density parameter.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126671885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expander Graph Quality Optimisation in Randomised Communication","authors":"P. Poonpakdee, G. D. Fatta","doi":"10.1109/ICDMW.2014.150","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.150","url":null,"abstract":"Epidemic protocols provide a randomised communication and computation paradigm for large and extreme-scale networked systems and can be adopted to build decentralised and fault-tolerant services. They have recently been proposed for the formulation of knowledge discovery algorithms in extreme scale environments. In distributed systems they rely on membership protocols to provide a peer sampling service. Epidemic membership protocols induce a network overlay topology that continuously evolves over time, quickly converging to random graphs. This work investigates the expansion property of the series of network overlay topologies induced by epidemic membership protocols. A search heuristic is adopted for the design of a novel epidemic membership protocol. The proposed Expander Membership Protocol explicitly aims at improving the expansion quality of the overlay topologies and incorporates a connectivity recovery mechanism to overcome the known issue of multiple connected components. In the comparative analysis the proposed protocol shows a faster convergence to random graphs and greater topology connectivity robustness than the state of the art protocols, resulting in an overall better performance of global aggregation tasks.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"79 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130575239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Checkins: Exploring User Motivation on Location Based Social Networks","authors":"Fengjiao Wang, Guan Wang, Philip S. Yu","doi":"10.1109/ICDMW.2014.175","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.175","url":null,"abstract":"Checkins, the niche service provided by location based social networks (LBSN), bridge users' online activities and offline social lives in a seamless way. Therefore, knowledge discovery on check in data has become an important research direction [1], [2], [3], [4]. However, a fundamental and interesting question about checkins remains unanswered yet. What are people's motivations behind those checkins? We give the first attempt to answer this question. Motivation studies first appear in social psychology in a less quantitative way. For example, the goal-directed behavior (MGB) model [5] uncovers the association between behaviors and motivations. Following a similar rationale, we design a computational model for the mining of user check in motivations from large scale real world data. We assume that the check in motivation has two types: social motivation and individual motivation. Social motivation is the type of check in incentive that stimulates interactions or influences among friends. Individual motivation is another type of check in incentive that aims to explore and share attractive places. Following the structure of the MGB model, we construct user check in motivation prediction model (UCMP) and then formalize the motivation prediction problem as an optimization problem. The idea is minimizing the difference between the estimated behavior and the true behavior to get the predicted motivations. The experiment on this GOWALLA dataset shows not only prediction results, but also very interesting phenomenons about social users and social locations.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130611468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visual Exploration of a Series of Academic Conferences","authors":"Kazuo Misue","doi":"10.1109/ICDMW.2014.73","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.73","url":null,"abstract":"The trends in a research field, especially changes in the features over the years, are subjects of interest for many researchers. This paper reports an exploratory analysis of the changes of research topics in an academic field. The target data of the analysis are the author-keywords included in papers presented at a series of academic conferences, IEEE International Conference on Data Mining (ICDM). The analysis process consists of three phases: (1) frequency of keywords, (2) appearance of keywords in papers, and (3) relationships among keywords. In phase 1, bar charts were used to observe the ranking of frequencies. In phases 2 and 3, anchored maps were adopted. The anchored maps are based on the spring-embedder model, but they provide viewpoints by using fixed \"anchors.\" The analysis process revealed the major topics in the field of data mining and some changes in the relationships among topics.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123883515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparison of Approaches to Chinese Word Segmentation in Hadoop","authors":"Zhangang Wang, Bangjie Meng","doi":"10.1109/ICDMW.2014.43","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.43","url":null,"abstract":"Today, we're surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centralized system, the observed performance of these word segmentation algorithms were strikingly better.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123718729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages","authors":"Jingsong Zhang, Yinglin Wang, Dingyu Yang","doi":"10.1109/ICDMW.2014.107","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.107","url":null,"abstract":"Automatic definition extraction has attracted wide interest in NLP domain and knowledge-based applications. One primary task of definition extraction is mining patterns from definitional sentences. Existing extraction methods of definitional patterns, either focus on manual extraction by intuition or observation, or aim to mine intricate definitional patterns by automatic extraction methods. The manual method requires large human resources to identify the definitional patterns because of diverse lexico-syntactic structures. It inevitable suffers poor behavior especially the extraction from cross-domain corpora. The latter method mainly considers the precision in definition extraction, which is at the cost of decreasing the recall of definitions. Both of them are unsuitable for cross-domain definition extraction. To address those issues, this paper proposes a solution to perform the automatic extraction of definitional patterns from multi-domain definitional sentences of Wikipedia. Our method FIND-SS is modified based on FIND-S algorithm and solves the definition extraction problems of cross-domain corpora. Find-SS adopts a \"the more similar the higher priority\" scheme to improve the learning performance. It can accommodate some noisy information and does not require any pattern seeds for pattern learning. The experimental results indicate that our scenario is significantly superior to previous method.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"71 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114111624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gievska, Kiril Koroveshovski, Tatjana Chavdarova
{"title":"A Hybrid Approach for Emotion Detection in Support of Affective Interaction","authors":"S. Gievska, Kiril Koroveshovski, Tatjana Chavdarova","doi":"10.1109/ICDMW.2014.130","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.130","url":null,"abstract":"Affective interaction is a new emerging area of interest for interaction designers. This research explores the potential of our hybrid approach that relies on both, lexical and machine learning techniques for detection of Ekman's six emotional categories in user's text. The initial results of the performance evaluation of the proposed hybrid approach are encouraging and comparable to related research. A demonstrative mobile application that employs the proposed approach was developed to engage the users in a dialogue that solicits their reflections on various daily events and provides appropriate affective responses.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"43 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain-Independent Unsupervised Text Segmentation for Data Management","authors":"Makoto Sakahara, S. Okada, K. Nitta","doi":"10.1109/ICDMW.2014.118","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.118","url":null,"abstract":"In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover's distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"35 22","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114028325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}