2014 IEEE International Conference on Data Mining Workshop最新文献_第2页

Two approaches of using heavy tails in high dimensional EDA 在高维EDA中使用重尾的两种方法

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.184

Momodou L. Sanyang, Hanno Muehlbrandt, A. Kabán

{"title":"Two approaches of using heavy tails in high dimensional EDA","authors":"Momodou L. Sanyang, Hanno Muehlbrandt, A. Kabán","doi":"10.1109/ICDMW.2014.184","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.184","url":null,"abstract":"We consider the problem of high dimensional black-box optimisation via Estimation of Distribution Algorithms (EDA). The Gaussian distribution is commonly used as a search operator in most of the EDA methods. However there are indications in the literature that heavy tailed distributions may perform better due to their higher exploration capabilities. Univariate heavy tailed distributions were already proposed for high dimensional problems. In 2D problems it has been reported that a multivariate heavy tailed (such as Cauchy) search distribution is able to blend together the strengths of multivariate modelling with a high exploration power. In this paper, we study whether a similar scheme would work well in high dimensional search problems. To get around of the difficulty of multivariate model building in high dimensions we employ a recently proposed random projections (RP) ensemble based approach which we modify to get samples from a multivariate Cauchy using the scale-mixture representation of the Cauchy distribution. Our experiments show that the resulting RP-based multivariate Cauchy EDA consistently improves on the performance of the univariate Cauchy search distribution. However, intriguingly, the RP-based multivariate Gaussian EDA has the best performance among these methods. It appears that the highly explorative nature of the multivariate Cauchy sampling is exacerbated in high dimensional search spaces and the population based search loses its focus and effectiveness as a result. Finally, we present an idea to increase exploration while maintaining exploitation and focus by using the RP-based multivariate Gaussian EDA in which the RP matrices are drawn with i.i.d. Heavy tailed entries. This achieves improved performance and is competitive with the state of the art.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115594695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Modeling of Writing and Thinking Process in Handwriting by Digital Pen Analysis 基于数字笔分析的手写书写与思维过程建模

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.85

Kenshin Ikegami, Y. Ohsawa

{"title":"Modeling of Writing and Thinking Process in Handwriting by Digital Pen Analysis","authors":"Kenshin Ikegami, Y. Ohsawa","doi":"10.1109/ICDMW.2014.85","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.85","url":null,"abstract":"In order to acquire infrequent events as new ideas and evaluate the ideas quantitatively, it is necessary to know how people create and refine ideas and to model creating and refining process. In this paper, we focused on relations between thinking time and writing time in handwriting, and proposed to model the relation by externalization, classification, relation, transportation and systematization, which are elements to make sentences. The relation depended on questions and formats of sheets. When sheets give participants the question answered by sentences, writing time become longer as thinking time is longer. On the other hand, if sheets give the question which could be answered only by words, writing time become shorter as thinking time is longer. We hypothesized that participants spent more time classifying, relating and transporting words in answering only by words than in answering by sentences. We could also confirm that when the same questions were given twice, writing time became longer and thinking time became shorter second time than first time. It was because enough externalizations were performed first time and participants spent less time externalizing second time.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128784801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Scalable Algorithm for Discovering Topologies in Social Networks 社交网络拓扑发现的可扩展算法

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.75

Jyoti Rani Yadav, D. Somayajulu, P. Krishna

引用次数: 3

Expander Graph Quality Optimisation in Randomised Communication 随机通信中的扩展图质量优化

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.150

P. Poonpakdee, G. D. Fatta

{"title":"Expander Graph Quality Optimisation in Randomised Communication","authors":"P. Poonpakdee, G. D. Fatta","doi":"10.1109/ICDMW.2014.150","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.150","url":null,"abstract":"Epidemic protocols provide a randomised communication and computation paradigm for large and extreme-scale networked systems and can be adopted to build decentralised and fault-tolerant services. They have recently been proposed for the formulation of knowledge discovery algorithms in extreme scale environments. In distributed systems they rely on membership protocols to provide a peer sampling service. Epidemic membership protocols induce a network overlay topology that continuously evolves over time, quickly converging to random graphs. This work investigates the expansion property of the series of network overlay topologies induced by epidemic membership protocols. A search heuristic is adopted for the design of a novel epidemic membership protocol. The proposed Expander Membership Protocol explicitly aims at improving the expansion quality of the overlay topologies and incorporates a connectivity recovery mechanism to overcome the known issue of multiple connected components. In the comparative analysis the proposed protocol shows a faster convergence to random graphs and greater topology connectivity robustness than the state of the art protocols, resulting in an overall better performance of global aggregation tasks.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"79 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130575239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Why Checkins: Exploring User Motivation on Location Based Social Networks 为什么签到:探索基于位置的社交网络的用户动机

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.175

Fengjiao Wang, Guan Wang, Philip S. Yu

{"title":"Why Checkins: Exploring User Motivation on Location Based Social Networks","authors":"Fengjiao Wang, Guan Wang, Philip S. Yu","doi":"10.1109/ICDMW.2014.175","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.175","url":null,"abstract":"Checkins, the niche service provided by location based social networks (LBSN), bridge users' online activities and offline social lives in a seamless way. Therefore, knowledge discovery on check in data has become an important research direction [1], [2], [3], [4]. However, a fundamental and interesting question about checkins remains unanswered yet. What are people's motivations behind those checkins? We give the first attempt to answer this question. Motivation studies first appear in social psychology in a less quantitative way. For example, the goal-directed behavior (MGB) model [5] uncovers the association between behaviors and motivations. Following a similar rationale, we design a computational model for the mining of user check in motivations from large scale real world data. We assume that the check in motivation has two types: social motivation and individual motivation. Social motivation is the type of check in incentive that stimulates interactions or influences among friends. Individual motivation is another type of check in incentive that aims to explore and share attractive places. Following the structure of the MGB model, we construct user check in motivation prediction model (UCMP) and then formalize the motivation prediction problem as an optimization problem. The idea is minimizing the difference between the estimated behavior and the true behavior to get the predicted motivations. The experiment on this GOWALLA dataset shows not only prediction results, but also very interesting phenomenons about social users and social locations.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130611468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Visual Exploration of a Series of Academic Conferences 一系列学术会议的视觉探索

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.73

Kazuo Misue

引用次数: 3

A Comparison of Approaches to Chinese Word Segmentation in Hadoop Hadoop中中文分词方法的比较

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.43

Zhangang Wang, Bangjie Meng

{"title":"A Comparison of Approaches to Chinese Word Segmentation in Hadoop","authors":"Zhangang Wang, Bangjie Meng","doi":"10.1109/ICDMW.2014.43","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.43","url":null,"abstract":"Today, we're surrounded by data especially Chinese information. The exponential growth of data first presented challenges to cutting-edge businesses such as Alibaba, Jingdong, Amazon, and Microsoft. They need to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Chinese word segmentation is a computer problem in Chinese information processing, and the Chinese word segmentation algorithm is one of the core, but because of the different characteristics of the environment morpheme in English, making the Chinese must solve word problems. Chinese lexical analysis is the foundation and key Chinese information processing. IKAnalyzer (IK) and ICTCLAS (IC) is a very popular Chinese word segmentation algorithm. At present, these two algorithms in Chinese segmentation play an important role in solving the text data. If the two algorithms are well applied to Hadoop distributed environment, will have better performance. In this paper we compare IK and IC algorithm performance by the theory and experiments. This paper reports the experimental work on the mass Chinese text segmentation problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large text data sets by considering prototype of big data application scenarios. The results obtained from various experiments indicate favorable results of above IC and IK algorithm to address mass Chinese text segmentation problem. (Addressing Big Data Problem Using Hadoop and Map Reduce). Furthermore, we evaluate both kinds of segmentation in terms of performance. Although the process to load data into and tune the execution of parallel distributed system took much longer than the centralized system, the observed performance of these word segmentation algorithms were strikingly better.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123718729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages 从多域维基百科页面中自动学习通用定义模式

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.107

Jingsong Zhang, Yinglin Wang, Dingyu Yang

{"title":"Automatic Learning Common Definitional Patterns from Multi-domain Wikipedia Pages","authors":"Jingsong Zhang, Yinglin Wang, Dingyu Yang","doi":"10.1109/ICDMW.2014.107","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.107","url":null,"abstract":"Automatic definition extraction has attracted wide interest in NLP domain and knowledge-based applications. One primary task of definition extraction is mining patterns from definitional sentences. Existing extraction methods of definitional patterns, either focus on manual extraction by intuition or observation, or aim to mine intricate definitional patterns by automatic extraction methods. The manual method requires large human resources to identify the definitional patterns because of diverse lexico-syntactic structures. It inevitable suffers poor behavior especially the extraction from cross-domain corpora. The latter method mainly considers the precision in definition extraction, which is at the cost of decreasing the recall of definitions. Both of them are unsuitable for cross-domain definition extraction. To address those issues, this paper proposes a solution to perform the automatic extraction of definitional patterns from multi-domain definitional sentences of Wikipedia. Our method FIND-SS is modified based on FIND-S algorithm and solves the definition extraction problems of cross-domain corpora. Find-SS adopts a \"the more similar the higher priority\" scheme to improve the learning performance. It can accommodate some noisy information and does not require any pattern seeds for pattern learning. The experimental results indicate that our scenario is significantly superior to previous method.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"71 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114111624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Hybrid Approach for Emotion Detection in Support of Affective Interaction 一种支持情感交互的混合情感检测方法

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.130

S. Gievska, Kiril Koroveshovski, Tatjana Chavdarova

引用次数: 19

Domain-Independent Unsupervised Text Segmentation for Data Management 面向数据管理的域独立无监督文本分割

2014 IEEE International Conference on Data Mining Workshop Pub Date : 2014-12-01 DOI: 10.1109/ICDMW.2014.118

Makoto Sakahara, S. Okada, K. Nitta

{"title":"Domain-Independent Unsupervised Text Segmentation for Data Management","authors":"Makoto Sakahara, S. Okada, K. Nitta","doi":"10.1109/ICDMW.2014.118","DOIUrl":"https://doi.org/10.1109/ICDMW.2014.118","url":null,"abstract":"In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover's distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.","PeriodicalId":289269,"journal":{"name":"2014 IEEE International Conference on Data Mining Workshop","volume":"35 22","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114028325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14