{"title":"KADetector: Automatic Identification of Key Actors in Online Hack Forums Based on Structured Heterogeneous Information Network","authors":"Yiming Zhang, Yujie Fan, Yanfang Ye, Liang Zhao, Jiabin Wang, Qi Xiong, Fudong Shao","doi":"10.1109/ICBK.2018.00028","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00028","url":null,"abstract":"Underground forums have been widely used by cybercriminals to exchange knowledge and trade in illicit products or services, which have played a central role in the cybercriminal ecosystem. In order to facilitate the deployment of effective countermeasures, in this paper, we propose and develop an intelligent system named KADetector to automate the analysis of Hack Forums for the identification of its key actors who play the vital role in the value chain. In KADetector, to identify whether the given users are key actors, we not only analyze their posted threads, but also utilize various kinds of relations among users, threads, replies, comments, sections and topics. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use a meta-path based approach to incorporate higher-level semantics to build up relatedness over users in Hack Forums. To reduce the high computation and space cost, given different meta-paths built from the HIN, we propose a new HIN embedding model named ActorHin2Vec to learn the low-dimensional representations for the nodes in HIN. After that, a classifier is built for key actor identification. To the best of our knowledge, this is the first work to use structured HIN for underground participant analysis. Comprehensive experiments on the data collections from Hack Forums are conducted to validate the effectiveness of our developed system KADetector in key actor identification by comparisons with alternative methods.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127137738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Stewart, Wei Liu, R. Cardell-Oliver, Rui Wang
{"title":"Short-Text Lexical Normalisation on Industrial Log Data","authors":"Michael Stewart, Wei Liu, R. Cardell-Oliver, Rui Wang","doi":"10.1109/ICBK.2018.00023","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00023","url":null,"abstract":"Lexical normalisation aims to computationally correct errors in text so that the data may be more successfully analysed. Noisy, unstructured short-text data presents unique challenges as it contains multiple types of Out Of Vocabulary (OOV) words. Some are spelling mistakes, which should be normalised to in-dictionary words; some are acronyms or abbreviations, which should be expanded to the corresponding phrases; and some are domain specific terms which should remain in their original form not to be mis-corrected to conform with the dictionary used. Despite its critical significance in assuring data quality, text normalisation is an area with a less cohesive and focused research effort, evidenced by the diverse set of keywords used and scattered publication venues. Integrated approaches that address all three types of OOV terms are scarce. Here we introduce a two-stage, modular classification-based framework that specifically targets the various types of Out Of Vocabulary terms prevalent in short-text data. To avoid laborious feature engineering, our system utilises a Bi-Directional Long Short-Term Memory + CRF model to classify each erroneous token into a particular class. The system then selects an appropriate normalisation technique based on the predicted class of each token. For spell-checking, we introduce two learning models that predict the correct spelling of a word given its context: one that utilises word embeddings, and another that uses a quasi-recurrent neural network. We compare our system to two existing state of the art lexical normalisation systems and find that our system achieves greater performance on the log data domain.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125174699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Publisher's information]","authors":"","doi":"10.1109/icbk.2018.00069","DOIUrl":"https://doi.org/10.1109/icbk.2018.00069","url":null,"abstract":"","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129517473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Fusion of Multiple Spatio-Temporal Data Sources for Improved Localisation in Cellular Network","authors":"S. Luo, Y. Li","doi":"10.1109/ICBK.2018.00067","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00067","url":null,"abstract":"An accurate and reliable estimation of subscribers' locations in a cellular network is becoming increasingly important for not only telco-related services but also commercial domains. The data collected in cellular network for locating subscribers could come from multiple sources with different characteristics such as accuracy, noise variance and spatial and temporal resolutions. Given various localisation techniques, it would be advantageous to utilize the multiple data sources to obtain an accurate location rather than relying on single type of measurement. Data fusion, which integrates multiple types of measurement, is an promising solution to provide location estimation with better accuracy, reliability and coverage. In this work, we proposed a data fusion framework using multiple spatio-temporal data sources. Existing solutions in the literature general rely on generative models based on attributes like Received Signal Strength (RSS), Angle of Arrival (AOA), and/or Round Trip Delay Time (RTT) that may not be available in practice due to various reasons. We address the problem from a pure data driven perspective. The challenges of practical implementation such as oscillation removal and noise estimation are discussed in depth. Moreover, the proposed framework is deployed into production and fully evaluated with data sources from a telco in Singapore.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115751633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Imbalanced Networked Multi-label Classification with Active Learning","authors":"Ruilong Zhang, Lei Li, Yuhong Zhang, Chenyang Bu","doi":"10.1109/ICBK.2018.00046","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00046","url":null,"abstract":"With the rapid development of social networks, the networked multi-label classification algorithms have gained wide attention. The existing networked multi-label classification algorithms mostly only consider the homogeneity or heterogeneity of the network without taking the imbalance of the network into account, and this is actually pretty common in real network environments, which deserves more attention. Moreover, the selection strategy of training set is very critical for multi-label classification algorithm, because it will directly affect both the parameter updating inside the classifier and the precision of the classifier. The application of active learning to the selection of training set can effectively improve the precision of the classifier. Similarly, the application of imbalanced data processing strategies to the selection of training sets also makes classifiers more suitable for imbalanced data networks. Thereout, we propose an algorithm BSHD (Block Sampling with selecting the Highest Degree nodes), which is an active learning based imbalanced networked multi-label classification algorithm. In this algorithm, we divide the network according to the edge density and utilize the oversampling and undersampling to dispose each block. Then we select the nodes with the highest degree from each block to form the training set. Experimental results show that our proposed BSHD outperforms other state-of-arts approaches.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117198293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DBkWik: A Consolidated Knowledge Graph from Thousands of Wikis","authors":"S. Hertling, Heiko Paulheim","doi":"10.1109/ICBK.2018.00011","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00011","url":null,"abstract":"Popular knowledge graphs such as DBpedia and YAGO are built from Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"17 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120832235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-target Core Network-Based Networked Multi-label Classification","authors":"Lei Li, Fang Zhang, Di Ma, Chuan Zhou, Xuegang Hu","doi":"10.1109/ICBK.2018.00044","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00044","url":null,"abstract":"As the increasing popularity of label classification, networked multi-label classification is becoming a hot topic in the field of data mining, where the networked multi-label means that each entity has more than one label during classification in network environments. In the existing works on networked multi-label classification, although only the labels of certain nodes are required to be determined, the labels of all nodes in the network have to be inferred. This works well for small networks, but not for large networks, especially not for large-scale networks with big data, as a plenty of time has been spent to compute a lot of unrequired labels. In this paper, we introduce a core network which is composed of the shortest paths that link some sources (i.e., some nodes with known labels) and some targets (i.e., some nodes with unknown labels required to be determined), as these paths have the most significant directly influence on label classification. Then we propose a novel heuristic MultI-TargeT corE Network discovery algorithm MITTEN to discover a core network, which aims to achieve the relatively accuracy of predicted labels with a relatively short time. Compared with existing networked multi-label classification approaches, the experimental results executed on real networks show that our proposed MITTEN can predict labels in network environments more precisely and more efficiently.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124731264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Word Embedding Representation with Synthetic Position and Context Information for Relation Extraction","authors":"Yunzhou Shi, Yujiu Yang, Yi Liu","doi":"10.1109/ICBK.2018.00022","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00022","url":null,"abstract":"In recent years, various knowledge bases have been built and widely used in different natural language possessing tasks. And relation extraction is an effective way to enrich knowledge bases. But in most existing relation extraction methods, they obtain word embedding from pre-trained Word2vec or GloVe, which don't consider the difference of word in different sentences. But, such a fact cannot be ignored, that is, the same word in different contexts or in different position in a sentence has different meanings. So, we propose an approach to get word embedding representation with synthetic context and position information and call it semantic word embedding. After getting semantic word embedding, we can get sentence-level representation by simple average-pooling rather than complex architecture of convolutional neural network. Furthermore, we apply the semantic word embedding representation to the relation extraction task of Natural Language Processing. The experimental results show that the performance of the proposed method on the popular benchmark dataset is better than the state-of-the-art CNN-based approach.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128591419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhi-Heng Zhang, Wen-jie Zhai, Rong-Ping Shen, Sheng-Chao Zeng, Fan Min
{"title":"State Transition Pattern with Periodic Wildcard Gaps","authors":"Zhi-Heng Zhang, Wen-jie Zhai, Rong-Ping Shen, Sheng-Chao Zeng, Fan Min","doi":"10.1109/ICBK.2018.00065","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00065","url":null,"abstract":"Sequence pattern discovery is a key issue in multivariate time series analysis. Popular methods consist of three stages: feature extraction, feature clustering, and block sequence discovery. Both cross and temporal associations are obtained during the third stage. In this paper, we propose a new type of pattern called a state transition pattern with periodic wildcard gaps (STAP) to enrich cross associations. We design an approach that consists of three stages, that is, feature extraction, frequent state discovery, and pattern synthesis, to obtain frequent STAPs. Compared with previous approaches, STAP emonstrates stronger cross associations by considering different variables multaneously. We propose two pre-pruning and the Apriori-pruning technique to speed up pattern discovery. We also propose a type of graph to visualize STAPs. Experimental results on three real-world datasets and one artificial dataset demonstrate that STAPs capture richer cross and temporal associations.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129030085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generic Embedded Semantic Dictionary for Robust Multi-Label Classification","authors":"Zhengming Ding, Ming Shao, Sheng Li, Y. Fu","doi":"10.1109/ICBK.2018.00045","DOIUrl":"https://doi.org/10.1109/ICBK.2018.00045","url":null,"abstract":"Multi-label classification has attracted great attention in various applications and generated significant interest in data mining and learning fields. For the incompleteness of multi-label data, numerous approaches were developed to address partially missing labels in multi-label data, and traditional multi-label algorithms mainly adopt low-rank embedding and graph regularizer to recover the missing labels. However, how to simultaneously approach missing labels and discriminant multi-label embedding within the low-rank regime is still unclear. In this work, we propose a Generic Embedded Semantic Dictionary (GESD) learning framework for robust multi-label classification, where we both consider the partially and totally missing labels for the visual data. Specifically, we explore a low-rank coding strategy to encode visual features with recovered label matrix by constructing an effective semantic dictionary. In this way, the low-rankness will be appropriately propagated to recover multi-labels and improve label correlation, given missing labels in the training stage. Extensive experiments on six real-world benchmarks verify that our method can correctly capture label correlation and achieve better label recovery & prediction results than the state-of-the-art algorithms.","PeriodicalId":144958,"journal":{"name":"2018 IEEE International Conference on Big Knowledge (ICBK)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132058662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}