Zhao Zhang, Weiming Jiang, Yang Wang, Qiaolin Ye, Mingbo Zhao, Mingliang Xu, Meng Wang
{"title":"Discriminative Additive Scale Loss for Deep Imbalanced Classification and Embedding","authors":"Zhao Zhang, Weiming Jiang, Yang Wang, Qiaolin Ye, Mingbo Zhao, Mingliang Xu, Meng Wang","doi":"10.1109/ICDM51629.2021.00105","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00105","url":null,"abstract":"Real-world data in emerging applications may suffer from highly-skewed class imbalanced distribution, however how to deal with this kind of problem appropriately through deep learning needs further investigation. In this paper, we mainly propose a novel cross-entropy based loss function, referred to as Additive Scale Loss (ASL), for deep representation learning and imbalanced classification. To deal with the class imbalanced problem, ASL aims at increasing the loss in case of misclassification, which can avoid the superimposed loss values caused by the large amount of easily classified data in the unbalanced database to dominate the loss value of misclassified data. Moreover, in real-world applications, one data source may be used for multiple scenarios, such as classification and embedding learning, however training two separable models to handle these problems is costly, especially in deep learning area. To tackle this issue, we present and integrate a discriminative inter-class separation term into ASL, and propose a discriminative ASL (D-ASL), which can not only improve the classification performance, but also obtain discriminative representations simultaneously. The discriminative inter-class separation term is general, and can be easily integrated to other loss functions, such as CE and FL, as the byproducts. Finally, a new deep convolutional neural network equipped with D-ASL and a fully-connected (FC) layer is proposed, which can classify the imbalanced image data and obtain the discriminative representations at the same time. Extensive experimental results verified the superior performance of our method.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127280233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huiping Chen, Changyu Dong, Liyue Fan, G. Loukides, S. Pissis, L. Stougie
{"title":"Differentially Private String Sanitization for Frequency-Based Mining Tasks","authors":"Huiping Chen, Changyu Dong, Liyue Fan, G. Loukides, S. Pissis, L. Stougie","doi":"10.1109/ICDM51629.2021.00014","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00014","url":null,"abstract":"Strings are used to model genomic, natural language, and web activity data, and are thus often shared broadly. However, string data sharing has raised privacy concerns stemming from the fact that knowledge of length-k substrings of a string and their frequencies (multiplicities) may be sufficient to uniquely reconstruct the string; and from that the inference of such substrings may leak confidential information. We thus introduce the problem of protecting length-k substrings of a single string S by applying Differential Privacy (DP) while maximizing data utility for frequency-based mining tasks. Our theoretical and empirical evidence suggests that classic DP mechanisms are not suitable to address the problem. In response, we employ the order-k de Bruijn graph G of S and propose a sampling-based mechanism for enforcing DP on G. We consider the task of enforcing DP on G using our mechanism while preserving the normalized edge multiplicities in G. We define an optimization problem on integer edge weights that is central to this task and develop an algorithm based on dynamic programming to solve it exactly. We also consider two variants of this problem with real edge weights. By relaxing the constraint of integer edge weights, we are able to develop linear-time exact algorithms for these variants, which we use as stepping stones towards effective heuristics. An extensive experimental evaluation using real-world large-scale strings (in the order of billions of letters) shows that our heuristics are efficient and produce near-optimal solutions which preserve data utility for frequency-based mining tasks.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129991835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaosu Wang, Yun Xiong, Hao Niu, Jingwen Yue, Yangyong Zhu, Philip S. Yu
{"title":"BioHanBERT: A Hanzi-aware Pre-trained Language Model for Chinese Biomedical Text Mining","authors":"Xiaosu Wang, Yun Xiong, Hao Niu, Jingwen Yue, Yangyong Zhu, Philip S. Yu","doi":"10.1109/ICDM51629.2021.00181","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00181","url":null,"abstract":"Unsupervised pre-trained language models (PLMs) have boosted the development of effective biomedical text mining models. But the biomedical texts contain a huge number of long-tail concepts and terminologies, which makes further pre-training on biomedical corpora relatively expensive (more biomedical corpora and more pre-training steps are needed). Nonetheless, this problem receives less attention in recent studies. In Chinese biomedical text, concepts and terminologies consist of Chinese characters, and Chinese characters are often composed of sub-character components which are also semantically informative; thus in order to enhance the semantics of biomedical concepts and terminologies, the use of a Chinese character’s component-level internal semantic information also appears to be reasonable.In this paper, we propose a novel hanzi-aware pre-trained language model for Chinese biomedical text mining, referred to as BioHanBERT (hanzi-aware BERT for Chinese biomedical text mining), utilizing the component-level internal semantic information of Chinese characters to enhance the semantics of Chinese biomedical concepts and terminologies, and thereby to reduce further pre-training costs. BioHanBERT first employs a Chinese character encoder to extract the component-level internal semantic feature of each Chinese character, and then fuse the character’s internal semantic feature and its contextual embedding extracted by BERT to enrich the representations of the concepts or terminologies containing the character. The results of extensive experiments show that our model is able to consistently outperform current state-of-the-art (SOTA) models in a wide range of Chinese biomedical natural language processing (NLP) tasks.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chang Wei Tan, Matthieu Herrmann, Geoffrey I. Webb
{"title":"Ultra fast warping window optimization for Dynamic Time Warping","authors":"Chang Wei Tan, Matthieu Herrmann, Geoffrey I. Webb","doi":"10.1109/ICDM51629.2021.00070","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00070","url":null,"abstract":"The Dynamic Time Warping (DTW) similarity measure is widely used in many time series data mining applications. It computes the cost of aligning two series, smaller costs indicating more similar series. Most applications require tuning of DTW’s Warping Window (WW) parameter in order to achieve good performance. This parameter controls the amount of warping allowed, reducing pathological alignments, with the added benefit of speeding up computation. However, since DTW is in itself very costly, learning the WW is a burdensome process, requiring days even for datasets containing only a few thousand series. In this paper, we propose ULTRAFASTWWSEARCH, a new algorithm able to learn the WW significantly faster than the state-of-the-art FASTWWSEARCH method. ULTRAFASTWWSEARCH builds upon the latter, exploiting the properties of a new efficient exact DTW algorithm which supports early abandoning and pruning (EAP). We show on 128 datasets from the UCR archive that ULTRAFASTWWSEARCH is up to an order of magnitude faster than the previous state of the art.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xunyu Zhu, Jian Li, Yong Liu, Junshuo Liao, Weiping Wang
{"title":"Operation-level Progressive Differentiable Architecture Search","authors":"Xunyu Zhu, Jian Li, Yong Liu, Junshuo Liao, Weiping Wang","doi":"10.1109/ICDM51629.2021.00205","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00205","url":null,"abstract":"Differentiable Neural Architecture Search (DARTS) is becoming more and more popular among Neural Architecture Search (NAS) methods because of its high search efficiency and low compute cost. However, the stability of DARTS is very inferior, especially skip connections aggregation that leads to performance collapse. Though existing methods leverage Hessian eigenvalues to alleviate skip connections aggregation, they make DARTS unable to explore architectures with better performance. In the paper, we propose operation-level progressive differentiable neural architecture search (OPP-DARTS) to avoid skip connections aggregation and explore better architectures simultaneously. We first divide the search process into several stages during the search phase and increase candidate operations into the search space progressively at the beginning of each stage. It can effectively alleviate the unfair competition between operations during the search phase of DARTS by offsetting the inherent unfair advantage of the skip connection over other operations. Besides, to keep the competition between operations relatively fair and select the operation from the candidate operations set that makes training loss of the supernet largest. The experiment results indicate that our method is effective and efficient. Our method’s performance on CIFAR-10 is superior to the architecture found by standard DARTS, and the transferability of our method also surpasses standard DARTS. We further demonstrate the robustness of our method on three simple search spaces, i.e., S2, S3, S4, and the results show us that our method is more robust than standard DARTS. Our code is available at https://github.com/zxunyu/OPP-DARTS.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123138373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-Supervised Graph Attention Networks for Event Representation Learning","authors":"João Pedro Rodrigues Mattos, R. Marcacini","doi":"10.1109/ICDM51629.2021.00150","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00150","url":null,"abstract":"Event analysis from news and social networks is very useful for a wide range of social studies and real-world applications. Recently, event graphs have been explored to model event datasets and their complex relationships, where events are vertices connected to other vertices representing locations, people’s names, dates, and various other event metadata. Graph representation learning methods are promising for extracting latent features from event graphs to enable the use of different classification algorithms. However, existing methods fail to meet essential requirements for event graphs, such as (i) dealing with semi-supervised graph embedding to take advantage of some labeled events, (ii) automatically determining the importance of the relationships between event vertices and their metadata vertices, as well as (iii) dealing with the graph heterogeneity. This paper presents GNEE (GAT Neural Event Embeddings), a method that combines Graph Attention Networks and Graph Regularization. First, an event graph regularization is proposed to ensure that all graph vertices receive event features, thereby mitigating the graph heterogeneity drawback. Second, semi-supervised graph embedding with self-attention mechanism considers existing labeled events, as well as learns the importance of relationships in the event graph during the representation learning process. A statistical analysis of experimental results with five real-world event graphs and six graph embedding methods shows that our GNEE outperforms state-of-the-art semi-supervised graph embedding methods.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115803856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"USTEP: Unfixed Search Tree for Efficient Log Parsing","authors":"Arthur Vervaet, Raja Chiky, Mar Callau-Zori","doi":"10.1109/ICDM51629.2021.00077","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00077","url":null,"abstract":"Logs record valuable system information at runtime. They are widely used by data-driven approaches for development and monitoring purposes. Parsing log messages to structure their format is a classic preliminary step for log-mining tasks. As they appear upstream, parsing operations can become a processing time bottleneck for downstream applications. The quality of parsing also has a direct influence on their efficiency. Previous approaches toward online log parsing focused on stateful methods. But an increasing number of tasks ask for real time monitoring. Regarding this problem, we propose USTEP, an online log parsing method based on an evolving tree structure. Evaluation results on a panel of 13 datasets coming from different real-world systems demonstrate USTEP superiority in terms of both effectiveness and robustness when compared to other online methods. We also introduce USTEP-UP, a way of running multiple decentralized instances of USTEP in parallel.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"17 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131095419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections","authors":"Rob Churchill, Lisa Singh","doi":"10.1109/ICDM51629.2021.00017","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00017","url":null,"abstract":"Most topic models define a document as a mixture of topics and each topic as a mixture of words. Generally, the difference in generative topic models is how these mixtures of topics are generated. We propose looking at topic models in a new way, as topic-noise models. Our topic-noise model defines a document as a mixture of topics and noise. Topic Noise Discriminator (TND) estimates both the topic and noise distributions using not only the relationships between words in documents, but also the linguistic relationships found using word embeddings. This type of model is important for short, sparse social media posts that contain both random and non-random noise. We also understand that topic quality is subjective and that researchers may have preferences. Therefore, we propose a variant of our model that combines the pre-trained noise distribution from TND in an ensemble with any generative topic model to filter noise words and produce more coherent and diverse topic sets. We present this approach using Latent Dirichlet Allocation (LDA) and show that it is effective for maintaining high quality LDA topics while removing noise within them. Finally, we show the value of using a context-specific noise list generated from TND to remove noise statically, after topics have been generated by any topic model, including non-generative ones. We demonstrate the effectiveness of all three of these approaches that explicitly model context-specific noise in document collections.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123704870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Jiao, Ruili Lu, Di Jin, Yinghui Wang, Huamin Wu
{"title":"An Effective and Robust Framework by Modeling Correlations of Multiplex Network Embedding","authors":"Pengfei Jiao, Ruili Lu, Di Jin, Yinghui Wang, Huamin Wu","doi":"10.1109/ICDM51629.2021.00136","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00136","url":null,"abstract":"The dependencies across different layers are an important property in multiplex networks and a few methods have been proposed to learn the dependencies in various ways. When capturing the dependencies across different layers, some of them assumed the structure among layers following consistent connectivity to force two nodes with a link in one layer tend to have links in other layers, some introduced a common vector to model the shared information across all layers. However, the correlations among layers in multiplex networks are diverse, which go beyond the connectivity consistency. In this paper, we propose a novel Modeling Correlations for Multiplex network Embedding (MCME) framework to learn the robust node representations for each layer. It can deal with complex correlations with a common structure, layer similarity and node heterogeneity through a unified framework in multiplex networks. To evaluate our proposed model, we conduct extensive experiments on several real-world datasets and the results demonstrate that our proposed model consistently outperforms state-of-the-art methods.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126267278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangping Kang, Guoxian Yu, C. Domeniconi, Jun Wang, Weicong Guo, Yazhou Ren, Lili Cui
{"title":"Crowdsourcing with Self-paced Workers","authors":"Xiangping Kang, Guoxian Yu, C. Domeniconi, Jun Wang, Weicong Guo, Yazhou Ren, Lili Cui","doi":"10.1109/ICDM51629.2021.00038","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00038","url":null,"abstract":"Crowdsourcing is a popular and relatively economic way to harness human intelligence to process computer-hard tasks. Due to diverse factors (i.e., task difficulty, worker capability, and incentives), the collected answers from various crowd workers are of different quality. Many approaches have been proposed to manage high quality answers and to reduce the budget by modelling tasks, workers, or both. However, most of the existing approaches implicitly assume that the capability of workers is fixed during the crowdsourcing process. But in practice, such capability can be improved by gradually completing easy to hard tasks, alike human beings’ intrinsic self-paced learning ability. In this paper, we investigate crowdsourcing with self-paced workers, whose capability can be gradually boosted as he/she scrutinises and completes easy to hard tasks. Our proposed SPCrowd (Self-Paced Crowd worker) first asks workers to complete a set of golden tasks with known annotations; provides feedback to assist workers with capturing the raw modes of tasks and to spark the self-paced learning, which in turn facilitates the estimation of workers’ quality and tasks’ difficulty. It then introduces a task difficulty model to quantify the difficulty of tasks and rank them from easy to hard, and a benefit maximization criterion for task assignment, which can dynamically monitor the quality of self-paced workers and assign the sorted tasks to capable workers. In this way, a worker can successfully complete hard tasks after he/she completes easier and related tasks. Experimental results on semi-simulated and real crowdsourcing projects show that SPCrowd can better control the quality and save the budget compared to competitive baselines.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115554115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}