2021 IEEE International Conference on Data Mining (ICDM)最新文献_第7页

Discriminative Additive Scale Loss for Deep Imbalanced Classification and Embedding 深度不平衡分类与嵌入的判别加性尺度损失

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00105

Zhao Zhang, Weiming Jiang, Yang Wang, Qiaolin Ye, Mingbo Zhao, Mingliang Xu, Meng Wang

{"title":"Discriminative Additive Scale Loss for Deep Imbalanced Classification and Embedding","authors":"Zhao Zhang, Weiming Jiang, Yang Wang, Qiaolin Ye, Mingbo Zhao, Mingliang Xu, Meng Wang","doi":"10.1109/ICDM51629.2021.00105","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00105","url":null,"abstract":"Real-world data in emerging applications may suffer from highly-skewed class imbalanced distribution, however how to deal with this kind of problem appropriately through deep learning needs further investigation. In this paper, we mainly propose a novel cross-entropy based loss function, referred to as Additive Scale Loss (ASL), for deep representation learning and imbalanced classification. To deal with the class imbalanced problem, ASL aims at increasing the loss in case of misclassification, which can avoid the superimposed loss values caused by the large amount of easily classified data in the unbalanced database to dominate the loss value of misclassified data. Moreover, in real-world applications, one data source may be used for multiple scenarios, such as classification and embedding learning, however training two separable models to handle these problems is costly, especially in deep learning area. To tackle this issue, we present and integrate a discriminative inter-class separation term into ASL, and propose a discriminative ASL (D-ASL), which can not only improve the classification performance, but also obtain discriminative representations simultaneously. The discriminative inter-class separation term is general, and can be easily integrated to other loss functions, such as CE and FL, as the byproducts. Finally, a new deep convolutional neural network equipped with D-ASL and a fully-connected (FC) layer is proposed, which can classify the imbalanced image data and obtain the discriminative representations at the same time. Extensive experimental results verified the superior performance of our method.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127280233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Differentially Private String Sanitization for Frequency-Based Mining Tasks 基于频率的挖掘任务的差分私有字符串清理

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00014

Huiping Chen, Changyu Dong, Liyue Fan, G. Loukides, S. Pissis, L. Stougie

{"title":"Differentially Private String Sanitization for Frequency-Based Mining Tasks","authors":"Huiping Chen, Changyu Dong, Liyue Fan, G. Loukides, S. Pissis, L. Stougie","doi":"10.1109/ICDM51629.2021.00014","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00014","url":null,"abstract":"Strings are used to model genomic, natural language, and web activity data, and are thus often shared broadly. However, string data sharing has raised privacy concerns stemming from the fact that knowledge of length-k substrings of a string and their frequencies (multiplicities) may be sufficient to uniquely reconstruct the string; and from that the inference of such substrings may leak confidential information. We thus introduce the problem of protecting length-k substrings of a single string S by applying Differential Privacy (DP) while maximizing data utility for frequency-based mining tasks. Our theoretical and empirical evidence suggests that classic DP mechanisms are not suitable to address the problem. In response, we employ the order-k de Bruijn graph G of S and propose a sampling-based mechanism for enforcing DP on G. We consider the task of enforcing DP on G using our mechanism while preserving the normalized edge multiplicities in G. We define an optimization problem on integer edge weights that is central to this task and develop an algorithm based on dynamic programming to solve it exactly. We also consider two variants of this problem with real edge weights. By relaxing the constraint of integer edge weights, we are able to develop linear-time exact algorithms for these variants, which we use as stepping stones towards effective heuristics. An extensive experimental evaluation using real-world large-scale strings (in the order of billions of letters) shows that our heuristics are efficient and produce near-optimal solutions which preserve data utility for frequency-based mining tasks.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129991835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

BioHanBERT: A Hanzi-aware Pre-trained Language Model for Chinese Biomedical Text Mining 中文生物医学文本挖掘的汉字感知预训练语言模型

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00181

Xiaosu Wang, Yun Xiong, Hao Niu, Jingwen Yue, Yangyong Zhu, Philip S. Yu

{"title":"BioHanBERT: A Hanzi-aware Pre-trained Language Model for Chinese Biomedical Text Mining","authors":"Xiaosu Wang, Yun Xiong, Hao Niu, Jingwen Yue, Yangyong Zhu, Philip S. Yu","doi":"10.1109/ICDM51629.2021.00181","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00181","url":null,"abstract":"Unsupervised pre-trained language models (PLMs) have boosted the development of effective biomedical text mining models. But the biomedical texts contain a huge number of long-tail concepts and terminologies, which makes further pre-training on biomedical corpora relatively expensive (more biomedical corpora and more pre-training steps are needed). Nonetheless, this problem receives less attention in recent studies. In Chinese biomedical text, concepts and terminologies consist of Chinese characters, and Chinese characters are often composed of sub-character components which are also semantically informative; thus in order to enhance the semantics of biomedical concepts and terminologies, the use of a Chinese character’s component-level internal semantic information also appears to be reasonable.In this paper, we propose a novel hanzi-aware pre-trained language model for Chinese biomedical text mining, referred to as BioHanBERT (hanzi-aware BERT for Chinese biomedical text mining), utilizing the component-level internal semantic information of Chinese characters to enhance the semantics of Chinese biomedical concepts and terminologies, and thereby to reduce further pre-training costs. BioHanBERT first employs a Chinese character encoder to extract the component-level internal semantic feature of each Chinese character, and then fuse the character’s internal semantic feature and its contextual embedding extracted by BERT to enrich the representations of the concepts or terminologies containing the character. The results of extensive experiments show that our model is able to consistently outperform current state-of-the-art (SOTA) models in a wide range of Chinese biomedical natural language processing (NLP) tasks.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Ultra fast warping window optimization for Dynamic Time Warping 动态时间翘曲的超快速翘曲窗口优化

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00070

Chang Wei Tan, Matthieu Herrmann, Geoffrey I. Webb

引用次数: 4

Operation-level Progressive Differentiable Architecture Search 操作级渐进式可微分架构搜索

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00205

Xunyu Zhu, Jian Li, Yong Liu, Junshuo Liao, Weiping Wang

{"title":"Operation-level Progressive Differentiable Architecture Search","authors":"Xunyu Zhu, Jian Li, Yong Liu, Junshuo Liao, Weiping Wang","doi":"10.1109/ICDM51629.2021.00205","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00205","url":null,"abstract":"Differentiable Neural Architecture Search (DARTS) is becoming more and more popular among Neural Architecture Search (NAS) methods because of its high search efficiency and low compute cost. However, the stability of DARTS is very inferior, especially skip connections aggregation that leads to performance collapse. Though existing methods leverage Hessian eigenvalues to alleviate skip connections aggregation, they make DARTS unable to explore architectures with better performance. In the paper, we propose operation-level progressive differentiable neural architecture search (OPP-DARTS) to avoid skip connections aggregation and explore better architectures simultaneously. We first divide the search process into several stages during the search phase and increase candidate operations into the search space progressively at the beginning of each stage. It can effectively alleviate the unfair competition between operations during the search phase of DARTS by offsetting the inherent unfair advantage of the skip connection over other operations. Besides, to keep the competition between operations relatively fair and select the operation from the candidate operations set that makes training loss of the supernet largest. The experiment results indicate that our method is effective and efficient. Our method’s performance on CIFAR-10 is superior to the architecture found by standard DARTS, and the transferability of our method also surpasses standard DARTS. We further demonstrate the robustness of our method on three simple search spaces, i.e., S2, S3, S4, and the results show us that our method is more robust than standard DARTS. Our code is available at https://github.com/zxunyu/OPP-DARTS.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123138373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Semi-Supervised Graph Attention Networks for Event Representation Learning 用于事件表示学习的半监督图注意网络

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00150

João Pedro Rodrigues Mattos, R. Marcacini

{"title":"Semi-Supervised Graph Attention Networks for Event Representation Learning","authors":"João Pedro Rodrigues Mattos, R. Marcacini","doi":"10.1109/ICDM51629.2021.00150","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00150","url":null,"abstract":"Event analysis from news and social networks is very useful for a wide range of social studies and real-world applications. Recently, event graphs have been explored to model event datasets and their complex relationships, where events are vertices connected to other vertices representing locations, people’s names, dates, and various other event metadata. Graph representation learning methods are promising for extracting latent features from event graphs to enable the use of different classification algorithms. However, existing methods fail to meet essential requirements for event graphs, such as (i) dealing with semi-supervised graph embedding to take advantage of some labeled events, (ii) automatically determining the importance of the relationships between event vertices and their metadata vertices, as well as (iii) dealing with the graph heterogeneity. This paper presents GNEE (GAT Neural Event Embeddings), a method that combines Graph Attention Networks and Graph Regularization. First, an event graph regularization is proposed to ensure that all graph vertices receive event features, thereby mitigating the graph heterogeneity drawback. Second, semi-supervised graph embedding with self-attention mechanism considers existing labeled events, as well as learns the importance of relationships in the event graph during the representation learning process. A statistical analysis of experimental results with five real-world event graphs and six graph embedding methods shows that our GNEE outperforms state-of-the-art semi-supervised graph embedding methods.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115803856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

USTEP: Unfixed Search Tree for Efficient Log Parsing USTEP:用于高效日志解析的不固定搜索树

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00077

Arthur Vervaet, Raja Chiky, Mar Callau-Zori

引用次数: 6

Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections 话题噪声模型:社交媒体帖子集合中的话题和噪声分布建模

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00017

Rob Churchill, Lisa Singh

{"title":"Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections","authors":"Rob Churchill, Lisa Singh","doi":"10.1109/ICDM51629.2021.00017","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00017","url":null,"abstract":"Most topic models define a document as a mixture of topics and each topic as a mixture of words. Generally, the difference in generative topic models is how these mixtures of topics are generated. We propose looking at topic models in a new way, as topic-noise models. Our topic-noise model defines a document as a mixture of topics and noise. Topic Noise Discriminator (TND) estimates both the topic and noise distributions using not only the relationships between words in documents, but also the linguistic relationships found using word embeddings. This type of model is important for short, sparse social media posts that contain both random and non-random noise. We also understand that topic quality is subjective and that researchers may have preferences. Therefore, we propose a variant of our model that combines the pre-trained noise distribution from TND in an ensemble with any generative topic model to filter noise words and produce more coherent and diverse topic sets. We present this approach using Latent Dirichlet Allocation (LDA) and show that it is effective for maintaining high quality LDA topics while removing noise within them. Finally, we show the value of using a context-specific noise list generated from TND to remove noise statically, after topics have been generated by any topic model, including non-generative ones. We demonstrate the effectiveness of all three of these approaches that explicitly model context-specific noise in document collections.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123704870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An Effective and Robust Framework by Modeling Correlations of Multiplex Network Embedding 基于多路网络嵌入关联建模的有效鲁棒框架

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00136

Pengfei Jiao, Ruili Lu, Di Jin, Yinghui Wang, Huamin Wu

{"title":"An Effective and Robust Framework by Modeling Correlations of Multiplex Network Embedding","authors":"Pengfei Jiao, Ruili Lu, Di Jin, Yinghui Wang, Huamin Wu","doi":"10.1109/ICDM51629.2021.00136","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00136","url":null,"abstract":"The dependencies across different layers are an important property in multiplex networks and a few methods have been proposed to learn the dependencies in various ways. When capturing the dependencies across different layers, some of them assumed the structure among layers following consistent connectivity to force two nodes with a link in one layer tend to have links in other layers, some introduced a common vector to model the shared information across all layers. However, the correlations among layers in multiplex networks are diverse, which go beyond the connectivity consistency. In this paper, we propose a novel Modeling Correlations for Multiplex network Embedding (MCME) framework to learn the robust node representations for each layer. It can deal with complex correlations with a common structure, layer similarity and node heterogeneity through a unified framework in multiplex networks. To evaluate our proposed model, we conduct extensive experiments on several real-world datasets and the results demonstrate that our proposed model consistently outperforms state-of-the-art methods.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126267278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Crowdsourcing with Self-paced Workers 使用自定进度的员工进行众包

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI: 10.1109/ICDM51629.2021.00038

Xiangping Kang, Guoxian Yu, C. Domeniconi, Jun Wang, Weicong Guo, Yazhou Ren, Lili Cui

{"title":"Crowdsourcing with Self-paced Workers","authors":"Xiangping Kang, Guoxian Yu, C. Domeniconi, Jun Wang, Weicong Guo, Yazhou Ren, Lili Cui","doi":"10.1109/ICDM51629.2021.00038","DOIUrl":"https://doi.org/10.1109/ICDM51629.2021.00038","url":null,"abstract":"Crowdsourcing is a popular and relatively economic way to harness human intelligence to process computer-hard tasks. Due to diverse factors (i.e., task difficulty, worker capability, and incentives), the collected answers from various crowd workers are of different quality. Many approaches have been proposed to manage high quality answers and to reduce the budget by modelling tasks, workers, or both. However, most of the existing approaches implicitly assume that the capability of workers is fixed during the crowdsourcing process. But in practice, such capability can be improved by gradually completing easy to hard tasks, alike human beings’ intrinsic self-paced learning ability. In this paper, we investigate crowdsourcing with self-paced workers, whose capability can be gradually boosted as he/she scrutinises and completes easy to hard tasks. Our proposed SPCrowd (Self-Paced Crowd worker) first asks workers to complete a set of golden tasks with known annotations; provides feedback to assist workers with capturing the raw modes of tasks and to spark the self-paced learning, which in turn facilitates the estimation of workers’ quality and tasks’ difficulty. It then introduces a task difficulty model to quantify the difficulty of tasks and rank them from easy to hard, and a benefit maximization criterion for task assignment, which can dynamically monitor the quality of self-paced workers and assign the sorted tasks to capable workers. In this way, a worker can successfully complete hard tasks after he/she completes easier and related tasks. Experimental results on semi-simulated and real crowdsourcing projects show that SPCrowd can better control the quality and save the budget compared to competitive baselines.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115554115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5