基于自适应混合掩蔽和最优传输对齐的面向领域语言建模

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining Pub Date : 2021-08-14 DOI:10.1145/3447548.3467215

Denghui Zhang, Zixuan Yuan, Yanchi Liu, Hao Liu, Fuzhen Zhuang, Hui Xiong, Haifeng Chen

{"title":"基于自适应混合掩蔽和最优传输对齐的面向领域语言建模","authors":"Denghui Zhang, Zixuan Yuan, Yanchi Liu, Hao Liu, Fuzhen Zhuang, Hui Xiong, Haifeng Chen","doi":"10.1145/3447548.3467215","DOIUrl":null,"url":null,"abstract":"Motivated by the success of pre-trained language models such as BERT in a broad range of natural language processing (NLP) tasks, recent research efforts have been made for adapting these models for different application domains. Along this line, existing domain-oriented models have primarily followed the vanilla BERT architecture and have a straightforward use of the domain corpus. However, domain-oriented tasks usually require accurate understanding of domain phrases, and such fine-grained phrase-level knowledge is hard to be captured by existing pre-training scheme. Also, the word co-occurrences guided semantic learning of pre-training models can be largely augmented by entity-level association knowledge. But meanwhile, there is a risk of introducing noise due to the lack of groundtruth word-level alignment. To address the issues, we provide a generalized domain-oriented approach, which leverages auxiliary domain knowledge to improve the existing pre-training framework from two aspects. First, to preserve phrase knowledge effectively, we build a domain phrase pool as auxiliary knowledge, meanwhile we introduce Adaptive Hybrid Masked Model to incorporate such knowledge. It integrates two learning modes, word learning and phrase learning, and allows them to switch between each other. Second, we introduce Cross Entity Alignment to leverage entity association as weak supervision to augment the semantic learning of pre-trained models. To alleviate the potential noise in this process, we introduce an interpretableOptimal Transport based approach to guide alignment learning. Experiments on four domain-oriented tasks demonstrate the superiority of our framework.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Domain-oriented Language Modeling with Adaptive Hybrid Masking and Optimal Transport Alignment\",\"authors\":\"Denghui Zhang, Zixuan Yuan, Yanchi Liu, Hao Liu, Fuzhen Zhuang, Hui Xiong, Haifeng Chen\",\"doi\":\"10.1145/3447548.3467215\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivated by the success of pre-trained language models such as BERT in a broad range of natural language processing (NLP) tasks, recent research efforts have been made for adapting these models for different application domains. Along this line, existing domain-oriented models have primarily followed the vanilla BERT architecture and have a straightforward use of the domain corpus. However, domain-oriented tasks usually require accurate understanding of domain phrases, and such fine-grained phrase-level knowledge is hard to be captured by existing pre-training scheme. Also, the word co-occurrences guided semantic learning of pre-training models can be largely augmented by entity-level association knowledge. But meanwhile, there is a risk of introducing noise due to the lack of groundtruth word-level alignment. To address the issues, we provide a generalized domain-oriented approach, which leverages auxiliary domain knowledge to improve the existing pre-training framework from two aspects. First, to preserve phrase knowledge effectively, we build a domain phrase pool as auxiliary knowledge, meanwhile we introduce Adaptive Hybrid Masked Model to incorporate such knowledge. It integrates two learning modes, word learning and phrase learning, and allows them to switch between each other. Second, we introduce Cross Entity Alignment to leverage entity association as weak supervision to augment the semantic learning of pre-trained models. To alleviate the potential noise in this process, we introduce an interpretableOptimal Transport based approach to guide alignment learning. Experiments on four domain-oriented tasks demonstrate the superiority of our framework.\",\"PeriodicalId\":421090,\"journal\":{\"name\":\"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447548.3467215\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447548.3467215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

由于预训练语言模型(如BERT)在广泛的自然语言处理(NLP)任务中取得了成功，最近的研究工作已将这些模型用于不同的应用领域。沿着这条路线，现有的面向领域的模型主要遵循香草BERT体系结构，并直接使用领域语料库。然而，面向领域的任务通常需要对领域短语的准确理解，这种细粒度的短语级知识很难被现有的预训练方案捕获。此外，词共现引导的预训练模型语义学习可以通过实体级关联知识得到极大的增强。但与此同时，由于缺乏真实字级对齐，存在引入噪声的风险。为了解决这些问题，我们提出了一种广义的面向领域的方法，该方法利用辅助领域知识从两个方面改进现有的预训练框架。首先，为了有效地保存短语知识，我们建立了一个领域短语池作为辅助知识，同时引入自适应混合掩模模型对这些知识进行融合。它整合了单词学习和短语学习两种学习模式，并允许它们相互切换。其次，我们引入跨实体对齐，利用实体关联作为弱监督来增强预训练模型的语义学习。为了减轻这一过程中潜在的噪声，我们引入了一种基于可解释的最优传输方法来指导对齐学习。在四个面向领域的任务上的实验证明了该框架的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Domain-oriented Language Modeling with Adaptive Hybrid Masking and Optimal Transport Alignment

Motivated by the success of pre-trained language models such as BERT in a broad range of natural language processing (NLP) tasks, recent research efforts have been made for adapting these models for different application domains. Along this line, existing domain-oriented models have primarily followed the vanilla BERT architecture and have a straightforward use of the domain corpus. However, domain-oriented tasks usually require accurate understanding of domain phrases, and such fine-grained phrase-level knowledge is hard to be captured by existing pre-training scheme. Also, the word co-occurrences guided semantic learning of pre-training models can be largely augmented by entity-level association knowledge. But meanwhile, there is a risk of introducing noise due to the lack of groundtruth word-level alignment. To address the issues, we provide a generalized domain-oriented approach, which leverages auxiliary domain knowledge to improve the existing pre-training framework from two aspects. First, to preserve phrase knowledge effectively, we build a domain phrase pool as auxiliary knowledge, meanwhile we introduce Adaptive Hybrid Masked Model to incorporate such knowledge. It integrates two learning modes, word learning and phrase learning, and allows them to switch between each other. Second, we introduce Cross Entity Alignment to leverage entity association as weak supervision to augment the semantic learning of pre-trained models. To alleviate the potential noise in this process, we introduce an interpretableOptimal Transport based approach to guide alignment learning. Experiments on four domain-oriented tasks demonstrate the superiority of our framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量