{"title":"Machine Reading Comprehension of High-Tech Industry Policies: A New Dataset and Chinese Pre-Trained Language Model","authors":"Changchang Zeng, Shaobo Li, B. Chen","doi":"10.1109/TOCS53301.2021.9688582","DOIUrl":null,"url":null,"abstract":"Machine reading comprehension (MRC) is a challenging research hotspot in the field of Artificial Intelligence (AI). It can be applied to many scenarios, such as intelligent question answering, intelligent document retrieval, and so on. In this article, we focus on the machine reading comprehension of high-tech industry policy texts in China. First, we create a cloze style machine reading comprehension dataset of Chinese high-tech industrial policies. Next, we propose a new pre-training objective named multi-segment ordering discriminator, and we also use domain-specific dictionary to improve the MLM pre-training process. Finally, on our dataset, we trained a new pre-trained language model for machine reading comprehension of Chinese industrial policies. Experiment results show that our pre-trained language model surpasses existing models such as BERT and RoBERTa in the new dataset.","PeriodicalId":360004,"journal":{"name":"2021 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TOCS53301.2021.9688582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine reading comprehension (MRC) is a challenging research hotspot in the field of Artificial Intelligence (AI). It can be applied to many scenarios, such as intelligent question answering, intelligent document retrieval, and so on. In this article, we focus on the machine reading comprehension of high-tech industry policy texts in China. First, we create a cloze style machine reading comprehension dataset of Chinese high-tech industrial policies. Next, we propose a new pre-training objective named multi-segment ordering discriminator, and we also use domain-specific dictionary to improve the MLM pre-training process. Finally, on our dataset, we trained a new pre-trained language model for machine reading comprehension of Chinese industrial policies. Experiment results show that our pre-trained language model surpasses existing models such as BERT and RoBERTa in the new dataset.