{"title":"Research on Chinese Short Text Segmentation for New Media Comments","authors":"Pei-jun Gao, Yana Zhang, Suya Zhang, Zeyu Chen","doi":"10.1109/icisfall51598.2021.9627361","DOIUrl":null,"url":null,"abstract":"With the development of new media industry, comments based user interaction is now fairly routine in live broadcasting. User comments usually appear in the form of short text with freestyle and cyber new words. The general word segmentation methods could not adapt to Chinese short text in new media comments. This paper proposes a novel method of Chinese short text segmentation to solve the problem of word segmentation granularity self-adaption. A New Media Comment Short Text Dataset(NMCD) is built for our researches, a word vector text containing cyber new words and entity words as well. Our optimized bidirectional Long Short Term Memory(LSTM) model based on attention mechanism and transfer learning could make number and its unit together after the word segmentation. The experiment results show that the Fl-score is improved by 21.43%. The word segmentation method in this paper could be efficiently applied to the new media comments analysis system later.","PeriodicalId":240142,"journal":{"name":"2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icisfall51598.2021.9627361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
With the development of new media industry, comments based user interaction is now fairly routine in live broadcasting. User comments usually appear in the form of short text with freestyle and cyber new words. The general word segmentation methods could not adapt to Chinese short text in new media comments. This paper proposes a novel method of Chinese short text segmentation to solve the problem of word segmentation granularity self-adaption. A New Media Comment Short Text Dataset(NMCD) is built for our researches, a word vector text containing cyber new words and entity words as well. Our optimized bidirectional Long Short Term Memory(LSTM) model based on attention mechanism and transfer learning could make number and its unit together after the word segmentation. The experiment results show that the Fl-score is improved by 21.43%. The word segmentation method in this paper could be efficiently applied to the new media comments analysis system later.