Distinctive Features of Association Measures Applied to Chinese Character Bigram Extraction Tasks

D. S. Korshunov
{"title":"Distinctive Features of Association Measures Applied to Chinese Character Bigram Extraction Tasks","authors":"D. S. Korshunov","doi":"10.25205/1818-7935-2022-20-2-64-80","DOIUrl":null,"url":null,"abstract":"Studying professional discourse, a researcher has now an opportunity to create collections of texts and apply linguistic analysis software tools to them. However, when it comes to Chinese discourse there is a problem with the reliability of automatic word segmentation of texts. One of the ways to extract lexical units in Chinese texts is to apply statistical association measures for collocations to Chinese character bigrams. The purpose of this work is to conduct a comparative analysis of seven different statistical measures for collocations as a means of extracting two-syllabic lexical units (binomes) in an unsegmented Chinese character text. The subject of the analysis is the lexical, grammatical and frequency characteristics of bigrams with higher values of the statistical measures. Their comparison makes it possible to draw a conclusion about the features of statistical measures, in particular, about the best correspondence of linguistic tasks to statistical measures. The linguistic material of the study was a collection of 560 military-related news texts in Chinese with more than 720 thousand characters. The results show that the statistical measures considered can be divided into three groups according to the characteristics of bigrams receiving the highest values. The first group includes measures MI, MS and logDice, which give priority to rare bigrams with limited compatibility of components, such as the Chinese two-syllable single morpheme words “lianmianzi”. These measures do not extract terms well, but can be used to search for phraseologically related components. The measures of the second group, t-score and log-likelihood, are frequency-oriented, similar to frequency analysis, but they cope with non-lexical bigrams better, while log-likelihood somewhat lowers the rank of numerals and pronouns, picking out best the typical vocabulary of professional discourse. The third group includes measures MI3 and MI.log-f, which average the opposite approaches of the first two groups. The MI3 measure is considered to be the most universal one; it could be used to compare different corpora or collections of texts. It is concluded that applying statistical association measures to Chinese character bi-grams is possible and appropriate, when taking into account the correspondence of their specifics to a research task.","PeriodicalId":434662,"journal":{"name":"NSU Vestnik. Series: Linguistics and Intercultural Communication","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NSU Vestnik. Series: Linguistics and Intercultural Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25205/1818-7935-2022-20-2-64-80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Studying professional discourse, a researcher has now an opportunity to create collections of texts and apply linguistic analysis software tools to them. However, when it comes to Chinese discourse there is a problem with the reliability of automatic word segmentation of texts. One of the ways to extract lexical units in Chinese texts is to apply statistical association measures for collocations to Chinese character bigrams. The purpose of this work is to conduct a comparative analysis of seven different statistical measures for collocations as a means of extracting two-syllabic lexical units (binomes) in an unsegmented Chinese character text. The subject of the analysis is the lexical, grammatical and frequency characteristics of bigrams with higher values of the statistical measures. Their comparison makes it possible to draw a conclusion about the features of statistical measures, in particular, about the best correspondence of linguistic tasks to statistical measures. The linguistic material of the study was a collection of 560 military-related news texts in Chinese with more than 720 thousand characters. The results show that the statistical measures considered can be divided into three groups according to the characteristics of bigrams receiving the highest values. The first group includes measures MI, MS and logDice, which give priority to rare bigrams with limited compatibility of components, such as the Chinese two-syllable single morpheme words “lianmianzi”. These measures do not extract terms well, but can be used to search for phraseologically related components. The measures of the second group, t-score and log-likelihood, are frequency-oriented, similar to frequency analysis, but they cope with non-lexical bigrams better, while log-likelihood somewhat lowers the rank of numerals and pronouns, picking out best the typical vocabulary of professional discourse. The third group includes measures MI3 and MI.log-f, which average the opposite approaches of the first two groups. The MI3 measure is considered to be the most universal one; it could be used to compare different corpora or collections of texts. It is concluded that applying statistical association measures to Chinese character bi-grams is possible and appropriate, when taking into account the correspondence of their specifics to a research task.
关联测度在汉字双拼提取任务中的显著特征
研究专业话语,研究人员现在有机会创建文本集合,并应用语言分析软件工具。然而,对于汉语语篇来说,文本自动分词的可靠性存在问题。在汉语文本中提取词汇单位的一种方法是对汉字组合使用统计关联度量。本研究的目的是比较分析7种不同的搭配统计方法在未分词的汉字文本中提取双音节词汇单位的方法。分析的主题是具有较高统计度量值的双元词的词汇、语法和频率特征。它们的比较使我们有可能得出关于统计措施的特征的结论,特别是关于语言任务与统计措施的最佳对应关系的结论。该研究的语言材料是560篇与军事有关的中文新闻文本的集合,超过72万字。结果表明,所考虑的统计测度可以根据获得最高值的双元特征分为三组。第一组测度包括MI、MS和logDice,优先考虑成分兼容性有限的稀有双音节词,如汉语双音节单语素词“连面字”。这些方法不能很好地提取术语,但可以用来搜索与短语相关的成分。第二组的测量方法,t-score和对数似然,是频率导向的,类似于频率分析,但它们能更好地处理非词汇的双引号,而对数似然在某种程度上降低了数字和代词的排名,最好地挑选出专业话语的典型词汇。第三组包括测量MI3和MI.log-f,它们平均了前两组相反的方法。军情三处的测量方法被认为是最通用的;它可以用来比较不同的语料库或文本集。考虑到汉字双图的特征与研究任务的对应关系,对汉字双图应用统计关联度量是可能的和适当的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信