Extracting Thai Compounds Using Collocations and POS Bigram Probabilities without a POS Tagger

Wirote Aroonmanakun
{"title":"Extracting Thai Compounds Using Collocations and POS Bigram Probabilities without a POS Tagger","authors":"Wirote Aroonmanakun","doi":"10.1109/IALP.2009.33","DOIUrl":null,"url":null,"abstract":"This paper presents a simple method to extract compounds using statistical collocations and POS bigram probabilities without a POS tagger. Statistical collocation was used to determine strength of word co-occurrences. Probabilities of POS sequences were used to adjust the strength of collocation within a possible compound. These probabilities were estimated from compounds found in the dictionary. Bigram and trigram words extracted from a corpus of 28 million words were ranked by two means, collocation scores and collocation scores weighted by POS pattern probabilities. Cutoff precision at every 200 points were calculated for both methods. The results showed that probabilities of POS sequences could increase the precision rate of compound extraction at certain level. The system can extract 2-word compounds and 3-word compounds at the precision rate up to 63% and 35% respectively. When eliminating bigram extractions that could be parts of trigram extraction, the precision rate is increased up to 71%.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2009.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

This paper presents a simple method to extract compounds using statistical collocations and POS bigram probabilities without a POS tagger. Statistical collocation was used to determine strength of word co-occurrences. Probabilities of POS sequences were used to adjust the strength of collocation within a possible compound. These probabilities were estimated from compounds found in the dictionary. Bigram and trigram words extracted from a corpus of 28 million words were ranked by two means, collocation scores and collocation scores weighted by POS pattern probabilities. Cutoff precision at every 200 points were calculated for both methods. The results showed that probabilities of POS sequences could increase the precision rate of compound extraction at certain level. The system can extract 2-word compounds and 3-word compounds at the precision rate up to 63% and 35% respectively. When eliminating bigram extractions that could be parts of trigram extraction, the precision rate is increased up to 71%.
在不使用词性标注器的情况下,使用搭配和词性Bigram概率提取泰语化合物
本文提出了一种简单的方法,利用统计配位和词性标注概率来提取化合物,而不需要词性标注器。统计搭配用于确定词共现的强度。利用POS序列的概率来调整可能化合物内的搭配强度。这些可能性是根据字典中发现的复合词估计出来的。从2800万词的语料库中提取双表词和三表词,采用搭配得分和搭配得分加权词性模式概率两种方法进行排序。计算了两种方法每200点的截止精度。结果表明,POS序列的概率化可以在一定程度上提高化合物提取的准确率。该系统可提取2字化合物和3字化合物,准确率分别高达63%和35%。当去除可能是三元提取部分的二元提取时,准确率提高到71%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信