{"title":"基于加权相似度的班图语词形态聚类归纳","authors":"Catherine Chavula, H. Suleman","doi":"10.1145/3129416.3129453","DOIUrl":null,"url":null,"abstract":"Unsupervised morphological segmentation is attractive for low density languages that have little linguistic description, such as many of the Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited for languages that have simple morphological systems. This paper proposes a weighted similarity measure that uses normal distribution for calculating Ordered Weighted Aggregator (OWA) operator weights. The weighting favours shared character sequences that are likely to be part of stems in highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, both belonging to group N of the Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than the Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.","PeriodicalId":269578,"journal":{"name":"Research Conference of the South African Institute of Computer Scientists and Information Technologists","volume":"2224 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Morphological cluster induction of Bantu words using a weighted similarity measure\",\"authors\":\"Catherine Chavula, H. Suleman\",\"doi\":\"10.1145/3129416.3129453\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unsupervised morphological segmentation is attractive for low density languages that have little linguistic description, such as many of the Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited for languages that have simple morphological systems. This paper proposes a weighted similarity measure that uses normal distribution for calculating Ordered Weighted Aggregator (OWA) operator weights. The weighting favours shared character sequences that are likely to be part of stems in highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, both belonging to group N of the Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than the Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.\",\"PeriodicalId\":269578,\"journal\":{\"name\":\"Research Conference of the South African Institute of Computer Scientists and Information Technologists\",\"volume\":\"2224 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research Conference of the South African Institute of Computer Scientists and Information Technologists\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3129416.3129453\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Conference of the South African Institute of Computer Scientists and Information Technologists","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3129416.3129453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Morphological cluster induction of Bantu words using a weighted similarity measure
Unsupervised morphological segmentation is attractive for low density languages that have little linguistic description, such as many of the Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited for languages that have simple morphological systems. This paper proposes a weighted similarity measure that uses normal distribution for calculating Ordered Weighted Aggregator (OWA) operator weights. The weighting favours shared character sequences that are likely to be part of stems in highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, both belonging to group N of the Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than the Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.