{"title":"Relationship between spoken Indian languages by clustering of long distance bigram features of speech","authors":"K. V. V. Girish, Veena Vijai, A. Ramakrishnan","doi":"10.1109/INDICON.2016.7839074","DOIUrl":null,"url":null,"abstract":"In this paper, a novel method of identifying relationships between languages has been proposed. Our analysis deals with four major Indian languages, as well as Sanskrit and English. We have made use of long distance bigram Mel Frequency Cepstrum Coefficient features and different linkage measures to test the similarities between the clusters formed. Phylogenetic trees have been constructed to provide a visual understanding of the same. The results obtained match with already existing knowledge about language families. For all types of linkage measures, the closest language to Hindi is Marathi and for Tamil, it is Telugu. Since K-medoids give expected language relationships, they are used to learn dictionaries in order to see if they are useful in language identification as well. We have reported the results of one-vs-one classification and found that accuracy improves in the case of English when the weights recovered are multiplied with joint probability of the cluster associated with that medoid.","PeriodicalId":283953,"journal":{"name":"2016 IEEE Annual India Conference (INDICON)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Annual India Conference (INDICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDICON.2016.7839074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In this paper, a novel method of identifying relationships between languages has been proposed. Our analysis deals with four major Indian languages, as well as Sanskrit and English. We have made use of long distance bigram Mel Frequency Cepstrum Coefficient features and different linkage measures to test the similarities between the clusters formed. Phylogenetic trees have been constructed to provide a visual understanding of the same. The results obtained match with already existing knowledge about language families. For all types of linkage measures, the closest language to Hindi is Marathi and for Tamil, it is Telugu. Since K-medoids give expected language relationships, they are used to learn dictionaries in order to see if they are useful in language identification as well. We have reported the results of one-vs-one classification and found that accuracy improves in the case of English when the weights recovered are multiplied with joint probability of the cluster associated with that medoid.