{"title":"基于采样原型词的贝叶斯噪声词聚类","authors":"T. Taniguchi, Yuta Fukusako, Toshiaki Takano","doi":"10.1109/DEVLRN.2018.8760503","DOIUrl":null,"url":null,"abstract":"This paper describes a new algorithm for sampling prototypical words from a set of noisy words and proposes a noisy word clustering method. In a lexical acquisition task, phoneme sequences recognized by a developmental robot using a phoneme recognizer have many errors. A letter or phoneme sequence involving errors is called a noisy word. To develop a mixture model for noisy words and develop a clustering method, a procedure needs to be developed for the sampling of a prototypical word, i.e., “mean” string, in a cluster of noisy words. Despite a long history regarding methods for treating noisy words, e.g., a stochastic deformation model, the edit distance and their variants, and an efficient sampling procedure for prototypical words have not been developed. In this paper, the mixture of stochastic deformation models, namely a generative model for noisy words, is proposed, and efficient blocked Gibbs samplers for the model are proposed. To develop this procedure, a forward filtering backward sampling procedure is proposed for jointly decoding noisy words and sampling their “mean” string. We applied the proposed clustering method to a set of noisy synthetic words and obtained better results than a baseline method. In particular, a sampling procedure using tied backward sampling demonstrated the best performance in reconstructing original words from noisy words through a clustering process.","PeriodicalId":236346,"journal":{"name":"2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Bayesian Noisy Word Clustering through Sampling Prototypical Words\",\"authors\":\"T. Taniguchi, Yuta Fukusako, Toshiaki Takano\",\"doi\":\"10.1109/DEVLRN.2018.8760503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a new algorithm for sampling prototypical words from a set of noisy words and proposes a noisy word clustering method. In a lexical acquisition task, phoneme sequences recognized by a developmental robot using a phoneme recognizer have many errors. A letter or phoneme sequence involving errors is called a noisy word. To develop a mixture model for noisy words and develop a clustering method, a procedure needs to be developed for the sampling of a prototypical word, i.e., “mean” string, in a cluster of noisy words. Despite a long history regarding methods for treating noisy words, e.g., a stochastic deformation model, the edit distance and their variants, and an efficient sampling procedure for prototypical words have not been developed. In this paper, the mixture of stochastic deformation models, namely a generative model for noisy words, is proposed, and efficient blocked Gibbs samplers for the model are proposed. To develop this procedure, a forward filtering backward sampling procedure is proposed for jointly decoding noisy words and sampling their “mean” string. We applied the proposed clustering method to a set of noisy synthetic words and obtained better results than a baseline method. In particular, a sampling procedure using tied backward sampling demonstrated the best performance in reconstructing original words from noisy words through a clustering process.\",\"PeriodicalId\":236346,\"journal\":{\"name\":\"2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEVLRN.2018.8760503\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Joint IEEE 8th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEVLRN.2018.8760503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Bayesian Noisy Word Clustering through Sampling Prototypical Words
This paper describes a new algorithm for sampling prototypical words from a set of noisy words and proposes a noisy word clustering method. In a lexical acquisition task, phoneme sequences recognized by a developmental robot using a phoneme recognizer have many errors. A letter or phoneme sequence involving errors is called a noisy word. To develop a mixture model for noisy words and develop a clustering method, a procedure needs to be developed for the sampling of a prototypical word, i.e., “mean” string, in a cluster of noisy words. Despite a long history regarding methods for treating noisy words, e.g., a stochastic deformation model, the edit distance and their variants, and an efficient sampling procedure for prototypical words have not been developed. In this paper, the mixture of stochastic deformation models, namely a generative model for noisy words, is proposed, and efficient blocked Gibbs samplers for the model are proposed. To develop this procedure, a forward filtering backward sampling procedure is proposed for jointly decoding noisy words and sampling their “mean” string. We applied the proposed clustering method to a set of noisy synthetic words and obtained better results than a baseline method. In particular, a sampling procedure using tied backward sampling demonstrated the best performance in reconstructing original words from noisy words through a clustering process.