{"title":"Unsupervised Learning of Morphology Using a Novel Directed Search Algorithm: Taking the First Step","authors":"Matthew G. Snover, G. Jarosz, M. Brent","doi":"10.3115/1118647.1118649","DOIUrl":"https://doi.org/10.3115/1118647.1118649","url":null,"abstract":"This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and a novel search algorithm. By examining morphologically rich subsets of an input lexicon, the search identifies highly productive paradigms. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with other recent unsupervised morphology learning algorithms demonstrate the effectiveness of this technique.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"318 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131724556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic Context-Free Grammars for Phonology","authors":"K. Müller","doi":"10.3115/1118647.1118655","DOIUrl":"https://doi.org/10.3115/1118647.1118655","url":null,"abstract":"We present a phonological probabilistic context-free grammar, which describes the word and syllable structure of German words. The grammar is trained on a large corpus by a simple supervised method, and evaluated on a syllabification task achieving 96.88% word accuracy on word tokens, and 90.33% on word types. We added rules for English phonemes to the grammar, and trained the enriched grammar on an English corpus. Both grammars are evaluated qualitatively showing that probabilistic context-free grammars can contribute linguistic knowledge to phonology. Our formal approach is multilingual, while the training data is language-dependent.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124740464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling English Past Tense Intuitions with Minimal Generalization","authors":"Adam Albright, B. Hayes","doi":"10.3115/1118647.1118654","DOIUrl":"https://doi.org/10.3115/1118647.1118654","url":null,"abstract":"We describe here a supervised learning model that, given paradigms of related words, learns the morphological and phonological rules needed to derive the paradigm. The model can use its rules to make guesses about how novel forms would be inflected, and has been tested experimentally against the intuitions of human speakers.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127332238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using eigenvectors of the bigram graph to infer morpheme identity","authors":"M. Belkin, J. Goldsmith","doi":"10.3115/1118647.1118652","DOIUrl":"https://doi.org/10.3115/1118647.1118652","url":null,"abstract":"This paper describes the results of some experiments exploring statistical methods to infer syntactic categories from a raw corpus in an unsupervised fashion. It shares certain points in common with Brown et at (1992) and work that has grown out of that: it employs statistical techniques to derive categories based on what words occur adjacent to a given word. However, we use an eigenvector decomposition of a nearest-neighbor graph to produce a two-dimensional rendering of the words of a corpus in which words of the same syntactic category tend to form clusters and neighborhoods. We exploit this technique for extending the value of automatic learning of morphology. In particular, we look at the suffixes derived from a corpus by unsupervised learning of morphology, and we ask which of these suffixes have a consistent syntactic function (e.g., in English, -ed is primarily a mark of verbal past tense, does but -s marks both noun plurals and 3rd person present on verbs).","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122416007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Discovery of Morphemes","authors":"Mathias Creutz, K. Lagus","doi":"10.3115/1118647.1118650","DOIUrl":"https://doi.org/10.3115/1118647.1118650","url":null,"abstract":"We present two methods for unsupervised segmentation of words into morpheme-like units. The model utilized is especially suited for languages with a rich morphology, such as Finnish. The first method is based on the Minimum Description Length (MDL) principle and works online. In the second method, Maximum Likelihood (ML) optimization is used. The quality of the segmentations is measured using an evaluation method that compares the segmentations produced to an existing morphological analysis. Experiments on both Finnish and English corpora show that the presented methods perform well compared to a current state-of-the-art system.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127729752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Morphological reinflection with weighted finite-state transducers","authors":"Alice Kwak, Michael Hammond, Cheyenne Wing","doi":"10.18653/v1/2023.sigmorphon-1.15","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.15","url":null,"abstract":"This paper describes the submission by the University of Arizona to the SIGMORPHON 2023 Shared Task on typologically diverse morphological (re-)infection. In our submission, we investigate the role of frequency, length, and weighted transducers in addressing the challenge of morphological reinflection. We start with the non-neural baseline provided for the task and show how some improvement can be gained by integrating length and frequency in prefix selection. We also investigate using weighted finite-state transducers, jump-started from edit distance and directly augmented with frequency. Our specific technique is promising and quite simple, but we see only modest improvements for some languages here.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116060157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Automated Prediction of English Lexical Blends Through the Use of Observable Linguistic Features","authors":"Jarem Saunders","doi":"10.18653/v1/2023.sigmorphon-1.10","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.10","url":null,"abstract":"The process of lexical blending is difficult to reliably predict. This difficulty has been shown by machine learning approaches in blend modeling, including attempts using then state-of-the-art LSTM deep neural networks trained on character embeddings, which were able to predict lexical blends given the ordered constituent words in less than half of cases, at maximum. This project introduces a novel model architecture which dramatically increases the correct prediction rates for lexical blends, using only Polynomial regression and Random Forest models. This is achieved by generating multiple possible blend candidates for each input word pairing and evaluating them based on observable linguistic features. The success of this model architecture illustrates the potential usefulness of observable linguistic features for problems that elude more advanced models which utilize only features discovered in the latent space.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121601881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-tuning mSLAM for the SIGMORPHON 2022 Shared Task on Grapheme-to-Phoneme Conversion","authors":"Dan Garrette","doi":"10.18653/v1/2023.sigmorphon-1.31","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.31","url":null,"abstract":"Grapheme-to-phoneme (G2P) conversion is a task that is inherently related to both written and spoken language. Therefore, our submission to the G2P shared task builds off of mSLAM (Bapna et al., 2022), a 600M parameter encoder model pretrained simultaneously on text from 101 languages and speech from 51 languages. For fine-tuning a G2P model, we combined mSLAM’s text encoder, which uses characters as its input tokens, with an uninitialized single-layer RNN-T decoder (Graves, 2012) whose vocabulary is the set of all 381 phonemes appearing in the shared task data. We took an explicitly multilingual approach to modeling the G2P tasks, fine-tuning and evaluating a single model that covered all the languages in each task, and adding language codes as prefixes to the input strings as a means of specifying the language of each example. Our models perform well in the shared task’s “high” setting (in which they were trained on 1,000 words from each language), though they do poorly in the “low” task setting (training on only 100 words from each language). Our models also perform reasonably in the “mixed” setting (training on 100 words in the target language and 1000 words in a related language), hinting that mSLAM’s multilingual pretraining may be enabling useful cross-lingual sharing.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115940513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LISN @ SIGMORPHON 2023 Shared Task on Interlinear Glossing","authors":"Shu Okabe, François Yvon","doi":"10.18653/v1/2023.sigmorphon-1.21","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.21","url":null,"abstract":"This paper describes LISN”’“s submission to the second track (open track) of the shared task on Interlinear Glossing for SIGMORPHON 2023. Our systems are based on Lost, a variation of linear Conditional Random Fields initially developed as a probabilistic translation model and then adapted to the glossing task. This model allows us to handle one of the main challenges posed by glossing, i.e. the fact that the list of potential labels for lexical morphemes is not fixed in advance and needs to be extended dynamically when labelling units are not seen in training. In such situations, we show how to make use of candidate lexical glosses found in the translation and discuss how such extension affects the training and inference procedures. The resulting automatic glossing systems prove to yield very competitive results, especially in low-resource settings.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132555931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIGMORPHON 2022 Shared Task on Grapheme-to-Phoneme Conversion Submission Description: Sequence Labelling for G2P","authors":"Leander Girrbach","doi":"10.18653/v1/2023.sigmorphon-1.28","DOIUrl":"https://doi.org/10.18653/v1/2023.sigmorphon-1.28","url":null,"abstract":"This paper describes our participation in the Third SIGMORPHON Shared Task on Grapheme-to-Phoneme Conversion (Low-Resource and Cross-Lingual) (McCarthy et al.,2022). Our models rely on different sequence labelling methods. The main model predicts multiple phonemes from each grapheme and is trained using CTC loss (Graves et al., 2006). We find that sequence labelling methods yield worse performance than the baseline when enough data is available, but can still be used when very little data is available. Furthermore, we demonstrate that alignments learned by the sequence labelling models can be easily inspected.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126489351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}