{"title":"高度屈折语言的鲁棒形态分析器","authors":"Andrés Tomás Hohendahl, J. Zelasco, J. Donayo","doi":"10.5220/0003015301120118","DOIUrl":null,"url":null,"abstract":"We present a multilingual robust morphologic tagger and tokenizer for highly inflected languages like Spanish, with efficient spell correction and ‘sound-like’ word inference, obtaining some semantic extraction even on parasynthetic and unknown words. This algorithm combines rules, statistical best-affix-fit along with a language estimator. A rich flag set controls the internal behaviour. The system has been designed for efficiency and low memory footprint, using data structures based on simple available affixing rules. Our system, packed with a Spanish dictionary of 83k lemmas and 5k rules, recognizes 2.2M exact words, the guessing word-space is many times this much.","PeriodicalId":378427,"journal":{"name":"International Workshop on Natural Language Processing and Cognitive Science","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Morphologic Analyzer for Highly Inflected Languages\",\"authors\":\"Andrés Tomás Hohendahl, J. Zelasco, J. Donayo\",\"doi\":\"10.5220/0003015301120118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a multilingual robust morphologic tagger and tokenizer for highly inflected languages like Spanish, with efficient spell correction and ‘sound-like’ word inference, obtaining some semantic extraction even on parasynthetic and unknown words. This algorithm combines rules, statistical best-affix-fit along with a language estimator. A rich flag set controls the internal behaviour. The system has been designed for efficiency and low memory footprint, using data structures based on simple available affixing rules. Our system, packed with a Spanish dictionary of 83k lemmas and 5k rules, recognizes 2.2M exact words, the guessing word-space is many times this much.\",\"PeriodicalId\":378427,\"journal\":{\"name\":\"International Workshop on Natural Language Processing and Cognitive Science\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Workshop on Natural Language Processing and Cognitive Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5220/0003015301120118\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Natural Language Processing and Cognitive Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0003015301120118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Robust Morphologic Analyzer for Highly Inflected Languages
We present a multilingual robust morphologic tagger and tokenizer for highly inflected languages like Spanish, with efficient spell correction and ‘sound-like’ word inference, obtaining some semantic extraction even on parasynthetic and unknown words. This algorithm combines rules, statistical best-affix-fit along with a language estimator. A rich flag set controls the internal behaviour. The system has been designed for efficiency and low memory footprint, using data structures based on simple available affixing rules. Our system, packed with a Spanish dictionary of 83k lemmas and 5k rules, recognizes 2.2M exact words, the guessing word-space is many times this much.