{"title":"Corpus-Based Vocabulary List for Thai Language","authors":"H. Ketmaneechairat, Maleerat Maliyaem","doi":"10.12720/jait.14.2.319-327","DOIUrl":null,"url":null,"abstract":"—For natural language processing, a corpus is important for training models as also for the algorithms to create the machine learning models. This paper aimed to describe the design and process in creating a corpus-based vocabulary in the Thai language that can be used as a main corpus for natural language processing research. A corpus is created under the regulation of language. By using the actual Word Usage Frequency (WUF) analyzed from a text corpus cover several types of contents. The results presented the frequency of use of several characteristics, namely the frequency of word use character usage frequency and the frequency of using bigram characters. To be used in this research and used as important information for further NLP research. Based on the findings, it was concluded that the average word length increases when the number of words in the corpus increases. It means that the correlation between word length and frequency of words is in the same direction.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.2.319-327","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
—For natural language processing, a corpus is important for training models as also for the algorithms to create the machine learning models. This paper aimed to describe the design and process in creating a corpus-based vocabulary in the Thai language that can be used as a main corpus for natural language processing research. A corpus is created under the regulation of language. By using the actual Word Usage Frequency (WUF) analyzed from a text corpus cover several types of contents. The results presented the frequency of use of several characteristics, namely the frequency of word use character usage frequency and the frequency of using bigram characters. To be used in this research and used as important information for further NLP research. Based on the findings, it was concluded that the average word length increases when the number of words in the corpus increases. It means that the correlation between word length and frequency of words is in the same direction.