V. Dhanalakshmi, P. Padmavathy, M. A. Kumar, K. Soman, S. Rajendran
{"title":"泰米尔语分块器","authors":"V. Dhanalakshmi, P. Padmavathy, M. A. Kumar, K. Soman, S. Rajendran","doi":"10.1109/ARTCom.2009.191","DOIUrl":null,"url":null,"abstract":"This paper presents the chunker for Tamil using Machine learning techniques. Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. The chunking is done by the machine learning techniques, where the linguistical knowledge is automatically extracted from the annotated corpus. We have developed our own tagset for annotating the corpus, which is used for training and testing the POS tagger generator and the chunker. The present tagset consists of thirty tags for POS and nine tags for chunking. A corpus size of two hundred and twenty five thousand words was used for training and testing the accuracy of the Chunker. We found that CRF++ affords the most encouraging result for Tamil chunker.","PeriodicalId":210885,"journal":{"name":"Advances in Recent Technologies in Communication and Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Chunker for Tamil\",\"authors\":\"V. Dhanalakshmi, P. Padmavathy, M. A. Kumar, K. Soman, S. Rajendran\",\"doi\":\"10.1109/ARTCom.2009.191\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the chunker for Tamil using Machine learning techniques. Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. The chunking is done by the machine learning techniques, where the linguistical knowledge is automatically extracted from the annotated corpus. We have developed our own tagset for annotating the corpus, which is used for training and testing the POS tagger generator and the chunker. The present tagset consists of thirty tags for POS and nine tags for chunking. A corpus size of two hundred and twenty five thousand words was used for training and testing the accuracy of the Chunker. We found that CRF++ affords the most encouraging result for Tamil chunker.\",\"PeriodicalId\":210885,\"journal\":{\"name\":\"Advances in Recent Technologies in Communication and Computing\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Recent Technologies in Communication and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ARTCom.2009.191\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Recent Technologies in Communication and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ARTCom.2009.191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This paper presents the chunker for Tamil using Machine learning techniques. Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. The chunking is done by the machine learning techniques, where the linguistical knowledge is automatically extracted from the annotated corpus. We have developed our own tagset for annotating the corpus, which is used for training and testing the POS tagger generator and the chunker. The present tagset consists of thirty tags for POS and nine tags for chunking. A corpus size of two hundred and twenty five thousand words was used for training and testing the accuracy of the Chunker. We found that CRF++ affords the most encouraging result for Tamil chunker.