{"title":"泰语:基于泰语变形金刚的图像字幕","authors":"Teetouch Jaknamon, S. Marukatat","doi":"10.1109/iSAI-NLP56921.2022.9960246","DOIUrl":null,"url":null,"abstract":"For problems with image captioning is a technique that has been used for a long time. In the past, there was a way to use convolutional neural network (CNN) for feature extraction and recurrent neural network (RNN) for generating text, and especially in Thai language, It has to be developed further in the era of the popular use of transformers. This paper proposes an end-to-end image captioning with pretrained vision Transformers (ViT) and text transformers in Thai language models namely ThaiTC, Which leverages the transformer architecture both. We has experiment pretrained vision transformer and text transformer in Thai language that best for Thai image captioning and tested on 3 Thai image captioning datasets 1) Travel 2) Food 3) Flickr 30k(t$r$ anslate) with different challenges. Includes freeze vision transformers weight training for image captioning dataset training with less image features, From the experiment, We found that ThaiTC performed much better in the Food and Flickr30k datasets than the Travel datasets, Which allowed us to automatically create subtitles about food and travel.","PeriodicalId":399019,"journal":{"name":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"ThaiTC:Thai Transformer-based Image Captioning\",\"authors\":\"Teetouch Jaknamon, S. Marukatat\",\"doi\":\"10.1109/iSAI-NLP56921.2022.9960246\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For problems with image captioning is a technique that has been used for a long time. In the past, there was a way to use convolutional neural network (CNN) for feature extraction and recurrent neural network (RNN) for generating text, and especially in Thai language, It has to be developed further in the era of the popular use of transformers. This paper proposes an end-to-end image captioning with pretrained vision Transformers (ViT) and text transformers in Thai language models namely ThaiTC, Which leverages the transformer architecture both. We has experiment pretrained vision transformer and text transformer in Thai language that best for Thai image captioning and tested on 3 Thai image captioning datasets 1) Travel 2) Food 3) Flickr 30k(t$r$ anslate) with different challenges. Includes freeze vision transformers weight training for image captioning dataset training with less image features, From the experiment, We found that ThaiTC performed much better in the Food and Flickr30k datasets than the Travel datasets, Which allowed us to automatically create subtitles about food and travel.\",\"PeriodicalId\":399019,\"journal\":{\"name\":\"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iSAI-NLP56921.2022.9960246\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSAI-NLP56921.2022.9960246","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
For problems with image captioning is a technique that has been used for a long time. In the past, there was a way to use convolutional neural network (CNN) for feature extraction and recurrent neural network (RNN) for generating text, and especially in Thai language, It has to be developed further in the era of the popular use of transformers. This paper proposes an end-to-end image captioning with pretrained vision Transformers (ViT) and text transformers in Thai language models namely ThaiTC, Which leverages the transformer architecture both. We has experiment pretrained vision transformer and text transformer in Thai language that best for Thai image captioning and tested on 3 Thai image captioning datasets 1) Travel 2) Food 3) Flickr 30k(t$r$ anslate) with different challenges. Includes freeze vision transformers weight training for image captioning dataset training with less image features, From the experiment, We found that ThaiTC performed much better in the Food and Flickr30k datasets than the Travel datasets, Which allowed us to automatically create subtitles about food and travel.