Bao G. Do, Doanh C. Bui, Nguyen D. Vo, Khang Nguyen
{"title":"A Multi-scale Approach for Vietnamese Image Captioning in Healthcare Domain","authors":"Bao G. Do, Doanh C. Bui, Nguyen D. Vo, Khang Nguyen","doi":"10.1109/NICS56915.2022.10013398","DOIUrl":null,"url":null,"abstract":"The image caption generator is a task that aims to automatically generate a natural language with syntactically and semantically meaningful sentences to describe the visual content of a given image. This problem is attractive because it is a combination of two fields Computer Vision and Natural Language Processing. Despite some research on this problem, most of this research only focuses on generating English captions. In this paper, we present a Transformer-based model for this problem based on the VieCap4H dataset - the first grand dataset for the Healthcare domain in Vietnamese. In detail, we first propose the TG2F module to enhance visual representations and the BERT-based language model to obtain language presentation. Through experiments on the VieCap4H dataset, our approach achieves competitive results on the public test and private test without using any data augmentation method.","PeriodicalId":381028,"journal":{"name":"2022 9th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 9th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS56915.2022.10013398","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The image caption generator is a task that aims to automatically generate a natural language with syntactically and semantically meaningful sentences to describe the visual content of a given image. This problem is attractive because it is a combination of two fields Computer Vision and Natural Language Processing. Despite some research on this problem, most of this research only focuses on generating English captions. In this paper, we present a Transformer-based model for this problem based on the VieCap4H dataset - the first grand dataset for the Healthcare domain in Vietnamese. In detail, we first propose the TG2F module to enhance visual representations and the BERT-based language model to obtain language presentation. Through experiments on the VieCap4H dataset, our approach achieves competitive results on the public test and private test without using any data augmentation method.