{"title":"RGFGM-LXMERT-An Improve Architecture Based On LXMERT","authors":"Renjie Yu","doi":"10.1145/3581807.3581879","DOIUrl":null,"url":null,"abstract":"LXMERT (Learning Cross-Modality Encoder Representations from Transformers) is a two-stream cross-modality pre-trained model that performs well in different downstream tasks which contain two visual question answering datasets and a challenging visual-reasoning task (i.e., VQA, GQA, and NLVR). But the large-scale model still has a lot of room for progress. That is, the model accuracy is very low, the generalization ability is weak, and it is easy to be attacked by adversarial attacks. Furthermore, training the LXMERT model takes a lot of time and money, so there is an urgent need to improve. Thus, I try to improve the training speed, generalization ability, and accuracy of the model by enhancing both the training method and the model structure. In the training method, FGM (Fast Gradient Method) adversarial training is introduced in the finetune phase of the model by adding the disturbances in both the language embedding layer's and visual feature linear layer's weights, which effectively improves the model accuracy and generalization ability. In the model structure, a residual block with weight is used to improve the training speed by 1.6% in the pre-training phase of this model without losing the model performance. Next, t the most important structure, the Encoder, is redesigned to make the model more convergent. The Encoder's FFN (Feed-Forward Neural Network) is replaced by GLU (Gated Linear Unit), which also improves the ability of model fitting and model performance. The improved model performs better on the VQA task than the benchmark (i.e., LXMERT). In the end, detailed ablation studies prove that my enhancement strategies are effective for LXMERT and observe the effectiveness of different measures on the model.","PeriodicalId":292813,"journal":{"name":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3581807.3581879","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
LXMERT (Learning Cross-Modality Encoder Representations from Transformers) is a two-stream cross-modality pre-trained model that performs well in different downstream tasks which contain two visual question answering datasets and a challenging visual-reasoning task (i.e., VQA, GQA, and NLVR). But the large-scale model still has a lot of room for progress. That is, the model accuracy is very low, the generalization ability is weak, and it is easy to be attacked by adversarial attacks. Furthermore, training the LXMERT model takes a lot of time and money, so there is an urgent need to improve. Thus, I try to improve the training speed, generalization ability, and accuracy of the model by enhancing both the training method and the model structure. In the training method, FGM (Fast Gradient Method) adversarial training is introduced in the finetune phase of the model by adding the disturbances in both the language embedding layer's and visual feature linear layer's weights, which effectively improves the model accuracy and generalization ability. In the model structure, a residual block with weight is used to improve the training speed by 1.6% in the pre-training phase of this model without losing the model performance. Next, t the most important structure, the Encoder, is redesigned to make the model more convergent. The Encoder's FFN (Feed-Forward Neural Network) is replaced by GLU (Gated Linear Unit), which also improves the ability of model fitting and model performance. The improved model performs better on the VQA task than the benchmark (i.e., LXMERT). In the end, detailed ablation studies prove that my enhancement strategies are effective for LXMERT and observe the effectiveness of different measures on the model.