Y. Heryadi, B. Wijanarko, Dina Fitria Murad, C. Tho, Kiyota Hashimoto
{"title":"重新验证编码器-解码器深度和激活函数以找到最佳香草变压器模型","authors":"Y. Heryadi, B. Wijanarko, Dina Fitria Murad, C. Tho, Kiyota Hashimoto","doi":"10.1109/ICCoSITE57641.2023.10127790","DOIUrl":null,"url":null,"abstract":"The transformer model has become a state-of-the-art model in Natural Language Processing. The initial transformer model, known as the vanilla transformer model, is designed to improve some prominent models in sequence modeling and transduction problems such as language modeling and machine translation. The initial transformer model has 6 stacks of identical encoder-decoder layers with an attention mechanism whose aim is to push limitations of common recurrent language models and encoder-decoder architectures. Its outstanding performance has inspired many researchers to extend the architecture to improve its performance and computation efficiency. Despite many extensions to the vanilla transformer, there is no clear explanation of the encoder-decoder set out depth in the vanilla transformer model. This paper presents exploration results on the effect of combination encoder-decoder layer depth and activation function in the feed-forward layer of the vanilla transformer model on its performance. The model is tested to address a downstream task: text translation from Bahasa Indonesia to the Sundanese language. Although the value difference is not significantly large, the empirical results show that the combination of depth = 2 with Sigmoid, Tanh, and ReLU activation function; and d = 6 with ReLU activation shows the highest average training accuracy. Interestingly, d = 6 and ReLU show the lowest average training and validation loss. However, statistically, there is no significant difference between depth and activation functions.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"397 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Revalidating the Encoder-Decoder Depths and Activation Function to Find Optimum Vanilla Transformer Model\",\"authors\":\"Y. Heryadi, B. Wijanarko, Dina Fitria Murad, C. Tho, Kiyota Hashimoto\",\"doi\":\"10.1109/ICCoSITE57641.2023.10127790\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The transformer model has become a state-of-the-art model in Natural Language Processing. The initial transformer model, known as the vanilla transformer model, is designed to improve some prominent models in sequence modeling and transduction problems such as language modeling and machine translation. The initial transformer model has 6 stacks of identical encoder-decoder layers with an attention mechanism whose aim is to push limitations of common recurrent language models and encoder-decoder architectures. Its outstanding performance has inspired many researchers to extend the architecture to improve its performance and computation efficiency. Despite many extensions to the vanilla transformer, there is no clear explanation of the encoder-decoder set out depth in the vanilla transformer model. This paper presents exploration results on the effect of combination encoder-decoder layer depth and activation function in the feed-forward layer of the vanilla transformer model on its performance. The model is tested to address a downstream task: text translation from Bahasa Indonesia to the Sundanese language. Although the value difference is not significantly large, the empirical results show that the combination of depth = 2 with Sigmoid, Tanh, and ReLU activation function; and d = 6 with ReLU activation shows the highest average training accuracy. Interestingly, d = 6 and ReLU show the lowest average training and validation loss. However, statistically, there is no significant difference between depth and activation functions.\",\"PeriodicalId\":256184,\"journal\":{\"name\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"volume\":\"397 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCoSITE57641.2023.10127790\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Revalidating the Encoder-Decoder Depths and Activation Function to Find Optimum Vanilla Transformer Model
The transformer model has become a state-of-the-art model in Natural Language Processing. The initial transformer model, known as the vanilla transformer model, is designed to improve some prominent models in sequence modeling and transduction problems such as language modeling and machine translation. The initial transformer model has 6 stacks of identical encoder-decoder layers with an attention mechanism whose aim is to push limitations of common recurrent language models and encoder-decoder architectures. Its outstanding performance has inspired many researchers to extend the architecture to improve its performance and computation efficiency. Despite many extensions to the vanilla transformer, there is no clear explanation of the encoder-decoder set out depth in the vanilla transformer model. This paper presents exploration results on the effect of combination encoder-decoder layer depth and activation function in the feed-forward layer of the vanilla transformer model on its performance. The model is tested to address a downstream task: text translation from Bahasa Indonesia to the Sundanese language. Although the value difference is not significantly large, the empirical results show that the combination of depth = 2 with Sigmoid, Tanh, and ReLU activation function; and d = 6 with ReLU activation shows the highest average training accuracy. Interestingly, d = 6 and ReLU show the lowest average training and validation loss. However, statistically, there is no significant difference between depth and activation functions.