Ly-Duyen Tran, Naushad Alam, Yvette Graham, L. K. Vo, N. T. Diep, Binh T. Nguyen, Liting Zhou, C. Gurrin
{"title":"生命日志检索CLIP模型的优势探讨","authors":"Ly-Duyen Tran, Naushad Alam, Yvette Graham, L. K. Vo, N. T. Diep, Binh T. Nguyen, Liting Zhou, C. Gurrin","doi":"10.1145/3549555.3549593","DOIUrl":null,"url":null,"abstract":"In this paper, we attempt to fine-tune the CLIP (Contrastive Language-Image Pre-Training) model on the Lifelog Question Answering dataset (LLQA) to investigate retrieval performance of the fine-tuned model over the zero-shot baseline model. We train the model adopting a weight space ensembling approach using a modified loss function to take into account the differences in our dataset (LLQA) when compared with the dataset the CLIP model was originally pretrained on. We further evaluate our fine-tuned model using visual as well as multimodal queries on multiple retrieval tasks, demonstrating improved performance over the zero-shot baseline model.","PeriodicalId":191591,"journal":{"name":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"An Exploration into the Benefits of the CLIP model for Lifelog Retrieval\",\"authors\":\"Ly-Duyen Tran, Naushad Alam, Yvette Graham, L. K. Vo, N. T. Diep, Binh T. Nguyen, Liting Zhou, C. Gurrin\",\"doi\":\"10.1145/3549555.3549593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we attempt to fine-tune the CLIP (Contrastive Language-Image Pre-Training) model on the Lifelog Question Answering dataset (LLQA) to investigate retrieval performance of the fine-tuned model over the zero-shot baseline model. We train the model adopting a weight space ensembling approach using a modified loss function to take into account the differences in our dataset (LLQA) when compared with the dataset the CLIP model was originally pretrained on. We further evaluate our fine-tuned model using visual as well as multimodal queries on multiple retrieval tasks, demonstrating improved performance over the zero-shot baseline model.\",\"PeriodicalId\":191591,\"journal\":{\"name\":\"Proceedings of the 19th International Conference on Content-based Multimedia Indexing\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th International Conference on Content-based Multimedia Indexing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3549555.3549593\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th International Conference on Content-based Multimedia Indexing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3549555.3549593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Exploration into the Benefits of the CLIP model for Lifelog Retrieval
In this paper, we attempt to fine-tune the CLIP (Contrastive Language-Image Pre-Training) model on the Lifelog Question Answering dataset (LLQA) to investigate retrieval performance of the fine-tuned model over the zero-shot baseline model. We train the model adopting a weight space ensembling approach using a modified loss function to take into account the differences in our dataset (LLQA) when compared with the dataset the CLIP model was originally pretrained on. We further evaluate our fine-tuned model using visual as well as multimodal queries on multiple retrieval tasks, demonstrating improved performance over the zero-shot baseline model.