{"title":"Beyond Concept Detection: The Potential of User Intent for Image Retrieval","authors":"Bo Wang, M. Larson","doi":"10.1145/3132515.3132521","DOIUrl":"https://doi.org/10.1145/3132515.3132521","url":null,"abstract":"Behind each photographic act is a rationale that impacts the visual appearance of the resulting photo. Better understanding of this rationale has great potential to support image retrieval systems in serving user needs. However, at present, surprisingly little is known about the connection between what a picture shows (the literally depicted conceptual content) and why that picture was taken (the photographer intent). In this paper, we investigate photographer intent in a large Flickr data set. First, an expert annotator carries out a large number of iterative intent judgments to create a taxonomy of intent classes. Next, analysis of the distribution of concepts and intent classes reveals patterns of independence both at a global and user level. Finally, we report the results of experiments showing that a deep neural network classifier is capable of learning to differentiate between these intent classes, and that these classes support the diversification of image search results.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114530455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixed Methods and the Future of Multi-Modal Media","authors":"Saeideh Bakhshi, David A. Shamma","doi":"10.1145/3132515.3132524","DOIUrl":"https://doi.org/10.1145/3132515.3132524","url":null,"abstract":"Humans are complex and their behaviors follow complex multi-modal patterns, however to solve many social computing problems one often looks at complexity in large-scale yet single point data sources or methodologies. While single data/single method techniques, fueled by large scale data, enjoyed some success, it is not without fault. Often with one type of data and method, all the other aspects of human behavior are overlooked, discarded, or, worse, misrepresented. We identify this as two succinct problems. First, social computing problems that cannot be solved using a single data source and need intelligence from multiple modals and, second, social behavior that cannot be fully understood using only one form of methodology. Throughout this talk, we discuss these problems and their implications, illustrate examples, and propose new directives to properly approach in the social computing research in today's age.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132031631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueting Wang, Yu Enokibori, Takatsugu Hirayama, Kensho Hara, K. Mase
{"title":"User Group based Viewpoint Recommendation using User Attributes for Multiview Videos","authors":"Xueting Wang, Yu Enokibori, Takatsugu Hirayama, Kensho Hara, K. Mase","doi":"10.1145/3132515.3132523","DOIUrl":"https://doi.org/10.1145/3132515.3132523","url":null,"abstract":"Multiview videos can provide diverse information and high flexibility in enhancing the viewing experience. User-dependent automatic viewpoint recommendation is important for reducing user stress while selecting continually suitable and favorable viewpoints. Existing personal viewpoint recommendation methods have been developed by learning the user»s viewing records. These methods have difficulty in acquiring sufficient personal viewing records in practice. Moreover, they neglect the importance of the user»s attribute information, such as user personality, interest, and experience level in the viewing content. Thus, we propose a group-based recommendation framework consisting of a user grouping approach based on the similarity in existing user viewing records, and a member group estimation approach based on the classification by user attributes. We validate the effectiveness of the proposed group-based recommendation and analyze the relationship between user attributes and multiview viewing patterns.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115180206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rim Trabelsi, Jagannadan Varadarajan, Yong Pei, Le Zhang, I. Jabri, A. Bouallègue, P. Moulin
{"title":"Robust Multi-Modal Cues for Dyadic Human Interaction Recognition","authors":"Rim Trabelsi, Jagannadan Varadarajan, Yong Pei, Le Zhang, I. Jabri, A. Bouallègue, P. Moulin","doi":"10.1145/3132515.3132517","DOIUrl":"https://doi.org/10.1145/3132515.3132517","url":null,"abstract":"Activity analysis methods usually tend to focus on elementary human actions but ignore to analyze complex scenarios. In this paper, we focus particularly on classifying interactions between two persons in a supervised fashion. We propose a robust multi-modal proxemic descriptor based on 3D joint locations, depth and color videos. The proposed descriptor incorporates inter-person and intra-person joint distances calculated from 3D skeleton data and multi-frame dense optical flow features obtained from the application of temporal Convolutional neural networks (CNN) on depth and color images. The descriptors from the three modalities are derived from sparse key-frames surrounding high activity content and fused using a linear SVM classifier. Through experiments on two publicly available RGB-D interaction datasets, we show that our method can efficiently classify complex interactions using only short video snippet, outperforming existing state-of-the-art results.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115270424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philipp Blandfort, Tushar Karayil, Damian Borth, A. Dengel
{"title":"Image Captioning in the Wild: How People Caption Images on Flickr","authors":"Philipp Blandfort, Tushar Karayil, Damian Borth, A. Dengel","doi":"10.1145/3132515.3132522","DOIUrl":"https://doi.org/10.1145/3132515.3132522","url":null,"abstract":"Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear \"in the wild\". We publish our findings here along with the annotated dataset.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121023197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Spotlights","authors":"Shih-Fu Chang","doi":"10.1145/3258576","DOIUrl":"https://doi.org/10.1145/3258576","url":null,"abstract":"","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123894537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Mining","authors":"Shih-Fu Chang","doi":"10.1145/3258575","DOIUrl":"https://doi.org/10.1145/3258575","url":null,"abstract":"","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122314051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Movie Genre Classification based on Poster Images with Deep Neural Networks","authors":"W. Chu, Hung-Jui Guo","doi":"10.1145/3132515.3132516","DOIUrl":"https://doi.org/10.1145/3132515.3132516","url":null,"abstract":"We propose to achieve movie genre classification based only on movie poster images. A deep neural network is constructed to jointly describe visual appearance and object information, and classify a given movie poster image into genres. Because a movie may belong to multiple genres, this is a multi-label image classification problem. To facilitate related studies, we collect a large-scale movie poster dataset, associated with various metadata. Based on this dataset, we fine-tune a pretrained convolutional neural network to extract visual representation, and adopt a state-of-the-art framework to detect objects in posters. Two types of information is then integrated by the proposed neural network. In the evaluation, we show that the proposed method yields encouraging performance, which is much better than previous works.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114479575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Deep Multi-Modal Fusion Approach for Semantic Place Prediction in Social Media","authors":"Kaidi Meng, Haojie Li, Zhihui Wang, Xin Fan, Fuming Sun, Zhongxuan Luo","doi":"10.1145/3132515.3132519","DOIUrl":"https://doi.org/10.1145/3132515.3132519","url":null,"abstract":"Semantic places such as \"home,\" \"work,\" and \"school\" are much easier to understand compared to GPS coordinates or street addresses and contribute to the automatic inference of related activities, which could further help in the study of personal lifestyle patterns and the provision of more customized services for human beings. In this work, we present a feature-level fusion method for semantic place prediction that utilizes user-generated text-image pairs from online social media as input. To take full advantage of each specific modality, we concatenate features from two state-of-the-art Convolutional Neural Networks (CNNs) and train them together. To the best of our knowledge, the present study is the first attempt to conduct semantic place prediction based only on microblogging multimedia content. The experimental results demonstrate that our deep multi-modal architecture outperforms single-modal methods and the traditional fusion method.","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117044853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Modeling","authors":"M. Soleymani","doi":"10.1145/3258577","DOIUrl":"https://doi.org/10.1145/3258577","url":null,"abstract":"","PeriodicalId":395519,"journal":{"name":"Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes","volume":"647 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115826069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}