{"title":"Making a Cultural Visit with a Smart Mate","authors":"A. Bimbo","doi":"10.1145/3078971.3079005","DOIUrl":null,"url":null,"abstract":"Digital and mobile technologies have become increasingly popular to support and improve the quality of experience during cultural visits. The portability of the device, the daily adaptation of most people to its usage, the easy access to information and the opportunity of interactive augmented reality have been key factors of this popularity. We believe that computer vision may help to improve such quality of experience, by making the mobile device smarter and capable of inferring the visitor interests directly from his/her behavior, so triggering the delivery of the appropriate information at the right time without any specific user actions. At MICC University of Florence, we have developed two prototypes of smart audio guides, respectively for indoor and outdoor cultural visits, that exploit the availability of multi-core CPUs and GPUs on mobile devices and computer vision to feed information according to the interests of the visitor, in a non intrusive and natural way. In the first one [Seidenari et al. 2017], the YOLO network [Redmon et al. 2016] is used to distinguish between artworks and people in the camera view. If an artwork is detected, it predicts a specific artwork label. The artwork's description is hence given in audio in the visitor's language. In the second one, the GPS coordinates are used to search Google Places and obtain the interest points closeby. To determine what landmark the visitor is actually looking at, the actual view of the camera is matched against the Google Street Map database using SIFT features. Matched views are classified as either artwork or background and for artworks, descriptions are obtained from Wikipedia. Both prototypes were conceived as a smart mate for visits in museums and outdoor sites or cities of art, respectively. In both prototypes, voice activity detection provides hints about what is happening in the surrounding context of the visitor and triggers the audio description only when the visitor is not talking with the accompanying persons. They were developed on NVIDIA Jetson TK1 and deployed on a NVIDIA Shield K1 Tablet, run in real time and were tested in real contexts in a musum and the city of Florence.","PeriodicalId":93291,"journal":{"name":"ICMR'17 : proceedings of the 2017 ACM International Conference on Multimedia Retrieval : June 6-9, 2017, Bucharest, Romania. ACM International Conference on Multimedia Retrieval (2017 : Bucharest, Romania)","volume":"66 1","pages":"2"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICMR'17 : proceedings of the 2017 ACM International Conference on Multimedia Retrieval : June 6-9, 2017, Bucharest, Romania. ACM International Conference on Multimedia Retrieval (2017 : Bucharest, Romania)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078971.3079005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Digital and mobile technologies have become increasingly popular to support and improve the quality of experience during cultural visits. The portability of the device, the daily adaptation of most people to its usage, the easy access to information and the opportunity of interactive augmented reality have been key factors of this popularity. We believe that computer vision may help to improve such quality of experience, by making the mobile device smarter and capable of inferring the visitor interests directly from his/her behavior, so triggering the delivery of the appropriate information at the right time without any specific user actions. At MICC University of Florence, we have developed two prototypes of smart audio guides, respectively for indoor and outdoor cultural visits, that exploit the availability of multi-core CPUs and GPUs on mobile devices and computer vision to feed information according to the interests of the visitor, in a non intrusive and natural way. In the first one [Seidenari et al. 2017], the YOLO network [Redmon et al. 2016] is used to distinguish between artworks and people in the camera view. If an artwork is detected, it predicts a specific artwork label. The artwork's description is hence given in audio in the visitor's language. In the second one, the GPS coordinates are used to search Google Places and obtain the interest points closeby. To determine what landmark the visitor is actually looking at, the actual view of the camera is matched against the Google Street Map database using SIFT features. Matched views are classified as either artwork or background and for artworks, descriptions are obtained from Wikipedia. Both prototypes were conceived as a smart mate for visits in museums and outdoor sites or cities of art, respectively. In both prototypes, voice activity detection provides hints about what is happening in the surrounding context of the visitor and triggers the audio description only when the visitor is not talking with the accompanying persons. They were developed on NVIDIA Jetson TK1 and deployed on a NVIDIA Shield K1 Tablet, run in real time and were tested in real contexts in a musum and the city of Florence.