Maria Eskevich, Quoc-Minh Bui, Hoang-An Le, B. Huet
{"title":"Exploring Video Hyperlinking in Broadcast Media","authors":"Maria Eskevich, Quoc-Minh Bui, Hoang-An Le, B. Huet","doi":"10.1145/2802558.2814647","DOIUrl":"https://doi.org/10.1145/2802558.2814647","url":null,"abstract":"Multimedia content produced by professionals and individual users on the daily basis and in constantly growing quantity requires creation of navigation systems that allow access to this data on different levels of granularity that can contribute to further discovery of a topic of user interest or to browsing by each user in an individual way. In this paper we describe our approach to enable the users to browse through the multimedia collection. We implement the hyperlinking approach that uses the fine-grained segmentation of the visual content based on the scene segmentation, as well as available metadata, transcripts, and information about extracted visual concepts. The approach was tested at the MediaEval Search and Hyperlinking 2014 evaluation task, where it has shown its effectiveness at locating accurately relevant content in a large media archive.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122106577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAIVT-BNEWS: An Australian Broadcast News Video Dataset for Entity Extraction, and More","authors":"David Dean","doi":"10.1145/2802558.2814653","DOIUrl":"https://doi.org/10.1145/2802558.2814653","url":null,"abstract":"Recently QUT have released a set of annotated broadcast news videos (SAIVT-BNEWS) that we have made available at our website (https://www.qut.edu.au/research/saivt). This presentation will outline the dataset itself, covering 50 or so short news clips surrounding a single political event with many entities appearing in multuple records, and cover interesting research that QUT has, is currently, and is interested in performing on this dataset in the future. This presentation will cover existing published research, including image processing tasks like face detection and clustering; and speech processing tasks (including the use of visual speech) like speech detection, speaker recognition, and speaker diarisation. We have also started very interesting research on fusing multiple sources of information, including metadata, OCR, faces, speech, and scene detection to improve the performance of many techniques, but with a focus on improving the automatic extraction of entities (people, places, companies and organisations) from large volumes of audio-visual data, and this will also be addressed in this talk. As this dataset is publicly available for free to all researchers, QUT hopes that other researchers will make use of, and improve upon this dataset as well.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127864010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation Data and Benchmarks for Cascaded Speech Recognition and Entity Extraction","authors":"Liyuan Zhou, H. Suominen, L. Hanlen","doi":"10.1145/2802558.2814646","DOIUrl":"https://doi.org/10.1145/2802558.2814646","url":null,"abstract":"During clinical handover, clinicians exchange information about the patients and the state of clinical management. To improve care safety and quality, both handover and its documentation have been standardized. Speech recognition and entity extraction provide a way to help health service providers to follow these standards by implementing the handover process as a structured form, whose headings guide the handover narrative, and the documentation process as proofing and sign-off of the automatically filled-out form. In this paper, we evaluate such systems. The form considers the sections of Handover nurse, Patient introduction, My shift, Medication, Appointments, and Future care, divided in 49 mutually exclusive headings to fill out with speech recognized and extracted entities. Our system correctly recognizes 10,244 out of 14,095 spoken words and regardless of 6,692 erroneous words, its error percentage is significantly smaller than for systems submitted to the CLEF eHealth Evaluation Lab 2015. In the extraction of 35 entities with training data (i.e., 14 headings were not present in the 101 expert-annotated training documents with 8,487 words in total), the system correctly extracts 2,375 out of 3,793 words in 50 test documents after calibration on 3,937 words in 50 validation documents. This translates to over 90% F1 in extracting information for the patient's age, current bed, current room, and given name and over 70% F1 for patient's admission reason/diagnosis and last name. F1 for filtering out irrelevant information is 78%. We have made the data publicly available for 201 handover cases together with processing results and code and proposed the extraction task for CLEF eHealth 2016.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"127 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123574156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","authors":"G. Gravier, M. Larson, G. Jones, R. Ordelman","doi":"10.1145/2802558","DOIUrl":"https://doi.org/10.1145/2802558","url":null,"abstract":"Welcome to SLAM 2015 in Brisbane, Australia! \u0000 \u0000SLAM 2015 is the third edition of the series of SLAM workshops, with worldwide leading protagonists in the field of speech, language and audio processing applied to multimedia material or in a multimedia context. From the very beginning, the workshop is steered and patronized by the Special Interest Group of the International Speech Communication Association on Speech and Language in Multimedia. This year's edition follows this tradition. \u0000 \u0000SLAM is by nature interdisciplinary, existing at the intersection of multiple scientific communities: music and audio processing, speech processing, natural language processing and, of course, multimedia. After collocating the first two editions of SLAM with Interspeech, the premier international conference in the field of speech communication, we're very proud to hold SLAM 2015 with ACM Multimedia. This is in logical continuation from the preceding editions and reflects the fact that the focus of SLAM goes far beyond speech processing to genuinely account for the multiple facets of multimedia. Our long-term goal is to establish SLAM as a regular workshop, alternating between major speech and language conferences and major multimedia conferences, as a bridge between these domains. This year's edition is a first step in this direction and we are very grateful to ACM Multimedia General and Workshop chairs for their support in the development of SLAM in spite of possible interferences with the main conference. \u0000 \u0000The program in 2015 covers a wide range of problems related to SLAM topics, with contributions related to music, speech, language but also computer vision. To emphasize the links between audio, speech, language and multimedia, the workshop features a special session on video hyperlinking, as recently introduced in international benchmark initiatives such as MediaEval or TRECVid. The multimodal nature of the video hyperlinking task makes it an emblematic case study where the speech and language modalities are perfectly complemented by audio and vision. The session gathers contributions where audio and natural language processing are used for video hyperlinking, possibly in conjunction with image processing and computer vision. A panel discussion focused on discussing the past, present and future of hyperlinking will conclude the workshop. This panel will aim at an understanding of which approaches are most promising and how they can be evaluated. The goal is to shape research directions at the crossroad of the scientific communities involved in SLAM and to nurture future implementations of video hyperlinking benchmarks.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133174847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shahram Kalantari, David Dean, S. Sridharan, H. Ghaemmaghami, C. Fookes
{"title":"Acoustic Adaptation in Cross Database Audio Visual SHMM Training for Phonetic Spoken Term Detection","authors":"Shahram Kalantari, David Dean, S. Sridharan, H. Ghaemmaghami, C. Fookes","doi":"10.1145/2802558.2814648","DOIUrl":"https://doi.org/10.1145/2802558.2814648","url":null,"abstract":"Visual information in the form of lip movements of the speaker has been shown to improve the performance of speech recognition and search applications. In our previous work, we proposed cross database training of synchronous hidden Markov models (SHMMs) to make use of external large and publicly available audio databases in addition to the relatively small given audio visual database. In this work, the cross database training approach is improved by performing an additional audio adaptation step, which enables audio visual SHMMs to benefit from audio observations of the external audio models before adding visual modality to them. The proposed approach outperforms the baseline cross database training approach in clean and noisy environments in terms of phone recognition accuracy as well as spoken term detection (STD) accuracy.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"18 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131375529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio Information for Hyperlinking of TV Content","authors":"P. Galuscáková, Pavel Pecina","doi":"10.1145/2802558.2814643","DOIUrl":"https://doi.org/10.1145/2802558.2814643","url":null,"abstract":"In this paper, we explore the use of audio information in the retrieval of multimedia content. Specifically, we focus on linking similar segments in a collection consisting of 4,000 hours of BBC TV programmes. We provide a description of our system submitted to the Hyperlinking Sub-task of the Search and Hyperlinking Task in the MediaEval 2014 Benchmark, in which it scored best. We explore three automatic transcripts and compare them to available subtitles. We confirm the relationship between retrieval performance and transcript quality. The performance of the retrieval is further improved by extending transcripts by metadata and context, by combining different transcripts, using the highest confident words of the transcripts, and by utilizing acoustic similarity.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131261269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ordelman, Robin Aly, Maria Eskevich, B. Huet, G. Jones
{"title":"Convenient Discovery of Archived Video Using Audiovisual Hyperlinking","authors":"R. Ordelman, Robin Aly, Maria Eskevich, B. Huet, G. Jones","doi":"10.1145/2802558.2814652","DOIUrl":"https://doi.org/10.1145/2802558.2814652","url":null,"abstract":"This paper overviews ongoing work that aims to support end-users in conveniently exploring and exploiting large audiovisual archives by deploying multiple multimodal linking approaches. We present ongoing work on multimodal video hyperlinking, from a perspective of unconstrained link anchor identification and based on the identification of named entities, and recent attempts to implement and validate the concept of outside-in linking that relates current events to archive content. Although these concepts are not new, current work is revealing novel insights, more mature technology, development of benchmark evaluations and emergence of dedicated workshops which are opening many interesting research questions on various levels that require closer collaboration between research communities.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129784241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting Music Popularity Patterns based on Musical Complexity and Early Stage Popularity","authors":"Junghyuk Lee, Jong-Seok Lee","doi":"10.1145/2802558.2814645","DOIUrl":"https://doi.org/10.1145/2802558.2814645","url":null,"abstract":"This paper investigates the problem of predicting popularity of music. In particular, we consider musical complexity as a cue that can be extracted from the audio signal and used for popularity prediction. In addition, we examine the effectiveness of the early stage popularity for long-term popularity prediction. We formulate the popularity prediction problem as a classification problem predicting popularity evolution patterns in a music ranking chart, such as the highest rank of a song over the whole time period, the growth/declination rate in the chart, the duration for which the song appears in the chart, etc. We conduct an experiment with the data collected from the Billboard Rock Songs Chart for about five years. It is found that the two types of features are effective for predicting popularity patterns when used together.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134064836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Simon, R. Bois, G. Gravier, P. Sébillot, E. Morin, Marie-Francine Moens
{"title":"Hierarchical Topic Models for Language-based Video Hyperlinking","authors":"A. Simon, R. Bois, G. Gravier, P. Sébillot, E. Morin, Marie-Francine Moens","doi":"10.1145/2802558.2814642","DOIUrl":"https://doi.org/10.1145/2802558.2814642","url":null,"abstract":"We investigate video hyperlinking based on speech transcripts, leveraging a hierarchical topical structure to address two essential aspects of hyperlinking, namely, serendipity control and link justification. We propose and compare different approaches exploiting a hierarchy of topic models as an intermediate representation to compare the transcripts of video segments. These hierarchical representations offer a basis to characterize the hyperlinks, thanks to the knowledge of the topics who contributed to the creation of the links, and to control serendipity by choosing to give more weights to either general or specific topics. Experiments are performed on BBC videos from the Search and Hyperlinking task at MediaEval. Link precisions similar to those of direct text comparison are achieved however exhibiting different targets along with a potential control of serendipity.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124458781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Score Propagation Based on Similarity Shot Graph for Improving Visual Object Retrieval","authors":"J. M. Barrios, J. M. Saavedra","doi":"10.1145/2802558.2814644","DOIUrl":"https://doi.org/10.1145/2802558.2814644","url":null,"abstract":"The Visual Object Retrieval problem consists in locating the occurrences of a specific entity in an image/video dataset. In this work, we focus on discovering new occurrences of an entity by propagating the detection scores of already computed candidates to other video segments. The score propagation follows the edges of a pre-computed Similarity Shot Graph (SSG). The SSG connects video segments that are similar according to some criterion. Four methods for creating the SSG are presented: two based on computing and comparing low-level visual features, one based on comparing text transcriptions, and other based on computing and comparing high-level concepts. The score propagation is evaluated on the INS 2014 dataset. The results show that the detection performance can be slightly improved by the proposed algorithm. However, the performance is variable and depends on the properties of the SSG and the target entity. It is part of the future work to automatically decide the kind of SSG that will be used to propagate scores given a set of detection candidates.","PeriodicalId":115369,"journal":{"name":"Proceedings of the Third Edition Workshop on Speech, Language & Audio in Multimedia","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121701409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}