S. Diaconescu, Monica-Mihaela Rizea, M. Ionescu, A. Minca, Monica Radulescu
{"title":"A rule-based approach to generating large phonetic databases for Romanian results of the AFLR project","authors":"S. Diaconescu, Monica-Mihaela Rizea, M. Ionescu, A. Minca, Monica Radulescu","doi":"10.1109/SPED.2017.7990439","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990439","url":null,"abstract":"This paper presents a rule-based approach for generating a large phonetic database for Romanian. The knowledge base is developed by means of the GRAALAN (Grammar Abstract Language) system. By inspecting dictionaries and corpora, we generate a phonetic database over 100,000 lemmas. Our database has a high degree of accuracy ensured by our rule-based method applied for generating phonetic transcriptions.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122186675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Natural language processing model compiling natural language into byte code","authors":"A. Trifan, Marilena Anghelus, R. Constantinescu","doi":"10.1109/SPED.2017.7990434","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990434","url":null,"abstract":"The need of progress implies the need of time. Daily tasks have been automated to solve time issues but they still need the input of a user. The need for interaction with different applications may endanger the user's life. The simplest way for these automatizations to be “life-saving” is to fully support speech recognition. Although, right now, this is done in an acceptable manner, the main problem resides in the language processing model itself. Without a good language processing model, there is no “learning” and no “progress”. This document is a technical proposal of a different approach regarding the processing of human languages and compiling it in a computer understandable form — byte code. The paper will treat the requirements needed for this to happen in the programming language known as Java, but the principles should be the same for any or all programming languages.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116816770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Word associations in media posts related to disasters — A statistical analysis","authors":"M. Pirnau","doi":"10.1109/SPED.2017.7990427","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990427","url":null,"abstract":"The paper aims to analyze the frequency of the posts in case of earthquakes and of the word associations included in such Social Media (SM) posts. Since important posts are shared by users in SM, the purpose was to identify the variation of a number of posts having unique content that occurred over a period of time in Social Media for a particular topic. The present study uses messages generated by the Twitter platform, which had been posted before and after the occurrence of the earthquakes in the areas with important seismic activity, such as Vrancea (24th September 2016), Ussita (30th October 2016), New Zealand (13th November 2016) and Papua (23rd January 2017). For the analysis of the contents of the tweets, the A-priori algorithm was used to extract words associations from these posts, keywords that draw attention to the analyzed earthquake situation.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121017619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic speaker analysis 2.0: Hearing the bigger picture","authors":"Björn Schuller","doi":"10.1109/SPED.2017.7990449","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990449","url":null,"abstract":"Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis — assessing the ‘larger picture’ in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach ‘big & rich’ speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the ‘entire’ person and her condition and whereabouts behind the voice and words — rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122696069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MaRePhoR — An open access machine-readable phonetic dictionary for Romanian","authors":"Stefan-Adrian Toma, Adriana Stan, Mihai-Lica Pura, Traian Barsan","doi":"10.1109/SPED.2017.7990435","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990435","url":null,"abstract":"This paper introduces a novel open access resource, the machine-readable phonetic dictionary for Romanian — MaRePhoR. It contains over 70,000 word entries, and their manually performed phonetic transcription. The paper describes the dictionary format and statistics, as well as an initial use of the phonetic transcription entries by building a grapheme to phoneme converter based on decision trees. Various training strategies were tested enabling the correct selection of a final setup for our predictor. The best results showed that using the dictionary as training data, an accuracy of over 99% can be achieved.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"420 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116687462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SpeeD's DNN approach to Romanian speech recognition","authors":"Alexandru-Lucian Georgescu, H. Cucu, C. Burileanu","doi":"10.1109/SPED.2017.7990443","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990443","url":null,"abstract":"This paper presents the main improvements brought recently to the large-vocabulary, continuous speech recognition (LVCSR) system for Romanian language developed by the Speech and Dialogue (SpeeD) research laboratory. While the most important improvement consists in the use of DNN-based acoustic models, instead of the classic HMM-GMM approach, several other aspects are discussed in the paper: a significant increase of the speech training corpus, the use of additional algorithms for feature processing, speaker adaptive training, and discriminative training and, finally, the use of lattice rescoring with significantly expanded language models (n-gram models up to order 5, based on vocabularies of up to 200k words). The ASR experiments were performed with several types of acoustic and language models in different configurations on the standard read and conversational speech corpora created by SpeeD in 2014. The results show that the extension of the training speech corpus leads to a relative word error rate (WER) improvement between 15% and 17%, while the use of DNN-based acoustic models instead of HMM-GMM-based acoustic models leads to a relative WER improvement between 18% and 23%, depending on the nature of the evaluation speech corpus (read or conversational, clean or noisy). The best configuration of the LVCSR system was integrated as a live transcription web application available online on SpeeD laboratory's website at https://speed.pub.ro/live-transcriber-2017.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133081669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio signal classification using Linear Predictive Coding and Random Forests","authors":"L. Grama, C. Rusu","doi":"10.1109/SPED.2017.7990431","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990431","url":null,"abstract":"The goal of this work is to present an audio signal classification system based on Linear Predictive Coding and Random Forests. We consider the problem of multiclass classification with imbalanced datasets. The signals under classification belong to the class of sounds from wildlife intruder detection applications: birds, gunshots, chainsaws, human voice and tractors. The proposed system achieves an overall correct classification rate of 99.25%. There is no probability of false alarms in the case of birds or human voices. For the other three classes the probability is low, around 0.3%. The false omission rate is also low: around 0.2% for birds and tractors, a little bit higher for chainsaws (0.4%), lower for gunshots (0.14%) and zero for human voices.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116424996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Dobre, C. Paleologu, S. Ciochină, C. Negrescu, D. Stanomir
{"title":"Investigation on the performances of APA in forensic noise reduction","authors":"R. Dobre, C. Paleologu, S. Ciochină, C. Negrescu, D. Stanomir","doi":"10.1109/SPED.2017.7990442","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990442","url":null,"abstract":"Multimedia files, either video or audio, could greatly influence the final verdict of a trial when accepted as evidence. The abundance of free editing software available nowadays make forgeries a very easy operation. Audio messages, even if authentic, in some cases, can be heavily masked by other signals and declared unusable. This paper presents the investigations on the performance of the affine projection algorithm (APA) in recovering a speech signal drowned in a loud musical signal and it represents a contribution to the multimedia forensic domain. The APA was tested in multiple situations showing the top performance limits and how the working parameters influence those limits.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122136949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Voice-related symptom and knowledge-bases using internet mining","authors":"H. Teodorescu, D. Gogalniceanu","doi":"10.1109/SPED.2017.7990426","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990426","url":null,"abstract":"We report the first development of a set of symptoms for a medical condition where the set of symptoms is based exclusively on information collected on the Internet. Also, we lay down a general method for doing so. Third, we introduce the first systematic set of symptoms for temporo-mandibular disorder (TMD) exclusively related to speech and suggest a set of known quantitative parameters for the analysis of these symptoms.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129559370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Diaconescu, Monica-Mihaela Rizea, M. Ionescu, A. Minca, Liviu Dorobantu, Stefan Fulea, Monica Radulescu, H. Cucu, D. Burileanu
{"title":"Building a representative audio base of syllables for Romanian language","authors":"S. Diaconescu, Monica-Mihaela Rizea, M. Ionescu, A. Minca, Liviu Dorobantu, Stefan Fulea, Monica Radulescu, H. Cucu, D. Burileanu","doi":"10.1109/SPED.2017.7990444","DOIUrl":"https://doi.org/10.1109/SPED.2017.7990444","url":null,"abstract":"The aim of this work is to provide some insights regarding the effort of building a representative and wide coverage audio base of syllables for Romanian. The audio base comprises audio recordings of syllables extracted from the following types of syllable embedding: isolated-syllable, isolated-word and continuous speech. The list of syllables has been computed over the syllabified form of single-word inflected forms. The inflected forms were generated using a general rule-based system for normal and phonetic inflection having at its core the GRAALAN (GRAmmar Abstract LANguage) metalanguage (designed for linguistic knowledge description). In addition, the word-position of a syllable was accounted for when planning the audio recordings.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117123321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}