{"title":"Generating dynamic lip-syncing using target audio in a multimedia environment","authors":"Diksha Pawar, Prashant Borde, Pravin Yannawar","doi":"10.1016/j.nlp.2024.100084","DOIUrl":null,"url":null,"abstract":"<div><p>The presented research focuses on the challenging task of creating lip-sync facial videos that align with a specified target speech segment. A novel deep-learning model has been developed to produce precise synthetic lip movements corresponding to the speech extracted from an audio source. Consequently, there are instances where portions of the visual data may fall out of sync with the updated audio and this challenge is handled through, a novel strategy, leveraging insights from a robust lip-sync discriminator. Additionally, this study introduces fresh criteria and evaluation benchmarks for assessing lip synchronization in unconstrained videos. LipChanger demonstrates improved PSNR values, indicative of enhanced image quality. Furthermore, it exhibits highly accurate lip synthesis, as evidenced by lower LMD values and higher SSIM values. These outcomes suggest that the LipChanger approach holds significant potential for enhancing lip synchronization in talking face videos, resulting in more realistic lip movements. The proposed LipChanger model and its associated evaluation benchmarks show promise and could potentially contribute to advancements in lip-sync technology for unconstrained talking face videos.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"8 ","pages":"Article 100084"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000323/pdfft?md5=84516d2e22e4420f113635a3914da66f&pid=1-s2.0-S2949719124000323-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The presented research focuses on the challenging task of creating lip-sync facial videos that align with a specified target speech segment. A novel deep-learning model has been developed to produce precise synthetic lip movements corresponding to the speech extracted from an audio source. Consequently, there are instances where portions of the visual data may fall out of sync with the updated audio and this challenge is handled through, a novel strategy, leveraging insights from a robust lip-sync discriminator. Additionally, this study introduces fresh criteria and evaluation benchmarks for assessing lip synchronization in unconstrained videos. LipChanger demonstrates improved PSNR values, indicative of enhanced image quality. Furthermore, it exhibits highly accurate lip synthesis, as evidenced by lower LMD values and higher SSIM values. These outcomes suggest that the LipChanger approach holds significant potential for enhancing lip synchronization in talking face videos, resulting in more realistic lip movements. The proposed LipChanger model and its associated evaluation benchmarks show promise and could potentially contribute to advancements in lip-sync technology for unconstrained talking face videos.