I am an Earphone and I can Hear my Users Face: Facial Landmark Tracking using Smart Earphones

IF 3.7 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Internet of Things Pub Date : 2023-08-09 DOI:10.1145/3614438

Shijia Zhang, Taiting Lu, Hao Zhou, Yilin Liu, Runze Liu, Mahanth K. Gowda

{"title":"I am an Earphone and I can Hear my Users Face: Facial Landmark Tracking using Smart Earphones","authors":"Shijia Zhang, Taiting Lu, Hao Zhou, Yilin Liu, Runze Liu, Mahanth K. Gowda","doi":"10.1145/3614438","DOIUrl":null,"url":null,"abstract":"This paper presents EARFace, a system that shows the feasibility of tracking facial landmarks for 3D facial reconstruction using in-ear acoustic sensors embedded within smart earphones. This enables a number of applications in the areas of facial expression tracking, user-interfaces, AR/VR applications, affective computing, accessibility, etc. While conventional vision-based solutions break down under poor lighting, occlusions, and also suffer from privacy concerns, earphone platforms are robust to ambient conditions, while being privacy-preserving. In contrast to prior work on earable platforms that perform outer-ear sensing for facial motion tracking, EARFace shows the feasibility of completely in-ear sensing with a natural earphone form-factor, thus enhancing the comfort levels of wearing. The core intuition exploited by EARFace is that the shape of the ear canal changes due to the movement of facial muscles during facial motion. EARFace tracks the changes in shape of the ear canal by measuring ultrasonic channel frequency response (CFR) of the inner ear, ultimately resulting in tracking of the facial motion. A transformer based machine learning (ML) model is designed to exploit spectral and temporal relationships in the ultrasonic CFR data to predict the facial landmarks of the user with an accuracy of 1.83 mm. Using these predicted landmarks, a 3D graphical model of the face that replicates the precise facial motion of the user is then reconstructed. Domain adaptation is further performed by adapting the weights of layers using a group-wise and differential learning rate. This decreases the training overhead in EARFace. The transformer based ML model runs on smartphone devices with a processing latency of 13 ms and an overall low power consumption profile. Finally, usability studies indicate higher levels of comforts of wearing EARFace’s earphone platform in comparison with alternative form-factors.","PeriodicalId":29764,"journal":{"name":"ACM Transactions on Internet of Things","volume":"19 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Internet of Things","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3614438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents EARFace, a system that shows the feasibility of tracking facial landmarks for 3D facial reconstruction using in-ear acoustic sensors embedded within smart earphones. This enables a number of applications in the areas of facial expression tracking, user-interfaces, AR/VR applications, affective computing, accessibility, etc. While conventional vision-based solutions break down under poor lighting, occlusions, and also suffer from privacy concerns, earphone platforms are robust to ambient conditions, while being privacy-preserving. In contrast to prior work on earable platforms that perform outer-ear sensing for facial motion tracking, EARFace shows the feasibility of completely in-ear sensing with a natural earphone form-factor, thus enhancing the comfort levels of wearing. The core intuition exploited by EARFace is that the shape of the ear canal changes due to the movement of facial muscles during facial motion. EARFace tracks the changes in shape of the ear canal by measuring ultrasonic channel frequency response (CFR) of the inner ear, ultimately resulting in tracking of the facial motion. A transformer based machine learning (ML) model is designed to exploit spectral and temporal relationships in the ultrasonic CFR data to predict the facial landmarks of the user with an accuracy of 1.83 mm. Using these predicted landmarks, a 3D graphical model of the face that replicates the precise facial motion of the user is then reconstructed. Domain adaptation is further performed by adapting the weights of layers using a group-wise and differential learning rate. This decreases the training overhead in EARFace. The transformer based ML model runs on smartphone devices with a processing latency of 13 ms and an overall low power consumption profile. Finally, usability studies indicate higher levels of comforts of wearing EARFace’s earphone platform in comparison with alternative form-factors.

查看原文本刊更多论文

我是一个耳机，我可以听到我的用户的脸:面部地标跟踪使用智能耳机

本文介绍了EARFace系统，该系统显示了使用嵌入智能耳机的入耳式声学传感器跟踪面部地标进行3D面部重建的可行性。这使得面部表情跟踪、用户界面、AR/VR应用、情感计算、可访问性等领域的许多应用成为可能。虽然传统的基于视觉的解决方案在光线不足、遮挡和隐私问题下会失效，但耳机平台在保护隐私的同时，对环境条件也很强大。与之前使用外耳感应进行面部运动跟踪的可穿戴平台相比，EARFace展示了完全入耳感应的可行性，具有自然的耳机形状因素，从而提高了佩戴的舒适度。EARFace利用的核心直觉是，在面部运动时，由于面部肌肉的运动，耳道的形状会发生变化。EARFace通过测量内耳的超声通道频率响应(CFR)来跟踪耳道形状的变化，最终实现对面部运动的跟踪。基于变压器的机器学习(ML)模型旨在利用超声CFR数据中的光谱和时间关系来预测用户的面部地标，精度为1.83 mm。使用这些预测的地标，然后重建一个面部的3D图形模型，该模型复制了用户精确的面部运动。通过使用分组和差分学习率来调整层的权重，进一步进行域自适应。这减少了EARFace的训练开销。基于变压器的ML模型运行在智能手机设备上，处理延迟为13毫秒，总体功耗低。最后，可用性研究表明，与其他形式的因素相比，佩戴EARFace耳机平台的舒适度更高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Internet of Things

CiteScore

5.20

自引率

3.70%

发文量