EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses.

IF 6.5

IEEE transactions on visualization and computer graphics Pub Date : 2025-10-07 DOI:10.1109/TVCG.2025.3616866

Akshay Paruchuri, Sinan Hersek, Lavisha Aggarwal, Qiao Yang, Xin Liu, Achin Kulshrestha, Andrea Colaco, Henry Fuchs, Ishan Chatterjee

{"title":"EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses.","authors":"Akshay Paruchuri, Sinan Hersek, Lavisha Aggarwal, Qiao Yang, Xin Liu, Achin Kulshrestha, Andrea Colaco, Henry Fuchs, Ishan Chatterjee","doi":"10.1109/TVCG.2025.3616866","DOIUrl":null,"url":null,"abstract":"<p><p>All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use - supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":6.5000,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3616866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use - supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).

查看原文本刊更多论文

EgoTrigger：在全天节能智能眼镜中实现音频驱动图像捕获以增强人类记忆。

全天候智能眼镜可能会成为能够持续感知环境的平台，为我们的日常生活提供前所未有的帮助。然而，在执行连续传感的同时，集成人类记忆增强所需的多模态人工智能代理，对全天使用提出了重大的能效挑战。实现这种平衡需要智能的、上下文感知的传感器管理。我们的方法，EgoTrigger，利用麦克风的音频线索有选择性地激活功耗高的摄像头，实现高效的传感，同时保持对人类记忆增强的实质性效用。EgoTrigger使用轻量级音频模型（YAMNet）和自定义分类头来触发手-对象交互（HOI）音频线索的图像捕获，例如抽屉打开或药瓶打开的声音。除了在QA-Ego4D数据集上进行评估外，我们还介绍并评估了人类记忆增强问答（HME-QA）数据集。我们的数据集包含340对来自全长Ego4D视频的人类注释的第一人称QA对，这些视频经过精心策划，以确保它们包含音频，重点关注对上下文理解和记忆至关重要的HOI时刻。我们的研究结果表明，EgoTrigger平均可以减少54%的帧数，大大节省了耗电传感组件（例如，相机）和下游操作（例如，无线传输）的能量，同时在情景记忆任务的数据集上取得了相当的性能。我们相信，这种情境感知触发策略代表了一个有前景的方向，即实现节能、功能强大的智能眼镜，能够全天使用，支持诸如帮助用户回忆他们把钥匙放在哪里或日常活动信息（例如，服用药物）的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量