Fang Peng, Hongkuan Shi, Shiquan He, Qiang Hu, Ting Li, Fan Huang, Xinxia Feng, Mei Liu, Jiazhi Liao, Qiang Li, Zhiwei Wang
{"title":"Fine-Grained Temporal Site Monitoring in EGD Streams Via Visual Time-Aware Embedding and Vision-Text Asymmetric Coworking.","authors":"Fang Peng, Hongkuan Shi, Shiquan He, Qiang Hu, Ting Li, Fan Huang, Xinxia Feng, Mei Liu, Jiazhi Liao, Qiang Li, Zhiwei Wang","doi":"10.1109/JBHI.2024.3488514","DOIUrl":null,"url":null,"abstract":"<p><p>Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two- fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i.e., Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame- wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7.66%.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":null,"pages":null},"PeriodicalIF":6.7000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2024.3488514","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two- fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i.e., Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame- wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7.66%.
期刊介绍:
IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.