A method for evaluating deep generative models of images for hallucinations in high-order spatial context

IF 3.9 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Rucha Deshpande , Mark A. Anastasio , Frank J. Brooks
{"title":"A method for evaluating deep generative models of images for hallucinations in high-order spatial context","authors":"Rucha Deshpande ,&nbsp;Mark A. Anastasio ,&nbsp;Frank J. Brooks","doi":"10.1016/j.patrec.2024.08.023","DOIUrl":null,"url":null,"abstract":"<div><p>Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying any sort of DGM in mission-critical applications is a lack of adequate and/or automatic means of assessing the domain-specific quality of generated images. In this work, we demonstrate several objective and human-interpretable tests of images output by two popular DGMs. These tests serve two goals: (i) ruling out DGMs for downstream, domain-specific applications, and (ii) quantifying hallucinations in the expected spatial context in DGM-generated images. The designed datasets are made public and the proposed tests could also serve as benchmarks and aid the prototyping of emerging DGMs. Although these tests are demonstrated on GANs, they can be employed as a benchmark for evaluating any DGM. Specifically, we designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained DGM. Together, these SCMs encode features as per-image constraints in prevalence, position, intensity, and/or texture. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different DGMs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. The main conclusion is that SCMs can be engineered, and serve as benchmarks, to quantify numerous <em>per image</em> errors, <em>i.e.</em>, hallucinations, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the DGM-generated images.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 23-29"},"PeriodicalIF":3.9000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002551/pdfft?md5=5df7937160b427d56d6a3c847ac5fdfc&pid=1-s2.0-S0167865524002551-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865524002551","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying any sort of DGM in mission-critical applications is a lack of adequate and/or automatic means of assessing the domain-specific quality of generated images. In this work, we demonstrate several objective and human-interpretable tests of images output by two popular DGMs. These tests serve two goals: (i) ruling out DGMs for downstream, domain-specific applications, and (ii) quantifying hallucinations in the expected spatial context in DGM-generated images. The designed datasets are made public and the proposed tests could also serve as benchmarks and aid the prototyping of emerging DGMs. Although these tests are demonstrated on GANs, they can be employed as a benchmark for evaluating any DGM. Specifically, we designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained DGM. Together, these SCMs encode features as per-image constraints in prevalence, position, intensity, and/or texture. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different DGMs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. The main conclusion is that SCMs can be engineered, and serve as benchmarks, to quantify numerous per image errors, i.e., hallucinations, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the DGM-generated images.

评估高阶空间背景下幻觉图像深度生成模型的方法
深度生成模型(DGM)有可能彻底改变成像诊断。生成式对抗网络(GAN)是一种被广泛应用的 DGM。在关键任务应用中部署任何类型的 DGM 的首要问题是缺乏适当和/或自动的方法来评估生成图像的特定领域质量。在这项工作中,我们展示了对两种流行的 DGM 所输出图像进行的几种客观且可人为解读的测试。这些测试有两个目的(i) 排除适用于下游特定领域应用的 DGM,(ii) 量化 DGM 生成的图像在预期空间环境中出现的幻觉。所设计的数据集是公开的,所建议的测试也可以作为基准,并有助于新兴 DGM 的原型开发。虽然这些测试是在 GANs 上进行的,但它们可以用作评估任何 DGM 的基准。具体来说,我们设计了几种不同图像特征的随机上下文模型(SCM),可以在训练有素的 DGM 生成后进行恢复。这些随机上下文模型共同将特征编码为每幅图像在流行度、位置、强度和/或纹理方面的约束条件。其中一些特征是高阶算法像素排列规则,不容易用协方差矩阵表示。我们设计并验证了统计分类器,以检测已知排列规则的特定效果。然后,我们测试了两种不同的 DGM 在各种训练场景和特征类相似程度下正确再现特征上下文的比率。我们发现,生成的图像集合可以在视觉上显示出很大程度的准确性,并且在集合测量中显示出很高的准确性,但却没有显示出已知的空间排列。我们的主要结论是,可以设计单片机并将其作为基准,以量化可能无法在集合统计中捕捉到、但可能会影响后续使用 DGM 生成的图像的众多单个图像错误(即幻觉)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信