Representing Outliers for Improved Multi-Spectral Data Reduction

Computer Graphics, Imaging and Visualization Pub Date : 2012-01-01 DOI:10.2352/cgiv.2012.6.1.art00064

Farnaz Agahian, B. Funt, S. H. Amirshahi

{"title":"Representing Outliers for Improved Multi-Spectral Data Reduction","authors":"Farnaz Agahian, B. Funt, S. H. Amirshahi","doi":"10.2352/cgiv.2012.6.1.art00064","DOIUrl":null,"url":null,"abstract":"Large multi-spectral datasets such as those created by multi-spectral images require a lot of data storage. Compression of these data is therefore an important problem. A common approach is to use principal components analysis (PCA) as a way of reducing the data requirements as part of a lossy compression strategy. In this paper, we employ the fast MCD (Minimum Covariance Determinant) algorithm, as a highly robust estimator of multivariate mean and covariance, to detect outlier spectra in a multi-spectral image. We then show that by removing the outliers from the main dataset, the performance of PCA in spectral compression significantly increases. However, since outlier spectra are a part of the image, they cannot simply be ignored. Our strategy is to cluster the outliers into a small number of groups and then compress each group separately using its own cluster-specific PCAderived bases. Overall, we show that significantly better compression can be achieved with this approach. Introduction Conventional 3-channel image color imaging devices capture limited spectral information about each scene location. RGB images are device-dependent in that they depend on the spectral sensitivity functions, which may differ from one device to another. In addition, the RGB color information depends on the scene illuminant. A change in illuminant leads to the problems of metamerism. The limitations of 3-channel color imagery, especially when high-fidelity color reproduction is required as, for example, in the reproduction and conservation of fine arts painting, are frequently overcome by moving to multi-spectral image capture [1-4]. The spectral reflectance defines an excellent “fingerprint” of a surface and provides the most useful information for color specification under any illuminant and for any observer. In the last decade, multi-spectral imaging has gained a growing interest in several applications such as color reproduction [4-5], medical imaging [6], art conservation science and digital image archives with high color accuracy [1-4]. Unlike typical digital photography, the multi-spectral imaging systems based on acquiring the spectral reflectance at each pixel of an image provide a device-independent representation that can be rendered as a correct color under any viewing condition. Although the extra information provided by a multispectral imaging device can be very useful, the large amount of data can be a problem in terms of storage and communication requirements. Digital image compression is an important task in image processing and provides efficient solutions for storage of a large volume of image data [7-9]. It is well documented that the spectral reflectance of a non-fluorescent objects is generally a smooth function of wavelength, and therefore can be modeled via dimensionality reduction techniques. In the other words, the smooth spectral reflectances are usually highly correlated and can be represented as a linear combination of a few basis vectors. Principal component analysis (PCA) is a well-known technique [10] in multivariate data analysis that has been extensively used in the context of spectral imaging as an efficient technique for spectral decorrelation as well as spectral dimensionality reduction [11]. PCA determines a linear transformation from the high-dimensional spectral space to the low-dimensional spectral subspace, which among all linear transformations guarantees the best possible representation of the highdimensional spectral vector in the low-dimensional subspace, spanned by the a few numbers of basis vectors. This feature has made PCA a powerful tool for spectral compression. It should be noted that the projected data can reconstructed to the original space; however, the compression process will usually lead to some error in the reconstructed data. According to Laamanen et al. [12], the number of basis vectors required for effective recovery of reflectance totally depends on the type of data involved and the basis vectors that are used. Obviously, the more correlated the input data, the better the result (in terms of reconstruction error) that is achievable by using PCA. Applying weighting factors on individual samples [13] and clustering of the main dataset based on a predefined criterion [14-15] are techniques that have been used to enhance the efficiency of linear models by increasing the similarity of the elements in the dataset. It is worth noting that in each dataset there are some elements that may be a long way from the remainder of the data or do not conform to its correlation structure. Such elements are known as outliers and they can have a substantial effect on the results of the dataset analysis. Therefore, it is desirable to remove or reduce the effect of such observations before applying PCA on a dataset [10]. Analysis of the spectral reconstruction of 1269 Matte Munsell color chips [16] indicates that some color samples, mostly in the family of purples, have a detrimental effect on the spectral and colorimetric reconstruction error of the whole dataset. Almost half of these samples are statistically outliers with respect to the other samples. Further investigation also shows that nearly 70% of the Munsell spectral whose reconstruction error (in terms of RMS) is more than the median error of the whole dataset also have a large robust Mahalanobis distance from the mean. If we omit purples from Munsell dataset and then extract eigenvectors and use these eigenvectors for reconstruction of all 1269 samples, the error is less than reconstruction with bases extracted from all samples (including purples). This observation motivated us to study the effect of outlier spectra in a large datasets of reflectance spectra, including those derived from multi-spectral images, and then propose a new method of compressing spectra based on the following steps: (1) separate the outliers from the non-outliers; (2) use standard PCA data reduction on the non-outliers; (3) apply k-means clustering to the outliers; (4) apply PCA data reduction to the clusters individually. CGIV 2012 Final Program and Proceedings 367 Outlier Detection in a Spectral Dataset The Mahalanobis distance is a measure based on the correlation between variables and has been widely used to detect multivariate outliers. For a multivariate vector T p 2 1 j ] x , , x , x [ K = x from a dataset with mean ] , , , [ p 2 1 μ μ μ = K μ and covariance matrix S the Mahalanobis distance is defined as ) ( ) ( ) ( MD i 1 T i i μ x S μ x x − − = − (1) Multivariate outliers can be defined as observations having a large Mahalanobis distance. A quantile of the chi-squared distribution ( 2 975 . 0 , P X ) is usually considered as the cutoff value. However, this approach does not provide a reliable measure for multiple outliers because of the masking effect collectively created by them, which means that they do not necessarily have a large MD. Therefore, it helps to estimate the mean and covariance of the dataset using a robust procedure [17-18]. There exist several robust estimators for mean and covariance. The minimum covariance determinant (MCD) [18-19] is widely known in the literature as a computationally fast algorithm and is the one we employ here. The MCD objective is to find h observations (out of N) whose classical covariance matrix has the lowest determinant. The MCD estimate of the mean is then the average of these h points. The MCD estimate of scatter is their covariance matrix. A complete description of the algorithm is presented in [18-19]. A Matlab library for robust analysis is readily available [20]. In this study we used one multispectral image entitled “Fruits and Flowers” from the Joensuu spectral image database [16] and four multi-spectral images available from the database of Hordley et al. [21]. “Fruits and Flowers” is a 160 120× pixel image containing 19,200 spectral reflectances sampled at 10 nm intervals over the range 400 nm to 700 nm. Another four multispectral images have also been measured in the same wavelength band with the same sampling rate. The number of spectra in each image is reported in Table II. It should be noted that the border of these images was removed before analysis, so the reported number of spectra in Table II is slightly different from the actual size of the images in [21]. In this paper, we show the steps of our method on “Fruits and Flowers” and report only the final results for the other images in Table II. The result of using the “Fast MCD” algorithm [18] in conjunction with the MD distance (denoted MD MCD ) on the 19,200 Fruits and Flowers spectra is shown in Fig. 1. As can be seen, there is a substantial difference in the distances as measured by MD MCD as compared to MD classic (i.e., MD as defined in Eq. 1), and this leads to very different sets of outliers. The red line represents the quantile cutoff value of 2 975 . 0 , 31 X =6.94 for the classification as an outlier. Based on this criterion, 7741 out of the 19,200 spectra were recognized as outliers by MD MCD in comparison to only 3358 by MDclassic. It is worth noting that a multivariate outlier that is not an extreme value for any of the original variables (i.e., wavelengths) can still be an outlier if it is inconsistent with the correlation structure of the remainder of the data [10]. The dataset is divided into outliers and non-outliers for the next processing steps, which involve applying PCA to the non-outliers and clustering of the outliers. (a)","PeriodicalId":252236,"journal":{"name":"Computer Graphics, Imaging and Visualization","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Graphics, Imaging and Visualization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/cgiv.2012.6.1.art00064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Large multi-spectral datasets such as those created by multi-spectral images require a lot of data storage. Compression of these data is therefore an important problem. A common approach is to use principal components analysis (PCA) as a way of reducing the data requirements as part of a lossy compression strategy. In this paper, we employ the fast MCD (Minimum Covariance Determinant) algorithm, as a highly robust estimator of multivariate mean and covariance, to detect outlier spectra in a multi-spectral image. We then show that by removing the outliers from the main dataset, the performance of PCA in spectral compression significantly increases. However, since outlier spectra are a part of the image, they cannot simply be ignored. Our strategy is to cluster the outliers into a small number of groups and then compress each group separately using its own cluster-specific PCAderived bases. Overall, we show that significantly better compression can be achieved with this approach. Introduction Conventional 3-channel image color imaging devices capture limited spectral information about each scene location. RGB images are device-dependent in that they depend on the spectral sensitivity functions, which may differ from one device to another. In addition, the RGB color information depends on the scene illuminant. A change in illuminant leads to the problems of metamerism. The limitations of 3-channel color imagery, especially when high-fidelity color reproduction is required as, for example, in the reproduction and conservation of fine arts painting, are frequently overcome by moving to multi-spectral image capture [1-4]. The spectral reflectance defines an excellent “fingerprint” of a surface and provides the most useful information for color specification under any illuminant and for any observer. In the last decade, multi-spectral imaging has gained a growing interest in several applications such as color reproduction [4-5], medical imaging [6], art conservation science and digital image archives with high color accuracy [1-4]. Unlike typical digital photography, the multi-spectral imaging systems based on acquiring the spectral reflectance at each pixel of an image provide a device-independent representation that can be rendered as a correct color under any viewing condition. Although the extra information provided by a multispectral imaging device can be very useful, the large amount of data can be a problem in terms of storage and communication requirements. Digital image compression is an important task in image processing and provides efficient solutions for storage of a large volume of image data [7-9]. It is well documented that the spectral reflectance of a non-fluorescent objects is generally a smooth function of wavelength, and therefore can be modeled via dimensionality reduction techniques. In the other words, the smooth spectral reflectances are usually highly correlated and can be represented as a linear combination of a few basis vectors. Principal component analysis (PCA) is a well-known technique [10] in multivariate data analysis that has been extensively used in the context of spectral imaging as an efficient technique for spectral decorrelation as well as spectral dimensionality reduction [11]. PCA determines a linear transformation from the high-dimensional spectral space to the low-dimensional spectral subspace, which among all linear transformations guarantees the best possible representation of the highdimensional spectral vector in the low-dimensional subspace, spanned by the a few numbers of basis vectors. This feature has made PCA a powerful tool for spectral compression. It should be noted that the projected data can reconstructed to the original space; however, the compression process will usually lead to some error in the reconstructed data. According to Laamanen et al. [12], the number of basis vectors required for effective recovery of reflectance totally depends on the type of data involved and the basis vectors that are used. Obviously, the more correlated the input data, the better the result (in terms of reconstruction error) that is achievable by using PCA. Applying weighting factors on individual samples [13] and clustering of the main dataset based on a predefined criterion [14-15] are techniques that have been used to enhance the efficiency of linear models by increasing the similarity of the elements in the dataset. It is worth noting that in each dataset there are some elements that may be a long way from the remainder of the data or do not conform to its correlation structure. Such elements are known as outliers and they can have a substantial effect on the results of the dataset analysis. Therefore, it is desirable to remove or reduce the effect of such observations before applying PCA on a dataset [10]. Analysis of the spectral reconstruction of 1269 Matte Munsell color chips [16] indicates that some color samples, mostly in the family of purples, have a detrimental effect on the spectral and colorimetric reconstruction error of the whole dataset. Almost half of these samples are statistically outliers with respect to the other samples. Further investigation also shows that nearly 70% of the Munsell spectral whose reconstruction error (in terms of RMS) is more than the median error of the whole dataset also have a large robust Mahalanobis distance from the mean. If we omit purples from Munsell dataset and then extract eigenvectors and use these eigenvectors for reconstruction of all 1269 samples, the error is less than reconstruction with bases extracted from all samples (including purples). This observation motivated us to study the effect of outlier spectra in a large datasets of reflectance spectra, including those derived from multi-spectral images, and then propose a new method of compressing spectra based on the following steps: (1) separate the outliers from the non-outliers; (2) use standard PCA data reduction on the non-outliers; (3) apply k-means clustering to the outliers; (4) apply PCA data reduction to the clusters individually. CGIV 2012 Final Program and Proceedings 367 Outlier Detection in a Spectral Dataset The Mahalanobis distance is a measure based on the correlation between variables and has been widely used to detect multivariate outliers. For a multivariate vector T p 2 1 j ] x , , x , x [ K = x from a dataset with mean ] , , , [ p 2 1 μ μ μ = K μ and covariance matrix S the Mahalanobis distance is defined as ) ( ) ( ) ( MD i 1 T i i μ x S μ x x − − = − (1) Multivariate outliers can be defined as observations having a large Mahalanobis distance. A quantile of the chi-squared distribution ( 2 975 . 0 , P X ) is usually considered as the cutoff value. However, this approach does not provide a reliable measure for multiple outliers because of the masking effect collectively created by them, which means that they do not necessarily have a large MD. Therefore, it helps to estimate the mean and covariance of the dataset using a robust procedure [17-18]. There exist several robust estimators for mean and covariance. The minimum covariance determinant (MCD) [18-19] is widely known in the literature as a computationally fast algorithm and is the one we employ here. The MCD objective is to find h observations (out of N) whose classical covariance matrix has the lowest determinant. The MCD estimate of the mean is then the average of these h points. The MCD estimate of scatter is their covariance matrix. A complete description of the algorithm is presented in [18-19]. A Matlab library for robust analysis is readily available [20]. In this study we used one multispectral image entitled “Fruits and Flowers” from the Joensuu spectral image database [16] and four multi-spectral images available from the database of Hordley et al. [21]. “Fruits and Flowers” is a 160 120× pixel image containing 19,200 spectral reflectances sampled at 10 nm intervals over the range 400 nm to 700 nm. Another four multispectral images have also been measured in the same wavelength band with the same sampling rate. The number of spectra in each image is reported in Table II. It should be noted that the border of these images was removed before analysis, so the reported number of spectra in Table II is slightly different from the actual size of the images in [21]. In this paper, we show the steps of our method on “Fruits and Flowers” and report only the final results for the other images in Table II. The result of using the “Fast MCD” algorithm [18] in conjunction with the MD distance (denoted MD MCD ) on the 19,200 Fruits and Flowers spectra is shown in Fig. 1. As can be seen, there is a substantial difference in the distances as measured by MD MCD as compared to MD classic (i.e., MD as defined in Eq. 1), and this leads to very different sets of outliers. The red line represents the quantile cutoff value of 2 975 . 0 , 31 X =6.94 for the classification as an outlier. Based on this criterion, 7741 out of the 19,200 spectra were recognized as outliers by MD MCD in comparison to only 3358 by MDclassic. It is worth noting that a multivariate outlier that is not an extreme value for any of the original variables (i.e., wavelengths) can still be an outlier if it is inconsistent with the correlation structure of the remainder of the data [10]. The dataset is divided into outliers and non-outliers for the next processing steps, which involve applying PCA to the non-outliers and clustering of the outliers. (a)

查看原文本刊更多论文

用离群值表示改进的多光谱数据还原

多光谱图像创建的大型多光谱数据集需要大量的数据存储。因此，这些数据的压缩是一个重要的问题。一种常见的方法是使用主成分分析(PCA)作为减少数据需求的一种方法，作为有损压缩策略的一部分。本文采用快速MCD(最小协方差行列式)算法，作为一种高度鲁棒的多元均值和协方差估计方法，来检测多光谱图像中的离群光谱。然后我们表明，通过从主数据集中去除异常值，主成分分析在光谱压缩中的性能显着提高。然而，由于离群光谱是图像的一部分，所以不能简单地忽略它们。我们的策略是将异常值聚类为少数组，然后使用其特定于集群的pcderived基分别压缩每组。总的来说，我们表明使用这种方法可以实现更好的压缩。传统的3通道图像彩色成像设备捕获每个场景位置的有限光谱信息。RGB图像是设备相关的，因为它们依赖于光谱灵敏度函数，这可能因设备而异。此外，RGB颜色信息取决于场景光源。光源的改变会引起同色异分的问题。三通道彩色图像的局限性，特别是当需要高保真色彩再现时，例如，在美术绘画的复制和保存中，经常通过转向多光谱图像捕获来克服[1-4]。光谱反射率定义了表面的优秀“指纹”，并为任何光源和任何观察者的颜色规格提供了最有用的信息。近十年来，多光谱成像在色彩再现[4-5]、医学成像[6]、艺术保护科学和高色彩精度数字图像档案[1-4]等领域的应用日益受到关注。与典型的数字摄影不同，基于获取图像每个像素的光谱反射率的多光谱成像系统提供了一种与设备无关的表示，可以在任何观看条件下呈现为正确的颜色。虽然多光谱成像设备提供的额外信息可能非常有用，但大量数据可能成为存储和通信要求方面的问题。数字图像压缩是图像处理中的一项重要任务，为大量图像数据的存储提供了高效的解决方案[7-9]。文献表明，非荧光物体的光谱反射率通常是波长的平滑函数，因此可以通过降维技术进行建模。换句话说，光滑光谱反射率通常是高度相关的，可以表示为几个基向量的线性组合。主成分分析(PCA)是一种众所周知的多变量数据分析技术[10]，作为一种有效的光谱去相关和光谱降维技术，被广泛应用于光谱成像领域[11]。PCA确定了从高维光谱空间到低维光谱子空间的线性变换，在所有线性变换中，它保证了高维光谱向量在由少量基向量张成的低维子空间中的最佳表示。这一特性使PCA成为光谱压缩的有力工具。需要注意的是，投影后的数据可以重构到原始空间;然而，在压缩过程中，重构的数据往往会产生一定的误差。Laamanen等[12]认为，有效恢复反射率所需的基向量数量完全取决于所涉及的数据类型和所使用的基向量。显然，输入数据的相关性越高，使用PCA获得的结果(就重建误差而言)就越好。在单个样本上应用加权因子[13]和基于预定义标准对主数据集进行聚类[14-15]是通过增加数据集中元素的相似性来提高线性模型效率的技术。值得注意的是，在每个数据集中都有一些元素可能与数据的其余部分相距甚远，或者不符合其相关结构。这些元素被称为异常值，它们可以对数据集分析的结果产生实质性影响。因此，在对数据集应用PCA之前，需要去除或减少这些观测值的影响[10]。对1269个Matte Munsell彩色芯片的光谱重建分析[16]表明，一些颜色样本(主要是紫色族)会对整个数据集的光谱和比色重建误差产生不利影响。与其他样本相比，这些样本中几乎有一半在统计上是异常值。进一步研究还表明，重建误差(以均方根计)大于整个数据集中位数误差的近70%的孟塞尔谱也具有较大的鲁棒马氏距离。如果我们从Munsell数据集中省略紫色，然后提取特征向量，并使用这些特征向量对所有1269个样本进行重建，则误差小于从所有样本(包括紫色)中提取碱基的重建。这一发现促使我们研究了在包括多光谱图像在内的大型反射光谱数据集中异常值光谱的影响，并提出了一种新的光谱压缩方法，该方法基于以下步骤:(1)将异常值与非异常值分离;(2)对非异常值采用标准PCA数据约简;(3)对异常值应用k-means聚类;(4)分别对聚类进行主成分分析数据约简。马氏距离是一种基于变量之间相关性的度量，已被广泛用于检测多变量异常值。对于一个多变量向量T p 2 1 j] x, x, x [K = x从一个数据集的意思 ] , , , [ p 2 1μμμ= Kμ和协方差矩阵的距离被定义为 ) ( ) ( ) ( MD 1 T i我μSμx x−−=−(1)多元异常值可以被定义为观察有很大距离。卡方分布的一个分位数。0, px)通常被认为是截止值。然而，由于它们共同产生的掩蔽效应，这种方法不能为多个异常值提供可靠的度量，这意味着它们不一定具有很大的MD。因此，它有助于使用鲁棒程序估计数据集的平均值和协方差[17-18]。存在几个稳健的均值和协方差估计量。最小协方差行列式(minimum covariance行列式，MCD)[18-19]在文献中被广泛认为是一种计算速度快的算法，我们在这里使用的就是这种算法。MCD的目标是找到h个观测值(从N中)，其经典协方差矩阵具有最低的行列式。平均的MCD估计值就是这h个点的平均值。散点的MCD估计是它们的协方差矩阵。[18-19]给出了该算法的完整描述。用于鲁棒分析的Matlab库是现成的[20]。在本研究中，我们使用了来自Joensuu光谱图像数据库[16]的一张名为“Fruits and Flowers”的多光谱图像和来自Hordley等人[21]数据库的四张多光谱图像。“水果和花朵”是一张160 120×像素的图像，包含19200个光谱反射，采样间隔为10 nm，范围为400 nm至700 nm。在相同的采样率下，在相同的波长范围内测量了另外四幅多光谱图像。每幅图像的光谱数见表二。需要注意的是，这些图像在分析前已经去除了边界，因此表2中报告的光谱数与[21]中图像的实际尺寸略有差异。在本文中，我们展示了我们的方法在“水果和花朵”上的步骤，并在表2中仅报告了其他图像的最终结果。“Fast MCD”算法[18]结合MD距离(记为MD MCD)对19200份花果光谱的处理结果如图1所示。可以看到，与MD经典(即公式1中定义的MD)相比，MD MCD测量的距离有很大差异，这导致了非常不同的异常值集。红线表示2 975的分位数截止值。0,31 X =6.94为离群值分类。基于这一标准，19200个光谱中7741个被mdmcd识别为异常值，而MDclassic仅识别出3358个。值得注意的是，如果多元离群值不是任何原始变量(即波长)的极值，如果它与其余数据的相关结构不一致，它仍然可以是离群值[10]。在接下来的处理步骤中，将数据集分为异常值和非异常值，其中包括将PCA应用于非异常值和异常值的聚类。(一)

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Graphics, Imaging and Visualization

自引率

0.00%

发文量