{"title":"Representing Outliers for Improved Multi-Spectral Data Reduction","authors":"Farnaz Agahian, B. Funt, S. H. Amirshahi","doi":"10.2352/cgiv.2012.6.1.art00064","DOIUrl":null,"url":null,"abstract":"Large multi-spectral datasets such as those created by multi-spectral images require a lot of data storage. Compression of these data is therefore an important problem. A common approach is to use principal components analysis (PCA) as a way of reducing the data requirements as part of a lossy compression strategy. In this paper, we employ the fast MCD (Minimum Covariance Determinant) algorithm, as a highly robust estimator of multivariate mean and covariance, to detect outlier spectra in a multi-spectral image. We then show that by removing the outliers from the main dataset, the performance of PCA in spectral compression significantly increases. However, since outlier spectra are a part of the image, they cannot simply be ignored. Our strategy is to cluster the outliers into a small number of groups and then compress each group separately using its own cluster-specific PCAderived bases. Overall, we show that significantly better compression can be achieved with this approach. Introduction Conventional 3-channel image color imaging devices capture limited spectral information about each scene location. RGB images are device-dependent in that they depend on the spectral sensitivity functions, which may differ from one device to another. In addition, the RGB color information depends on the scene illuminant. A change in illuminant leads to the problems of metamerism. The limitations of 3-channel color imagery, especially when high-fidelity color reproduction is required as, for example, in the reproduction and conservation of fine arts painting, are frequently overcome by moving to multi-spectral image capture [1-4]. The spectral reflectance defines an excellent “fingerprint” of a surface and provides the most useful information for color specification under any illuminant and for any observer. In the last decade, multi-spectral imaging has gained a growing interest in several applications such as color reproduction [4-5], medical imaging [6], art conservation science and digital image archives with high color accuracy [1-4]. Unlike typical digital photography, the multi-spectral imaging systems based on acquiring the spectral reflectance at each pixel of an image provide a device-independent representation that can be rendered as a correct color under any viewing condition. Although the extra information provided by a multispectral imaging device can be very useful, the large amount of data can be a problem in terms of storage and communication requirements. Digital image compression is an important task in image processing and provides efficient solutions for storage of a large volume of image data [7-9]. It is well documented that the spectral reflectance of a non-fluorescent objects is generally a smooth function of wavelength, and therefore can be modeled via dimensionality reduction techniques. In the other words, the smooth spectral reflectances are usually highly correlated and can be represented as a linear combination of a few basis vectors. Principal component analysis (PCA) is a well-known technique [10] in multivariate data analysis that has been extensively used in the context of spectral imaging as an efficient technique for spectral decorrelation as well as spectral dimensionality reduction [11]. PCA determines a linear transformation from the high-dimensional spectral space to the low-dimensional spectral subspace, which among all linear transformations guarantees the best possible representation of the highdimensional spectral vector in the low-dimensional subspace, spanned by the a few numbers of basis vectors. This feature has made PCA a powerful tool for spectral compression. It should be noted that the projected data can reconstructed to the original space; however, the compression process will usually lead to some error in the reconstructed data. According to Laamanen et al. [12], the number of basis vectors required for effective recovery of reflectance totally depends on the type of data involved and the basis vectors that are used. Obviously, the more correlated the input data, the better the result (in terms of reconstruction error) that is achievable by using PCA. Applying weighting factors on individual samples [13] and clustering of the main dataset based on a predefined criterion [14-15] are techniques that have been used to enhance the efficiency of linear models by increasing the similarity of the elements in the dataset. It is worth noting that in each dataset there are some elements that may be a long way from the remainder of the data or do not conform to its correlation structure. Such elements are known as outliers and they can have a substantial effect on the results of the dataset analysis. Therefore, it is desirable to remove or reduce the effect of such observations before applying PCA on a dataset [10]. Analysis of the spectral reconstruction of 1269 Matte Munsell color chips [16] indicates that some color samples, mostly in the family of purples, have a detrimental effect on the spectral and colorimetric reconstruction error of the whole dataset. Almost half of these samples are statistically outliers with respect to the other samples. Further investigation also shows that nearly 70% of the Munsell spectral whose reconstruction error (in terms of RMS) is more than the median error of the whole dataset also have a large robust Mahalanobis distance from the mean. If we omit purples from Munsell dataset and then extract eigenvectors and use these eigenvectors for reconstruction of all 1269 samples, the error is less than reconstruction with bases extracted from all samples (including purples). This observation motivated us to study the effect of outlier spectra in a large datasets of reflectance spectra, including those derived from multi-spectral images, and then propose a new method of compressing spectra based on the following steps: (1) separate the outliers from the non-outliers; (2) use standard PCA data reduction on the non-outliers; (3) apply k-means clustering to the outliers; (4) apply PCA data reduction to the clusters individually. CGIV 2012 Final Program and Proceedings 367 Outlier Detection in a Spectral Dataset The Mahalanobis distance is a measure based on the correlation between variables and has been widely used to detect multivariate outliers. For a multivariate vector T p 2 1 j ] x , , x , x [ K = x from a dataset with mean ] , , , [ p 2 1 μ μ μ = K μ and covariance matrix S the Mahalanobis distance is defined as ) ( ) ( ) ( MD i 1 T i i μ x S μ x x − − = − (1) Multivariate outliers can be defined as observations having a large Mahalanobis distance. A quantile of the chi-squared distribution ( 2 975 . 0 , P X ) is usually considered as the cutoff value. However, this approach does not provide a reliable measure for multiple outliers because of the masking effect collectively created by them, which means that they do not necessarily have a large MD. Therefore, it helps to estimate the mean and covariance of the dataset using a robust procedure [17-18]. There exist several robust estimators for mean and covariance. The minimum covariance determinant (MCD) [18-19] is widely known in the literature as a computationally fast algorithm and is the one we employ here. The MCD objective is to find h observations (out of N) whose classical covariance matrix has the lowest determinant. The MCD estimate of the mean is then the average of these h points. The MCD estimate of scatter is their covariance matrix. A complete description of the algorithm is presented in [18-19]. A Matlab library for robust analysis is readily available [20]. In this study we used one multispectral image entitled “Fruits and Flowers” from the Joensuu spectral image database [16] and four multi-spectral images available from the database of Hordley et al. [21]. “Fruits and Flowers” is a 160 120× pixel image containing 19,200 spectral reflectances sampled at 10 nm intervals over the range 400 nm to 700 nm. Another four multispectral images have also been measured in the same wavelength band with the same sampling rate. The number of spectra in each image is reported in Table II. It should be noted that the border of these images was removed before analysis, so the reported number of spectra in Table II is slightly different from the actual size of the images in [21]. In this paper, we show the steps of our method on “Fruits and Flowers” and report only the final results for the other images in Table II. The result of using the “Fast MCD” algorithm [18] in conjunction with the MD distance (denoted MD MCD ) on the 19,200 Fruits and Flowers spectra is shown in Fig. 1. As can be seen, there is a substantial difference in the distances as measured by MD MCD as compared to MD classic (i.e., MD as defined in Eq. 1), and this leads to very different sets of outliers. The red line represents the quantile cutoff value of 2 975 . 0 , 31 X =6.94 for the classification as an outlier. Based on this criterion, 7741 out of the 19,200 spectra were recognized as outliers by MD MCD in comparison to only 3358 by MDclassic. It is worth noting that a multivariate outlier that is not an extreme value for any of the original variables (i.e., wavelengths) can still be an outlier if it is inconsistent with the correlation structure of the remainder of the data [10]. The dataset is divided into outliers and non-outliers for the next processing steps, which involve applying PCA to the non-outliers and clustering of the outliers. (a)","PeriodicalId":252236,"journal":{"name":"Computer Graphics, Imaging and Visualization","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Graphics, Imaging and Visualization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/cgiv.2012.6.1.art00064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Large multi-spectral datasets such as those created by multi-spectral images require a lot of data storage. Compression of these data is therefore an important problem. A common approach is to use principal components analysis (PCA) as a way of reducing the data requirements as part of a lossy compression strategy. In this paper, we employ the fast MCD (Minimum Covariance Determinant) algorithm, as a highly robust estimator of multivariate mean and covariance, to detect outlier spectra in a multi-spectral image. We then show that by removing the outliers from the main dataset, the performance of PCA in spectral compression significantly increases. However, since outlier spectra are a part of the image, they cannot simply be ignored. Our strategy is to cluster the outliers into a small number of groups and then compress each group separately using its own cluster-specific PCAderived bases. Overall, we show that significantly better compression can be achieved with this approach. Introduction Conventional 3-channel image color imaging devices capture limited spectral information about each scene location. RGB images are device-dependent in that they depend on the spectral sensitivity functions, which may differ from one device to another. In addition, the RGB color information depends on the scene illuminant. A change in illuminant leads to the problems of metamerism. The limitations of 3-channel color imagery, especially when high-fidelity color reproduction is required as, for example, in the reproduction and conservation of fine arts painting, are frequently overcome by moving to multi-spectral image capture [1-4]. The spectral reflectance defines an excellent “fingerprint” of a surface and provides the most useful information for color specification under any illuminant and for any observer. In the last decade, multi-spectral imaging has gained a growing interest in several applications such as color reproduction [4-5], medical imaging [6], art conservation science and digital image archives with high color accuracy [1-4]. Unlike typical digital photography, the multi-spectral imaging systems based on acquiring the spectral reflectance at each pixel of an image provide a device-independent representation that can be rendered as a correct color under any viewing condition. Although the extra information provided by a multispectral imaging device can be very useful, the large amount of data can be a problem in terms of storage and communication requirements. Digital image compression is an important task in image processing and provides efficient solutions for storage of a large volume of image data [7-9]. It is well documented that the spectral reflectance of a non-fluorescent objects is generally a smooth function of wavelength, and therefore can be modeled via dimensionality reduction techniques. In the other words, the smooth spectral reflectances are usually highly correlated and can be represented as a linear combination of a few basis vectors. Principal component analysis (PCA) is a well-known technique [10] in multivariate data analysis that has been extensively used in the context of spectral imaging as an efficient technique for spectral decorrelation as well as spectral dimensionality reduction [11]. PCA determines a linear transformation from the high-dimensional spectral space to the low-dimensional spectral subspace, which among all linear transformations guarantees the best possible representation of the highdimensional spectral vector in the low-dimensional subspace, spanned by the a few numbers of basis vectors. This feature has made PCA a powerful tool for spectral compression. It should be noted that the projected data can reconstructed to the original space; however, the compression process will usually lead to some error in the reconstructed data. According to Laamanen et al. [12], the number of basis vectors required for effective recovery of reflectance totally depends on the type of data involved and the basis vectors that are used. Obviously, the more correlated the input data, the better the result (in terms of reconstruction error) that is achievable by using PCA. Applying weighting factors on individual samples [13] and clustering of the main dataset based on a predefined criterion [14-15] are techniques that have been used to enhance the efficiency of linear models by increasing the similarity of the elements in the dataset. It is worth noting that in each dataset there are some elements that may be a long way from the remainder of the data or do not conform to its correlation structure. Such elements are known as outliers and they can have a substantial effect on the results of the dataset analysis. Therefore, it is desirable to remove or reduce the effect of such observations before applying PCA on a dataset [10]. Analysis of the spectral reconstruction of 1269 Matte Munsell color chips [16] indicates that some color samples, mostly in the family of purples, have a detrimental effect on the spectral and colorimetric reconstruction error of the whole dataset. Almost half of these samples are statistically outliers with respect to the other samples. Further investigation also shows that nearly 70% of the Munsell spectral whose reconstruction error (in terms of RMS) is more than the median error of the whole dataset also have a large robust Mahalanobis distance from the mean. If we omit purples from Munsell dataset and then extract eigenvectors and use these eigenvectors for reconstruction of all 1269 samples, the error is less than reconstruction with bases extracted from all samples (including purples). This observation motivated us to study the effect of outlier spectra in a large datasets of reflectance spectra, including those derived from multi-spectral images, and then propose a new method of compressing spectra based on the following steps: (1) separate the outliers from the non-outliers; (2) use standard PCA data reduction on the non-outliers; (3) apply k-means clustering to the outliers; (4) apply PCA data reduction to the clusters individually. CGIV 2012 Final Program and Proceedings 367 Outlier Detection in a Spectral Dataset The Mahalanobis distance is a measure based on the correlation between variables and has been widely used to detect multivariate outliers. For a multivariate vector T p 2 1 j ] x , , x , x [ K = x from a dataset with mean ] , , , [ p 2 1 μ μ μ = K μ and covariance matrix S the Mahalanobis distance is defined as ) ( ) ( ) ( MD i 1 T i i μ x S μ x x − − = − (1) Multivariate outliers can be defined as observations having a large Mahalanobis distance. A quantile of the chi-squared distribution ( 2 975 . 0 , P X ) is usually considered as the cutoff value. However, this approach does not provide a reliable measure for multiple outliers because of the masking effect collectively created by them, which means that they do not necessarily have a large MD. Therefore, it helps to estimate the mean and covariance of the dataset using a robust procedure [17-18]. There exist several robust estimators for mean and covariance. The minimum covariance determinant (MCD) [18-19] is widely known in the literature as a computationally fast algorithm and is the one we employ here. The MCD objective is to find h observations (out of N) whose classical covariance matrix has the lowest determinant. The MCD estimate of the mean is then the average of these h points. The MCD estimate of scatter is their covariance matrix. A complete description of the algorithm is presented in [18-19]. A Matlab library for robust analysis is readily available [20]. In this study we used one multispectral image entitled “Fruits and Flowers” from the Joensuu spectral image database [16] and four multi-spectral images available from the database of Hordley et al. [21]. “Fruits and Flowers” is a 160 120× pixel image containing 19,200 spectral reflectances sampled at 10 nm intervals over the range 400 nm to 700 nm. Another four multispectral images have also been measured in the same wavelength band with the same sampling rate. The number of spectra in each image is reported in Table II. It should be noted that the border of these images was removed before analysis, so the reported number of spectra in Table II is slightly different from the actual size of the images in [21]. In this paper, we show the steps of our method on “Fruits and Flowers” and report only the final results for the other images in Table II. The result of using the “Fast MCD” algorithm [18] in conjunction with the MD distance (denoted MD MCD ) on the 19,200 Fruits and Flowers spectra is shown in Fig. 1. As can be seen, there is a substantial difference in the distances as measured by MD MCD as compared to MD classic (i.e., MD as defined in Eq. 1), and this leads to very different sets of outliers. The red line represents the quantile cutoff value of 2 975 . 0 , 31 X =6.94 for the classification as an outlier. Based on this criterion, 7741 out of the 19,200 spectra were recognized as outliers by MD MCD in comparison to only 3358 by MDclassic. It is worth noting that a multivariate outlier that is not an extreme value for any of the original variables (i.e., wavelengths) can still be an outlier if it is inconsistent with the correlation structure of the remainder of the data [10]. The dataset is divided into outliers and non-outliers for the next processing steps, which involve applying PCA to the non-outliers and clustering of the outliers. (a)