Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang
{"title":"数据不确定性下保序子矩阵的挖掘:一种可能世界方法和有效逼近方法","authors":"Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang","doi":"10.1145/3524915","DOIUrl":null,"url":null,"abstract":"Given a data matrix \\( D \\) , a submatrix \\( S \\) of \\( D \\) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of \\( S \\) , under which the entry values of each row in \\( S \\) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"25 1","pages":"1 - 57"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods\",\"authors\":\"Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang\",\"doi\":\"10.1145/3524915\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given a data matrix \\\\( D \\\\) , a submatrix \\\\( S \\\\) of \\\\( D \\\\) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of \\\\( S \\\\) , under which the entry values of each row in \\\\( S \\\\) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.\",\"PeriodicalId\":6983,\"journal\":{\"name\":\"ACM Transactions on Database Systems (TODS)\",\"volume\":\"25 1\",\"pages\":\"1 - 57\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Database Systems (TODS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3524915\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems (TODS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
给定一个数据矩阵\( D \),如果存在\( S \)列的置换,则\( D \)的子矩阵\( S \)是一个保序子矩阵(OPSM),在这种置换下,\( S \)中每一行的条目值都严格递增。OPSM挖掘广泛应用于现实生活中,如识别共表达基因和寻找具有相似偏好的客户。然而,由于实验条件和测量误差的变化,噪声在真实数据矩阵中普遍存在,使得传统的OPSM挖掘算法无法适用。以前在OPSM上的工作从来没有使用公认的可能世界语义来考虑不确定值区间。我们基于可能世界语义建立了两种不同的重要opsm定义:(1)基于期望支持度和(2)基于概率频率。提出了一种优化的动态规划方法来计算行支持特定列排列的概率,推导了一个封闭公式来有效地处理均匀值分布的特殊情况,以及一个精确的三次样条近似方法,可以很好地处理任何不确定值分布。为了有效地检查概率频率,设计了几种有效的剪枝规则来有效地剪枝不重要的opsm;分别提出了基于泊松分布和高斯分布的两种近似技术来进一步提高速度。这些技术被集成到我们的两个OPSM挖掘算法中,分别基于前缀投影和Apriori。我们使用最近提出的并行频繁模式挖掘框架PrefixFPM进一步并行化基于前缀投影的挖掘算法,并在CPU内核数量上实现了良好的加速。在实际微阵列数据上的大量实验表明,我们的算法发现的opsm比现有方法发现的opsm质量高得多。
Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods
Given a data matrix \( D \) , a submatrix \( S \) of \( D \) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of \( S \) , under which the entry values of each row in \( S \) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.