An Information Theoretic Approach to Detection of Minority Subsets in Database

S. Ando, Einoshin Suzuki
{"title":"An Information Theoretic Approach to Detection of Minority Subsets in Database","authors":"S. Ando, Einoshin Suzuki","doi":"10.1109/ICDM.2006.19","DOIUrl":null,"url":null,"abstract":"Detection of rare and exceptional occurrences in large- scale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of \"minority\" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth International Conference on Data Mining (ICDM'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2006.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

Detection of rare and exceptional occurrences in large- scale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of "minority" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.
数据库中少数子集检测的信息论方法
大型数据库中罕见异常事件的检测已成为知识发现和信息检索领域的一项重要实践。许多数据库包含大量的噪声或不相关数据,其分布往往与包含有用知识的异常数据子集重叠。本文解决了寻找“少数”数据的一个小子集的问题,这些数据的分布与数据库的大多数数据重叠,但不例外或不一致。在这种情况下,传统的基于距离或基于密度的离群检测方法由于依赖于多数的结构或关键参数的先决条件而无效。我们将任务形式化为对少数子集的模型的估计,该模型提供了子集的简单描述,但保持了与多数子集的差异。利用率失真理论的信息论框架,将该估计形式化为最小化问题。我们进一步引入多数人的条件,推导出一个考虑少数人的性质和对多数人结构的依赖的目标函数。该方法在处理人工数据方面比传统方法有所改进,在处理文档检索问题方面也取得了令人满意的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信