Automatic Discovery of Abnormal Values in Large Textual Databases

Journal of Data and Information Quality (JDIQ) Pub Date : 2016-04-19 DOI:10.1145/2889311

P. Christen, Ross W. Gayler, Khoi-Nguyen Tran, Jeffrey Fisher, Dinusha Vatsalan

{"title":"Automatic Discovery of Abnormal Values in Large Textual Databases","authors":"P. Christen, Ross W. Gayler, Khoi-Nguyen Tran, Jeffrey Fisher, Dinusha Vatsalan","doi":"10.1145/2889311","DOIUrl":null,"url":null,"abstract":"Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records. With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services, while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that “normal” values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"21 1","pages":"1 - 31"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2889311","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Textual databases are ubiquitous in many application domains. Examples of textual data range from names and addresses of customers to social media posts and bibliographic records. With online services, individuals are increasingly required to enter their personal details for example when purchasing products online or registering for government services, while many social network and e-commerce sites allow users to post short comments. Many online sites leave open the possibility for people to enter unintended or malicious abnormal values, such as names with errors, bogus values, profane comments, or random character sequences. In other applications, such as online bibliographic databases or comparative online shopping sites, databases are increasingly populated in (semi-) automatic ways through Web crawls. This practice can result in low quality data being added automatically into a database. In this article, we develop three techniques to automatically discover abnormal (unexpected or unusual) values in large textual databases. Following recent work in categorical outlier detection, our assumption is that “normal” values are those that occur frequently in a database, while an individual abnormal value is rare. Our techniques are unsupervised and address the challenge of discovering abnormal values as an outlier detection problem. Our first technique is a basic but efficient q-gram set based technique, the second is based on a probabilistic language model, and the third employs morphological word features to train a one-class support vector machine classifier. Our aim is to investigate and develop techniques that are fast, efficient, and automatic. The output of our techniques can help in the development of rule-based data cleaning and information extraction systems, or be used as training data for further supervised data cleaning procedures. We evaluate our techniques on four large real-world datasets from different domains: two US voter registration databases containing personal details, the 2013 KDD Cup dataset of bibliographic records, and the SNAP Memetracker dataset of phrases from social networking sites. Our results show that our techniques can efficiently and automatically discover abnormal textual values, allowing an organization to conduct efficient data exploration, and improve the quality of their textual databases without the need of requiring explicit training data.

查看原文本刊更多论文

大型文本数据库异常值的自动发现

文本数据库在许多应用程序领域中无处不在。文本数据的示例范围从客户的姓名和地址到社交媒体帖子和书目记录。随着网络服务的发展，个人越来越多地被要求输入个人详细信息，例如在网上购买产品或注册政府服务时，而许多社交网络和电子商务网站允许用户发表简短的评论。许多在线站点都允许人们输入无意的或恶意的异常值，例如带有错误的名称、虚假值、亵渎的评论或随机字符序列。在其他应用程序中，例如在线书目数据库或比较在线购物站点，数据库越来越多地通过Web爬虫以(半)自动化的方式填充。这种做法可能导致低质量的数据被自动添加到数据库中。在本文中，我们开发了三种技术来自动发现大型文本数据库中的异常(意外或不寻常)值。根据最近在分类离群值检测方面的工作，我们的假设是“正常”值是那些在数据库中频繁出现的值，而个别异常值是罕见的。我们的技术是无监督的，解决了发现异常值作为异常值检测问题的挑战。我们的第一种技术是一种基本但有效的基于q-gram集的技术，第二种是基于概率语言模型的技术，第三种是利用形态词特征来训练一类支持向量机分类器。我们的目标是研究和开发快速、高效和自动化的技术。我们技术的输出可以帮助开发基于规则的数据清理和信息提取系统，或者用作进一步监督数据清理程序的训练数据。我们在来自不同领域的四个大型真实世界数据集上评估了我们的技术:两个包含个人详细信息的美国选民登记数据库，2013年KDD杯书目记录数据集，以及来自社交网站的SNAP Memetracker短语数据集。我们的研究结果表明，我们的技术可以有效和自动地发现异常的文本值，允许组织进行有效的数据探索，并提高文本数据库的质量，而不需要明确的训练数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量