噪声非结构化文本数据分析2

L. V. Subramaniam, Shourya Roy
{"title":"噪声非结构化文本数据分析2","authors":"L. V. Subramaniam, Shourya Roy","doi":"10.4018/9781599048499.ch016","DOIUrl":null,"url":null,"abstract":"The importance of text mining applications is growing proportionally with the exponential growth of electronic text. Along with the growth of internet many other sources of electronic text have become really popular. With increasing penetration of internet, many forms of communication and interaction such as email, chat, newsgroups, blogs, discussion groups, scraps etc. have become increasingly popular. These generate huge amount of noisy text data everyday. Apart from these the other big contributors in the pool of electronic text documents are call centres and customer relationship management organizations in the form of call logs, call transcriptions, problem tickets, complaint emails etc., electronic text generated by Optical Character Recognition (OCR) process from hand written and printed documents and mobile text such as Short Message Service (SMS). Though the nature of each of these documents is different but there is a common thread between all of these—presence of noise. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: “Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.” Opinion(product1,good), from a blog post such as: “I absolutely liked the texture of SheetK quilts.” At superficial level, there are two ways for information extraction from noisy text. The first one is cleaning text by removing noise and then applying existing state of the art techniques for information extraction. There in lies the importance of techniques for automatically correcting noisy text. In this chapter, first we will review some work in the area of noisy text correction. The second approach is to devise extraction techniques which are robust with respect to noise. Later in this chapter, we will see how the task of information extraction is affected by noise.","PeriodicalId":320314,"journal":{"name":"Encyclopedia of Artificial Intelligence","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Analytics for Noisy Unstructured Text A Data II\",\"authors\":\"L. V. Subramaniam, Shourya Roy\",\"doi\":\"10.4018/9781599048499.ch016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The importance of text mining applications is growing proportionally with the exponential growth of electronic text. Along with the growth of internet many other sources of electronic text have become really popular. With increasing penetration of internet, many forms of communication and interaction such as email, chat, newsgroups, blogs, discussion groups, scraps etc. have become increasingly popular. These generate huge amount of noisy text data everyday. Apart from these the other big contributors in the pool of electronic text documents are call centres and customer relationship management organizations in the form of call logs, call transcriptions, problem tickets, complaint emails etc., electronic text generated by Optical Character Recognition (OCR) process from hand written and printed documents and mobile text such as Short Message Service (SMS). Though the nature of each of these documents is different but there is a common thread between all of these—presence of noise. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: “Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.” Opinion(product1,good), from a blog post such as: “I absolutely liked the texture of SheetK quilts.” At superficial level, there are two ways for information extraction from noisy text. The first one is cleaning text by removing noise and then applying existing state of the art techniques for information extraction. There in lies the importance of techniques for automatically correcting noisy text. In this chapter, first we will review some work in the area of noisy text correction. The second approach is to devise extraction techniques which are robust with respect to noise. Later in this chapter, we will see how the task of information extraction is affected by noise.\",\"PeriodicalId\":320314,\"journal\":{\"name\":\"Encyclopedia of Artificial Intelligence\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Encyclopedia of Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4018/9781599048499.ch016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Encyclopedia of Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/9781599048499.ch016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

随着电子文本的指数级增长,文本挖掘应用的重要性也越来越大。随着互联网的发展,许多其他的电子文本来源已经变得非常受欢迎。随着互联网的日益普及,许多形式的交流和互动,如电子邮件、聊天、新闻组、博客、讨论组、废料等变得越来越流行。这些每天都会产生大量嘈杂的文本数据。除此之外,电子文本文档的其他主要贡献者是呼叫中心和客户关系管理机构,其形式包括通话记录、通话记录、问题单、投诉电子邮件等,通过光学字符识别(OCR)过程从手写和打印文档生成的电子文本,以及移动文本,如短信服务(SMS)。虽然这些文件的性质不同,但它们之间有一个共同的线索——噪音的存在。信息提取的一个例子是从在线新闻句子中提取公司合并的实例,更正式的例子是MergerBetween(company1,company2,date),例如:“昨天,总部位于纽约的Foo Inc.宣布收购Bar Corp.”Opinion(product1,good),例如:“我非常喜欢SheetK被子的质地。”从表面上看,从噪声文本中提取信息有两种方法。第一种是通过去除噪声来清理文本,然后应用现有的最先进的技术进行信息提取。自动纠错噪声文本的技术的重要性就在于此。在本章中,我们将首先回顾噪声文本校正领域的一些工作。第二种方法是设计对噪声具有鲁棒性的提取技术。在本章的后面,我们将看到噪声如何影响信息提取的任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Analytics for Noisy Unstructured Text A Data II
The importance of text mining applications is growing proportionally with the exponential growth of electronic text. Along with the growth of internet many other sources of electronic text have become really popular. With increasing penetration of internet, many forms of communication and interaction such as email, chat, newsgroups, blogs, discussion groups, scraps etc. have become increasingly popular. These generate huge amount of noisy text data everyday. Apart from these the other big contributors in the pool of electronic text documents are call centres and customer relationship management organizations in the form of call logs, call transcriptions, problem tickets, complaint emails etc., electronic text generated by Optical Character Recognition (OCR) process from hand written and printed documents and mobile text such as Short Message Service (SMS). Though the nature of each of these documents is different but there is a common thread between all of these—presence of noise. An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company1,company2,date), from an online news sentence such as: “Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.” Opinion(product1,good), from a blog post such as: “I absolutely liked the texture of SheetK quilts.” At superficial level, there are two ways for information extraction from noisy text. The first one is cleaning text by removing noise and then applying existing state of the art techniques for information extraction. There in lies the importance of techniques for automatically correcting noisy text. In this chapter, first we will review some work in the area of noisy text correction. The second approach is to devise extraction techniques which are robust with respect to noise. Later in this chapter, we will see how the task of information extraction is affected by noise.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信