Unsupervised discovery and extraction of semi-structured regions in text via self-information

Eric Yeh, J. Niekrasz, Dayne Freitag
{"title":"Unsupervised discovery and extraction of semi-structured regions in text via self-information","authors":"Eric Yeh, J. Niekrasz, Dayne Freitag","doi":"10.1145/2509558.2509576","DOIUrl":null,"url":null,"abstract":"We describe a general method for identifying and extracting information from semi-structured regions of text embedded within a natural language document. These regions encode information according to ad hoc schemas and visual cues, instead of using the grammatical and presentational conventions of normal sentential language. Examples include tables, key-value listings, or repeated enumerations of properties. Because of their generally non-sentential nature, these regions can present problems for standard information extraction algorithms. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach for identifying semi-structured regions is an unsupervised one, based on scoring unusual regularity inside the document. As content in semi-structured regions are governed by a schema, the occurrence of features encompassing textual content and visual appearance would be unusual compared to those seen in sentential language. Regularity refers to repetition of these unusual features, as semi-structured regions commonly encode more than a single row or group of information. To score this, we present a measure based on expected self-information, derived from statistics over patterns of textual categories and visual layout. We describe the results of an initial study to assess the ability of these measures to detect semi-structured text in a corpus culled from the web, and show that this measure outperform baseline methods on an average precision measure. We present initial work that uses these significant patterns to generate extraction rules, and conclude with a discussion of future directions.","PeriodicalId":371465,"journal":{"name":"Conference on Automated Knowledge Base Construction","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Automated Knowledge Base Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2509558.2509576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We describe a general method for identifying and extracting information from semi-structured regions of text embedded within a natural language document. These regions encode information according to ad hoc schemas and visual cues, instead of using the grammatical and presentational conventions of normal sentential language. Examples include tables, key-value listings, or repeated enumerations of properties. Because of their generally non-sentential nature, these regions can present problems for standard information extraction algorithms. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach for identifying semi-structured regions is an unsupervised one, based on scoring unusual regularity inside the document. As content in semi-structured regions are governed by a schema, the occurrence of features encompassing textual content and visual appearance would be unusual compared to those seen in sentential language. Regularity refers to repetition of these unusual features, as semi-structured regions commonly encode more than a single row or group of information. To score this, we present a measure based on expected self-information, derived from statistics over patterns of textual categories and visual layout. We describe the results of an initial study to assess the ability of these measures to detect semi-structured text in a corpus culled from the web, and show that this measure outperform baseline methods on an average precision measure. We present initial work that uses these significant patterns to generate extraction rules, and conclude with a discussion of future directions.
基于自信息的文本半结构化区域的无监督发现与提取
我们描述了一种从嵌入在自然语言文档中的半结构化文本区域中识别和提取信息的通用方法。这些区域根据特别模式和视觉线索编码信息,而不是使用正常句子语言的语法和表示惯例。示例包括表、键值列表或属性的重复枚举。由于它们通常是非句子的性质,这些区域可能会给标准信息提取算法带来问题。与之前的表提取工作不同,它依赖于相对无噪声的二维布局,我们的目标是适应各种各样的结构类型。我们识别半结构化区域的方法是一种无监督的方法,基于对文档中不寻常的规律性进行评分。由于半结构化区域中的内容是由模式控制的,因此与句子语言相比,包含文本内容和视觉外观的特征的出现将是不寻常的。规律性是指这些不寻常特征的重复,因为半结构化区域通常编码不止一行或一组信息。为了对此进行评分,我们提出了一种基于预期自我信息的测量方法,该方法来自文本类别和视觉布局模式的统计数据。我们描述了一项初步研究的结果,该研究评估了这些测量方法在从网络中挑选的语料库中检测半结构化文本的能力,并表明该测量方法在平均精度测量上优于基线方法。我们介绍了使用这些重要模式生成提取规则的初步工作,并以对未来方向的讨论作为结论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信