Navigable Topic Maps for Overlaying Multiple Acquired Semantic Classifications

Helka Folch, B. Habert, S. Lahlou
{"title":"Navigable Topic Maps for Overlaying Multiple Acquired Semantic Classifications","authors":"Helka Folch, B. Habert, S. Lahlou","doi":"10.1162/109966200750363625","DOIUrl":null,"url":null,"abstract":"We present work carried out within the framework of the Scriptorium project, developed at the Research & Development division of Electricite de France (EDF), the French electricity company. We are exploring issues related to knowledge acquisition from very large, heterogeneous corpora, and to the semantic annotation of these corpora, with the aim of facilitating browsing and navigation. Semantic access to heterogeneous, evolving text collections has become a crucial issue today in the world of online information: the increasing availability of electronic text enables the construction (and dispersion) of heterogeneous text collections. Current navigation tools such as thesauri, glossaries, indexes, etc., based on pre-defined semantic categories or taxonomies are inadequate for describing or browsing this kind of dynamic, loosely structured text collections. We therefore have adopted an inductive, data-driven approach aimed at extracting semantic classes from a corpus through the statistical analysis of textual data. We create different views or 'slices' of the document collection by extracting sub-corpora of manageable size, which we submit to the statistical software. We then build a navigable topic map of our document collection using the Topic Map Standard (ISO/IEC 13250) which provides a semantic interface to the document collection and enables navigation through the viewpoints and classes inductively acquired. Navigation is aided by a 3D geometric representation of the semantic space of the corpus... The aim of this project is to identify prominent and emerging topics from the automatic analysis of the discourse of the company's (EDF's) different social agents (managers, trade-unions, employees, etc.) by way of textual data analysis methods. The corpus under study in this project has eight million words and is very heterogeneous (it contains book extracts, corporate press, union press, summaries of corporate meetings, transcriptions of taped trade union messages, etc.). This diversity makes this corpus prototypical of the electronic documents available nowadays in a given domain. All documents are SGML tagged following the TEI (Text Encoding Initiative) recommendations... We are exploring issues related to semantic acquisition from large, heterogeneous corpora and content-based access to these corpora on the basis of inductively-acquired categories. We feel that data-driven, inductive approaches for building semantic interfaces to text collections will become more and more necessary, to efficiently manage the unrestricted, dynamic online information available today.","PeriodicalId":447112,"journal":{"name":"Markup Lang.","volume":"7 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Markup Lang.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/109966200750363625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We present work carried out within the framework of the Scriptorium project, developed at the Research & Development division of Electricite de France (EDF), the French electricity company. We are exploring issues related to knowledge acquisition from very large, heterogeneous corpora, and to the semantic annotation of these corpora, with the aim of facilitating browsing and navigation. Semantic access to heterogeneous, evolving text collections has become a crucial issue today in the world of online information: the increasing availability of electronic text enables the construction (and dispersion) of heterogeneous text collections. Current navigation tools such as thesauri, glossaries, indexes, etc., based on pre-defined semantic categories or taxonomies are inadequate for describing or browsing this kind of dynamic, loosely structured text collections. We therefore have adopted an inductive, data-driven approach aimed at extracting semantic classes from a corpus through the statistical analysis of textual data. We create different views or 'slices' of the document collection by extracting sub-corpora of manageable size, which we submit to the statistical software. We then build a navigable topic map of our document collection using the Topic Map Standard (ISO/IEC 13250) which provides a semantic interface to the document collection and enables navigation through the viewpoints and classes inductively acquired. Navigation is aided by a 3D geometric representation of the semantic space of the corpus... The aim of this project is to identify prominent and emerging topics from the automatic analysis of the discourse of the company's (EDF's) different social agents (managers, trade-unions, employees, etc.) by way of textual data analysis methods. The corpus under study in this project has eight million words and is very heterogeneous (it contains book extracts, corporate press, union press, summaries of corporate meetings, transcriptions of taped trade union messages, etc.). This diversity makes this corpus prototypical of the electronic documents available nowadays in a given domain. All documents are SGML tagged following the TEI (Text Encoding Initiative) recommendations... We are exploring issues related to semantic acquisition from large, heterogeneous corpora and content-based access to these corpora on the basis of inductively-acquired categories. We feel that data-driven, inductive approaches for building semantic interfaces to text collections will become more and more necessary, to efficiently manage the unrestricted, dynamic online information available today.
用于覆盖多个获得的语义分类的可导航主题图
我们将介绍在Scriptorium项目框架内开展的工作,该项目由法国电力公司(EDF)的研发部门开发。我们正在探索从非常大的异构语料库中获取知识的相关问题,以及这些语料库的语义注释,目的是促进浏览和导航。对异构的、不断发展的文本集合的语义访问已成为当今在线信息世界的一个关键问题:电子文本的日益可用性使异构文本集合的构建(和分散)成为可能。当前基于预定义语义类别或分类法的导航工具,如叙词表、词汇表、索引等,不足以描述或浏览这种动态的、结构松散的文本集合。因此,我们采用了一种归纳的、数据驱动的方法,旨在通过对文本数据的统计分析从语料库中提取语义类。我们通过提取可管理大小的子语料库来创建文档集合的不同视图或“切片”,并将其提交给统计软件。然后,我们使用主题地图标准(ISO/IEC 13250)为文档集合构建一个可导航的主题地图,该标准为文档集合提供了一个语义接口,并支持通过归纳获得的视点和类进行导航。通过语料库语义空间的三维几何表示来辅助导航。该项目的目的是通过文本数据分析方法,从公司(EDF)不同社会主体(经理、工会、员工等)的话语自动分析中识别出突出的和新兴的主题。本项目所研究的语料库有800万字,内容非常多样化(包括书籍摘录、企业新闻、工会新闻、公司会议摘要、工会录音信息转录等)。这种多样性使该语料库成为当今给定领域中可用的电子文档的原型。所有文档都是按照TEI(文本编码倡议)推荐的SGML标记的…我们正在探索从大型异构语料库中获取语义的相关问题,并在归纳获得类别的基础上对这些语料库进行基于内容的访问。我们认为,为了有效地管理当今不受限制的动态在线信息,构建文本集合语义接口的数据驱动、归纳方法将变得越来越必要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信