A deep and uniform model for semantic annotation of semi structured documents based on SHIRI

2016 4th International Conference on Control Engineering & Information Technology (CEIT) Pub Date : 2016-12-01 DOI:10.1109/CEIT.2016.7929020

M. Thiam

{"title":"A deep and uniform model for semantic annotation of semi structured documents based on SHIRI","authors":"M. Thiam","doi":"10.1109/CEIT.2016.7929020","DOIUrl":null,"url":null,"abstract":"In the construction of the semantic web, scientists use to annotate the existing web to improve the precision in handling documents for applications. The rapid growing of the web make impossible doing this manually. Many annotation techniques are used to resolve the first and easiest problem of information search which is finding documents containing the searched data. In this work we proposed a deep annotation model for locating and extracting the more exact parts of the documents that correspond to the responses of the request. This work extends SHIRI1 which is an ontology-based system for integration of semi-structured documents related to a specific domain. The ontology is described by a set of concepts, relations and their properties. It also contains a lexical part. It relies on an automatic, unsupervised and ontology-driven approach for extraction, alignment and querying for semantic annotation of tagged elements of documents. In this paper we focus on two major improvements: (1) we apply statistical techniques to purge extracted terms and named entities and (2) we annotate documents parts with one metadata. Experiments on real datasets will show that these improvements increase greatly the recall and the returned answers are effectively more precise and ranked according to their precision.","PeriodicalId":355001,"journal":{"name":"2016 4th International Conference on Control Engineering & Information Technology (CEIT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 4th International Conference on Control Engineering & Information Technology (CEIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEIT.2016.7929020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In the construction of the semantic web, scientists use to annotate the existing web to improve the precision in handling documents for applications. The rapid growing of the web make impossible doing this manually. Many annotation techniques are used to resolve the first and easiest problem of information search which is finding documents containing the searched data. In this work we proposed a deep annotation model for locating and extracting the more exact parts of the documents that correspond to the responses of the request. This work extends SHIRI1 which is an ontology-based system for integration of semi-structured documents related to a specific domain. The ontology is described by a set of concepts, relations and their properties. It also contains a lexical part. It relies on an automatic, unsupervised and ontology-driven approach for extraction, alignment and querying for semantic annotation of tagged elements of documents. In this paper we focus on two major improvements: (1) we apply statistical techniques to purge extracted terms and named entities and (2) we annotate documents parts with one metadata. Experiments on real datasets will show that these improvements increase greatly the recall and the returned answers are effectively more precise and ranked according to their precision.

查看原文本刊更多论文

基于SHIRI的半结构化文档语义标注深度统一模型

在构建语义网的过程中，科学家们采用对现有网络进行标注的方法来提高应用程序处理文档的精度。网络的快速发展使得手工操作变得不可能。许多注释技术用于解决信息搜索的第一个也是最简单的问题，即找到包含搜索数据的文档。在这项工作中，我们提出了一个深度注释模型，用于定位和提取与请求响应相对应的文档中更精确的部分。这项工作扩展了SHIRI1, SHIRI1是一个基于本体的系统，用于集成与特定领域相关的半结构化文档。本体由一组概念、关系及其属性来描述。它还包含一个词汇部分。它依赖于自动、无监督和本体驱动的方法来提取、对齐和查询文档标记元素的语义注释。在本文中，我们主要关注两个主要改进:(1)我们应用统计技术来清除提取的术语和命名实体;(2)我们用一个元数据注释文档部分。在真实数据集上的实验表明，这些改进大大提高了召回率，返回的答案有效地提高了精度，并根据精度进行了排名。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 4th International Conference on Control Engineering & Information Technology (CEIT)

自引率

0.00%

发文量