Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis

E. Inan, Vahab Mostafapour, Fatif Tekbacak
{"title":"Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis","authors":"E. Inan, Vahab Mostafapour, Fatif Tekbacak","doi":"10.54856/jiswa.201912084","DOIUrl":null,"url":null,"abstract":"Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.","PeriodicalId":112412,"journal":{"name":"Journal of Intelligent Systems with Applications","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54856/jiswa.201912084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.
用于多语言文本分析的领域特定评估数据集生成器
Web允许检索有关特定实体的简明信息,包括人员、组织、电影及其特征。此外,大量的Web资源通常位于非结构化表单上,它处理查找特定实体的关键信息。文本分析方法,如命名实体识别器和实体链接,旨在识别实体并将其链接到给定知识库中的相关实体。为了评估这些方法,有大量的通用基准数据集。然而,由于缺乏针对特定领域的评估数据集,很难对特定领域的方法进行评估。本研究提出了WeDGeM,这是一个利用维基百科分类页面和DBpedia层次结构的特定领域的多语言评估集生成器。此外,维基百科消歧义页面用于调整生成文本的歧义程度。基于此生成的测试数据,在电影领域评估了支持土耳其文本的知名实体链接系统的用例。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信