An Unsupervised Approach of Paraphrase Discovery from Large Crime Corpus

Priyanka Das, A. Das
{"title":"An Unsupervised Approach of Paraphrase Discovery from Large Crime Corpus","authors":"Priyanka Das, A. Das","doi":"10.1109/ICCCI.2018.8441265","DOIUrl":null,"url":null,"abstract":"Massive crime reports often comprises valuable structured information regarding crime pattern but manual processing of these massive dataset is quite strenuous and error-prone. These huge dataset can be best exploited by identifying relational clusters of named entities from the crime reports. But often the clusters contain phrases not defining the same relation as the relational characterisation of the whole cluster. Therefore, paraphrasing is performed to filter out those phrases not defining the same relation. Paraphrases are mostly the phrases that reflect the same context in different articulations. Discovering paraphrases from a large corpus is a demanding task for various applications of natural language processing and researchers have been working on it since long time. But none have taken an attempt to perform the paraphrasing task on crime data. In order to deal with the perplexity of the phrases, the present work proposes an unsupervised approach for recognising synonymous phrases or paraphrases from an untagged crime corpus. This work mainly emphasises on the sentences that comprises two entities and each entity pair from different domain is represented as shallow parse tree. The head word from each parsing tree depicts the actual meaning of the phrase and all the phrases with the same headword have been accumulated for each domain of entity pairs. However, many phrases exist that reflects the same meaning without sharing the same headword. So, the objective is to cluster these phrases defining the same meaning by using an agglomerative hierarchical clustering technique. The method presented in this work is an unsupervised approach and it does not need any kind of training samples to work with.","PeriodicalId":141663,"journal":{"name":"2018 International Conference on Computer Communication and Informatics (ICCCI)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Computer Communication and Informatics (ICCCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCI.2018.8441265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Massive crime reports often comprises valuable structured information regarding crime pattern but manual processing of these massive dataset is quite strenuous and error-prone. These huge dataset can be best exploited by identifying relational clusters of named entities from the crime reports. But often the clusters contain phrases not defining the same relation as the relational characterisation of the whole cluster. Therefore, paraphrasing is performed to filter out those phrases not defining the same relation. Paraphrases are mostly the phrases that reflect the same context in different articulations. Discovering paraphrases from a large corpus is a demanding task for various applications of natural language processing and researchers have been working on it since long time. But none have taken an attempt to perform the paraphrasing task on crime data. In order to deal with the perplexity of the phrases, the present work proposes an unsupervised approach for recognising synonymous phrases or paraphrases from an untagged crime corpus. This work mainly emphasises on the sentences that comprises two entities and each entity pair from different domain is represented as shallow parse tree. The head word from each parsing tree depicts the actual meaning of the phrase and all the phrases with the same headword have been accumulated for each domain of entity pairs. However, many phrases exist that reflects the same meaning without sharing the same headword. So, the objective is to cluster these phrases defining the same meaning by using an agglomerative hierarchical clustering technique. The method presented in this work is an unsupervised approach and it does not need any kind of training samples to work with.
大型犯罪语料库释义发现的无监督方法
大量犯罪报告通常包含有关犯罪模式的有价值的结构化信息,但手工处理这些大量数据集非常费力且容易出错。通过从犯罪报告中识别命名实体的关系集群,可以最好地利用这些庞大的数据集。但是通常集群包含的短语并不定义与整个集群的关系特征相同的关系。因此,执行释义以过滤掉那些没有定义相同关系的短语。释义主要是用不同的表达方式反映同一上下文的短语。对于自然语言处理的各种应用来说,从大型语料库中发现释义是一项艰巨的任务,研究人员已经为此进行了很长时间的研究。但没有人尝试对犯罪数据进行意译。为了处理短语的困惑,目前的工作提出了一种无监督的方法来识别同义短语或从未标记的犯罪语料库的释义。这项工作主要强调由两个实体组成的句子,并将来自不同领域的每个实体对表示为浅解析树。来自每个解析树的头词描述短语的实际含义,并且所有具有相同头词的短语都已累积到实体对的每个域。然而,存在许多反映相同意思的短语,而不共享相同的标题。因此,我们的目标是通过使用聚类层次聚类技术来聚类这些定义相同含义的短语。本文提出的方法是一种无监督方法,它不需要任何类型的训练样本来工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信