{"title":"An Unsupervised Approach of Paraphrase Discovery from Large Crime Corpus","authors":"Priyanka Das, A. Das","doi":"10.1109/ICCCI.2018.8441265","DOIUrl":null,"url":null,"abstract":"Massive crime reports often comprises valuable structured information regarding crime pattern but manual processing of these massive dataset is quite strenuous and error-prone. These huge dataset can be best exploited by identifying relational clusters of named entities from the crime reports. But often the clusters contain phrases not defining the same relation as the relational characterisation of the whole cluster. Therefore, paraphrasing is performed to filter out those phrases not defining the same relation. Paraphrases are mostly the phrases that reflect the same context in different articulations. Discovering paraphrases from a large corpus is a demanding task for various applications of natural language processing and researchers have been working on it since long time. But none have taken an attempt to perform the paraphrasing task on crime data. In order to deal with the perplexity of the phrases, the present work proposes an unsupervised approach for recognising synonymous phrases or paraphrases from an untagged crime corpus. This work mainly emphasises on the sentences that comprises two entities and each entity pair from different domain is represented as shallow parse tree. The head word from each parsing tree depicts the actual meaning of the phrase and all the phrases with the same headword have been accumulated for each domain of entity pairs. However, many phrases exist that reflects the same meaning without sharing the same headword. So, the objective is to cluster these phrases defining the same meaning by using an agglomerative hierarchical clustering technique. The method presented in this work is an unsupervised approach and it does not need any kind of training samples to work with.","PeriodicalId":141663,"journal":{"name":"2018 International Conference on Computer Communication and Informatics (ICCCI)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Computer Communication and Informatics (ICCCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCI.2018.8441265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Massive crime reports often comprises valuable structured information regarding crime pattern but manual processing of these massive dataset is quite strenuous and error-prone. These huge dataset can be best exploited by identifying relational clusters of named entities from the crime reports. But often the clusters contain phrases not defining the same relation as the relational characterisation of the whole cluster. Therefore, paraphrasing is performed to filter out those phrases not defining the same relation. Paraphrases are mostly the phrases that reflect the same context in different articulations. Discovering paraphrases from a large corpus is a demanding task for various applications of natural language processing and researchers have been working on it since long time. But none have taken an attempt to perform the paraphrasing task on crime data. In order to deal with the perplexity of the phrases, the present work proposes an unsupervised approach for recognising synonymous phrases or paraphrases from an untagged crime corpus. This work mainly emphasises on the sentences that comprises two entities and each entity pair from different domain is represented as shallow parse tree. The head word from each parsing tree depicts the actual meaning of the phrase and all the phrases with the same headword have been accumulated for each domain of entity pairs. However, many phrases exist that reflects the same meaning without sharing the same headword. So, the objective is to cluster these phrases defining the same meaning by using an agglomerative hierarchical clustering technique. The method presented in this work is an unsupervised approach and it does not need any kind of training samples to work with.