基于非监督学习的印地语多文档词云摘要

2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19) Pub Date : 2019-11-01 DOI:10.1109/ICETET-SIP-1946815.2019.9092259

P. Bafna, Jatinderkumar R. Saini

{"title":"基于非监督学习的印地语多文档词云摘要","authors":"P. Bafna, Jatinderkumar R. Saini","doi":"10.1109/ICETET-SIP-1946815.2019.9092259","DOIUrl":null,"url":null,"abstract":"Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results","PeriodicalId":200787,"journal":{"name":"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)","volume":"395 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning\",\"authors\":\"P. Bafna, Jatinderkumar R. Saini\",\"doi\":\"10.1109/ICETET-SIP-1946815.2019.9092259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results\",\"PeriodicalId\":200787,\"journal\":{\"name\":\"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)\",\"volume\":\"395 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICETET-SIP-1946815.2019.9092259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETET-SIP-1946815.2019.9092259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

管理文档是一项关键而重要的任务，它支持从信息检索到聚类搜索引擎结果的许多应用程序。网站提供的多语言设施使印地语成为当今信息技术数字领域的主要语言。本工作的重点是印地语语料库的文档管理和摘要。目标是通过应用提取令牌和文档聚类来管理文档和总结印地语语料库。这项工作在可伸缩性方面更好，并且支持增量数据集的集群质量一致。过去和当代的大部分研究工作都是针对英语语料库文档管理的。印地语语料库主要用于探索印地语语料库的词干提取、单文档摘要和分类器设计。通过Word Cloud实现对印地语语料库的无监督学习，以便对多个文档进行摘要，这仍然是一个未触及的领域。从技术上讲，目前的工作是TF-IDF、基于余弦的文档相似性度量和聚类树形图的应用，以及各种其他自然语言处理(NLP)活动。熵和精度用于评估在不同的实时和可用/测试数据集上进行的实验和结果

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning

Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)

自引率

0.00%

发文量