基于非监督学习的印地语多文档词云摘要

P. Bafna, Jatinderkumar R. Saini
{"title":"基于非监督学习的印地语多文档词云摘要","authors":"P. Bafna, Jatinderkumar R. Saini","doi":"10.1109/ICETET-SIP-1946815.2019.9092259","DOIUrl":null,"url":null,"abstract":"Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results","PeriodicalId":200787,"journal":{"name":"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)","volume":"395 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning\",\"authors\":\"P. Bafna, Jatinderkumar R. Saini\",\"doi\":\"10.1109/ICETET-SIP-1946815.2019.9092259\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results\",\"PeriodicalId\":200787,\"journal\":{\"name\":\"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)\",\"volume\":\"395 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICETET-SIP-1946815.2019.9092259\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETET-SIP-1946815.2019.9092259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

管理文档是一项关键而重要的任务,它支持从信息检索到聚类搜索引擎结果的许多应用程序。网站提供的多语言设施使印地语成为当今信息技术数字领域的主要语言。本工作的重点是印地语语料库的文档管理和摘要。目标是通过应用提取令牌和文档聚类来管理文档和总结印地语语料库。这项工作在可伸缩性方面更好,并且支持增量数据集的集群质量一致。过去和当代的大部分研究工作都是针对英语语料库文档管理的。印地语语料库主要用于探索印地语语料库的词干提取、单文档摘要和分类器设计。通过Word Cloud实现对印地语语料库的无监督学习,以便对多个文档进行摘要,这仍然是一个未触及的领域。从技术上讲,目前的工作是TF-IDF、基于余弦的文档相似性度量和聚类树形图的应用,以及各种其他自然语言处理(NLP)活动。熵和精度用于评估在不同的实时和可用/测试数据集上进行的实验和结果
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning
Managing documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. This work focuses on document management and summarization of Hindi corpus. The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering. The work is better in terms of scalability and supports consistent quality of cluster for incremental data set. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single- document summarization and classifier design on Hindi corpus. Implementing unsupervised learning on Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信