Multi-Document Summarization Made Easy: An Abstractive Query-Focused System Using Web Scraping and Transformer Models

Isaac Ritharson P, D. Sujitha Juliet, J. Anitha, S. Immanuel Alex Pandian
{"title":"Multi-Document Summarization Made Easy: An Abstractive Query-Focused System Using Web Scraping and Transformer Models","authors":"Isaac Ritharson P, D. Sujitha Juliet, J. Anitha, S. Immanuel Alex Pandian","doi":"10.1109/CONIT59222.2023.10205946","DOIUrl":null,"url":null,"abstract":"The paper proposes a web-based abstractive query-focused multi-document summarization system that aims to simplify the process of summarizing multiple documents on a given topic. The system leverages a range of technologies and techniques, including web scraping, natural language processing, and transformer models, to automate the summarization process and improve the accessibility of information for users. The system is designed to take user input in the form of a query, the number of words to be summarized, and the number of documents to be referred to. It then utilizes Google search engine API integration to retrieve the most relevant webpages based on their ranking, and performs web scraping of tags using beautiful soup (bs4) and selenium frameworks. The scraped data undergoes pre-processing, including stop word removal, tokenization using Auto tokenizer, and visualizing frequency matrix and word-cloud plots with seaborn and matplotlib. The system employs a transformer model ‘mt5-small Pretrained’ as the pipeline summarizer. The transformer model ranks the words based on frequency and generates a summary of the text that is coherent, concise, and relevant to the user’s query. The system delivers the output in the form of a well-structured summary that captures the essential information from multiple documents. The experimental results demonstrate the potential of integrating different technologies and techniques to automate the summarization process and provide users with high-quality summaries of multiple documents on a given query.","PeriodicalId":377623,"journal":{"name":"2023 3rd International Conference on Intelligent Technologies (CONIT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Intelligent Technologies (CONIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CONIT59222.2023.10205946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The paper proposes a web-based abstractive query-focused multi-document summarization system that aims to simplify the process of summarizing multiple documents on a given topic. The system leverages a range of technologies and techniques, including web scraping, natural language processing, and transformer models, to automate the summarization process and improve the accessibility of information for users. The system is designed to take user input in the form of a query, the number of words to be summarized, and the number of documents to be referred to. It then utilizes Google search engine API integration to retrieve the most relevant webpages based on their ranking, and performs web scraping of tags using beautiful soup (bs4) and selenium frameworks. The scraped data undergoes pre-processing, including stop word removal, tokenization using Auto tokenizer, and visualizing frequency matrix and word-cloud plots with seaborn and matplotlib. The system employs a transformer model ‘mt5-small Pretrained’ as the pipeline summarizer. The transformer model ranks the words based on frequency and generates a summary of the text that is coherent, concise, and relevant to the user’s query. The system delivers the output in the form of a well-structured summary that captures the essential information from multiple documents. The experimental results demonstrate the potential of integrating different technologies and techniques to automate the summarization process and provide users with high-quality summaries of multiple documents on a given query.
简化多文档摘要:使用Web抓取和转换模型的以查询为中心的抽象系统
本文提出了一种基于web的以查询为中心的抽象多文档摘要系统,旨在简化对给定主题的多文档进行摘要的过程。该系统利用了一系列的技术和技巧,包括网络抓取、自然语言处理和转换模型,使汇总过程自动化,并提高了用户对信息的可访问性。该系统被设计为以查询、要汇总的单词数量和要引用的文档数量的形式接收用户输入。然后,它利用谷歌搜索引擎API集成检索最相关的网页基于他们的排名,并执行网页抓取标签使用靓汤(bs4)和硒框架。抓取的数据经过预处理,包括停止词删除,使用Auto tokenizer进行标记,以及使用seaborn和matplotlib可视化频率矩阵和词云图。该系统采用“mt5-small Pretrained”变压器模型作为管道汇总器。transformer模型根据频率对单词进行排序,并生成连贯、简洁且与用户查询相关的文本摘要。系统以结构良好的摘要的形式提供输出,该摘要从多个文档中捕获重要信息。实验结果表明,集成不同的技术和技巧来自动化摘要过程,并为用户提供给定查询的多个文档的高质量摘要的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信