Isaac Ritharson P, D. Sujitha Juliet, J. Anitha, S. Immanuel Alex Pandian
{"title":"Multi-Document Summarization Made Easy: An Abstractive Query-Focused System Using Web Scraping and Transformer Models","authors":"Isaac Ritharson P, D. Sujitha Juliet, J. Anitha, S. Immanuel Alex Pandian","doi":"10.1109/CONIT59222.2023.10205946","DOIUrl":null,"url":null,"abstract":"The paper proposes a web-based abstractive query-focused multi-document summarization system that aims to simplify the process of summarizing multiple documents on a given topic. The system leverages a range of technologies and techniques, including web scraping, natural language processing, and transformer models, to automate the summarization process and improve the accessibility of information for users. The system is designed to take user input in the form of a query, the number of words to be summarized, and the number of documents to be referred to. It then utilizes Google search engine API integration to retrieve the most relevant webpages based on their ranking, and performs web scraping of tags using beautiful soup (bs4) and selenium frameworks. The scraped data undergoes pre-processing, including stop word removal, tokenization using Auto tokenizer, and visualizing frequency matrix and word-cloud plots with seaborn and matplotlib. The system employs a transformer model ‘mt5-small Pretrained’ as the pipeline summarizer. The transformer model ranks the words based on frequency and generates a summary of the text that is coherent, concise, and relevant to the user’s query. The system delivers the output in the form of a well-structured summary that captures the essential information from multiple documents. The experimental results demonstrate the potential of integrating different technologies and techniques to automate the summarization process and provide users with high-quality summaries of multiple documents on a given query.","PeriodicalId":377623,"journal":{"name":"2023 3rd International Conference on Intelligent Technologies (CONIT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Intelligent Technologies (CONIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CONIT59222.2023.10205946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The paper proposes a web-based abstractive query-focused multi-document summarization system that aims to simplify the process of summarizing multiple documents on a given topic. The system leverages a range of technologies and techniques, including web scraping, natural language processing, and transformer models, to automate the summarization process and improve the accessibility of information for users. The system is designed to take user input in the form of a query, the number of words to be summarized, and the number of documents to be referred to. It then utilizes Google search engine API integration to retrieve the most relevant webpages based on their ranking, and performs web scraping of tags using beautiful soup (bs4) and selenium frameworks. The scraped data undergoes pre-processing, including stop word removal, tokenization using Auto tokenizer, and visualizing frequency matrix and word-cloud plots with seaborn and matplotlib. The system employs a transformer model ‘mt5-small Pretrained’ as the pipeline summarizer. The transformer model ranks the words based on frequency and generates a summary of the text that is coherent, concise, and relevant to the user’s query. The system delivers the output in the form of a well-structured summary that captures the essential information from multiple documents. The experimental results demonstrate the potential of integrating different technologies and techniques to automate the summarization process and provide users with high-quality summaries of multiple documents on a given query.