Yangyang Wang, Liping Hua, Hui Zhao, Lingfeng Yang
{"title":"面向科技博客的无监督降价特征感知关键词提取","authors":"Yangyang Wang, Liping Hua, Hui Zhao, Lingfeng Yang","doi":"10.1109/COMPSAC54236.2022.00039","DOIUrl":null,"url":null,"abstract":"A vast amount of blogs are generated from online technology communities every day. Most of them are in Markdown format. The increase of Markdown documents has brought opportunities and challenges to many natural language processing tasks. Extracting keywords from technology blogs is of great value for discovering, retrieving, and sharing knowl-edge about technical blogs. The mainstream keyword extraction algorithms remain to use statistical char-acteristics of words to determine the keywords of a document, seldom considering the structure char-acteristics of the document that potentially express the semantic information. We argue that Markdown markup features as well as the textual content of the document are both concerned with the keywords extraction. In this paper, we propose a novel un-supervised Markdown markup features aware key-words extraction algorithm for technology blogs. The algorithm integrates Markdown markup syntax in-formation with a blog text representation. Through experiments against TF-IDF, TextRank, and PositionRank algorithms on a real Markdown document dataset, our algorithm achieves higher performance with a substantial improvement when the number of keywords extracted is greater than 3.","PeriodicalId":330838,"journal":{"name":"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unsupervised Markdown Feature-Aware Keywords Extraction Towards Technology Blogs\",\"authors\":\"Yangyang Wang, Liping Hua, Hui Zhao, Lingfeng Yang\",\"doi\":\"10.1109/COMPSAC54236.2022.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A vast amount of blogs are generated from online technology communities every day. Most of them are in Markdown format. The increase of Markdown documents has brought opportunities and challenges to many natural language processing tasks. Extracting keywords from technology blogs is of great value for discovering, retrieving, and sharing knowl-edge about technical blogs. The mainstream keyword extraction algorithms remain to use statistical char-acteristics of words to determine the keywords of a document, seldom considering the structure char-acteristics of the document that potentially express the semantic information. We argue that Markdown markup features as well as the textual content of the document are both concerned with the keywords extraction. In this paper, we propose a novel un-supervised Markdown markup features aware key-words extraction algorithm for technology blogs. The algorithm integrates Markdown markup syntax in-formation with a blog text representation. Through experiments against TF-IDF, TextRank, and PositionRank algorithms on a real Markdown document dataset, our algorithm achieves higher performance with a substantial improvement when the number of keywords extracted is greater than 3.\",\"PeriodicalId\":330838,\"journal\":{\"name\":\"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMPSAC54236.2022.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC54236.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Unsupervised Markdown Feature-Aware Keywords Extraction Towards Technology Blogs
A vast amount of blogs are generated from online technology communities every day. Most of them are in Markdown format. The increase of Markdown documents has brought opportunities and challenges to many natural language processing tasks. Extracting keywords from technology blogs is of great value for discovering, retrieving, and sharing knowl-edge about technical blogs. The mainstream keyword extraction algorithms remain to use statistical char-acteristics of words to determine the keywords of a document, seldom considering the structure char-acteristics of the document that potentially express the semantic information. We argue that Markdown markup features as well as the textual content of the document are both concerned with the keywords extraction. In this paper, we propose a novel un-supervised Markdown markup features aware key-words extraction algorithm for technology blogs. The algorithm integrates Markdown markup syntax in-formation with a blog text representation. Through experiments against TF-IDF, TextRank, and PositionRank algorithms on a real Markdown document dataset, our algorithm achieves higher performance with a substantial improvement when the number of keywords extracted is greater than 3.