{"title":"一种基于Word2Vec的文档摘要新方法","authors":"Zhibo Wang, Long Ma, Yanqing Zhang","doi":"10.1109/ICCI-CC.2016.7862087","DOIUrl":null,"url":null,"abstract":"Texting mining is a process to extract useful patterns and information from large volume of unstructured text data. Unlike other quantitative data, unstructured text data cannot be directly utilized in machine learning models. Hence, data pre-processing is an essential step to remove vague or redundant data such as punctuations, stop-words, low-frequency words in the corpus, and re-organize the data in a format that computers can understand. Though existing approaches are able to eliminate some symbols and stop-words during the pre-processing step, a portion of words are not used to describe the documents' topics. These irrelevant words not only waste the storage that lessen the efficiency of computing, but also lead to confounding results. In this paper, we propose an optimization method to further remove these irrelevant words which are not highly correlated to the documents' topics. Experimental results indicate that our proposed method significantly compresses the documents, while the resulting documents remain a high discrimination in classification tasks; additionally, storage is greatly reduced according to various criteria.","PeriodicalId":135701,"journal":{"name":"2016 IEEE 15th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A novel method for document summarization using Word2Vec\",\"authors\":\"Zhibo Wang, Long Ma, Yanqing Zhang\",\"doi\":\"10.1109/ICCI-CC.2016.7862087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Texting mining is a process to extract useful patterns and information from large volume of unstructured text data. Unlike other quantitative data, unstructured text data cannot be directly utilized in machine learning models. Hence, data pre-processing is an essential step to remove vague or redundant data such as punctuations, stop-words, low-frequency words in the corpus, and re-organize the data in a format that computers can understand. Though existing approaches are able to eliminate some symbols and stop-words during the pre-processing step, a portion of words are not used to describe the documents' topics. These irrelevant words not only waste the storage that lessen the efficiency of computing, but also lead to confounding results. In this paper, we propose an optimization method to further remove these irrelevant words which are not highly correlated to the documents' topics. Experimental results indicate that our proposed method significantly compresses the documents, while the resulting documents remain a high discrimination in classification tasks; additionally, storage is greatly reduced according to various criteria.\",\"PeriodicalId\":135701,\"journal\":{\"name\":\"2016 IEEE 15th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 15th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCI-CC.2016.7862087\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 15th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCI-CC.2016.7862087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A novel method for document summarization using Word2Vec
Texting mining is a process to extract useful patterns and information from large volume of unstructured text data. Unlike other quantitative data, unstructured text data cannot be directly utilized in machine learning models. Hence, data pre-processing is an essential step to remove vague or redundant data such as punctuations, stop-words, low-frequency words in the corpus, and re-organize the data in a format that computers can understand. Though existing approaches are able to eliminate some symbols and stop-words during the pre-processing step, a portion of words are not used to describe the documents' topics. These irrelevant words not only waste the storage that lessen the efficiency of computing, but also lead to confounding results. In this paper, we propose an optimization method to further remove these irrelevant words which are not highly correlated to the documents' topics. Experimental results indicate that our proposed method significantly compresses the documents, while the resulting documents remain a high discrimination in classification tasks; additionally, storage is greatly reduced according to various criteria.