{"title":"DocNet: A document embedding approach based on neural networks","authors":"Z. Mo, Jianhong Ma","doi":"10.23919/IConAC.2018.8749095","DOIUrl":null,"url":null,"abstract":"Embedding texts into vector spaces is a common and fundamental preprocessing. Despite there are several approaches to put documents into vectors, reducing the dimension and improving ability of expression can still be a problem when facing large scale data and sophisticated demand. Distributed dense vector have been shown to be powerful in capturing token level semantics. In this paper, we propose a new method to embed entire documents into vector space using a deep neural network which described as DocNet in this paper. With DocNet, we trained end-to-end learning the vector space and by that we take all the information including semantics into account. Once this space has been produced, tasks such as classification and clustering can be simply done using standard techniques. Our method introduces triplet loss to train. The benefit is vector space can be learned directly so we can control the final dimension of embedding vectors. To demonstrate performance of our method, we built a clustering system compared with several baseline methods. Experiments prove that our approach achieves state-of-art document clustering performance. Furthermore, it proves that complicated clustering or classification demands can be satisfied by our method.","PeriodicalId":121030,"journal":{"name":"2018 24th International Conference on Automation and Computing (ICAC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 24th International Conference on Automation and Computing (ICAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IConAC.2018.8749095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Embedding texts into vector spaces is a common and fundamental preprocessing. Despite there are several approaches to put documents into vectors, reducing the dimension and improving ability of expression can still be a problem when facing large scale data and sophisticated demand. Distributed dense vector have been shown to be powerful in capturing token level semantics. In this paper, we propose a new method to embed entire documents into vector space using a deep neural network which described as DocNet in this paper. With DocNet, we trained end-to-end learning the vector space and by that we take all the information including semantics into account. Once this space has been produced, tasks such as classification and clustering can be simply done using standard techniques. Our method introduces triplet loss to train. The benefit is vector space can be learned directly so we can control the final dimension of embedding vectors. To demonstrate performance of our method, we built a clustering system compared with several baseline methods. Experiments prove that our approach achieves state-of-art document clustering performance. Furthermore, it proves that complicated clustering or classification demands can be satisfied by our method.