{"title":"Efficient and accurate Word2Vec implementations in GPU and shared-memory multicore architectures","authors":"Trevor M. Simonton, G. Alaghband","doi":"10.1109/HPEC.2017.8091076","DOIUrl":null,"url":null,"abstract":"Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this work, we propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory CPU systems, and on modern NVIDIA GPUs with CUDA. We accomplish this on multi-core CPUs by batching training operations to increase thread locality and to reduce accesses to shared memory. We then propose new heterogeneous NVIDIA GPU CUDA implementations of both the skip gram hierarchical softmax and negative sampling techniques that utilize shared memory registers and in-warp shuffle operations for maximized performance. Our GPU skip gram with negative sampling approach produces a higher quality of word vectors than previous GPU implementations, and our flexible skip gram with hierarchical softmax implementation achieves a factor of 10 speedup of the existing methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2017.8091076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this work, we propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory CPU systems, and on modern NVIDIA GPUs with CUDA. We accomplish this on multi-core CPUs by batching training operations to increase thread locality and to reduce accesses to shared memory. We then propose new heterogeneous NVIDIA GPU CUDA implementations of both the skip gram hierarchical softmax and negative sampling techniques that utilize shared memory registers and in-warp shuffle operations for maximized performance. Our GPU skip gram with negative sampling approach produces a higher quality of word vectors than previous GPU implementations, and our flexible skip gram with hierarchical softmax implementation achieves a factor of 10 speedup of the existing methods.