Zichu Ai, Jie Mei, A. Mohammad, N. Zeh, Meng He, E. Milios
{"title":"短语相关性的高性能计算框架","authors":"Zichu Ai, Jie Mei, A. Mohammad, N. Zeh, Meng He, E. Milios","doi":"10.1145/3103010.3121039","DOIUrl":null,"url":null,"abstract":"TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web 1T 5-gram corpus. The phrase similarity computation in TrWP is costly in terms of both time and space, making the existing implementation of TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search using perfect hashing and minimizes the required memory cost using variable length encoding. Evaluated using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.","PeriodicalId":200469,"journal":{"name":"Proceedings of the 2017 ACM Symposium on Document Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"High-performance Computational Framework for Phrase Relatedness\",\"authors\":\"Zichu Ai, Jie Mei, A. Mohammad, N. Zeh, Meng He, E. Milios\",\"doi\":\"10.1145/3103010.3121039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web 1T 5-gram corpus. The phrase similarity computation in TrWP is costly in terms of both time and space, making the existing implementation of TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search using perfect hashing and minimizes the required memory cost using variable length encoding. Evaluated using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.\",\"PeriodicalId\":200469,\"journal\":{\"name\":\"Proceedings of the 2017 ACM Symposium on Document Engineering\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3103010.3121039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3103010.3121039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
TrWP是一种文本相关性度量,它利用来自Google Web 1T 5克语料库的汇总统计数据计算单词和短语之间的语义相似性。TrWP中的短语相似度计算在时间和空间上都是昂贵的,使得TrWP的现有实现不适合实际使用。在这项工作中,我们提出了一个TrWP的内存计算框架,它使用完美哈希优化语料库搜索,并使用可变长度编码最小化所需的内存成本。使用Google Web 1T 5克语料库进行评估,我们证明了我们的框架的计算速度比基于文件的实现高出几个数量级。
High-performance Computational Framework for Phrase Relatedness
TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web 1T 5-gram corpus. The phrase similarity computation in TrWP is costly in terms of both time and space, making the existing implementation of TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search using perfect hashing and minimizes the required memory cost using variable length encoding. Evaluated using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.