短语相关性的高性能计算框架

Proceedings of the 2017 ACM Symposium on Document Engineering Pub Date : 2017-08-31 DOI:10.1145/3103010.3121039

Zichu Ai, Jie Mei, A. Mohammad, N. Zeh, Meng He, E. Milios

{"title":"短语相关性的高性能计算框架","authors":"Zichu Ai, Jie Mei, A. Mohammad, N. Zeh, Meng He, E. Milios","doi":"10.1145/3103010.3121039","DOIUrl":null,"url":null,"abstract":"TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web 1T 5-gram corpus. The phrase similarity computation in TrWP is costly in terms of both time and space, making the existing implementation of TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search using perfect hashing and minimizes the required memory cost using variable length encoding. Evaluated using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.","PeriodicalId":200469,"journal":{"name":"Proceedings of the 2017 ACM Symposium on Document Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"High-performance Computational Framework for Phrase Relatedness\",\"authors\":\"Zichu Ai, Jie Mei, A. Mohammad, N. Zeh, Meng He, E. Milios\",\"doi\":\"10.1145/3103010.3121039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web 1T 5-gram corpus. The phrase similarity computation in TrWP is costly in terms of both time and space, making the existing implementation of TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search using perfect hashing and minimizes the required memory cost using variable length encoding. Evaluated using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.\",\"PeriodicalId\":200469,\"journal\":{\"name\":\"Proceedings of the 2017 ACM Symposium on Document Engineering\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3103010.3121039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3103010.3121039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

TrWP是一种文本相关性度量，它利用来自Google Web 1T 5克语料库的汇总统计数据计算单词和短语之间的语义相似性。TrWP中的短语相似度计算在时间和空间上都是昂贵的，使得TrWP的现有实现不适合实际使用。在这项工作中，我们提出了一个TrWP的内存计算框架，它使用完美哈希优化语料库搜索，并使用可变长度编码最小化所需的内存成本。使用Google Web 1T 5克语料库进行评估，我们证明了我们的框架的计算速度比基于文件的实现高出几个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

High-performance Computational Framework for Phrase Relatedness

TrWP is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web 1T 5-gram corpus. The phrase similarity computation in TrWP is costly in terms of both time and space, making the existing implementation of TrWP impractical for real-world usage. In this work, we present an in-memory computational framework for TrWP, which optimizes the corpus search using perfect hashing and minimizes the required memory cost using variable length encoding. Evaluated using the Google Web 1T 5-gram corpus, we demonstrate that the computational speed of our framework outperforms a file-based implementation by several orders of magnitude.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2017 ACM Symposium on Document Engineering

自引率

0.00%

发文量