Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval最新文献

Semantic Preserving Siamese Autoencoder for Binary Quantization of Word Embeddings 用于词嵌入二进制量化的语义保持连体自编码器

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508235

Wouter Mostard, Lambert Schomaker, M. Wiering

{"title":"Semantic Preserving Siamese Autoencoder for Binary Quantization of Word Embeddings","authors":"Wouter Mostard, Lambert Schomaker, M. Wiering","doi":"10.1145/3508230.3508235","DOIUrl":"https://doi.org/10.1145/3508230.3508235","url":null,"abstract":"Word embeddings are used as building blocks for a wide range of natural language processing and information retrieval tasks. These embeddings are usually represented as continuous vectors, requiring significant memory capacity and computationally expensive similarity measures. In this study, we introduce a novel method for semantic hashing continuous vector representations into lower-dimensional Hamming space while explicitly preserving semantic information between words. This is achieved by introducing a Siamese autoencoder combined with a novel semantic preserving loss function. We show that our quantization model induces only a 4% loss of semantic information over continuous representations and outperforms the baseline models on several word similarity and sentence classification tasks. Finally, we show through cluster analysis that our method learns binary representations where individual bits hold interpretable semantic information. In conclusion, binary quantization of word embeddings significantly decreases time and space requirements while offering new possibilities through exploiting semantic information of individual bits in downstream information retrieval tasks.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127487650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Contrastive Study on Linguistic Features between HT and MT based on NLPIR-ICTCLAS: A Case Study of Philosophical Text 基于NLPIR-ICTCLAS的汉译与汉译语言特征对比研究——以哲学文本为例

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508240

Yumei Ge, Bin Xu

引用次数: 0

Text Sentiment Analysis based on BERT and Convolutional Neural Networks 基于BERT和卷积神经网络的文本情感分析

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508231

Ping Huang, Huijuan Zhu, Lei Zheng, Ying Wang

引用次数: 3

Query Disambiguation to Enhance Biomedical Information Retrieval Based on Neural Networks 基于神经网络的查询消歧增强生物医学信息检索

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508253

Wided Selmi, Hager Kammoun, Ikram Amous

{"title":"Query Disambiguation to Enhance Biomedical Information Retrieval Based on Neural Networks","authors":"Wided Selmi, Hager Kammoun, Ikram Amous","doi":"10.1145/3508230.3508253","DOIUrl":"https://doi.org/10.1145/3508230.3508253","url":null,"abstract":"Information Retrieval Systems (IRS) use a query to find the relevant documents. Often the query term can have more than one sense; this is known as the ambiguity problem. This problem is a cause of poor performance in IRS. For this purpose, Word Sense Disambiguation (WSD) specifically deals with choosing the right sense of an ambiguous term, among a set of given candidate senses, according to its context (surrounding text). Obtaining all candidate senses is therefore a challenge for WSD. Word Sense Induction (WSI) is a task that automatically induces the different senses of a target word in different contexts. In this work, we propose a biomedical query disambiguation method. In this method, WSI use K-means algorithm to cluster the different contexts of ambiguous query term (MeSH descriptor) in order to induce the different senses. The different contexts are the sentences extracted from PubMed containing the target MeSH descriptor. To represent sentences as vectors, we propose to use a contextualized embeddings model “Biobert”. Our method is derived from the intuitive idea that the correct sense in the one having the high similarity among the candidate senses of an ambiguous term with its context. The conducted experiments on OHSUMED test collection yielded significant results.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125986145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Retrieval-based End-to-End Tamil language Conversational Agent for Closed Domain using Machine Learning 使用机器学习的基于检索的端到端泰米尔语封闭域会话代理

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508251

Kumaran Kugathasan, Uthayasanker Thayasivam

{"title":"Retrieval-based End-to-End Tamil language Conversational Agent for Closed Domain using Machine Learning","authors":"Kumaran Kugathasan, Uthayasanker Thayasivam","doi":"10.1145/3508230.3508251","DOIUrl":"https://doi.org/10.1145/3508230.3508251","url":null,"abstract":"Businesses around the world have started to adopt text-based conversational agents to provide a great customer experience as an alternative to minimize expensive customer service agents. Coming up with a conversational agent is comparatively easier for businesses that serve customers who speak high resourced languages like English since there are enough and more paid as well as open-source chatbot frameworks available. But for a low resource language like Tamil, there is no such framework support. The approaches proposed in researches for building high resource language chatbots are not suitable for Tamil due to the lack of many language-related resources. This paper proposes a new approach for building a Tamil language conversational agent using the dataset scraped from the FAQ corpus and expanding it more to capture the morphological richness and high inflexional nature of the Tamil language. Each question is mapped to intent and a multiclass intent classifier was built to identify the intent of the user. CNN based classifier performed best with 98.72% accuracy.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Method of Graphical User Interface Adaptation Using Reinforcement Learning and Automated Testing 使用强化学习和自动化测试的图形用户界面适应方法

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508255

Victor Fyodorov, A. Karsakov

引用次数: 0

Annotation and Evaluation of Utterance Intention Tag for Interview Dialogue Corpus 访谈对话语料中话语意图标签的标注与评价

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508236

M. Sasayama, Kazuyuki Matsumoto

引用次数: 1

STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering STIF:使用词嵌入和聚类的半监督分类归纳

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508247

Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu

{"title":"STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering","authors":"Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu","doi":"10.1145/3508230.3508247","DOIUrl":"https://doi.org/10.1145/3508230.3508247","url":null,"abstract":"In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Named Entity Recognition using Knowledge Graph Embeddings and DistilBERT 基于知识图嵌入和蒸馏器的命名实体识别

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508252

Shreya R. Mehta, Mansi A. Radke, Sagar Sunkle

引用次数: 1

A Study of Predicting the Sincerity of a Question Asked Using Machine Learning 使用机器学习预测问题诚意的研究

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508258

T. Nguyen, P. Meesad

{"title":"A Study of Predicting the Sincerity of a Question Asked Using Machine Learning","authors":"T. Nguyen, P. Meesad","doi":"10.1145/3508230.3508258","DOIUrl":"https://doi.org/10.1145/3508230.3508258","url":null,"abstract":"The growth of applications in both scientific socialism and naturalism causes it increasingly difficult to assess whether a question is sincere or not. It is mandatory for many marketing and financial companies. Many utilizations will be reconfigured beyond recognition, especially text and images, while others face potential extinction as a corollary of advances in technology and computer science in particular. Analyzing text and image data will be truly needed for understanding valuable insights. In this paper, we analyzed the Quora dataset obtained from Kaggle.com to filter insincere and spam content. We used different preprocessing algorithms and analysis models provided in PySpark. Besides, we analyzed the manner of users established in writing their posts via the proposed prediction models. Finally, we showed the most accurate algorithm of the selected algorithms for classifying questions on Quora. The Gradient Boosted Tree was the best model for questions on Quora with an accuracy was 79.5% and followed was Long-Short Term Memory (LSTM) reaching 78.0%. Compared to other methods, the same building in Scikit-Learn and machine learning GRU, BiLSTM, BiGRU, applying models in PySpark could get a better answer in classifying questions on Quora.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126270466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4