{"title":"A new benchmark dataset with production methodology for short text semantic similarity algorithms","authors":"J. O'Shea, Z. Bandar, Keeley A. Crockett","doi":"10.1145/2537046","DOIUrl":"https://doi.org/10.1145/2537046","url":null,"abstract":"This research presents a new benchmark dataset for evaluating Short Text Semantic Similarity (STSS) measurement algorithms and the methodology used for its creation. The power of the dataset is evaluated by using it to compare two established algorithms, STASIS and Latent Semantic Analysis. This dataset focuses on measures for use in Conversational Agents; other potential applications include email processing and data mining of social networks. Such applications involve integrating the STSS algorithm in a complex system, but STSS algorithms must be evaluated in their own right and compared with others for their effectiveness before systems integration. Semantic similarity is an artifact of human perception; therefore its evaluation is inherently empirical and requires benchmark datasets derived from human similarity ratings. The new dataset of 64 sentence pairs, STSS-131, has been designed to meet these requirements drawing on a range of resources from traditional grammar to cognitive neuroscience. The human ratings are obtained from a set of trials using new and improved experimental methods, with validated measures and statistics. The results illustrate the increased challenge and the potential longevity of the STSS-131 dataset as the Gold Standard for future STSS algorithm evaluation.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124685172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining co-clustering with noise detection for theme-based summarization","authors":"Xiaoyan Cai, Wenjie Li, Renxian Zhang","doi":"10.1145/2513563","DOIUrl":"https://doi.org/10.1145/2513563","url":null,"abstract":"To overcome the fact that the length of sentences is short and their content is limited, we regard words as independent text objects rather than features of sentences in sentence clustering and develop two co-clustering frameworks, namely integrated clustering and interactive clustering, to cluster sentences and words simultaneously. Since real-world datasets always contain noise, we incorporate noise detection and removal to enhance clustering of sentences and words. Meanwhile, a semisupervised approach is explored to incorporate the query information (and the sentence information in early document sets) in theme-based summarization. Thorough experimental studies are conducted. When evaluated on the DUC2005-2007 datasets and TAC 2008-2009 datasets, the performance of the two noise-detecting co-clustering approaches is comparable with that of the top three systems. The results also demonstrate that the interactive with noise detection algorithm is more effective than the noise-detecting integrated algorithm.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114999154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artem Sokolov, Guillaume Wisniewski, François Yvon
{"title":"Lattice BLEU oracles in machine translation","authors":"Artem Sokolov, Guillaume Wisniewski, François Yvon","doi":"10.1145/2513147","DOIUrl":"https://doi.org/10.1145/2513147","url":null,"abstract":"The search space of Phrase-Based Statistical Machine Translation (PBSMT) systems can be represented as a directed acyclic graph (lattice). By exploring this search space, it is possible to analyze and understand the failures of PBSMT systems. Indeed, useful diagnoses can be obtained by computing the so-called oracle hypotheses, which are hypotheses in the search space that have the highest quality score. For standard SMT metrics, this problem is, however, NP-hard and can only be solved approximately. In this work, we present two new methods for efficiently computing oracles on lattices: the first one is based on a linear approximation of the corpus bleu score and is solved using generic shortest distance algorithms; the second one relies on an Integer Linear Programming (ILP) formulation of the oracle decoding that incorporates count clipping constraints. It can either be solved directly using a standard ILP solver or using Lagrangian relaxation techniques. These new decoders are evaluated and compared with several alternatives from the literature for three language pairs, using lattices produced by two PBSMT systems.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123022010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Composition of semantic relations: Theoretical framework and case study","authors":"Eduardo Blanco, D. Moldovan","doi":"10.1145/2513146","DOIUrl":"https://doi.org/10.1145/2513146","url":null,"abstract":"Extracting semantic relations from text is a preliminary step towards understanding the meaning of text. The more semantic relations are extracted from a sentence, the better the representation of the knowledge encoded into that sentence. This article introduces a framework for the Composition of Semantic Relations (CSR). CSR aims to reveal more text semantics than existing semantic parsers by composing new relations out of previously extracted relations. Semantic relations are defined using vectors of semantic primitives, and an algebra is suggested to manipulate these vectors according to a CSR algorithm. Inference axioms that combine two relations and yield another relation are generated automatically. CSR is a language-agnostic, inventory-independent method to extract semantic relations. The formalism has been applied to a set of 26 well-known relations and results are reported.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123922069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toyomi Meguro, Yasuhiro Minami, Ryuichiro Higashinaka, Kohji Dohsaka
{"title":"Learning to control listening-oriented dialogue using partially observable markov decision processes","authors":"Toyomi Meguro, Yasuhiro Minami, Ryuichiro Higashinaka, Kohji Dohsaka","doi":"10.1145/2513145","DOIUrl":"https://doi.org/10.1145/2513145","url":null,"abstract":"Our aim is to build listening agents that attentively listen to their users and satisfy their desire to speak and have themselves heard. This article investigates how to automatically create a dialogue control component of such a listening agent. We collected a large number of listening-oriented dialogues with their user satisfaction ratings and used them to create a dialogue control component that satisfies users by means of Partially Observable Markov Decision Processes (POMDPs). Using a hybrid dialog controller where high-level dialog acts are chosen with a statistical policy and low-level slot values are populated by a wizard, we evaluated our dialogue control method in a Wizard-of-Oz experiment. The experimental results show that our POMDP-based method achieves significantly higher user satisfaction than other stochastic models, confirming the validity of our approach. This article is the first to verify, by using human users, the usefulness of POMDP-based dialogue control for improving user satisfaction in nontask-oriented dialogue systems.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124581519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cognitive canonicalization of natural language queries using semantic strata","authors":"S. Roy, W. Zeng","doi":"10.1145/2539053","DOIUrl":"https://doi.org/10.1145/2539053","url":null,"abstract":"Natural language search relies strongly on perceiving semantics in a query sentence. Semantics is captured by the relationship among the query words, represented as a network (graph). Such a network of words can be fed into larger ontologies, like DBpedia or Google Knowledge Graph, where they appear as subgraphs— fashioning the name subnetworks (subnets). Thus, subnet is a canonical form for interfacing a natural language query to a graph database and is an integral step for graph-based searching. In this article, we present a novel standalone NLP technique that leverages the cognitive psychology notion of semantic strata for semantic subnetwork extraction from natural language queries. The cognitive model describes some of the fundamental structures employed by the human cognition to construct semantic information in the brain, called semantic strata. We propose a computational model based on conditional random fields to capture the cognitive abstraction provided by semantic strata, facilitating cognitive canonicalization of the query. Our results, conducted on approximately 5000 queries, suggest that the cognitive canonicals based on semantic strata are capable of significantly improving parsing and role labeling performance beyond pure lexical approaches, such as parts-of-speech based techniques. We also find that cognitive canonicalized subnets are more semantically coherent compared to syntax trees when explored in graph ontologies like DBpedia and improve ranking of retrieved documents.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128651304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic interpretation of noun compounds using verbal and other paraphrases","authors":"Preslav Nakov, Marti A. Hearst","doi":"10.1145/2483969.2483975","DOIUrl":"https://doi.org/10.1145/2483969.2483975","url":null,"abstract":"We study the problem of semantic interpretation of noun compounds such as bee honey, malaria mosquito, apple cake, and stem cell. In particular, we explore the potential of using predicates that make explicit the hidden relation that holds between the nouns that form the noun compound. For example, mosquito that carries malaria is a paraphrase of the compound malaria mosquito in which the verb explicitly states the semantic relation between the two nouns. We study the utility of using such paraphrasing verbs, with associated weights, to build a representation of the semantics of a noun compound, for example, malaria mosquito can be represented as follows: carry (23), spread (16), cause (12), transmit (9), and so on. We also explore the potential of using multiple paraphrasing verbs as features for predicting abstract semantic relations such as CAUSE, and we demonstrate that using explicit paraphrases can help improve statistical machine translation.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122606821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields","authors":"M. Constant, Joseph Le Roux, Anthony Sigogne","doi":"10.1145/2483969.2483970","DOIUrl":"https://doi.org/10.1145/2483969.2483970","url":null,"abstract":"The integration of compounds in a parsing procedure has been shown to improve accuracy in an artificial context where such expressions have been perfectly preidentified. This article evaluates two empirical strategies to incorporate such multiword units in a real PCFG-LA parsing context: (1) the use of a grammar including compound recognition, thanks to specialized annotation schemes for compounds; (2) the use of a state-of-the-art discriminative compound prerecognizer integrating endogenous and exogenous features. We show how these two strategies can be combined with word lattices representing possible lexical analyses generated by the recognizer. The proposed systems display significant gains in terms of multiword recognition and often in terms of standard parsing accuracy. Moreover, we show through an Oracle analysis that this combined strategy opens promising new research directions.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125982897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On collocations and topic models","authors":"Jey Han Lau, Timothy Baldwin, D. Newman","doi":"10.1145/2483969.2483972","DOIUrl":"https://doi.org/10.1145/2483969.2483972","url":null,"abstract":"We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in improved topic quality over unigram tokenization. Further increases in topic quality can be achieved by using up to 10,000 bigrams, but this is at the cost of a more complex model. We also show that multiword (bigram and longer) named entities give consistent results, indicating that they should be represented as single tokens. This is the first work to explicitly study the effect of n-gram tokenization on LDA topic models, and the first work to make empirical recommendations to topic modelling practitioners, challenging the standard practice of unigram-based tokenization.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"343 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134209051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beata Beigman Klebanov, J. Burstein, Nitin Madnani
{"title":"Sentiment profiles of multiword expressions in test-taker essays: The case of noun-noun compounds","authors":"Beata Beigman Klebanov, J. Burstein, Nitin Madnani","doi":"10.1145/2483969.2483974","DOIUrl":"https://doi.org/10.1145/2483969.2483974","url":null,"abstract":"The property of idiomaticity vs. compositionality of a multiword expression traditionally pertains to the semantic interpretation of the expression. In this article, we consider this property as it applies to the expression's sentiment profile (relative degree of positivity, negativity, and neutrality). Thus, while heart attack is idiomatic in terms of semantic interpretation, the sentiment profile of the expression (strongly negative) can, in fact, be determined from the strongly negative profile of the head word. In this article, we (1) propose a way to measure compositionality of a multiword expression's sentiment profile, and perform the measurement on noun-noun compounds; (2) evaluate the utility of using sentiment profiles of noun-noun compounds in a sentence-level sentiment classification task. We find that the sentiment profiles of noun-noun compounds in test-taker essays tend to be highly compositional and that their incorporation improves the performance of a sentiment classification system.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121970979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}