{"title":"Automatic construction of direction-aware sentiment lexicon using direction-dependent words","authors":"Jihye Park, Hye Jin Lee, Sungzoon Cho","doi":"10.1007/s10579-024-09737-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09737-9","url":null,"abstract":"<p>Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman
{"title":"Cross-linguistically consistent semantic and syntactic annotation of child-directed speech","authors":"Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman","doi":"10.1007/s10579-024-09734-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09734-y","url":null,"abstract":"<p>Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate <span>(approx)</span> 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"70 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov
{"title":"Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain","authors":"Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.1007/s10579-024-09738-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09738-8","url":null,"abstract":"<p>Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"47 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Features in extractive supervised single-document summarization: case of Persian news","authors":"Hosein Rezaei, Seyed Amid Moeinzadeh Mirhosseini, Azar Shahgholian, Mohamad Saraee","doi":"10.1007/s10579-024-09739-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09739-7","url":null,"abstract":"<p>Text summarization has been one of the most challenging areas of research in NLP. Much effort has been made to overcome this challenge by using either abstractive or extractive methods. Extractive methods are preferable due to their simplicity compared with the more elaborate abstractive methods. In extractive supervised single-document approaches, the system will not generate sentences. Instead, via supervised learning, it learns how to score sentences within the document based on some textual features and subsequently selects those with the highest rank. Therefore, the core objective is ranking, which enormously depends on the document structure and context. These dependencies have been unnoticed by many state-of-the-art solutions. In this work, document-related features such as topic and relative length are integrated into the vectors of every sentence to enhance the quality of summaries. Our experiment results show that the system takes contextual and structural patterns into account, which will increase the precision of the learned model. Consequently, our method will produce more comprehensive and concise summaries.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"130 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti
{"title":"Mismatching-aware unsupervised translation quality estimation for low-resource languages","authors":"Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti","doi":"10.1007/s10579-024-09727-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09727-x","url":null,"abstract":"<p>Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new English<span>(rightarrow)</span>Persian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"128 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing","authors":"Abubakr H. Ombabi, Wael Ouarda, Adel M. Alimi","doi":"10.1007/s10579-024-09741-z","DOIUrl":"https://doi.org/10.1007/s10579-024-09741-z","url":null,"abstract":"<p>With the enormous growth of social data in recent years, sentiment analysis has gained increasing research attention and has been widely explored in various languages. Arabic language nature imposes several challenges, such as the complicated morphological structure and the limited resources, Thereby, the current state-of-the-art methods for sentiment analysis remain to be enhanced. This inspired us to explore the application of the emerging deep-learning architecture to Arabic text classification. In this paper, we present an ensemble model which integrates a convolutional neural network, bidirectional long short-term memory (Bi-LSTM), and attention mechanism, to predict the sentiment orientation of Arabic sentences. The convolutional layer is used for feature extraction from the higher-level sentence representations layer, the BiLSTM is integrated to further capture the contextual information from the produced set of features. Two attention mechanism units are incorporated to highlight the critical information from the contextual feature vectors produced by the Bi-LSTM hidden layers. The context-related vectors generated by the attention mechanism layers are then concatenated and passed into a classifier to predict the final label. To disentangle the influence of these components, the proposed model is validated as three variant architectures on a multi-domains corpus, as well as four benchmarks. Experimental results show that incorporating Bi-LSTM and attention mechanism improves the model’s performance while yielding 96.08% in accuracy. Consequently, this architecture consistently outperforms the other State-of-The-Art approaches with up to + 14.47%, + 20.38%, and + 18.45% improvements in accuracy, precision, and recall respectively. These results demonstrated the strengths of this model in addressing the challenges of text classification tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"8 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140809633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text","authors":"Shankar Biradar, Sunil Saumya, Arun Chauhan","doi":"10.1007/s10579-024-09732-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09732-0","url":null,"abstract":"<p>Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated <span>(chi ^2)</span> value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, Osmar R. Zaïane
{"title":"Depression symptoms modelling from social media text: an LLM driven semi-supervised learning approach","authors":"Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, Osmar R. Zaïane","doi":"10.1007/s10579-024-09720-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09720-4","url":null,"abstract":"<p>A fundamental component of user-level social media language based clinical depression modelling is depression symptoms detection (DSD). Unfortunately, there does not exist any DSD dataset that reflects both the clinical insights and the distribution of depression symptoms from the samples of self-disclosed depressed population. In our work, we describe a semi-supervised learning (SSL) framework which uses an initial supervised learning model that leverages (1) a state-of-the-art large mental health forum text pre-trained language model further fine-tuned on a clinician annotated DSD dataset, (2) a Zero-Shot learning model for DSD, and couples them together to harvest depression symptoms related samples from our large self-curated depressive tweets repository (DTR). Our clinician annotated dataset is the largest of its kind. Furthermore, DTR is created from the samples of tweets in self-disclosed depressed users Twitter timeline from two datasets, including one of the largest benchmark datasets for user-level depression detection from Twitter. This further helps preserve the depression symptoms distribution of self-disclosed tweets. Subsequently, we iteratively retrain our initial DSD model with the harvested data. We discuss the stopping criteria and limitations of this SSL process, and elaborate the underlying constructs which play a vital role in the overall SSL process. We show that we can produce a final dataset which is the largest of its kind. Furthermore, a DSD and a Depression Post Detection model trained on it achieves significantly better accuracy than their initial version.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"86 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140577863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A longitudinal multi-modal dataset for dementia monitoring and diagnosis","authors":"","doi":"10.1007/s10579-023-09718-4","DOIUrl":"https://doi.org/10.1007/s10579-023-09718-4","url":null,"abstract":"<h3>Abstract</h3> <p>Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"42 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140577783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DILLo: an Italian lexical database for speech-language pathologists","authors":"Federica Beccaria, Angela Cristiano, Flavio Pisciotta, Noemi Usardi, Elisa Borgogni, Filippo Prayer Galletti, Giulia Corsi, Lorenzo Gregori, Gloria Gagliardi","doi":"10.1007/s10579-024-09722-2","DOIUrl":"https://doi.org/10.1007/s10579-024-09722-2","url":null,"abstract":"<p>A novel lexical resource for treating speech impairments from childhood to senility: DILLo—<i>Database Italiano del Lessico per Logopedisti</i> (i.e., Italian Database for Speech-Language Pathologists) is presented. DILLo is a free online web application that allows extraction of filtered wordlists for flexible rehabilitative purposes. Its major aim is to provide Italian speech-language pathologists (SLPs) with a resource that takes advantage of Information and Communication Technologies for language in a healthcare setting. DILLo’s design adopts an integrated approach that envisages fruitful cooperation between clinical and linguistic professionals. The 7690 Italian words in the database have been selected based on phonological, phonotactic, and morphological properties, and their frequency of use. These linguistic features are encoded in the tool, which includes the orthographic and phonological transcriptions, and the phonotactic structure of each word. Moreover, most of the entries are associated with their respective ARASAAC pictogram, providing an additional and inclusive tool for treating speech impairments. The user-friendly interface is structured to allow for different and adaptable search options. DILLo allows Speech-Language Pathologists (SLPs) to obtain a rich, tailored, and varied selection of suitable linguistic stimuli. It can be used to customize the treatment of many impairments, e.g., Speech Sound Disorders, Childhood Apraxia of Speech, Specific Learning Disabilities, aphasia, dysarthria, dysphonia, and the auditory training that follows cochlear implantations.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"80 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140200878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}