{"title":"Conversations in the wild: Data collection, automatic generation and evaluation","authors":"Nimra Zaheer , Agha Ali Raza , Mudassir Shabbir","doi":"10.1016/j.csl.2024.101699","DOIUrl":"10.1016/j.csl.2024.101699","url":null,"abstract":"<div><p>The aim of conversational speech processing is to analyze human conversations in natural settings. It finds numerous applications in personality traits identification, speech therapy, speaker identification and verification, speech emotion detection, and speaker diarization. However, large-scale annotated datasets required for feature extraction and conversational model training only exist for a handful of languages (e.g. English, Mandarin, and French) as the gathering, cleaning, and annotation of such datasets is tedious, time-consuming, and expensive. We propose two scalable, language-agnostic algorithms for automatically generating multi-speaker, variable-length, spontaneous conversations. These algorithms synthesize conversations using existing non-conversational speech datasets. We also contribute the resulting datasets (283 hours, 50 speakers). As a comparison, we also gathered the first spontaneous conversational dataset for Urdu (24 hours, 212 speakers) from public talk shows. Using speaker diarization as an example, we evaluate our datasets and report the first baseline diarization error rates (DER) for Urdu (25% for synthetic dataset-based models, and 29% for natural conversations). Our conversational speech generation technique allows training speaker diarization pipelines without the need for preparing huge conversational repositories.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101699"},"PeriodicalIF":3.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000822/pdfft?md5=3c965afd5ed1a80b86a1318a77699ef7&pid=1-s2.0-S0885230824000822-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141947077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prompting large language models for user simulation in task-oriented dialogue systems","authors":"Atheer Algherairy , Moataz Ahmed","doi":"10.1016/j.csl.2024.101697","DOIUrl":"10.1016/j.csl.2024.101697","url":null,"abstract":"<div><p>Large Language Models (LLMs) have gained widespread popularity due to their instruction-following abilities. In this study, we evaluate their ability in simulating user interactions for task-oriented dialogue (TOD) systems. Our findings demonstrate that prompting LLMs reveals their promising capabilities for training and testing dialogue policies, eliminating the need for domain expertise in crafting complex rules or relying on large annotated data, as required by traditional simulators. The results show that the dialogue system trained with the ChatGPT simulator achieves a success rate of 59%, comparable to a 62% success rate of the dialogue system trained with the manual-rules, agenda-based user simulator (ABUS). Furthermore, the dialogue system trained with the ChatGPT simulator demonstrates better generalization ability compared to the dialogue system trained with the ABUS. Its success rate outperforms that of the dialogue system trained with the ABUS by 4% on GenTUS, 5% on the ChatGPT Simulator, and 3% on the Llama simulator. Nevertheless, LLM-based user simulators provide challenging environment, lexically rich, diverse, and random responses. Llama simulator outperforms the human reference in all lexical diversity metrics with a margin of 0.66 in SE, 0.39 in CE, 0.01 in MSTTR, 0.04 in HDD, and 0.55 in MTLD, while the ChatGPT simulator achieves comparable results. This ultimately contributes to enhancing the system’s ability to generalize more effectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101697"},"PeriodicalIF":3.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000809/pdfft?md5=81b644a0e6ced84bc9ba93092c2f49b3&pid=1-s2.0-S0885230824000809-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141848167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Demystifying large language models in second language development research","authors":"Yan Cong","doi":"10.1016/j.csl.2024.101700","DOIUrl":"10.1016/j.csl.2024.101700","url":null,"abstract":"<div><p>Evaluating students' textual response is a common and critical task in language research and education practice. However, manual assessment can be tedious and may lack consistency, posing challenges for both scientific discovery and frontline teaching. Leveraging state-of-the-art large language models (LLMs), we aim to define and operationalize LLM-Surprisal, a numeric representation of the interplay between lexical diversity and syntactic complexity, and to empirically and theoretically demonstrate its relevance for automatic writing assessment and Chinese L2 (second language) learners’ English writing development. We developed an LLM-based natural language processing pipeline that can automatically compute text Surprisal scores. By comparing Surprisal metrics with the widely used classic indices in L2 studies, we extended the usage of computational metrics in Chinese learners’ L2 English writing. Our analyses suggested that LLM-Surprisals can distinguish L2 from L1 (first language) writing, index L2 development stages, and predict scores provided by human professionals. This indicated that the Surprisal dimension may manifest itself as critical aspects in L2 development. The relative advantages and disadvantages of these approaches were discussed in depth. We concluded that LLMs are promising tools that can enhance L2 research. Our showcase paves the way for more nuanced approaches to computationally assessing and understanding L2 development. Our pipelines and findings will inspire language teachers, learners, and researchers to operationalize LLMs in an innovative and accessible manner.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101700"},"PeriodicalIF":3.1,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000834/pdfft?md5=88083b1a8544dcbd7f01cce3a7d527d7&pid=1-s2.0-S0885230824000834-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141843458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The effect of preference elicitation methods on the user experience in conversational recommender systems","authors":"Liv Ziegfeld , Daan Di Scala , Anita H.M. Cremers","doi":"10.1016/j.csl.2024.101696","DOIUrl":"10.1016/j.csl.2024.101696","url":null,"abstract":"<div><p>The prevalence of conversational interfaces is rapidly rising, since improved algorithms allow for remarkable proficiency in understanding and generating natural language. This also holds for Conversational Recommender Systems (CRS), that benefit from information being provided by the user in the course of the dialogue to offer personalized recommendations. However, the challenge remains eliciting the user's characteristics and preferences in a way that leads to the most optimal user experience. Hence, the current research was aimed at investigating the effect of different Preference Elicitation (PE) methods on the user experience of a CRS. We introduce two axes across which PE methods can be classified, namely the degree of system prompt guidance and the level of user input restriction. We built three versions of a CRS to conduct a between-subjects experiment which compared three conditions: high guidance-high restriction, high guidance-low restriction and low guidance-low restriction. We tested their effect on ten constructs of user experience measures on 66 European participants, all working in agriculture or forestry.</p><p>The study did not find any significant effects of the three preference elicitation methods on all user experience constructs collected through questionnaires. However, we did find significant differences in terms of the objective measures chat duration (Speed), response time (Cognitive Demand) and recommendation performance (Accuracy of Recommended Items). Regarding the recommendation performance, it was found that the preference elicitation methods with high guidance led to a higher match score than the condition with low guidance. The certainty score was highest in the condition with high guidance and high input restriction. Finally, we found through a question at the end of the conversation that users who were satisfied with the recommendation responded more positively to six out of ten user experience constructs. This suggests that satisfaction with the recommendation performance is a crucial factor in the user experience of CRSs.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101696"},"PeriodicalIF":3.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000792/pdfft?md5=2468411a22f6c0a2ba9f84281b96dacc&pid=1-s2.0-S0885230824000792-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141840842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Theory of mind performance of large language models: A comparative analysis of Turkish and English","authors":"Burcu Ünlütabak, Onur Bal","doi":"10.1016/j.csl.2024.101698","DOIUrl":"10.1016/j.csl.2024.101698","url":null,"abstract":"<div><p>Theory of mind (ToM), understanding others’ mental states, is a defining skill belonging to humans. Research assessing LLMs’ ToM performance yields conflicting findings and leads to discussions about whether and how they could show ToM understanding. Psychological research indicates that the characteristics of a specific language can influence how mental states are represented and communicated. Thus, it is reasonable to expect language characteristics to influence how LLMs communicate with humans, especially when the conversation involves references to mental states. This study examines how these characteristics affect LLMs’ ToM performance by evaluating GPT 3.5 and 4 performances in English and Turkish. Turkish provides an excellent contrast to English since Turkish has a different syntactic structure and special verbs, san- and zannet-, meaning “falsely believe.” Using Open AI's Chat Completion API, we collected responses from GPT models for first- and second-order ToM scenarios in English and Turkish. Our innovative approach combined completion prompts and open-ended questions within the same chat session, offering deep insights into models’ reasoning processes. Our data showed that while GPT models can respond accurately to standard ToM tasks (100% accuracy), their performance deteriorates (below chance level) with slight modifications. This high sensitivity suggests a lack of robustness in ToM performance. GPT 4 outperformed its predecessor, GPT 3.5, showing improvement in ToM performance to some extent. The models generally performed better when tasks were presented in English than in Turkish. These findings indicate that GPT models cannot reliably pass first-order and second-order ToM tasks in either of the languages yet. The findings have significant implications for <em>Explainability</em> of LLMs by highlighting challenges and biases that they face when simulating human-like ToM understanding in different languages.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101698"},"PeriodicalIF":3.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000810/pdfft?md5=e4a1b003e652ef2e0a652d3d4eaf2c3d&pid=1-s2.0-S0885230824000810-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141848847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang
{"title":"ChatMatch: Exploring the potential of hybrid vision–language deep learning approach for the intelligent analysis and inference of racket sports","authors":"Jiawen Zhang , Dongliang Han , Shuai Han , Heng Li , Wing-Kai Lam , Mingyu Zhang","doi":"10.1016/j.csl.2024.101694","DOIUrl":"10.1016/j.csl.2024.101694","url":null,"abstract":"<div><p>Video understanding technology has become increasingly important in various disciplines, yet current approaches have primarily focused on lower comprehension level of video content, posing challenges for providing comprehensive and professional insights at a higher comprehension level. Video analysis plays a crucial role in athlete training and strategy development in racket sports. This study aims to demonstrate an innovative and higher-level video comprehension framework (ChatMatch), which integrates computer vision technologies with the cutting-edge large language models (LLM) to enable intelligent analysis and inference of racket sports videos. To examine the feasibility of this framework, we deployed a prototype of ChatMatch in the badminton in this study. A vision-based encoder was first proposed to extract the meta-features included the locations, actions, gestures, and action results of players in each frame of racket match videos, followed by a rule-based decoding method to transform the extracted information in both structured knowledge and unstructured knowledge. A set of LLM-based agents included namely task identifier, coach agent, statistician agent, and video manager, was developed through a prompt engineering and driven by an automated mechanism. The automatic collaborative interaction among the agents enabled the provision of a comprehensive response to professional inquiries from users. The validation findings showed that our vision models had excellent performances in meta-feature extraction, achieving a location identification accuracy of 0.991, an action recognition accuracy of 0.902, and a gesture recognition accuracy of 0.950. Additionally, a total of 100 questions were gathered from four proficient badminton players and one coach to evaluate the performance of the LLM-based agents, and the outcomes obtained from ChatMatch exhibited commendable results across general inquiries, statistical queries, and video retrieval tasks. These findings highlight the potential of using this approach that can offer valuable insights for athletes and coaches while significantly improve the efficiency of sports video analysis.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101694"},"PeriodicalIF":3.1,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000779/pdfft?md5=2c72701b559ac872232548320e08722b&pid=1-s2.0-S0885230824000779-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141853772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuyan Wu, Romina Soledad Albornoz-De Luise, Miguel Arevalillo-Herráez
{"title":"On improving conversational interfaces in educational systems","authors":"Yuyan Wu, Romina Soledad Albornoz-De Luise, Miguel Arevalillo-Herráez","doi":"10.1016/j.csl.2024.101693","DOIUrl":"10.1016/j.csl.2024.101693","url":null,"abstract":"<div><p>Conversational Intelligent Tutoring Systems (CITS) have drawn increasing interest in education because of their capacity to tailor learning experiences, improve user engagement, and contribute to the effective transfer of knowledge. Conversational agents employ advanced natural language techniques to engage in a convincing human-like tutorial conversation. In solving math word problems, a significant challenge arises in enabling the system to understand user utterances and accurately map extracted entities to the essential problem quantities required for problem-solving, despite the inherent ambiguity of human natural language. In this study, we propose two possible approaches to enhance the performance of a particular CITS designed to teach learners to solve arithmetic–algebraic word problems. Firstly, we propose an ensemble approach to intent classification and entity extraction, which combines the predictions made by two distinct individual models that use constraints defined by human experts. This approach leverages the intertwined nature of the intents and entities to yield a comprehensive understanding of the user’s utterance, ultimately aiming to enhance semantic accuracy. Secondly, we introduce an adapted Term Frequency-Inverse Document Frequency technique to associate entities with problem quantity descriptions. The evaluation was conducted on the AWPS and MATH-HINTS datasets, containing conversational data and a collection of arithmetical and algebraic math problems, respectively. The results demonstrate that the proposed ensemble approach outperforms individual models, and the proposed method for entity–quantity matching surpasses the performance of typical text semantic embedding models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101693"},"PeriodicalIF":3.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000767/pdfft?md5=56f2f2395571e332090191dc68fc5505&pid=1-s2.0-S0885230824000767-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141851561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Sigona , Daniele P. Radicioni , Barbara Gili Fivela , Davide Colla , Matteo Delsanto , Enrico Mensa , Andrea Bolioli , Pietro Vigorelli
{"title":"A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus","authors":"Francesco Sigona , Daniele P. Radicioni , Barbara Gili Fivela , Davide Colla , Matteo Delsanto , Enrico Mensa , Andrea Bolioli , Pietro Vigorelli","doi":"10.1016/j.csl.2024.101691","DOIUrl":"10.1016/j.csl.2024.101691","url":null,"abstract":"<div><h3>Introduction</h3><p>Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions.</p></div><div><h3>Methods</h3><p>Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values.</p></div><div><h3>Results and discussion</h3><p>Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the clas","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101691"},"PeriodicalIF":3.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000743/pdfft?md5=5a1457a7753032d3fdc01ffd4b14e74e&pid=1-s2.0-S0885230824000743-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141844241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PaSCoNT - Parallel Speech Corpus of Northern-central Thai for automatic speech recognition","authors":"Supawat Taerungruang , Phimphaka Taninpong , Vataya Chunwijitra , Sumonmas Thatphithakkul , Sawit Kasuriya , Viroj Inthanon , Pawat Paksaranuwat , Salinee Thumronglaohapun , Nawapon Nakharutai , Papangkorn Inkeaw , Jakramate Bootkrajang","doi":"10.1016/j.csl.2024.101692","DOIUrl":"10.1016/j.csl.2024.101692","url":null,"abstract":"<div><p>This paper proposed a Parallel Speech Corpus of Northern-central Thai (PaSCoNT). The purpose of this research is not only to understand the different linguistic characteristics between Northern and Central Thai, but also to utilize this corpus for automatic speech recognition. The corpus is composed of speech data from dialogues of daily life among northern Thai people. We designed 2,000 Northern Thai sentences covering all phonemes, in collaboration with linguists specialized in the Northern Thai dialect. The samples in this study are 200 Northern Thai dialect speakers who had been living in Chiang Mai province for more than 18 years. The speech was recorded in both open and closed environments. In the speech recording, each speaker must read 100 pairs of Northern-Central Thai sentences to ensure that the speech data comes from the same speaker. In total, 100 h of speech were recorded: 50 h of Northern Thai and 50 h of Central Thai. Overall, PaSCoNT consists of 907,832 words and 6,279 vocabulary items. Statistical analysis of the PaSCoNT corpus revealed that 49.64 % of words in the lexicon belongs to the Northern Thai dialect, 50.36 % from the Central Thai dialect, and 1,621 vocabulary items appeared in both Northern and Central Thai. Statistical analysis is used to examine the difference in speech tempo, i.e. time per phoneme (TTP), syllable per minute (SPM), between Northern and Central Thai. The results revealed that there were statistically significant differences speech tempo between Central and Northern Thai. The TTP speaking and articulation rate of Central Thai is lower than Northern Thai whereas SPM speaking and articulation rate of Central Thai is higher than Northern Thai. The results also showed that the ASR model training using Northern Thai speech corpus provides the lower WER% when testing using Northern Thai testing speech data and provides the higher WER% when testing using Central Thai Testing speech data and vice versa. However, the ASR model training using the PaSCoNT speech corpus provides the lower WER% for both Northern Thai and Central Thai testing speech data.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101692"},"PeriodicalIF":3.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000755/pdfft?md5=f97afe2aa357037c83c6473c50174543&pid=1-s2.0-S0885230824000755-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141839086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalizing Hate Speech Detection Using Multi-Task Learning: A Case Study of Political Public Figures","authors":"Lanqin Yuan, Marian-Andrei Rizoiu","doi":"10.1016/j.csl.2024.101690","DOIUrl":"10.1016/j.csl.2024.101690","url":null,"abstract":"<div><p>Automatic identification of hateful and abusive content is vital in combating the spread of harmful online content and its damaging effects. Most existing works evaluate models by examining the generalization error on train–test splits on hate speech datasets. These datasets often differ in their definitions and labeling criteria, leading to poor generalization performance when predicting across new domains and datasets. This work proposes a new Multi-task Learning (MTL) pipeline that trains simultaneously across multiple hate speech datasets to construct a more encompassing classification model. Using a dataset-level leave-one-out evaluation (designating a dataset for testing and jointly training on all others), we trial the MTL detection on new, previously unseen datasets. Our results consistently outperform a large sample of existing work. We show strong results when examining the generalization error in train–test splits and substantial improvements when predicting on previously unseen datasets. Furthermore, we assemble a novel dataset, dubbed <span>PubFigs</span>, focusing on the problematic speech of American Public Political Figures. We crowdsource-label using Amazon MTurk more than 20,000 tweets and machine-label problematic speech in all the 305,235 tweets in <span>PubFigs</span>. We find that the abusive and hate tweeting mainly originates from right-leaning figures and relates to six topics, including Islam, women, ethnicity, and immigrants. We show that MTL builds embeddings that can simultaneously separate abusive from hate speech, and identify its topics.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101690"},"PeriodicalIF":3.1,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000731/pdfft?md5=e169fb47936a2284a9d518194884b197&pid=1-s2.0-S0885230824000731-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141853188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}