Language is not a data set—Why overcoming ideologies of dataism is more important than ever in the age of AI

IF 1.5 1区 文学 Q2 LINGUISTICS
Iker Erdocia, Bettina Migge, Britta Schneider
{"title":"Language is not a data set—Why overcoming ideologies of dataism is more important than ever in the age of AI","authors":"Iker Erdocia,&nbsp;Bettina Migge,&nbsp;Britta Schneider","doi":"10.1111/josl.12680","DOIUrl":null,"url":null,"abstract":"<p>Helen Kelly-Holmes’ call to explore the implications for sociolinguistics arising from the increased commercially driven digitalization of society is very timely. Like Kelly-Holmes, we share the view that the growing prevalence of online and artificial intelligence (AI) technologies in all aspects of our lives requires a critical assessment of assumptions, approaches, and practices that have grounded sociolinguistic research since its inception. While our discussion confirms Helen's observations, we also urge the development of a general critical attitude toward understanding language as digital data. The starting point for our argument is Helen's claim that there is an erasure of “authentic” languages from public digital spaces, “making it more difficult to gather data on real usage because it would be necessary to rely on public areas and/or negotiate access to these private spaces” (p. 5). For us, her observation brings to the fore that treating language as data has always been problematic. We want to raise two issues: the general epistemological limitations of using digital user data as a representation of language and community, and the consequent need for methods that take seriously the study of language in its social, political, and technological context. We suggest ethnography as a method for understanding what speakers actually do, and an opening of language research to also consider the workings and socio-political embeddings of digital and generative AI language technologies. Our discussion is in the spirit of a joint fruitful and constructive debate.</p><p>Let us start with a general critique of approaching language as “data” that correlates with social groups, which is so far a neglected aspect in the debates surrounding language, sociolinguistics, and AI. Historically, this discussion links to the colonial backgrounds of Western science and linguistics specifically. Colonial or missionary linguistic research (e.g., Deumert &amp; Storch, <span>2020</span>; Errington, <span>2008</span>) demonstrates that dominant Western epistemologies of language and research methods in linguistics were shaped during the period of European colonialism. An important legacy of European colonialism is that it “sought to fundamentally change and reorganize the social and economic order of the societies it colonized, as opposed to satisfy itself with extracting tribute” (Couldry &amp; Mejias, <span>2019</span>, p. 70). Part of this endeavor involved language “development” activities aimed at the goal of Bible translation and turning the colonized into Christian disciples. This was based on constructions of language that are still dominant today. They developed on the grounds of “collecting data” (in colonial times, often from single speakers) and then transforming the human capacity of embodied, interactive and collaborative meaning-making into word lists, grammar books, or dictionaries (e.g., Deumert &amp; Storch, <span>2020</span>; Gal &amp; Irvine, <span>2019</span>, Chap. 9). Linguistic research is therefore grounded in what is now called the ideology of “dataism” (Bode &amp; Goodlad, <span>2023</span>). This is the belief in data as representing human behavior. In the age of digital technologies, this is coupled with the aim of tracking human behavior to predict and ultimately shape social life (Rushkoff, <span>2019</span>). Dataism implies assumptions such as the belief in objectivity of quantification and trust in data processing agents. The resulting datafication of everyday life then consists of extracting information from the flow of social life, matching it to imagined social realities and categories and fixing such relationships. In the context of linguistics, the understanding of language as data, collected from oral practices transformed into writing, led to conceptualizing language as referential code and languages as “natural,” given objects that are systematically and neatly structured (e.g., Pennycook, <span>2004</span>). The outcomes of these activities are summarized in typologies, developmental hierarchies, and a canon of methods for optimal data extraction and analysis (for critical discussion, see Deumert &amp; Storch, <span>2020</span>).</p><p>Sociolinguistics is part of this tradition. But unlike missionaries who aimed to impose their social imaginations on people through their linguistic activities, sociolinguists’ goal is to discover and explain what people do with language, specifically variation, to deepen our understanding of language and its sociocultural embeddedness and to raise awareness and fight discrimination. Early work aggregated people's practices into externally defined, homogenizing macro social (e.g., age, class) and structural linguistic categories and derived meaning from statistical correlations between them. Over time, however, sociolinguistics has also successively questioned objectification. Subsequent research emphasized the construction of meaning as a local, interactional process that required understanding people's actions and views and advocated for detailed participant observation of people's everyday activities and semi-guided discussion. The aim is to discover locally relevant social and linguistic categories (e.g., practices, linguistic phenomena) and their local indexicalities and relationships in a holistic manner (Eckert, <span>2012</span>). It is now widely accepted that social and linguistic categories are complex and dynamic across contexts and interactions. Language is a fuzzy network of social acts whose meanings emerge in context as it is fundamentally pragmatic and indexical, and meaning-making is a localized process (Eckert, <span>2008</span>; Gal &amp; Irvine, <span>2019</span>; Silverstein, <span>2014</span>). People are social agents who pick linguistic practices based on their identity, on the goals that they want to (temporarily) foreground, and on their current understanding of an interaction based on the indexicalities that they perceive. In addition, humans “dynamically reshape the context that provides organization for their actions within the interaction itself” (Duranti &amp; Goodwin, <span>1992</span>, p. 5), and in literate cultures, writing and printing have co-constructed these contexts, in particular, normative ideas and epistemological approaches (Linell, <span>2005</span>). Overall, sociolinguistic diversity is therefore rich and dynamic and defies neat correlative relationships.</p><p>What does all this suggest for the future of sociolinguistic research in a digital and AI technology enriched environment? The short answer is that we cannot limit our analyses to the study of digital language data, be it the effect of user interaction or the output of generative AI. On the one hand, we need to continue and enhance what we have been doing: observing people, understanding meaning-making in practice and considering the social embeddedness of language, while critically assessing, critiquing, and recalibrating our tools to avoid essentializing practices, contexts, and communities. The current social reality, ripe with unfamiliar tools, processes, and logics, requires us to upskill and engage. Ethnography's dedication to a multi-perspective and holistic understanding is well suited to grasp, for example, how, when, and where people actually use generative AI tools and how it impacts on language attitudes and language ideologies and therefore contributes to sociolinguistic realities. Instead of only examining the outputs of technological applications, that is, the linguistic data, we need to cast our net more widely. Given their intertwined nature, we have to explore people's offline and online experiences, activities and ideologies, the technological infrastructures and affordances, and their intersections in an integrated manner. Ethnography is here useful to study micro-practices, but at the same time, as it is limited to the observation of locally visible conditions, it needs to be complemented by other methodological approaches. The social, cultural, political, technological, and interactional situatedness and dynamic nature of any language activity need to be embraced in a multi-methodological fashion (e.g., Page et al., <span>2022</span>). For example, young(er) people in the Global North often stay connected throughout most of their waking hours and frequently blend offline activities with simultaneous online activity, leading to at times intensely intertwined experiences because the technological and social contingencies of their public, private, and educational lives “allow” or even mandate a convergence of online and offline activities. For others, there might be a greater difference between online and offline worlds due to the lack of appropriate devices, data or network coverage, non-digitized contexts, or just a preference for offline interaction (Deumert, <span>2014</span>, Chap. 3). Due to being involved in different communities of online and offline practice, individuals also develop, use, and learn to interpret linguistic and sociopragmatic indexicalities differently and develop different metapragmatic realities. Language is also not the only meaning-making resource. Type of technology, ways of using technologies (e.g., voice vs. written messages; multimodal vs. plain text) may also become contextualization cues and their indexicalities are not constant as different contexts have different affordances in terms of devices, literacy, and ideologies of language and media (Gershon, <span>2010</span>). Without ethnographic observation and a consideration of the social and technological contexts, local meanings of language and the social indexicality of language and technology choices can easily be misinterpreted. This also applies to the linguistic output of AI tools, which is typically edited by users, according to their audiences and language ideologies, the latter increasingly influenced by the ascription of authority to data and algorithms, but possibly also by their rejection. The edited language feeds back into systems so that the whole AI arrangement becomes a complex socio-technical human–machine assemblage (Fester-Seeger et al., in preparation; Pennycook, <span>2024</span>). In this, it is impossible to know what people do and why without engaging with people—the belief in objectified, decontextualized data as the sole source of knowledge creation has been problematic in the past and becomes even more so in an age of digital transnational interaction and AI interventions.</p><p>This also means that we need new conceptual tools, categories, and methodological approaches to study language in a society in which digital platforms, owned by a handful of American companies, make enormous profits with their data collection activities. They feed these into AI systems, which, in turn, impact language use, language ideologies, and the formation of communities worldwide. We concur with Kelly-Holmes (<span>2023</span>) that we therefore cannot neglect the macro level in our research and need to put a greater focus on investigating and critically exploring sociopolitical structures and systems of commercialization and technology, and how they impact on language practices, language ideologies, linguistic research, and language policies. Our call for engagement with language in a holistic manner is thus not only a call for ethnography. We have to deepen our understanding of how language technologies are built and why. Understanding the ideological underpinnings of the market activity of the tech sector is of particular importance to fully capture the processes in which language technologies are embedded. The actual workings and motivations of digital technologies have received the least attention in linguistic research despite their impact on language practices (see however e.g., Jones et al., <span>2015</span>). Critical sociological research (Couldry &amp; Mejias, <span>2019</span>) characterizes digital AI technologies as built on data colonialism, dominated by big tech companies’ desire for maximization of profits through digital dispossession and data surveillance (Zuboff, <span>2019</span>). Language data are of utmost centrality in making AI infrastructures a highly potent tool for structuring but also controlling humans and for commercially exploiting our life in the form of data. A research agenda that reacts to this could include, for example, exploring the implications for our field of recent critical accounts in the social sciences of the tech industry's global pillaging of human practices in the context of extractive capitalism (e.g., Couldry &amp; Mejias, <span>2019</span>; Zuboff, <span>2019</span>). These critical insights can help inform a more nuanced understanding of the modus operandi of corporate language technologies, whose interests they serve and how they are monetized.</p><p>In the overall context of changing socio-technological conditions of society, we must not forget the sociopolitical and economic context. The state has traditionally played a crucial role in the framing of sociolinguistic economies (Blommaert, <span>2010</span>, p. 195). More recently, many governments of both the Global North and South appear to have adopted a techno-solutionist approach to AI, including language technologies. Public authorities have long delegated the development of digital technologies to the market, replacing government language technology policy with the strategies of the commercially driven private sector, in the belief that it would achieve social goods for all (Birhane, <span>2020</span>; Morozov, <span>2013</span>). This situation has resulted in an acute digital inequality among languages, where the technological readiness of populations (e.g., use of smartphones), the degree of language norming (e.g., uniform/roman scripts), the size of data sets, and/or the decision by companies to create artificial data sets (see, e.g., NLLB et al., 2022) impact on whether or not a language is provided with critical AI tools and thus becomes reified and visible in digital space. It has also created tension between the private sector of commercial providers of language technologies and the blurring role of public institutions as traditional regulators and exclusive holders of normative authority in language matters (Erdocia et al., <span>under review</span>). We have become utterly dependent on private technologies manufactured and controlled by a handful of opaque companies. Like the raw resource mining industries, they appear mostly indifferent to the social consequences of their activities and only invest minimally if obliged by government regulations to enhance their public image. It is expected that the state, also within supra-national organizations, regains a more active role as a guarantor of fundamental rights for users with regulatory and supervisory frameworks (see EU's <i>Digital Services Act</i>). In the language field, this includes public–private partnerships to develop accurate, ethical, and unbiased data sets and technologies for all (particularly “low-resource”) languages in an attempt to reduce the technology gap between English and other languages (see “Language Equality in the Digital Age” resolution, European Parliament, <span>2018</span>; Rehm &amp; Way, <span>2023</span>). In contexts like Spain, state-sponsored language academies are beginning to collaborate with tech corporations to extend their language authority to AI (see Erdocia et al., <span>under review</span>).</p><p>And yet, we overall still know little about how companies resource, compile, and turn language practices into data. This is due to big tech secrecy but probably also due to our own disciplinary orientation and biases (where we often avoid interaction with computational linguistics). Academic and commercial computational publications that assess processes, models, and procedures such as data scraping, “curating,” “cleaning,” debiasing processes, or training and testing of language models are useful to get a deeper insight into the politics of language technologies. However, since they are oriented to computationally trained specialists and are typically framed in ideologies of dataism, they need to be triangulated with narrative explanations from different people involved in and affected by these processes. Ideally, this should be complemented with observing their activities. Thus, we suggest developing studies that use semi-guided interviews and observation to investigate professionals’ practices (e.g., researchers, localizers, technology designers, annotators, data curators, and CEOs) and commercial, public, and lay users’ experiences with these infrastructures. This would re-visibilize the “invisibilization of technology” that Helen Kelly-Holmes observes (p. 2).</p><p>Looking at sociolinguistic phenomena across macro-meso-micro levels may help us not only to take the pulse of traditional concepts in our field in the digital space, such as standard language, language authenticity or linguistic authority. It might also contribute to our understanding of people's value attributions to the legacies of modernist conceptions of language and nation-state in a technologized world of late modern communication. This would paint a rich picture of the language ideologies and social and commercial actors that co-construct sociolinguistic economies today; their social, political, financial, and linguistic dynamics; and their material affordances, practices, understandings, and the web of indexical relationships between them. Paying attention to the entire sociopolitical and technological structure that enables the existence and penetration into all spheres of life of AI—not just user's data, activities and views—will confront us with our own disciplinary assumptions, biases of dataism, categories, practices, and colonial ideologies and can only enhance our work. Sociolinguistic findings and expertise are increasingly sought out by the tech industry to help fine-tune the functioning of AI technologies. Comprehensive engagement with the intertwined online and offline context will put us in a better position to engage with this interest in our work, understand the role of language data in contemporary socio-political contexts, and, more broadly, how our work can contribute to critically engaged understanding of the sociolinguistics of AI.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":51486,"journal":{"name":"Journal of Sociolinguistics","volume":"28 5","pages":"20-25"},"PeriodicalIF":1.5000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/josl.12680","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Sociolinguistics","FirstCategoryId":"98","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/josl.12680","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Helen Kelly-Holmes’ call to explore the implications for sociolinguistics arising from the increased commercially driven digitalization of society is very timely. Like Kelly-Holmes, we share the view that the growing prevalence of online and artificial intelligence (AI) technologies in all aspects of our lives requires a critical assessment of assumptions, approaches, and practices that have grounded sociolinguistic research since its inception. While our discussion confirms Helen's observations, we also urge the development of a general critical attitude toward understanding language as digital data. The starting point for our argument is Helen's claim that there is an erasure of “authentic” languages from public digital spaces, “making it more difficult to gather data on real usage because it would be necessary to rely on public areas and/or negotiate access to these private spaces” (p. 5). For us, her observation brings to the fore that treating language as data has always been problematic. We want to raise two issues: the general epistemological limitations of using digital user data as a representation of language and community, and the consequent need for methods that take seriously the study of language in its social, political, and technological context. We suggest ethnography as a method for understanding what speakers actually do, and an opening of language research to also consider the workings and socio-political embeddings of digital and generative AI language technologies. Our discussion is in the spirit of a joint fruitful and constructive debate.

Let us start with a general critique of approaching language as “data” that correlates with social groups, which is so far a neglected aspect in the debates surrounding language, sociolinguistics, and AI. Historically, this discussion links to the colonial backgrounds of Western science and linguistics specifically. Colonial or missionary linguistic research (e.g., Deumert & Storch, 2020; Errington, 2008) demonstrates that dominant Western epistemologies of language and research methods in linguistics were shaped during the period of European colonialism. An important legacy of European colonialism is that it “sought to fundamentally change and reorganize the social and economic order of the societies it colonized, as opposed to satisfy itself with extracting tribute” (Couldry & Mejias, 2019, p. 70). Part of this endeavor involved language “development” activities aimed at the goal of Bible translation and turning the colonized into Christian disciples. This was based on constructions of language that are still dominant today. They developed on the grounds of “collecting data” (in colonial times, often from single speakers) and then transforming the human capacity of embodied, interactive and collaborative meaning-making into word lists, grammar books, or dictionaries (e.g., Deumert & Storch, 2020; Gal & Irvine, 2019, Chap. 9). Linguistic research is therefore grounded in what is now called the ideology of “dataism” (Bode & Goodlad, 2023). This is the belief in data as representing human behavior. In the age of digital technologies, this is coupled with the aim of tracking human behavior to predict and ultimately shape social life (Rushkoff, 2019). Dataism implies assumptions such as the belief in objectivity of quantification and trust in data processing agents. The resulting datafication of everyday life then consists of extracting information from the flow of social life, matching it to imagined social realities and categories and fixing such relationships. In the context of linguistics, the understanding of language as data, collected from oral practices transformed into writing, led to conceptualizing language as referential code and languages as “natural,” given objects that are systematically and neatly structured (e.g., Pennycook, 2004). The outcomes of these activities are summarized in typologies, developmental hierarchies, and a canon of methods for optimal data extraction and analysis (for critical discussion, see Deumert & Storch, 2020).

Sociolinguistics is part of this tradition. But unlike missionaries who aimed to impose their social imaginations on people through their linguistic activities, sociolinguists’ goal is to discover and explain what people do with language, specifically variation, to deepen our understanding of language and its sociocultural embeddedness and to raise awareness and fight discrimination. Early work aggregated people's practices into externally defined, homogenizing macro social (e.g., age, class) and structural linguistic categories and derived meaning from statistical correlations between them. Over time, however, sociolinguistics has also successively questioned objectification. Subsequent research emphasized the construction of meaning as a local, interactional process that required understanding people's actions and views and advocated for detailed participant observation of people's everyday activities and semi-guided discussion. The aim is to discover locally relevant social and linguistic categories (e.g., practices, linguistic phenomena) and their local indexicalities and relationships in a holistic manner (Eckert, 2012). It is now widely accepted that social and linguistic categories are complex and dynamic across contexts and interactions. Language is a fuzzy network of social acts whose meanings emerge in context as it is fundamentally pragmatic and indexical, and meaning-making is a localized process (Eckert, 2008; Gal & Irvine, 2019; Silverstein, 2014). People are social agents who pick linguistic practices based on their identity, on the goals that they want to (temporarily) foreground, and on their current understanding of an interaction based on the indexicalities that they perceive. In addition, humans “dynamically reshape the context that provides organization for their actions within the interaction itself” (Duranti & Goodwin, 1992, p. 5), and in literate cultures, writing and printing have co-constructed these contexts, in particular, normative ideas and epistemological approaches (Linell, 2005). Overall, sociolinguistic diversity is therefore rich and dynamic and defies neat correlative relationships.

What does all this suggest for the future of sociolinguistic research in a digital and AI technology enriched environment? The short answer is that we cannot limit our analyses to the study of digital language data, be it the effect of user interaction or the output of generative AI. On the one hand, we need to continue and enhance what we have been doing: observing people, understanding meaning-making in practice and considering the social embeddedness of language, while critically assessing, critiquing, and recalibrating our tools to avoid essentializing practices, contexts, and communities. The current social reality, ripe with unfamiliar tools, processes, and logics, requires us to upskill and engage. Ethnography's dedication to a multi-perspective and holistic understanding is well suited to grasp, for example, how, when, and where people actually use generative AI tools and how it impacts on language attitudes and language ideologies and therefore contributes to sociolinguistic realities. Instead of only examining the outputs of technological applications, that is, the linguistic data, we need to cast our net more widely. Given their intertwined nature, we have to explore people's offline and online experiences, activities and ideologies, the technological infrastructures and affordances, and their intersections in an integrated manner. Ethnography is here useful to study micro-practices, but at the same time, as it is limited to the observation of locally visible conditions, it needs to be complemented by other methodological approaches. The social, cultural, political, technological, and interactional situatedness and dynamic nature of any language activity need to be embraced in a multi-methodological fashion (e.g., Page et al., 2022). For example, young(er) people in the Global North often stay connected throughout most of their waking hours and frequently blend offline activities with simultaneous online activity, leading to at times intensely intertwined experiences because the technological and social contingencies of their public, private, and educational lives “allow” or even mandate a convergence of online and offline activities. For others, there might be a greater difference between online and offline worlds due to the lack of appropriate devices, data or network coverage, non-digitized contexts, or just a preference for offline interaction (Deumert, 2014, Chap. 3). Due to being involved in different communities of online and offline practice, individuals also develop, use, and learn to interpret linguistic and sociopragmatic indexicalities differently and develop different metapragmatic realities. Language is also not the only meaning-making resource. Type of technology, ways of using technologies (e.g., voice vs. written messages; multimodal vs. plain text) may also become contextualization cues and their indexicalities are not constant as different contexts have different affordances in terms of devices, literacy, and ideologies of language and media (Gershon, 2010). Without ethnographic observation and a consideration of the social and technological contexts, local meanings of language and the social indexicality of language and technology choices can easily be misinterpreted. This also applies to the linguistic output of AI tools, which is typically edited by users, according to their audiences and language ideologies, the latter increasingly influenced by the ascription of authority to data and algorithms, but possibly also by their rejection. The edited language feeds back into systems so that the whole AI arrangement becomes a complex socio-technical human–machine assemblage (Fester-Seeger et al., in preparation; Pennycook, 2024). In this, it is impossible to know what people do and why without engaging with people—the belief in objectified, decontextualized data as the sole source of knowledge creation has been problematic in the past and becomes even more so in an age of digital transnational interaction and AI interventions.

This also means that we need new conceptual tools, categories, and methodological approaches to study language in a society in which digital platforms, owned by a handful of American companies, make enormous profits with their data collection activities. They feed these into AI systems, which, in turn, impact language use, language ideologies, and the formation of communities worldwide. We concur with Kelly-Holmes (2023) that we therefore cannot neglect the macro level in our research and need to put a greater focus on investigating and critically exploring sociopolitical structures and systems of commercialization and technology, and how they impact on language practices, language ideologies, linguistic research, and language policies. Our call for engagement with language in a holistic manner is thus not only a call for ethnography. We have to deepen our understanding of how language technologies are built and why. Understanding the ideological underpinnings of the market activity of the tech sector is of particular importance to fully capture the processes in which language technologies are embedded. The actual workings and motivations of digital technologies have received the least attention in linguistic research despite their impact on language practices (see however e.g., Jones et al., 2015). Critical sociological research (Couldry & Mejias, 2019) characterizes digital AI technologies as built on data colonialism, dominated by big tech companies’ desire for maximization of profits through digital dispossession and data surveillance (Zuboff, 2019). Language data are of utmost centrality in making AI infrastructures a highly potent tool for structuring but also controlling humans and for commercially exploiting our life in the form of data. A research agenda that reacts to this could include, for example, exploring the implications for our field of recent critical accounts in the social sciences of the tech industry's global pillaging of human practices in the context of extractive capitalism (e.g., Couldry & Mejias, 2019; Zuboff, 2019). These critical insights can help inform a more nuanced understanding of the modus operandi of corporate language technologies, whose interests they serve and how they are monetized.

In the overall context of changing socio-technological conditions of society, we must not forget the sociopolitical and economic context. The state has traditionally played a crucial role in the framing of sociolinguistic economies (Blommaert, 2010, p. 195). More recently, many governments of both the Global North and South appear to have adopted a techno-solutionist approach to AI, including language technologies. Public authorities have long delegated the development of digital technologies to the market, replacing government language technology policy with the strategies of the commercially driven private sector, in the belief that it would achieve social goods for all (Birhane, 2020; Morozov, 2013). This situation has resulted in an acute digital inequality among languages, where the technological readiness of populations (e.g., use of smartphones), the degree of language norming (e.g., uniform/roman scripts), the size of data sets, and/or the decision by companies to create artificial data sets (see, e.g., NLLB et al., 2022) impact on whether or not a language is provided with critical AI tools and thus becomes reified and visible in digital space. It has also created tension between the private sector of commercial providers of language technologies and the blurring role of public institutions as traditional regulators and exclusive holders of normative authority in language matters (Erdocia et al., under review). We have become utterly dependent on private technologies manufactured and controlled by a handful of opaque companies. Like the raw resource mining industries, they appear mostly indifferent to the social consequences of their activities and only invest minimally if obliged by government regulations to enhance their public image. It is expected that the state, also within supra-national organizations, regains a more active role as a guarantor of fundamental rights for users with regulatory and supervisory frameworks (see EU's Digital Services Act). In the language field, this includes public–private partnerships to develop accurate, ethical, and unbiased data sets and technologies for all (particularly “low-resource”) languages in an attempt to reduce the technology gap between English and other languages (see “Language Equality in the Digital Age” resolution, European Parliament, 2018; Rehm & Way, 2023). In contexts like Spain, state-sponsored language academies are beginning to collaborate with tech corporations to extend their language authority to AI (see Erdocia et al., under review).

And yet, we overall still know little about how companies resource, compile, and turn language practices into data. This is due to big tech secrecy but probably also due to our own disciplinary orientation and biases (where we often avoid interaction with computational linguistics). Academic and commercial computational publications that assess processes, models, and procedures such as data scraping, “curating,” “cleaning,” debiasing processes, or training and testing of language models are useful to get a deeper insight into the politics of language technologies. However, since they are oriented to computationally trained specialists and are typically framed in ideologies of dataism, they need to be triangulated with narrative explanations from different people involved in and affected by these processes. Ideally, this should be complemented with observing their activities. Thus, we suggest developing studies that use semi-guided interviews and observation to investigate professionals’ practices (e.g., researchers, localizers, technology designers, annotators, data curators, and CEOs) and commercial, public, and lay users’ experiences with these infrastructures. This would re-visibilize the “invisibilization of technology” that Helen Kelly-Holmes observes (p. 2).

Looking at sociolinguistic phenomena across macro-meso-micro levels may help us not only to take the pulse of traditional concepts in our field in the digital space, such as standard language, language authenticity or linguistic authority. It might also contribute to our understanding of people's value attributions to the legacies of modernist conceptions of language and nation-state in a technologized world of late modern communication. This would paint a rich picture of the language ideologies and social and commercial actors that co-construct sociolinguistic economies today; their social, political, financial, and linguistic dynamics; and their material affordances, practices, understandings, and the web of indexical relationships between them. Paying attention to the entire sociopolitical and technological structure that enables the existence and penetration into all spheres of life of AI—not just user's data, activities and views—will confront us with our own disciplinary assumptions, biases of dataism, categories, practices, and colonial ideologies and can only enhance our work. Sociolinguistic findings and expertise are increasingly sought out by the tech industry to help fine-tune the functioning of AI technologies. Comprehensive engagement with the intertwined online and offline context will put us in a better position to engage with this interest in our work, understand the role of language data in contemporary socio-political contexts, and, more broadly, how our work can contribute to critically engaged understanding of the sociolinguistics of AI.

The authors declare no conflicts of interest.

语言不是数据集--为什么在人工智能时代,克服数据主义的意识形态比以往任何时候都更重要?
海伦-凯利-霍尔姆斯(Helen Kelly-Holmes)呼吁探讨商业驱动的社会数字化对社会语言学的影响,这一呼吁非常及时。与凯利-霍尔姆斯一样,我们也认为,随着网络和人工智能(AI)技术在我们生活的方方面面日益普及,我们需要对社会语言学研究自诞生以来的假设、方法和实践进行批判性评估。我们的讨论证实了海伦的观点,同时我们也敦促大家对理解作为数字数据的语言形成一种普遍的批判态度。我们论证的出发点是海伦的主张,即 "正宗 "语言在公共数字空间中被抹去,"这使得收集真实使用情况的数据变得更加困难,因为有必要依靠公共区域和/或通过协商进入这些私人空间"(第 5 页)。对我们来说,她的观点让我们意识到,将语言作为数据处理一直是个问题。我们想提出两个问题:将数字用户数据作为语言和社区代表的一般认识论局限性,以及因此需要认真研究社会、政治和技术背景下的语言的方法。我们建议将人种学作为了解说话者实际行为的一种方法,并将语言研究扩展到对数字和生成式人工智能语言技术的运作和社会政治嵌入的考虑。让我们先对将语言作为与社会群体相关联的 "数据 "的做法进行一般性批判,迄今为止,在围绕语言、社会语言学和人工智能的争论中,这是一个被忽视的方面。从历史上看,这种讨论与西方科学和语言学的殖民背景有关。殖民地或传教士的语言学研究(如 Deumert &amp; Storch, 2020; Errington, 2008)表明,西方占主导地位的语言认识论和语言学研究方法是在欧洲殖民主义时期形成的。欧洲殖民主义的一个重要遗产是,它 "试图从根本上改变和重组其殖民地社会的社会和经济秩序,而不是满足于榨取贡品"(Couldry &amp; Mejias, 2019, p.70)。这种努力的一部分涉及语言 "发展 "活动,目的是翻译《圣经》并将殖民者变成基督教门徒。这些活动的基础是至今仍占主导地位的语言结构。它们是在 "收集数据 "的基础上发展起来的(在殖民地时代,数据往往来自单个说话者),然后将人类体现性、互动性和协作性的意义生成能力转化为词表、语法书或词典(例如,Deumert &amp; Storch, 2020; Gal &amp; Irvine, 2019, Chap.9)。因此,语言学研究的基础是现在所谓的 "数据主义 "意识形态(Bode &amp; Goodlad, 2023)。这就是相信数据代表人类行为。在数字技术时代,这与追踪人类行为以预测并最终塑造社会生活的目标相结合(Rushkoff,2019 年)。数据主义意味着对量化客观性的信念和对数据处理代理的信任等假设。由此产生的日常生活数据化包括从社会生活流中提取信息,将其与想象的社会现实和类别相匹配,并固定这种关系。在语言学方面,将语言理解为从口头实践中收集并转化为文字的数据,导致将语言概念化为指代代码,将语言概念化为 "自然的"、有系统的、结构整齐的给定对象(例如,Pennycook,2004 年)。这些活动的成果归纳为类型学、发展等级以及优化数据提取和分析的方法(批判性讨论见 Deumert &amp; Storch, 2020)。社会语言学是这一传统的一部分。但与旨在通过语言活动将自己的社会想象强加于人的传教士不同,社会语言学家的目标是发现和解释人们如何使用语言,特别是变体,以加深我们对语言及其社会文化嵌入性的理解,并提高人们的意识和反对歧视。早期的研究工作将人们的语言实践归纳为外部定义的、同质化的宏观社会(如年龄、阶级)和结构性语言类别,并从它们之间的统计相关性中得出意义。然而,随着时间的推移,社会语言学也相继对客观化提出了质疑。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.20
自引率
10.50%
发文量
69
期刊介绍: Journal of Sociolinguistics promotes sociolinguistics as a thoroughly linguistic and thoroughly social-scientific endeavour. The journal is concerned with language in all its dimensions, macro and micro, as formal features or abstract discourses, as situated talk or written text. Data in published articles represent a wide range of languages, regions and situations - from Alune to Xhosa, from Cameroun to Canada, from bulletin boards to dating ads.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信