数据驱动的语言习得方法

Ting Liu, S. Small, James R. Kubricht, Peter Tu, Harry Shen, Lydia Cartwright, Samuil Orlioglu
{"title":"数据驱动的语言习得方法","authors":"Ting Liu, S. Small, James R. Kubricht, Peter Tu, Harry Shen, Lydia Cartwright, Samuil Orlioglu","doi":"10.54941/ahfe1002842","DOIUrl":null,"url":null,"abstract":"Automatic Language Acquisition focuses on teaching an agent to acquire\n knowledge to understand the surrounding environment and be adaptive to a new\n environment. The traditional language understanding models fall into three\n main categories, supervised, semi-supervised, and unsupervised. A supervised\n approach is usually accurate but requires a large training dataset, which\n building process is expensive and time consuming. In addition, the trained\n model is difficult to shift to other domains. On the other hand, building an\n unsupervised model is cheap and flexible, but its performance is usually\n significantly lower than the performance of the supervised one. With a\n relatively small set of guidance at the beginning, a semi-supervised\n approach can teach itself through the unlabeled dataset to achieve a\n comparable performance as a supervised modal. However, building the guidance\n is not a trivial task since the learning process won’t be effective if the\n relationship between labeled data and unlabeled data. Different from the\n traditional modals, when children learn, they do not require large amounts\n of training data. Instead, they can accurately generalize their knowledge\n from one object to other objects. In addition, the communication between\n them and their parents/teachers/peers helps to fix the wrong claims from the\n generalization. In this paper, we present a multimodal system that simulates\n the children’s learning process to acquire the knowledge of the entities by\n studying three types of attributes, descriptive (the outlook of an entity),\n defining (the components of an entity), and affordance (how an entity can be\n used). We first utilize an unsupervised Emergent Language (EL) approach to\n generate symbolic language (EL codes) to interpret the given images of the\n entities (10 images per entity). The K-Mean clustering methods to group\n entity images that share the similar EL code. Then we employ a data driven\n approach to teach the agent the attributes of the entities in the clusters.\n We first calculate the tf-idf scores of the words from the text pieces\n (extracted from Corpus of Contemporary American English, a balanced corpus\n for American English) containing an entity. From the top ranked words with a\n few sample text pieces, the human expert tells which words are attributes\n and the attribute type. The human expert also marked the text pieces having\n the attributes. For example, red is the descriptive attribute of cup in “the\n red cup’ but not in “a cup of red wine”. The learned knowledge is sent to a\n bootstrapping modal to find not only new attributes but also new entities.\n Our system results show that the data driven approach spent much less time\n but learned more attributes compared with the baseline of our system,\n teaching the agent the defining attributes of the entities using a carefully\n designed curriculum.","PeriodicalId":269162,"journal":{"name":"Proceedings of the 6th International Conference on Intelligent Human Systems Integration (IHSI 2023) Integrating People and Intelligent Systems, February 22–24, 2023, Venice, Italy","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Data Driven Approach for Language Acquisition\",\"authors\":\"Ting Liu, S. Small, James R. Kubricht, Peter Tu, Harry Shen, Lydia Cartwright, Samuil Orlioglu\",\"doi\":\"10.54941/ahfe1002842\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Language Acquisition focuses on teaching an agent to acquire\\n knowledge to understand the surrounding environment and be adaptive to a new\\n environment. The traditional language understanding models fall into three\\n main categories, supervised, semi-supervised, and unsupervised. A supervised\\n approach is usually accurate but requires a large training dataset, which\\n building process is expensive and time consuming. In addition, the trained\\n model is difficult to shift to other domains. On the other hand, building an\\n unsupervised model is cheap and flexible, but its performance is usually\\n significantly lower than the performance of the supervised one. With a\\n relatively small set of guidance at the beginning, a semi-supervised\\n approach can teach itself through the unlabeled dataset to achieve a\\n comparable performance as a supervised modal. However, building the guidance\\n is not a trivial task since the learning process won’t be effective if the\\n relationship between labeled data and unlabeled data. Different from the\\n traditional modals, when children learn, they do not require large amounts\\n of training data. Instead, they can accurately generalize their knowledge\\n from one object to other objects. In addition, the communication between\\n them and their parents/teachers/peers helps to fix the wrong claims from the\\n generalization. In this paper, we present a multimodal system that simulates\\n the children’s learning process to acquire the knowledge of the entities by\\n studying three types of attributes, descriptive (the outlook of an entity),\\n defining (the components of an entity), and affordance (how an entity can be\\n used). We first utilize an unsupervised Emergent Language (EL) approach to\\n generate symbolic language (EL codes) to interpret the given images of the\\n entities (10 images per entity). The K-Mean clustering methods to group\\n entity images that share the similar EL code. Then we employ a data driven\\n approach to teach the agent the attributes of the entities in the clusters.\\n We first calculate the tf-idf scores of the words from the text pieces\\n (extracted from Corpus of Contemporary American English, a balanced corpus\\n for American English) containing an entity. From the top ranked words with a\\n few sample text pieces, the human expert tells which words are attributes\\n and the attribute type. The human expert also marked the text pieces having\\n the attributes. For example, red is the descriptive attribute of cup in “the\\n red cup’ but not in “a cup of red wine”. The learned knowledge is sent to a\\n bootstrapping modal to find not only new attributes but also new entities.\\n Our system results show that the data driven approach spent much less time\\n but learned more attributes compared with the baseline of our system,\\n teaching the agent the defining attributes of the entities using a carefully\\n designed curriculum.\",\"PeriodicalId\":269162,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Intelligent Human Systems Integration (IHSI 2023) Integrating People and Intelligent Systems, February 22–24, 2023, Venice, Italy\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Intelligent Human Systems Integration (IHSI 2023) Integrating People and Intelligent Systems, February 22–24, 2023, Venice, Italy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.54941/ahfe1002842\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Intelligent Human Systems Integration (IHSI 2023) Integrating People and Intelligent Systems, February 22–24, 2023, Venice, Italy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54941/ahfe1002842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

自动语言习得侧重于教智能体获取知识,以理解周围环境并适应新环境。传统的语言理解模型分为三大类:监督式、半监督式和非监督式。监督方法通常是准确的,但需要大量的训练数据集,构建过程昂贵且耗时。此外,训练好的模型很难转移到其他领域。另一方面,构建无监督模型成本低且灵活,但其性能通常明显低于有监督模型的性能。在一开始使用相对较小的指导集时,半监督方法可以通过未标记的数据集自学,以达到与监督模态相当的性能。然而,构建指南并不是一项微不足道的任务,因为如果标记数据和未标记数据之间的关系不一致,学习过程就不会有效。与传统模式不同的是,当孩子学习时,他们不需要大量的训练数据。相反,他们可以准确地将他们的知识从一个对象推广到其他对象。此外,他们与父母/老师/同龄人之间的交流有助于纠正泛化的错误主张。在本文中,我们提出了一个多模态系统,该系统通过研究三种类型的属性,描述性(实体的外观),定义性(实体的组成部分)和可视性(实体如何使用)来模拟儿童学习实体知识的过程。我们首先利用无监督紧急语言(EL)方法生成符号语言(EL代码)来解释给定的实体图像(每个实体10张图像)。使用k -均值聚类方法对具有相似EL代码的实体图像进行分组。然后,我们采用数据驱动的方法来教智能体集群中实体的属性。我们首先计算包含实体的文本片段(取自当代美国英语语料库,一个平衡的美国英语语料库)中单词的tf-idf分数。从排名靠前的单词和几个示例文本片段中,人类专家可以分辨出哪些单词是属性和属性类型。人类专家还标记了具有这些属性的文本片段。例如,red在“the red cup”中是cup的描述性属性,而在“a cup of red wine”中不是。学习到的知识被发送到一个自举模型,不仅可以找到新的属性,还可以找到新的实体。我们的系统结果表明,与我们系统的基线相比,数据驱动方法花费的时间要少得多,但学习了更多的属性,使用精心设计的课程教授代理实体的定义属性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Data Driven Approach for Language Acquisition
Automatic Language Acquisition focuses on teaching an agent to acquire knowledge to understand the surrounding environment and be adaptive to a new environment. The traditional language understanding models fall into three main categories, supervised, semi-supervised, and unsupervised. A supervised approach is usually accurate but requires a large training dataset, which building process is expensive and time consuming. In addition, the trained model is difficult to shift to other domains. On the other hand, building an unsupervised model is cheap and flexible, but its performance is usually significantly lower than the performance of the supervised one. With a relatively small set of guidance at the beginning, a semi-supervised approach can teach itself through the unlabeled dataset to achieve a comparable performance as a supervised modal. However, building the guidance is not a trivial task since the learning process won’t be effective if the relationship between labeled data and unlabeled data. Different from the traditional modals, when children learn, they do not require large amounts of training data. Instead, they can accurately generalize their knowledge from one object to other objects. In addition, the communication between them and their parents/teachers/peers helps to fix the wrong claims from the generalization. In this paper, we present a multimodal system that simulates the children’s learning process to acquire the knowledge of the entities by studying three types of attributes, descriptive (the outlook of an entity), defining (the components of an entity), and affordance (how an entity can be used). We first utilize an unsupervised Emergent Language (EL) approach to generate symbolic language (EL codes) to interpret the given images of the entities (10 images per entity). The K-Mean clustering methods to group entity images that share the similar EL code. Then we employ a data driven approach to teach the agent the attributes of the entities in the clusters. We first calculate the tf-idf scores of the words from the text pieces (extracted from Corpus of Contemporary American English, a balanced corpus for American English) containing an entity. From the top ranked words with a few sample text pieces, the human expert tells which words are attributes and the attribute type. The human expert also marked the text pieces having the attributes. For example, red is the descriptive attribute of cup in “the red cup’ but not in “a cup of red wine”. The learned knowledge is sent to a bootstrapping modal to find not only new attributes but also new entities. Our system results show that the data driven approach spent much less time but learned more attributes compared with the baseline of our system, teaching the agent the defining attributes of the entities using a carefully designed curriculum.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信