超越大数据:我们能从人工智能模型中学到什么?:特邀演讲

Aylin Caliskan
{"title":"超越大数据:我们能从人工智能模型中学到什么?:特邀演讲","authors":"Aylin Caliskan","doi":"10.1145/3128572.3140452","DOIUrl":null,"url":null,"abstract":"My research involves the heavy use of machine learning and natural language processing in novel ways to interpret big data, develop privacy and security attacks, and gain insights about humans and society through these methods. I do not use machine learning only as a tool but I also analyze machine learning models? internal representations to investigate how the artificial intelligence perceives the world. This work [3] has been recently featured in Science where I showed that societal bias exists at the construct level of machine learning models, namely semantic space word embeddings which are dictionaries for machines to understand language. When I use machine learning as a tool to uncover privacy and security problems, I characterize and quantify human behavior in language, including programming languages, by coming up with a linguistic fingerprint for each individual. By extracting linguistic features from natural language or programming language texts of humans, I show that humans have unique linguistic fingerprints since they all learn language on an individual basis. Based on this finding, I can de-anonymize humans that have written certain text, source code, or even executable binaries of compiled code [2, 4, 5]. This is a serious privacy threat for individuals that would like to remain anonymous, such as activists, programmers in oppressed regimes, or malware authors. Nevertheless, being able to identify authors of malicious code enhances security. On the other hand, identifying authors can be used to resolve copyright disputes or detect plagiarism. The methods in this realm [1] have been used to identify so called doppelgängers to link the accounts that belong to the same identities across platforms, especially underground forums that are business platforms for cyber criminals. By analyzing machine learning models? internal representation and linguistic human fingerprints, I am able to uncover facts about the world, society, and the use of language, which have implications for privacy, security, and fairness in machine learning.","PeriodicalId":318259,"journal":{"name":"Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security","volume":"408 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Beyond Big Data: What Can We Learn from AI Models?: Invited Keynote\",\"authors\":\"Aylin Caliskan\",\"doi\":\"10.1145/3128572.3140452\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"My research involves the heavy use of machine learning and natural language processing in novel ways to interpret big data, develop privacy and security attacks, and gain insights about humans and society through these methods. I do not use machine learning only as a tool but I also analyze machine learning models? internal representations to investigate how the artificial intelligence perceives the world. This work [3] has been recently featured in Science where I showed that societal bias exists at the construct level of machine learning models, namely semantic space word embeddings which are dictionaries for machines to understand language. When I use machine learning as a tool to uncover privacy and security problems, I characterize and quantify human behavior in language, including programming languages, by coming up with a linguistic fingerprint for each individual. By extracting linguistic features from natural language or programming language texts of humans, I show that humans have unique linguistic fingerprints since they all learn language on an individual basis. Based on this finding, I can de-anonymize humans that have written certain text, source code, or even executable binaries of compiled code [2, 4, 5]. This is a serious privacy threat for individuals that would like to remain anonymous, such as activists, programmers in oppressed regimes, or malware authors. Nevertheless, being able to identify authors of malicious code enhances security. On the other hand, identifying authors can be used to resolve copyright disputes or detect plagiarism. The methods in this realm [1] have been used to identify so called doppelgängers to link the accounts that belong to the same identities across platforms, especially underground forums that are business platforms for cyber criminals. By analyzing machine learning models? internal representation and linguistic human fingerprints, I am able to uncover facts about the world, society, and the use of language, which have implications for privacy, security, and fairness in machine learning.\",\"PeriodicalId\":318259,\"journal\":{\"name\":\"Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security\",\"volume\":\"408 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3128572.3140452\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3128572.3140452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

我的研究涉及以新颖的方式大量使用机器学习和自然语言处理来解释大数据,开发隐私和安全攻击,并通过这些方法获得关于人类和社会的见解。我不只是把机器学习作为一种工具,我还分析机器学习模型。内部表征来研究人工智能如何感知世界。这项工作b[3]最近被刊登在《科学》杂志上,我展示了社会偏见存在于机器学习模型的构建层面,即语义空间词嵌入,它是机器理解语言的字典。当我使用机器学习作为发现隐私和安全问题的工具时,我通过为每个人提出语言指纹来描述和量化语言中的人类行为,包括编程语言。通过从人类的自然语言或编程语言文本中提取语言特征,我展示了人类有独特的语言指纹,因为他们都是在个体的基础上学习语言的。基于这个发现,我可以去匿名化那些写过某些文本、源代码,甚至是编译代码的可执行二进制文件的人[2,4,5]。对于那些希望保持匿名的个人来说,这是一个严重的隐私威胁,比如激进分子、受压迫政权的程序员或恶意软件作者。然而,能够识别恶意代码的作者可以增强安全性。另一方面,识别作者可以用来解决版权纠纷或检测剽窃。[1]领域的方法已被用于识别所谓的doppelgängers,以链接跨平台的相同身份的帐户,特别是作为网络罪犯商业平台的地下论坛。通过分析机器学习模型?通过内部表征和人类语言指纹,我能够揭示有关世界、社会和语言使用的事实,这些事实对机器学习中的隐私、安全和公平具有影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Beyond Big Data: What Can We Learn from AI Models?: Invited Keynote
My research involves the heavy use of machine learning and natural language processing in novel ways to interpret big data, develop privacy and security attacks, and gain insights about humans and society through these methods. I do not use machine learning only as a tool but I also analyze machine learning models? internal representations to investigate how the artificial intelligence perceives the world. This work [3] has been recently featured in Science where I showed that societal bias exists at the construct level of machine learning models, namely semantic space word embeddings which are dictionaries for machines to understand language. When I use machine learning as a tool to uncover privacy and security problems, I characterize and quantify human behavior in language, including programming languages, by coming up with a linguistic fingerprint for each individual. By extracting linguistic features from natural language or programming language texts of humans, I show that humans have unique linguistic fingerprints since they all learn language on an individual basis. Based on this finding, I can de-anonymize humans that have written certain text, source code, or even executable binaries of compiled code [2, 4, 5]. This is a serious privacy threat for individuals that would like to remain anonymous, such as activists, programmers in oppressed regimes, or malware authors. Nevertheless, being able to identify authors of malicious code enhances security. On the other hand, identifying authors can be used to resolve copyright disputes or detect plagiarism. The methods in this realm [1] have been used to identify so called doppelgängers to link the accounts that belong to the same identities across platforms, especially underground forums that are business platforms for cyber criminals. By analyzing machine learning models? internal representation and linguistic human fingerprints, I am able to uncover facts about the world, society, and the use of language, which have implications for privacy, security, and fairness in machine learning.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信