{"title":"Beyond Big Data: What Can We Learn from AI Models?: Invited Keynote","authors":"Aylin Caliskan","doi":"10.1145/3128572.3140452","DOIUrl":null,"url":null,"abstract":"My research involves the heavy use of machine learning and natural language processing in novel ways to interpret big data, develop privacy and security attacks, and gain insights about humans and society through these methods. I do not use machine learning only as a tool but I also analyze machine learning models? internal representations to investigate how the artificial intelligence perceives the world. This work [3] has been recently featured in Science where I showed that societal bias exists at the construct level of machine learning models, namely semantic space word embeddings which are dictionaries for machines to understand language. When I use machine learning as a tool to uncover privacy and security problems, I characterize and quantify human behavior in language, including programming languages, by coming up with a linguistic fingerprint for each individual. By extracting linguistic features from natural language or programming language texts of humans, I show that humans have unique linguistic fingerprints since they all learn language on an individual basis. Based on this finding, I can de-anonymize humans that have written certain text, source code, or even executable binaries of compiled code [2, 4, 5]. This is a serious privacy threat for individuals that would like to remain anonymous, such as activists, programmers in oppressed regimes, or malware authors. Nevertheless, being able to identify authors of malicious code enhances security. On the other hand, identifying authors can be used to resolve copyright disputes or detect plagiarism. The methods in this realm [1] have been used to identify so called doppelgängers to link the accounts that belong to the same identities across platforms, especially underground forums that are business platforms for cyber criminals. By analyzing machine learning models? internal representation and linguistic human fingerprints, I am able to uncover facts about the world, society, and the use of language, which have implications for privacy, security, and fairness in machine learning.","PeriodicalId":318259,"journal":{"name":"Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security","volume":"408 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3128572.3140452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
My research involves the heavy use of machine learning and natural language processing in novel ways to interpret big data, develop privacy and security attacks, and gain insights about humans and society through these methods. I do not use machine learning only as a tool but I also analyze machine learning models? internal representations to investigate how the artificial intelligence perceives the world. This work [3] has been recently featured in Science where I showed that societal bias exists at the construct level of machine learning models, namely semantic space word embeddings which are dictionaries for machines to understand language. When I use machine learning as a tool to uncover privacy and security problems, I characterize and quantify human behavior in language, including programming languages, by coming up with a linguistic fingerprint for each individual. By extracting linguistic features from natural language or programming language texts of humans, I show that humans have unique linguistic fingerprints since they all learn language on an individual basis. Based on this finding, I can de-anonymize humans that have written certain text, source code, or even executable binaries of compiled code [2, 4, 5]. This is a serious privacy threat for individuals that would like to remain anonymous, such as activists, programmers in oppressed regimes, or malware authors. Nevertheless, being able to identify authors of malicious code enhances security. On the other hand, identifying authors can be used to resolve copyright disputes or detect plagiarism. The methods in this realm [1] have been used to identify so called doppelgängers to link the accounts that belong to the same identities across platforms, especially underground forums that are business platforms for cyber criminals. By analyzing machine learning models? internal representation and linguistic human fingerprints, I am able to uncover facts about the world, society, and the use of language, which have implications for privacy, security, and fairness in machine learning.