{"title":"Sociolinguistics and programming","authors":"F. Naz, J. Rice","doi":"10.1109/PACRIM.2015.7334812","DOIUrl":null,"url":null,"abstract":"This paper focuses on the use of machine learning techniques for the analysis of computer programs in order to acquire information about an author's gender. There are few existing studies that address the relationship between linguistics and programming; however, in many areas where language is analyzed it is possible to mine important information about the users of that language associated with set of attribute or coding style. In this work we use open source implementations of machine learning algorithms, specifically, nearest neighbor (K*), decision tree (J48), and Bayes classifier (Naïve Bayes). These algorithms were applied to C++ programs which were associated with sociolinguistic information about the program authors. Our goal was to classify the programs according to the gender of the author. As indicated by our initial results we have been able to achieve precision of 72.3%, recall of 72%, and f-measure of 71.9% which demonstrates that we can predict the gender of the authors of C++ programs.","PeriodicalId":350052,"journal":{"name":"2015 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACRIM.2015.7334812","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper focuses on the use of machine learning techniques for the analysis of computer programs in order to acquire information about an author's gender. There are few existing studies that address the relationship between linguistics and programming; however, in many areas where language is analyzed it is possible to mine important information about the users of that language associated with set of attribute or coding style. In this work we use open source implementations of machine learning algorithms, specifically, nearest neighbor (K*), decision tree (J48), and Bayes classifier (Naïve Bayes). These algorithms were applied to C++ programs which were associated with sociolinguistic information about the program authors. Our goal was to classify the programs according to the gender of the author. As indicated by our initial results we have been able to achieve precision of 72.3%, recall of 72%, and f-measure of 71.9% which demonstrates that we can predict the gender of the authors of C++ programs.