Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Session: 3
Harmit Raval - Princeton University
Co-Author(s): Nicolas Agostini, Northeastern University, Boston; David Kaeli, Northeastern University, Boston
A common situation involves a programmer trying to update a program that lacks proper documentation or commenting. The programmer must then “reverse engineer” the code, deciphering its purpose or characteristics, which involves time-consuming code reading. In order to avoid these laborious tasks, the researchers have consider how well they can represent the code and perform this kind of analysis by, respectively, using natural language processing algorithms for transforming the code to natural language, and exploiting supervised machine learning models identify the original programmer’s intent. The focus of our contributions is to perform the translation of code by, before tokenization of the data, generating a natural language friendly representation (pseudo-English) with the conversion of specific tokens, such as operators, to readable sentences. We evaluate the utility of this approach to understanding code intention by measuring accuracy of our model to identify “What problem this code was originally coded to solve?”, using a corpus of programming examples taken from a coding competition website (Codeforces). Employing our novel “code to natural language translation” approach, the results of different natural language processing algorithms, paired up with different classification models, show a consistent increase in classification accuracy, with 5 – 15% improvement across our study, achieving the best case cross-validation classification accuracy of 85.95% in a 52 classes classification problem. This method can be transferred to different high-level programming languages and easily applied to other classification problems to better understand a programmer’s intention (e.g., malware classification). References: McCormick, Chris. Word2Vec Tutorial The Skip – Gram Model.2016, http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ Le, Quoc, et al. ”Distributed Representations of Sentences and Documents.” Presentation
Funder Acknowledgement(s): I would like to thank my advisor Dr. Kaeli and mentor Nico for their help in the field as well as for providing the opportunity to work in Northeastern's Electrical and Computer Engineering lab for the summer. The National Science Foundation provided the funding for this project.
Faculty Advisor: David Kaeli, kaeli@ece.neu.edu
Role: I focused on working on both the Word2Vec model as well as the Doc2Vec model. In particular, I learned the Python libraries for these two models and implemented them. I also implemented the necessary machine learning algorithm scripts, such as random forest, decision tree, support vector machine, and neural network. Finally, I wrote a script which converts raw C code to natural language and finally I cleaned, parsed, and split the data that I was using and provided general testing cases for the code I wrote.