Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Room: Virginia B
John Heaps - University of Texas at San Antonio
Co-Author(s): Dr. Rocky Slavin, University of Texas at San Antonio, San Antonio, TX; Dr. Xiaoyin Wang, University of Texas at San Antonio, San Antonio, TX; Dr. Jianwei Niu, University of Texas at San Antonio, San Antonio, TX.
Software bugs can lead to delays in development, high costs, and security risks to all stakeholders. In 2018, software bugs cost the world economy over $1.7 trillion and impacted over 3.7 billion people. Currently, most bug detection tools use static analysis techniques to detect software bugs. However, static analysis techniques have many limitations: code patterns or specifications must be manually defined, it is conservative, and it is not always scalable. Deep learning models and techniques have been shown to solve similar limitations in the past, and so may be successful in mitigating such limitations again. However, there are many obstacles in the application of deep learning to code: the complex syntactic structures of code, the constant definition of new methods and variables, no well-catered dataset for learning, and the data sparsity problem. Deep learning on code works by creating a vector representation for every code element in a vocabulary to represent a language model. The current language model is a statistical model based on the probability of occurrence of code elements. However, such a language model does not well represent the bug detection problem, where it is more important to model the meaning and logic behind code elements. To solve this, we define a behavioral language model for code by utilizing code elements that only affect the behavior of the code element being learned. The two main evaluations performed to determine vector representation quality are perplexity (or entropy) and an exploratory analysis of the vector space (where the better clustering of similar code elements indicates better vector representations). The vector representations produced by our model, to our knowledge, achieved similar quality as the state of the art based on its perplexity of 6.45 where our model was significantly smaller and far simpler than those in the current literature. Exploratory analysis showed many good clusters of similar code elements, but there were some code elements that did not seem to cluster properly. Further, the model is unable to mitigate all the above limitations, the worst of such being the data sparsity problem and encountering new code elements, which significantly impedes its ability to be applied to any code analysis. In our future work we plan to implement a different semantic language model based on code element definitions, which has the potential to mitigate all above limitations and will allow it to be feasibly applied to bug detection analysis. References: Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In 2012 34th International Conference on Software Engineering, pages 837?847. IEEE, 2012. John Heaps, Xiaoyin Wang, Travis Breaux, and Jianwei Niu. Toward detection of access control models from source code via word embedding. In Proceedings of the 24th ACM Symposium on Access Control Models and Technologies, pages 103?112. ACM, 2019.
Funder Acknowledgement(s): This research was funded, in part, by the CREST Center for Security and Privacy Enhanced Cloud Computing (C-SPECC) through the National Science Foundation (NSF) (Grant #1736209)
Faculty Advisor: Dr. Jianwei Niu, email@example.com
Role: I was heavily involved in every aspect of the research including: the definition of the new language model, the implementation of the deep learning model, the collection of data to learn, and the evaluation of results.