Machine Learning Methods for Software Vulnerability Detection



Journal Title

Journal ISSN

Volume Title



Software vulnerabilities are a primary concern in the IT security industry, as malicious hackers who discover these vulnerabilities can often exploit them for nefarious purposes. Numerous countermeasures, such as canaries, data execution prevention, and address space layout randomization, have been implemented to deter attackers from gaining full control over systems, but thus far, most of these techniques are only minor hurdles for a determined adversary. Currently, the only way to prevent systems from being exploited is by writing secure code. However, complex programs, particularly those written in a relatively low-level language like C, are di cult to fully scan for bugs, even when both manual and automated techniques are used. Because analyzing code and making sure it is securely written is proven to be a non-trivial task, improving the existing techniques for automated bug detection is an important area of research. Both static analysis and dynamic analysis techniques have been heavily investigated, and this work focuses on the former. The contribution of this paper is a demonstration of how it is possible to catch a large percentage of bugs by extracting features from C source code and analyzing them with a machine learning classi er. Both simple and complex features were extracted from these functions, and the simple features unexpectedly performed better than the complex features. This suggests that simple features might be worth researching further, because they are very cheap to analyze and seem to have a lot of potential for vulnerability detection.



N-grams, Software metrics, Machine learning, Buffer overflow, Vulnerabilities, Suffix trees