Methodologies for Evaluating and Interpreting Neural Code Intelligence Models

Date

2023-04-24

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Deep neural models are increasingly being used for various code intelligence tasks, such as code summarization, auto code generation, and software bug detection. Researchers commonly utilize these models to solve different downstream tasks for improving developer productivity and code quality. Despite the continuing development of code intelligence models, it remains largely unclear how reliable these models are in real-world scenarios. This issue is further complicated by the fact that these models are opaque black-box and depend on noise-prone data sources for learning. Therefore, to reliably adapt such models, researchers often need to reason about their underlying behaviors and the factors that affect them. However, our understanding of how generalizable these models are on unseen data and what relevant features they learn for making predictions is largely unknown. A lack of knowledge in these areas may exaggerate the learning behaviors of models and can lead to reckless deployment in safety-critical applications. Moreover, state-of-the-art approaches are typically specific to a particular set of architectures and require access to the model's parameters, which hinders their reliable adoption by most researchers.

To address these challenges, we propose a set of model-agnostic methodologies that inspect models by analyzing inputs and observing outputs without accessing the model's parameters. The overarching goal is to enhance our understanding of the model's inference by exploring its learning behaviors in terms of generalizability and interpretability. Specifically, we assess the ability of a model to generalize its performance with respect to noise-inducing memorization and semantic-preserving transformation. Additionally, we identify critical features from input programs for interpreting the predictions of a model through prediction-preserving reduction. Our results indicate that neural code intelligence models are prone to memorizing noisy data due to their excessive parameters, are often vulnerable to very small semantic changes, and typically rely on a few syntactic features for making their predictions; thus, models usually suffer from poor generalization performance in unseen scenarios. These observations could assist researchers in better understanding the underlying behavior of these models and prompt them to focus their efforts on devising new techniques to alleviate the shortcomings of existing models.

Description

Keywords

Models of code, Evaluation, Transparency, Generalizability, Interpretability

Citation

Portions of this document appear in: Md Rafiqul Islam Rabin, Aftab Hussain, Sahil Suneja, and Mohammad Amin Alipour. Study of Distractors in Neural Models of Code. Proceedings of the 1st IEEE/ACM International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE), Melbourne, Australia, May 2023; and in: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, and Vincent J Hellendoorn. Memorization and Generalization in Neural Code Intelligence Models. Information and Software Technology (IST), Volume 153, January 2023. https://doi.org/10.1016/j.infsof.2022.107066; and in: Md Rafiqul Islam Rabin and Mohammad Amin Alipour. Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code. Proceedings of the 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, The Bahamas, December 2022. https://doi.org/10.1109/ICMLA55696.2022.00140; and in: Md Rafiqul Islam Rabin, Aftab Hussain, and Mohammad Amin Alipour. Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS), San Diego, CA, USA, June 2022. https://doi.org/10.1145/3520312.3534869; and in: Md Rafiqul Islam Rabin, Vincent J Hellendoorn, and Mohammad Amin Alipour. Understanding Neural Code Intelligence through Program Simplification. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Athens, Greece, August 2021. https://doi.org/10.1145/3468264.3468539; and in: Md Rafiqul Islam Rabin, Nghi D.Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mohammad Amin Alipour. On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations. Information and Software Technology (IST), Volume 135, July 2021. Journal First Track of the 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Virtual Event, USA, March 2022. https://doi.org/10.1016/j.infsof.2021.106552; and in: Md Rafiqul Islam Rabin and Mohammad Amin Alipour. Configuring Test Generators Using Bug Reports: A Case Study of GCC Compiler and Csmith. Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC), Virtual Event, Republic of Korea, March 2021. https://doi.org/10.1145/3412841.3442047; and in: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, and Mohammad Amin Alipour. Towards Demystifying Dimensions of Source Code Embeddings. Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages (RL+SE&PL), Virtual Event, USA, November 2020. https://doi.org/10.1145/3416506.3423580; and in: Md Rafiqul Islam Rabin and Mohammad Amin Alipour. K-CONFIG: Using Failing Test Cases to Generate Test Cases in GCC Compilers. Late Breaking Results Track of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, November 2019. https://doi.org/10.48550/arXiv.1908.10481; and in: Md Rafiqul Islam Rabin, Ke Wang, and Mohammad Amin Alipour. Testing Neural Program Analyzers. Late Breaking Results Track of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, November 2019. https://doi.org/10.48550/arXiv.1908.10711; and in: Md Rafiqul Islam Rabin and Mohammad Amin Alipour. FeatureExtractor: A tool for Extracting Key Input Features of Code Intelligence Models. Software Impacts, Volume 14, December 2022. https://doi.org/10.1016/j.simpa.2022.100432; and in: Md Rafiqul Islam Rabin and Mohammad Amin Alipour. ProgramTransformer: A tool for Generating Semantically Equivalent Transformed Programs. Software Impacts, Volume 14, December 2022. https://doi.org/10.1016/j.simpa.2022.100429; and in: Md Rafiqul Islam Rabin, Aftab Hussain, and Mohammad Amin Alipour. Artifact for Article (CI-DD-Perses): Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models. ACM Digital Library, Zenodo 6630188, June 2022. https://doi.org/10.5281/zenodo.6630188; and in: Md Rafiqul Islam Rabin, Vincent J Hellendoorn, and Mohammad Amin Alipour. Artifact for Article (SIVAND): Understanding Neural Code Intelligence Through Program Simplification. ACM Digital Library, Zenodo 5154090, August 2021. https://doi.org/10.1145/3462296