Methodologies for Evaluating and Interpreting Neural Code Intelligence Models
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Deep neural models are increasingly being used for various code intelligence tasks, such as code summarization, auto code generation, and software bug detection. Researchers commonly utilize these models to solve different downstream tasks for improving developer productivity and code quality. Despite the continuing development of code intelligence models, it remains largely unclear how reliable these models are in real-world scenarios. This issue is further complicated by the fact that these models are opaque black-box and depend on noise-prone data sources for learning. Therefore, to reliably adapt such models, researchers often need to reason about their underlying behaviors and the factors that affect them. However, our understanding of how generalizable these models are on unseen data and what relevant features they learn for making predictions is largely unknown. A lack of knowledge in these areas may exaggerate the learning behaviors of models and can lead to reckless deployment in safety-critical applications. Moreover, state-of-the-art approaches are typically specific to a particular set of architectures and require access to the model's parameters, which hinders their reliable adoption by most researchers.
To address these challenges, we propose a set of model-agnostic methodologies that inspect models by analyzing inputs and observing outputs without accessing the model's parameters. The overarching goal is to enhance our understanding of the model's inference by exploring its learning behaviors in terms of generalizability and interpretability. Specifically, we assess the ability of a model to generalize its performance with respect to noise-inducing memorization and semantic-preserving transformation. Additionally, we identify critical features from input programs for interpreting the predictions of a model through prediction-preserving reduction. Our results indicate that neural code intelligence models are prone to memorizing noisy data due to their excessive parameters, are often vulnerable to very small semantic changes, and typically rely on a few syntactic features for making their predictions; thus, models usually suffer from poor generalization performance in unseen scenarios. These observations could assist researchers in better understanding the underlying behavior of these models and prompt them to focus their efforts on devising new techniques to alleviate the shortcomings of existing models.