Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

Published in arXiv, 2023

Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases and potential hallucinations. We build a prototype, GENNM, from pre-trained generative models CodeGemma-2B and CodeLlama-7B. We finetune GENNM on decompiled functions and mitigate model biases by incorporating symbol preference into the training pipeline. GENNM includes names from callers and callees while querying a function, providing rich contextual information within the model’s input token limitation. It further leverages program analysis to validate the consistency of names produced by the generative model. Our results show that GENNM improves the state-of-the-art name recovery precision by 8.6 and 11.4 percentage points on two commonly used datasets and improves the state-of-the-art from 8.5% to 22.8% in the most challenging setup where ground-truth variable names are not seen in the training dataset.

Recommended citation: Xu, X., Zhang, Z., Su, Z., Huang, Z., Feng, S., Ye, Y., ... & Zhang, X. (2023). Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary. arXiv preprint arXiv:2306.02546.
Download Paper