Reading Seminar
-
Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation (USENIX, 2024) (Discussion) In many ways, the compilation process is destructive. High-level constructs, variable names, and comments present in the source code are often lost. Nevertheless, decompilers aim to retrieve the source from a binary. Decompilation results vary between decompilers. So, what is a good decompilation? One could argue that fewer goto instructions indicate good decompilation. Because multiple gotos signify the decompiler failed to structure the control flow. This paper argues that good decompilation is similar to the source code and points out that 3,754 goto instructions are present in the Linux kernel. Since a goto instruction could be intentional by the developer, they argue we should separate intended gotos from unintended gotos. The authors investigate the gcc compiler to identify the transformation passes that cause unintended gotos. Afterward, they propose a structuring algorithm that deoptimizes code to remove unintended gotos while preserving intended ones. SAILR, the proposed structuring algorithm, is implemented on the angr decompiler. SAILR demonstrates and performs decently against state-of-the-art decompilers. The paper evaluates SAILR with 7,355 functions from 26 popular Debian packages. However, the number of functions is oddly low, considering the number of packages used. -
DETECTING PRETRAINING DATA FROM LARGE LANGUAGE MODELS (ICLR, 2024) (Discussion) This paper proposes the Membership Inference method 'MIN-K% PROB' in LLM(Large Language Model). The problem presented in this paper is the challenge of detecting pretraining data. This is because LLM developers do not open the data used for pertaining, and since pretraining involves training on data instances once at a time, it makes detection even more difficult. Hence, this paper assumes that the Membership Inference Detector cannot know the distribution of pretraining data. This means that there is no Reference Model (e.g., Shadow Model) used in Membership Inference techniques. Consequently, this paper proposes the Reference-free Method, MIN-K% PROB. The 'Min-K% PROB' method selects outlier tokens with low token probability to create a set, and then uses the average of this set as the threshold for inferring between members and non-members. This method seems to be efficient as it infers membership based on the token-level probability during the pretraining. Additionally, using this method as an evaluation for Unlearning techniques, as demonstrated in the case study, would be beneficial. -
You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content (S&P, 2024) (Discussion) The paper investigates the application of prompt learning with Large Language Models (LLMs) such as GPT-3 and T5 to address the issue of toxic online content. It focuses on three primary tasks: toxicity classification, toxic span detection, and detoxification, showing that prompt learning can perform as well as or even better than traditional models specifically trained for these tasks. Notably, this approach achieves a 10% improvement in toxicity classification and significantly lowers toxicity scores in detoxification tasks while maintaining the semantic integrity of the content. The study is distinguished by its innovative use of prompt tuning for managing toxic content and its thorough evaluation across five model architectures and eight different datasets, which verifies the method's effectiveness and efficiency. However, despite these strengths, the approach's dependency on the quality of datasets and its ability to generalize to unseen toxic content remain potential weaknesses. Furthermore, the complexity involved in designing effective prompts and the possible misuse of the techniques are also concerns. -
ANUBIS: a provenance graph-based framework for advanced persistent threat detection (SAC, 2022) (Discussion) This paper proposes 'ANUBIS', a provenance graph-based supervised APT detection framework. ANUBIS is a method leveraging a BNN(Bayesian Neural Network) to overcome limitations in current APT detection methods using provenance graphs. The authors hypothesize that an attacker cannot breach the probability graph used to train ANUBIS, nor does it violate the event logging mechanism of the host system. These assumptions provide a strong premise for security scenarios, but they have limitations that do not take into account the practical considerations. However, this paper is significant because it has demonstrated that it is effective in predicting the nature of the activity by introducing a new graph neighbor encoding method. -
The Circle Of Life: A Large-Scale Study of The IoT Malware Lifecycle (USENIX, 2021) (Discussion) This paper is a measurement study on the IoT Malware Lifecycle. It collects around 166K Linux-based IoT malware samples collected over a year to measure the characteristics of IoT Malware. The paper concludes with some characteristics that differentiate IoT malware from traditional malware, such as most IoT malware being a variant of the Mirai botnet. However, there does not seem to be unexpected findings. A slight complaint is that figures within the paper had unclear captions. Although the paper explains the concepts, some figures are difficult to understand. -
How Machine Learning Is Solving the Binary Function Similarity Problem (USENIX, 2022) (Discussion) In this paper, a dataset composition for testing the latest BCSD (Binary Code Similarity Detection) techniques is proposed, and analysis of test results for various BCSD models, including the latest GNN-based BCSD and Embedding-based BCSD, has been performed. However, in the evaluation section, it is necessary to divide the results for similarity and similarity lacking into separate tables to enhance clarity. Additionally, there is a lack of explanation for the many abbreviations used in the dataset composition, making it difficult to understand. -
Quark: Controllable Text Generation with Reinforced [Un]learning (NeurIPS, 2022) (Discussion) The paper discussion focuses on an innovative approach to optimizing the reward function of a Reinforcement Learning (RL) model to mitigate undesired behaviors. The authors detail a three-stage algorithm comprising exploration, quantization, and learning, each critical to the model's development. Notably, the paper demonstrates the model's effectiveness through three distinct evaluations, addressing the reduction of toxicity, improvement over negative baselines, and minimization of repetitive actions. This structured evaluation underscores the model's capability to unlearn specific unwanted behaviors, which outperforms both baselines and state-of-the-art RL methods. In this paper, the focus is on defining three undesirable behaviors and proposing three methods (e.g. reward function) to forget each behavior. It seems inefficient to train a model separately for each behavior. -
A Comprehensive Detection Method for the Lateral Movement Stage of APT Attacks (IEEE, 2023) (Discussion) This paper designs a multidimensional detection framework to detect lateral movement behavior against APT attacks in an intranet environment based on the SMB protocol. However, contrary to this purpose, the experimental results are unfortunate in that the purpose and results differ because they are about how well malware has been classified. -
Anomaly detection in a forensic timeline with deep autoencoders (JISA, 2021) (Discussion) Systems generate logs to record internal events. These logs are used after a cyber incident for forensic investigation. Unfortunately, the system generates numerous logs during its runtime and majority of the analysis is manual. This paper proposes a deep autoencoder for anomaly detection to assist analyzers by highlighting anomalous events in the Linux kernal system logs. Anomaly detection is a common application for autoencoders. It is utilized to analyze network traffic, logs, etc. It is difficult to identify what is different from previous work and this paper. Furthermore, the proposed model is evaluated with old dataset against ML methods such as SVM. It would have been preferable to see the model’s performance on up-to-date dataset against state-of-the-art models. -
Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures (ASIA CCS, 2023) (Discussion) There are efficient ways to represent binary functions using a tokenizer (e.g., BPE) in assembly language. I'm curious about why the suggestion is to divide them into seven features like Vex-IR instructions, LibcCalls, Constants, etc., and create specific tokenizers for each feature to generate one-hot vectors. I also wonder if this approach is genuinely effective. The explanation about whether this method of tokenization has a positive impact on BCSD (Binary Code Similarity Detection) performance seems insufficient. -
Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers (AAAI, 2024) (Discussion) The paper utilizes adversarial examples to retain the original decision boundary during unlearning. This approach is intriguing yet it raises questions as to what it means to “forget” a data point. The paper states that misclassification signifies forgetting yet “Right to be Forgotten” seems to imply the removal of a certain training data. -
From Grim Reality to Practical Solution: Malware Classification in Real-World Noise (S&P, 2023) Malware datasets inevitably contain incorrect labels due to the shortage of expertise and experience needed for sample labeling. Previous research demonstrated that a training dataset with incorrectly labeled samples would result in inaccurate model learning. To address this problem, researchers have proposed various noise learning methods to offset the impact of incorrectly labeled samples, and in image recognition and text mining applications, these methods demonstrated great success. In this work, we apply both representative and state-of-the-art noise learning methods to real-world malware classification tasks. We surprisingly observe that none of the existing methods could minimize incorrect labels’ impact. Through a carefully designed experiment, we discover that the inefficacy mainly results from extreme data imbalance and the high percentage of incorrectly labeled data samples. As such, we further propose a new noise learning method and name it after MORSE. Unlike existing methods, MORSE customizes and extends a state-of-the-art semi-supervised learning technique. It takes possibly incorrectly labeled data as unlabeled data and thus avoids their potential negative impact on model learning. In MORSE, we also integrate a sample re-weighting method that balances the training data usage in the model learning and thus handles the data imbalance challenge. We evaluate MORSE on both our synthesized and real-world datasets. We show that MORSE could significantly outperform existing noise learning methods and minimize the impact of incorrectly labeled data. -
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation (EMNLP, 2021) (Abstract) Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https: //github.com/salesforce/CodeT5. -
InCoder: A Generative Model for Code Infilling and Synthesis (EMNLP, 2021) (Abstract) Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce INCODER, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via masking and infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first large generative code model that is able to infill arbitrary regions of code, which we evaluate in a zero-shot setting on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. Our models and code are publicly released. -
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS, 2022) (Abstract) Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose “CodeRL”, a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark. -
Ground Truth for Binary Disassembly is Not Easy (USENIX, 2022) (Abstract) Modern disassembly tools often rely on empirical evaluations to validate their performance and discover their limitations, thus promoting long-term evolvement. To support the empirical evaluation, a foundation is the right approach to collect the ground truth knowledge. However, there has been no unanimous agreement on the approach we should use. Most users pick an approach based on their experience or will, regardless of the properties that the approach presents. In this paper, we perform a study on the approaches to building the ground truth for binary disassembly, aiming to shed light on the right way for the future. We first provide a taxonomy of the approaches used by past research, which unveils five major mechanisms behind those approaches. Following the taxonomy, we summarize the properties of the five mechanisms from two perspectives: (i) the coverage and precision of the ground truth produced by the mechanisms and (ii) the applicable scope of the mechanisms (e.g., what disassembly tasks and what types of binaries are supported). The summarization, accompanied by quantitative evaluations, illustrates that many mechanisms are ill-suited to support the generation of disassembly ground truth. The mechanism best serving today’s need is to trace the compiling process of the target binaries to collect the ground truth information. Observing that the existing tool to trace the compiling process can still miss ground truth results and can only handle x86/x64 binaries, we extend the tool to avoid overlooking those results and support ARM32/AArch64/MIPS32/MIPS64 binaries. We envision that our extension will make the tool a better foundation to enable universal, standard ground truth for binary disassembly. -
Black-box Attacks Against Neural Binary Function Detection (RAID, 2023) (Abstract) Binary analyses based on deep neural networks (DNNs), or neural binary analyses (NBAs), have become a hotly researched topic in recent years. DNNs have been wildly successful at pushing the performance and accuracy envelopes in the natural language and image processing domains. Thus, DNNs are highly promising for solving binary analysis problems that are hard due to a lack of complete information resulting from the lossy compilation process. Despite this promise, it is unclear that the prevailing strategy of repurposing embeddings and model architectures originally developed for other problem domains is sound given the adversarial contexts under which binary analysis often operates. In this paper, we empirically demonstrate that the current state of the art in neural function boundary detection is vulnerable to both inadvertent and deliberate adversarial attacks. We proceed from the insight that current generation NBAs are built upon embeddings and model architectures intended to solve syntactic problems. We devise a simple, reproducible, and scalable black-box methodology for exploring the space of inadvertent attacks – instruction sequences that could be emitted by common compiler toolchains and configurations – that exploits this syntactic design focus. We then show that these inadvertent misclassifications can be exploited by an attacker, serving as the basis for a highly effective black-box adversarial example generation process. We evaluate this methodology against two state-of-the-art neural function boundary detectors: XDA and DeepDi. We conclude with an analysis of the evaluation data and recommendations for ho -
SelectiveTaint: Efficient Data Flow Tracking With Static Binary Rewriting (USENIX, 2021) (Abstract) Taint analysis has been widely used in many security applications such as exploit detection, information flow tracking, malware analysis, and protocol reverse engineering. State-of-theart taint analysis tools are usually built atop dynamic binary instrumentation, which instruments at every possible instruction, and rely on runtime information to decide whether a particular instruction involves taint or not, thereby usually having high performance overhead. This paper presents SELECTIVETAINT, an efficient selective taint analysis framework for binary executables. The key idea is to selectively instrument the instructions involving taint analysis using static binary rewriting instead of dynamic binary instrumentation. At a high level, SELECTIVETAINT statically scans taint sources of interest in the binary code, leverages value set analysis to conservatively determine whether an instruction operand needs to be tainted or not, and then selectively taints the instructions of interest. We have implemented SELECTIVETAINT and evaluated it with a set of binary programs including 16 coreutils (focusing on file I/O) and five network daemon programs (focusing on network I/O) such as nginx web server. Our evaluation results show that the binaries statically instrumented by SELECTIVETAINT has superior performance compared to the state-of-the-art dynamic taint analysis frameworks (e.g., 1.7x faster than that of libdft).