llm_vulnerability

LLM-Assisted Static Analysis for LLM Serving Framework Vulnerabilities¶

ICLR 2025 IRIS- LLM-ASSISTED STATIC ANALYSIS FOR DETECTING SECURITY VULNERABILITIES¶

The paper introduces IRIS, a neuro-symbolic approach that combines Large Language Models (LLMs) with static analysis to detect security vulnerabilities. Key features:

Uses LLMs to infer taint specifications and perform contextual analysis
Works on whole-repository level for comprehensive security analysis
Reduces false positives through systematic filtering

The researchers created a new dataset comprising:

120 manually validated security vulnerabilities
Real-world Java projects averaging 300K lines of code
Four common vulnerability classes

The evaluation showed significant improvements:

IRIS with GPT-4 detected 55 vulnerabilities (28 more than CodeQL)
Improved false discovery rate by 5% points compared to CodeQL
Discovered 4 previously unknown vulnerabilities in latest versions of Java projects

This work demonstrates the potential of combining LLMs with traditional static analysis tools to improve vulnerability detection while reducing false positives and manual effort.

LLMSA: A Compositional Neuro-Symbolic Approach to Compilation-free and Customizable Static Analysis¶

We propose LLMSA, a compositional neuro-symbolic approach for compilation-free, customizable static analysis with reduced hallucinations. Specifically, we propose an analysis policy language to support users decomposing an analysis problem into several sub-problems that target simple syntactic or semantic properties upon smaller code snippets. The problem decomposition enables the LLMs to target more manageable semantic-related sub-problems, while the syntactic ones are resolved by parsing-based analysis without hallucinations. An analysis policy is evaluated with lazy, incremental, and parallel prompting, which mitigates the hallucinations and improves the performance. It is shown that LLMSA achieves comparable and even superior performance to existing techniques in various clients. For instance, it attains 66.27% precision and 78.57% recall in taint vulnerability detection, surpassing an industrial approach in F1 score by 0.20.

Harnessing the Power of LLM to Support Binary Taint Analysis¶

First, LATTE is fully automated while prior static binary taint analyzers need rely on human expertise to manually customize taint propagation rules and vulnerability inspection rules.
Second, LATTE is significantly effective in vulnerability detection, demonstrated by our comprehensive evaluations.

For example, LATTE has found 37 new bugs in real-world firmware, which the baselines failed to find. Moreover, 10 of them have been assigned CVE numbers.

Lastly, LATTE incurs remarkably low engineering cost, making it a cost-efficient and scalable solution for security researchers and practitioners.

LSAST--Enhancing Cybersecurity through LLM-supported Static Application Security Testing¶

we propose LSAST, a novel approach that integrates LLMs with SAST scanners to enhance vulnerability detection. LSAST leverages a locally hostable LLM, combined with a state-of-the-art knowledge retrieval system, to provide up-to-date vulnerability insights without compromising data privacy. We set a new benchmark for static vulnerability analysis, offering a robust, privacy-conscious solution that bridges the gap between traditional scanners and advanced AI-driven analysis. Our evaluation demonstrates that incorporating SAST results into LLM analysis significantly improves detection accuracy, identifying vulnerabilities missed by conventional methods.

SkipAnalyzer: A Tool for Static Code Analysis with Large Language Model¶

We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above.

To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects.

We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource complete automation;improved vulnerability detection with new bug discoveries;low engineLeak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%.

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks¶

Llm4vuln: A unified evaluation framework for decoupling and enhancing llms' vulnerability reasoning¶

PrimeVul--Vulnerability Detection with Code Language Models: How Far Are We?¶

we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings.

This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

AIBugHunter: A Practical tool for predicting, classifying and repairing software vulnerabilities¶

we propose in this article AIBugHunter, a novel Machine Learning-based software vulnerability analysis tool for C/C++ languages that is integrated into the Visual Studio Code (VS Code) IDE. AIBugHunter helps software developers to achieve real-time vulnerability detection, explanation, and repairs during programming. In particular, AIBugHunter scans through developers’ source code to

(1) locate vulnerabilities,

(2) identify vulnerability types,

(3) estimate vulnerability severity, and

(4) suggest vulnerability repairs.

We integrate our previous works (i.e., LineVul and VulRepair) to achieve vulnerability localization and repairs. In this article, we propose a novel multi-objective optimization (MOO)-based vulnerability classification approach and a transformer-based estimation approach to help AIBugHunter accurately identify vulnerability types and estimate severity.

Our empirical experiments on a large dataset consisting of 188K+ C/C++ functions confirm that our proposed approaches are more accurate than other state-of-the-art baseline methods for vulnerability classification and estimation. Furthermore, we conduct qualitative evaluations including a survey study and a user study to obtain software practitioners’ perceptions of our AIBugHunter tool and assess the impact that AIBugHunter may have on developers’ productivity in security aspects. Our survey study shows that our AIBugHunter is perceived as useful where 90% of the participants consider adopting our AIBugHunter during their software development.

LLbezpeky: Leveraging Large Language Models for Vulnerability Detection¶

We dive into the efficacy of LLMs for detecting vulnerabilities in the context of Android security. We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities. Our experiments show that LLMs outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. We use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. Our experiments also shed light on how different various simple configurations can affect the True Positive (TP) and False Positive (FP) rates.

LLMDFA: Analyzing Dataflow in Code with Large Language Models¶

This paper presents LLMDFA, an LLM-powered compilation-free and customizable dataflow analysis framework. To address hallucinations for reliable results, we decompose the problem into several subtasks and introduce a series of novel strategies. Specifically, we leverage LLMs to synthesize code that outsources delicate reasoning to external expert tools, such as using a parsing library to extract program values of interest and invoking an automated theorem prover to validate path feasibility. Additionally, we adopt a few-shot chain-of-thought prompting to summarize dataflow facts in individual functions, aligning the LLMs with the program semantics of small code snippets to mitigate hallucinations.

We evaluate LLMDFA on synthetic programs to detect three representative types of bugs and on real-world Android applications for customized bug detection. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35. We have open-sourced LLMDFA at https://github.com/chengpeng-wang/LLMDFA.

Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach¶

we present LLift, a pioneering framework that synergizes static analysis and LLMs, with a spotlight on identifying use-before-initialization (UBI) bugs within the Linux kernel. Drawing from our insights into variable usage conventions in Linux, we enhance path analysis using post-constraint guidance. This approach, combined with our methodically crafted procedures, empowers LLift to adeptly handle the challenges of bug-specific modeling, extensive codebases, and the unpredictable nature of LLMs.

Our real-world evaluations identified four previously undiscovered UBI bugs in the mainstream Linux kernel, which the Linux community has acknowledged. This study reaffirms the potential of marrying static analysis with LLMs, setting a compelling direction for future research in this area.

Interleaving Static Analysis and LLM Prompting¶

This paper presents a new approach for using Large Language Models (LLMs) to improve static program analysis. Specifically, during program analysis, we interleave calls to the static analyzer and queries to the LLM: the prompt used to query the LLM is constructed using intermediate results from the static analysis, and the result from the LLM query is used for subsequent analysis of the program. We apply this novel approach to the problem of error-specification inference of functions in systems code written in C; i.e., inferring the set of values returned by each function upon error, which can aid in program understanding as well as in finding error-handling bugs.

Building Prompts

we construct a prompt that consists of the Common Context, Function Context, and Question

Error Specification Inference algorithm

We evaluate our approach on real-world C programs, such as MbedTLS and zlib, by incorporating LLMs into EESI, a state-of-the-art static analysis for error-specification inference. Compared to EESI, our approach achieves higher recall across all benchmarks (from average of 52.55% to 77.83%) and higher F1-score (from average of 0.612 to 0.804) while maintaining precision (from average of 86.67% to 85.12%).

LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions¶

This paper presents Vercation, an approach designed to identify vulnerable versions of OSS written in C/C++. Vercation combines program slicing with a Large Language Model (LLM) to identify vulnerability-relevant code from vulnerability patches. It then backtraces historical commits to gather previous modifications of identified vulnerability-relevant code. We propose semantic-level code clone detection to compare the differences between pre-modification and post-modification code, thereby locating the vulnerability-introducing commit (vic) and enabling to identify the vulnerable versions between the patch commit and the vic.

We curate a dataset linking 74 OSS vulnerabilities and 1013 versions to evaluate Vercation. On this dataset, our approach achieves the F1 score of 92.4%, outperforming current state-of-the-art methods. More importantly, Vercation detected 134 incorrect vulnerable OSS versions in NVD reports.

vulnhuntr:Zero shot vulnerability discovery using LLMs¶

https://github.com/protectai/vulnhuntr

Vulnhuntr leverages the power of LLMs to automatically create and analyze entire code call chains starting from remote user input and ending at server output for detection of complex, multi-step, security-bypassing vulnerabilities that go far beyond what traditional static code analysis tools are capable of performing. This tool is designed to analyze a GitHub repository for potential remotely exploitable vulnerabilities.

Context windows

constraints make it impossible to feed entire projects or even just multiple files into one LLM request Vulnhuntr breaks down the code in small, manageable chunks,smartly requests only the relevant portions of the code. It automatically searches the project files for files that are likely to be the first to handle user input. Then it ingests that entire file and responds with all the potential vulnerabilities. Using this list of potential vulnerabilities, it moves on to complete the entire function call chain from user input to server output for each potential vulnerability all throughout the project one function/class at a time until it’s satisfied it has the entire call chain for final analysis.

LLM-powered Call Chain Search

Vulnhuntr analyzes and reanalyzes in a loop as it requests functions, classes, or other related snippets of code it needs to confirm or deny a vulnerability. This goes on until Vulnhuntr has seen enough code to map out the complete path from user input to server output.

Once the full picture is clear, it returns a detailed final analysis, pointing out trouble spots, providing a proof-of-concept exploit, and attaching a confidence rating for each vulnerability. The beauty of this approach is that even when Vulnhuntr is not 100% certain there’s a vulnerability, it explicitly tells the user exactly why it’s concerned about certain areas of the code. This analysis has shown extremely accurate results in narrowing down entire projects’ worth of code to just a few simple functions that bug hunters should be focusing on when looking for vulnerabilities.

Advanced Prompt Engineering Best practices prompt engineering to guide the LLM efficiently. XML-based prompts to keep responses structured. Chain of thought prompting to walk the model through complex reasoning. Prefilled responses for standard output formats.

The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs¶

We introduce BugLens, a post-refinement framework that significantly improves static analysis precision. BugLens guides an LLM to follow traditional analysis steps by assessing buggy code patterns for security impact and validating the constraints associated with static warnings.

Evaluated on real-world Linux kernel bugs, BugLens raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives and revealing four previously unreported vulnerabilities. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.

LLM-Driven Multi-step Translation from C to Rust using Static Analysis¶

we propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology: an "unidiomatic" step to translate C into Rust while preserving semantics, and an "idiomatic" step to refine the code to follow Rust's semantic standards. SACTOR utilizes information provided by static analysis of the source C program to address challenges such as pointer semantics and dependency resolution. To validate the correctness of the translated result from each step, we use end-to-end testing via the foreign function interface to embed our translated code segment into the original code. We evaluate the translation of 200 programs from two datasets and two case studies, comparing the performance of GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3 70B and DeepSeek-R1 in SACTOR. Our results demonstrate that SACTOR achieves high correctness and improved idiomaticity, with the best-performing model (DeepSeek-R1) reaching 93% and (GPT-4o, Claude 3.5, DeepSeek-R1) reaching 84% correctness (on each dataset, respectively), while producing more natural and Rust-compliant translations compared to existing methods

KNighter: Transforming Static Analysis with LLM-Synthesized Checkers¶

We present KNighter, the first approach that unlocks practical LLM-based static analysis by automatically synthesizing static analyzers from historical bug patterns. our key insight is leveraging LLMs to generate specialized static analyzers guided by historical patch knowledge. KNighter implements this vision through a multi-stage synthesis pipeline that validates checker correctness against original patches and employs an automated refinement process to iteratively reduce false positives. Our evaluation on the Linux kernel demonstrates that KNighter generates high-precision checkers capable of detecting diverse bug patterns overlooked by existing human-written analyzers. To date, KNighter-synthesized checkers have discovered 70 new bugs/vulnerabilities in the Linux kernel, with 56 confirmed and 41 already fixed. 11 of these findings have been assigned CVE numbers.

Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection¶

We investigate various prompting strategies for vulnerability detection and, as part of this exploration, propose a prompting strategy that integrates natural language descriptions of vulnerabilities with a contrastive chain-of-thought reasoning approach, augmented using contrastive samples from a synthetic dataset. Our study highlights the potential of LLMs to detect vulnerabilities by integrating natural language descriptions, contrastive reasoning, and synthetic examples into a comprehensive prompting framework. Our results show that this approach can enhance LLM understanding of vulnerabilities. On a high-quality vulnerability detection dataset such as SVEN, our prompting strategies can improve accuracies, F1-scores, and pairwise accuracies by 23%, 11%, and 14%, respectively.

Feasibility Study for Supporting Static Malware Analysis Using LLM¶

Effectiveness of ChatGPT for Static Analysis: How Far Are We?¶

In static bug detection, ChatGPT achieves accuracy and precision values of up to 68.37% and 63.76% for detecting Null Dereference bugs and 76.95% and 82.73% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer by 12.86% and 43.13% respectively. For removing false-positive warnings, ChatGPT can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs, surpassing existing state-of-the-art false-positive warning removal tools.

Assisting Static Analysis with Large Language Models: A ChatGPT Experiment¶

In this paper, we investigate where and how LLMs can assist static analysis by asking appropriate questions. In particular, we target a specific bug-finding tool, which produces many false positives from the static analysis.

In our evaluation, we find that these false positives can be effectively pruned by asking carefully constructed questions about function-level behaviors or function summaries. Specifically, with a pilot study of 20 false positives, we can successfully prune 8 out of 20 based on GPT-3.5, whereas GPT-4 had a near-perfect result of 16 out of 20, where the four failed ones are not currently considered/supported by our questions, e.g., involving concurrency. Additionally, it also identified one false negative case (a missed bug).

Better Debugging: Combining Static Analysis and LLMs for Explainable Crashing Fault Localization¶

we propose an explainable crashing fault localization approach by combining static analysis and LLM techniques. Our primary insight is that understanding the semantics of exception-throwing statements in the framework code can help find and apprehend the buggy methods in the app code. Based on this idea, - first, we design the exception-thrown summary (ETS) that describes the key elements related to each framework-specific exception and extract ETSs by performing static analysis. - Then we make data-tracking of its key elements to identify and sort buggy candidates for the given crash. - After that, we introduce LLMs to improve the explainability of the localization results. - To construct effective LLM prompts, we design the candidate information summary (CIS) that describes multiple types of explanation-related contexts and then extract CISs via static analysis.

We apply our approach to one typical scenario, i.e., locating Android framework-specific crashing faults, and implement a tool CrashTracker. For fault localization, it exhibited an overall MRR value of 0.91 in precision. For fault explanation, compared to the naive one produced by static analysis only, the LLM-powered explanation achieved a 67.04% improvement in users' satisfaction score.

GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning¶

we propose a novel vulnerability detection approach GRACE that empowers LLM-based software vulnerability detection by incorporating graph structural information in the code and in-context learning. We also design an effective demonstration retrieval approach that identifies highly relevant code examples by considering semantic, lexical, and syntactic similarities for the target code to provide better demonstrations for in-context learning.

To evaluate the effectiveness of GRACE, we conducted an empirical study on three vulnerability detection datasets (i.e., Devign, Reveal, and Big-Vul). The results demonstrate that GRACE outperforms six state-of-the-art vulnerability detection baselines by at least 28.65% in terms of the F1 score across these three datasets.

Vulnerability Detection and Monitoring Using LLM¶

The current study has used the capabilities of the GPT-3.5- Turbo model to conduct a detailed assessment of various code snippets to find any vulnerabilities. The main objective of the experiment was to introduce continuous monitoring technologies to enhance software security and release control. To obtain reliable results, we used a classification report and a confusion matrix. Out of these validation methods we choose accuracy as an important metric for this validation because in this experiment we need our model to predict the vulnerabilities that are present in the 2740 test cases and we would need our model to focus more on true positives(TP). The ideal goal of this experiment was to predict any kind of vulnerability from the real-world data.

Out of all test cases, we were able to have an accuracy of 0.77. This demonstrates the approach's potential efficacy in discovering vulnerabilities. Nonetheless, the study found certain parts that require improvement, emphasizing the importance of continual refinement in the model's methodology to ensure more thorough security assessments. This study lays the groundwork for future research into the use of powerful machine learning models in the assessment of software vulnerabilities.

LProtector: An LLM-driven Vulnerability Detection System¶

This paper presents LProtector, an automated vulnerability detection system for C/C++ codebases driven by the large language model (LLM) GPT-4o and Retrieval-Augmented Generation (RAG). LProtector leverages GPT-4o's powerful code comprehension and generation capabilities to perform binary classification and identify vulnerabilities within target codebases. We conducted experiments on the Big-Vul dataset, showing that LProtector outperforms two state-of-the-art baselines in terms of F1 score, demonstrating the potential of integrating LLMs with vulnerability detections.

Outside the comfort zone: Analysing llm capabilities in software vulnerability detection¶

This paper thoroughly analyses LLMs' capabilities in detecting vulnerabilities within source code by testing models beyond their usual applications to study their potential in cybersecurity tasks. We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs, three of which were further fine-tuned on a dataset that we compiled. Our dataset, alongside five state-of-the-art benchmark datasets, were used to create a pipeline to leverage a binary classification task, namely classifying code into vulnerable and non-vulnerable. The findings highlight significant variations in classification accuracy across benchmarks, revealing the critical influence of fine-tuning in enhancing the detection capabilities of small LLMs over their larger counterparts, yet only in the specific scenarios in which they were trained. Further experiments and analysis also underscore the issues with current benchmark datasets, particularly around mislabeling and their impact on model training and performance, which raises concerns about the current state of practice. We also discuss the road ahead in the field suggesting strategies for improved model training and dataset curation.

Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG¶

In this work, we propose a novel LLM-based vulnerability detection technique Vul-RAG, which leverages knowledge-level retrieval-augmented generation (RAG) framework to detect vulnerability for the given code in three phases.

First, Vul-RAG constructs a vulnerability knowledge base by extracting multi-dimension knowledge via LLMs from existing CVE instances;
second, for a given code snippet, Vul-RAG} retrieves the relevant vulnerability knowledge from the constructed knowledge base based on functional semantics;
third, Vul-RAG leverages LLMs to check the vulnerability of the given code snippet by reasoning the presence of vulnerability causes and fixing solutions of the retrieved vulnerability knowledge.

Our evaluation of Vul-RAG on our constructed benchmark PairVul shows that Vul-RAG substantially outperforms all baselines by 12.96\%/110\% relative improvement in accuracy/pairwise-accuracy. In addition, our user study shows that the vulnerability knowledge generated by Vul-RAG can serve as high-quality explanations which can improve the manual detection accuracy from 0.60 to 0.77.

Datasets¶

A Vulnerability Code Intent Summary Dataset¶

The dataset contains X code examples which cover Y categories of vulnerability. Our data are from Z open-source projects and CVE entries, and compared to existing work, our dataset not only contains original code but also code function summary and security intent summary, providing context information for research in code security analysis. All information is in CSV format. The contributions of this paper are four-fold: the establishment of a high-quality, multi-perspective Code Intent Summary Dataset; an innovative method in data collection and processing; A new multi-perspective code analysis framework that promotes cross-disciplinary research in the fields of software engineering and cybersecurity; improving the practicality and scalability of the research outcomes by considering the code length limitations in real-world applications. Our dataset and related tools have been publicly released on GitHub.

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection¶

We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs. We propose the CASTLE Score, a novel evaluation metric to ensure fair comparison. Our results reveal key differences: ESBMC (a formal verification tool) minimizes false positives but struggles with vulnerabilities beyond model checking, such as weak cryptography or SQL injection. Static analyzers suffer from high false positives, increasing manual validation efforts for developers. LLMs perform exceptionally well in the CASTLE dataset when identifying vulnerabilities in small code snippets. However, their accuracy declines, and hallucinations increase as the code size grows. These results suggest that LLMs could play a pivotal role in future security solutions, particularly within code completion frameworks, where they can provide real-time guidance to prevent vulnerabilities. The dataset is accessible at CASTLE-Benchmark (CASTLE Benchmark)

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection¶

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

LLM Agents can Autonomously Exploit One-day Vulnerabilities¶

In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.