Benchmarking and Improving Rust Code Generation for Windows API Tasks Using Verified Feedback

Faculty Mentor

Sanmeet Kaur

Presentation Type

Poster

Start Date

4-14-2026 11:30 AM

End Date

4-14-2026 1:30 PM

Location

PUB NCR

Primary Discipline of Presentation

Computer Science

Abstract

Large Language Models (LLMs) have made significant improvements in code generation but continue to experience challenges in generating Rust code that compiles and passes tests for low-resource systems tasks, such as Windows API programming. This project evaluates whether unit test-guided supervision can improve the Rust code generation of an open-weight LLM, with a focus on low-resource tasks. Rust is particularly well-suited for execution based evaluation because the compiler, borrow checker, and linters all provide strong signals regarding type safety, lifetime validity, and idiomatic correctness before runtime. We construct a benchmark of Rust problems with automated tests covering single-operation Windows API tasks, multi-API tasks that require correct control flow and memory management, and general Rust problem-solving drawn from established execution-based evaluation benchmarks to assess generalization. Baseline performance will be measured using compile success, lint thresholds, and test pass rates (e.g., pass@k). An agent with documentation access and iterative compile/test feedback generates a verified subset of correct solutions to fine-tune the base model. We then reevaluate on a held-out split using the same metrics to determine whether verified supervision improves correctness on low-resource Rust tasks without reducing performance on general Rust problems.

This document is currently not available here.

Share

COinS
 
Apr 14th, 11:30 AM Apr 14th, 1:30 PM

Benchmarking and Improving Rust Code Generation for Windows API Tasks Using Verified Feedback

PUB NCR

Large Language Models (LLMs) have made significant improvements in code generation but continue to experience challenges in generating Rust code that compiles and passes tests for low-resource systems tasks, such as Windows API programming. This project evaluates whether unit test-guided supervision can improve the Rust code generation of an open-weight LLM, with a focus on low-resource tasks. Rust is particularly well-suited for execution based evaluation because the compiler, borrow checker, and linters all provide strong signals regarding type safety, lifetime validity, and idiomatic correctness before runtime. We construct a benchmark of Rust problems with automated tests covering single-operation Windows API tasks, multi-API tasks that require correct control flow and memory management, and general Rust problem-solving drawn from established execution-based evaluation benchmarks to assess generalization. Baseline performance will be measured using compile success, lint thresholds, and test pass rates (e.g., pass@k). An agent with documentation access and iterative compile/test feedback generates a verified subset of correct solutions to fine-tune the base model. We then reevaluate on a held-out split using the same metrics to determine whether verified supervision improves correctness on low-resource Rust tasks without reducing performance on general Rust problems.