VLA Autonomous Driving

Fine-tuning a 957M parameter multimodal model for real-time vehicle control - MS Thesis

Solo Researcher 2025 - 2026
PyTorch DeepSpeed LoRA / PEFT Hydra W&B PyTorch Lightning

Vision-language-action (VLA) models trained in one simulator don't transfer to another. SimLingo, a state-of-the-art driving policy trained in CARLA's photorealistic environment, achieves only 4.7% route coverage when deployed directly in QLabs - a flat-shaded educational simulator with fundamentally different visual characteristics.

The thesis frames this as a distribution alignment problem: how do you adapt a 957M parameter model to a new visual domain with limited data, while maintaining real-time control at 4Hz on embedded hardware?

End-to-end pipeline: data collection → fine-tuning → inference → evaluation

The architecture threads a frozen visual encoder into a lightweight language model adapted with LoRA - keeping the pretrained driving knowledge intact while remapping visual representations to the new simulator's domain.

Camera images enter InternViT-300M (frozen visual encoder), pass through a projection layer, then feed into Qwen2-0.5B with LoRA adapters (17.6M trainable parameters). Thirty learnable query tokens extract route waypoints (20) and speed waypoints (10) from the LLM hidden states. The LLM serves as a feature aggregator, not a text generator - autoregressive output is unused.

Visual comparison of CARLA photorealistic rendering vs QLabs flat-shaded graphics
The visual domain gap: CARLA's photorealistic rendering (left) vs QLabs' flat-shaded graphics (right)
LoRA fine-tuning (rank 32) instead of full fine-tuning

With only 7,498 training frames, full fine-tuning would overfit and destroy the pretrained driving priors embedded in the base model. LoRA restricts updates to low-rank adapter matrices, preserving the model's existing knowledge of scene understanding and motion planning while adapting its visual-linguistic representations to QLabs' flat-shaded environment.

WHY → preserves pretrained knowledge while adapting to new visual domain
Frozen vision encoder

InternViT-300M was pretrained on a massive and diverse visual corpus. Freezing it entirely keeps low-level and mid-level visual grounding stable across training. Only the language reasoning pathway adapts - which is precisely where the sim-to-sim gap manifests as misinterpreted scene semantics rather than raw feature failure.

WHY → keeps visual grounding stable; only language reasoning adapts
Single-frame velocity estimation at inference

The training data was collected with instantaneous velocity derived from a single frame's odometry reading. Using multi-frame optical flow or temporal differencing at inference would introduce a distribution shift within the simulator itself - the model would see velocity signals it was never trained on. Single-frame estimation matches the training distribution exactly.

WHY → matches training distribution to avoid sim-to-real shift within the sim

Key features

Custom Data Collection Pipeline
Teleoperation system running at 30Hz with synchronized logging at 4Hz. Outputs in SimLingo-compatible format: JPEG frames and gzip-compressed JSON telemetry. 40 training runs + 15 validation runs yielded 10,495 total frames (7,498 training).
Parameter-Efficient Fine-Tuning
LoRA adapters on Qwen2-0.5B trained via PyTorch Lightning and DeepSpeed. The model reaches near-expert performance after just 1 epoch, with marginal gains over the remaining 15. Average Displacement Error drops from 0.114m to 0.085m, matching the human expert baseline of 0.087m.
Real-Time Inference Engine
4Hz control loop: camera capture → velocity estimation → route lookahead → model inference → PID-tuned steering/speed conversion → actuation. Designed for embedded hardware constraints while sustaining closed-loop vehicle control.
Rigorous Evaluation Harness
5 baseline + 5 obstacle scenarios with quantitative metrics: route coverage, lateral deviation, and collision detection. Results compared against a LiDAR-based Adaptive Cruise Control baseline to establish task-appropriate benchmarks.
VLA model navigating around a stationary obstacle in QLabs simulator
Obstacle scenario 1 - stationary avoidance
VLA model navigating around a second obstacle configuration in QLabs simulator
Obstacle scenario 2 - alternate obstacle placement
Trajectory overlays showing model predictions vs ground truth across evaluation scenarios
Trajectory overlays across evaluation scenarios - model predictions vs ground truth
Pass/fail results across all obstacle avoidance scenarios
Obstacle avoidance evaluation results across all test scenarios

By the numbers

99%+ Route Coverage (baseline)
957M Total Parameters (17.6M trainable)
0.085m Avg Displacement Error (vs 0.087m human)
4Hz Real-time Control Loop
10,495 Training Frames Collected
99%+ Route Coverage (baseline scenarios)

The biggest insight was that the 4.7% pre-trained performance wasn't a model problem - it was a distribution problem. LoRA fine-tuning with only 7,498 frames was sufficient to bridge the visual gap, which suggests that the pretrained model's driving knowledge is remarkably transferable once you align the input distribution.

The failure cases (curved sections at high speed with late obstacle visibility) point to limitations of single-camera perception, not the VLA architecture itself. If I were to extend this work, I'd explore multi-frame temporal context and evaluate transfer to the physical QCar2 hardware.