QCar2 Autonomous Driving with Natural Language

Vision-Language-Action Model for Intelligent Vehicle Control

Garegin Mazmanyan

2025 - Present

Python PyTorch InternVL2-1B ROS2 QLabs QCar2 CUDA Computer Vision NLP

Demonstration

Demonstration of QCar2 autonomous driving with natural language command processing. The vehicle navigates the QLabs environment using the SimLingo Vision-Language-Action model, responding to real-time commands while generating natural language commentary explaining its driving decisions.

Project Overview

This project adapts the SimLingo Vision-Language-Action (VLA) model from the CARLA simulator to Quanser's QCar2 platform in QLabs simulation. SimLingo is a state-of-the-art vision-language model for autonomous driving that uses the InternVL2-1B backbone to predict driving waypoints from camera images, target points, and optional natural language instructions.

The system enables autonomous vehicles to understand and execute natural language commands while providing real-time explanations of driving decisions. By integrating computer vision with natural language processing, this research advances the field of intelligent vehicle systems beyond mechanical automation toward truly intelligent robotic behavior.

The complete pipeline processes camera input through JPEG compression and dynamic preprocessing, feeds it to the SimLingo model for waypoint prediction, converts predictions to control commands using PID controllers and a kinematic bicycle model, and executes them on the QCar2 platform—all while generating natural language commentary explaining the vehicle's actions.

Methodology

Vision-Language-Action Architecture

The system employs the InternVL2-1B vision-language model with LoRA (Low-Rank Adaptation) adapters trained on the SimLingo checkpoint (epoch=013.ckpt). The model processes multi-modal inputs to generate driving actions:

Visual Input: Camera images preprocessed into 448x448 patches with JPEG compression matching CARLA training data
Spatial Input: Target waypoints converted to ego-centric coordinates using lookahead algorithm
Language Input: Optional natural language instructions via interactive commentary window
Output: Predicted waypoints, speed profiles, and natural language commentary

Control System Pipeline

The complete autonomous driving pipeline consists of five integrated components:

Route Manager: Predefined waypoint routes with lookahead algorithm for target point selection
Camera Processor: JPEG compression, dynamic preprocessing, and ImageNet normalization
SimLingo Model: Vision-language inference predicting 2-second trajectory and speeds
Control Converter: Lateral PID (Kp=3.25, Ki=1.0, Kd=1.0) and longitudinal linear regression
QCar2 Interface: Kinematic bicycle model converting to forward velocity and turn angle

Natural Language Command Processing

The system supports high-level commands (HLC) through an interactive commentary window. Users can issue instructions like "Turn left at the intersection" or "Slow down and prepare to stop." The model switches to INSTRUCTION_FOLLOWING mode, incorporating the command into its prompt alongside visual and spatial inputs to generate appropriate driving behavior while maintaining safety constraints.

Experimental Results

Obstacle Avoidance Demonstrations

The following images demonstrate the system's ability to navigate obstacle-rich environments while processing natural language commands. Each scenario shows the ego vehicle, an interaction window for real-time command input, and commentary describing the vehicle's actions.

QCar2 approaching roundabout with parked vehicle obstacle

Figure 1: Roundabout obstacle avoidance scenario. The ego vehicle is approaching a roundabout with a car parked at the entrance, positioned slightly to the left of the desired green polyline route. The model's commentary output states: "Go back to your route after avoiding the obstacle. Accelerate to drive with the target speed." This demonstrates the system's ability to detect obstacles, plan safe avoidance maneuvers, and provide real-time natural language explanations of its driving decisions.

QCar2 navigating around parked vehicle with oncoming traffic

Figure 2: Complex multi-obstacle scenario. The ego vehicle approaches a parked vehicle on the curb while oncoming traffic is present, positioned slightly to the right of the polyline. The model generates two commentary outputs: (1) "Go around the accident in your lane" (a hallucination/false positive indicating the model requires fine-tuning for QLabs environment), and (2) "Go around the parked vehicle" (correct detection). This highlights both the system's perception capabilities and areas for improvement through domain-specific fine-tuning.

Trajectory comparison: expected vs actual path with performance statistics

Figure 3: Trajectory comparison showing the expected route (blue polyline representing defined waypoints) versus the actual path followed by the vehicle (red line). The graph includes experimental statistics demonstrating the system's tracking accuracy and performance metrics across the complete test route.

Key Findings

Vision-Language Integration: Successfully adapted SimLingo model from CARLA to QCar2 platform while preserving core AI capabilities
Natural Language Understanding: Demonstrated ability to interpret and execute diverse natural language commands in real-time
Trajectory Accuracy: Achieved precise waypoint following with minimal deviation from planned routes
Real-time Performance: GPU-accelerated inference on RTX 5070 enables responsive control at practical driving speeds
Explainable AI: Generated natural language commentary provides transparency into autonomous decision-making

Technical Contributions

VLA Model Adaptation

Successfully adapted SimLingo vision-language-action model from CARLA to QCar2 platform

Natural Language Control

Implemented high-level command interface for intuitive vehicle control via natural language

Integrated Pipeline

Developed complete autonomous driving pipeline from camera to control with real-time commentary

GPU Acceleration

Optimized inference for CUDA 12.8 enabling real-time performance on RTX 5070

Repository

QCar2 SimLingo Integration

Garegin Mazmanyan

GitHub Repository - 2025

Autonomous Driving, Vision-Language Models, Natural Language Processing

View Repository Documentation

Back to All Projects