SafeVL: Driving Safety Evaluation via Meticulous Reasoning in Vision Language Models

Anonymous Author(s)

Abstract


Safety remains a fundamental challenge in autonomous driving, with a key step being the development of a safety evaluator that can reliably identify unsafe (i.e., collision-prone) scenarios. Existing methods, however, either rely heavily on object trajectories or use only language-based reasoning, neglecting crucial visual cues and limiting their generalization to unsafe events. Vision–Language Models (VLMs) have recently shown strong generalization across various autonomous driving tasks, yet their application to safety evaluation remains limited due to scarce unsafe driving data and insufficient instance-level visual grounding. In this work, we present SafeVL, a VLM-based safety evaluator for autonomous driving that takes video as input, produces structured chain-of-thought reasoning traces, and ultimately outputs a safe/unsafe decision. Our framework consists of two key components: (1) a Road-Graph Counterfactual Data Generation Engine, which synthesizes diverse counterfactual unsafe scenarios, and (2) an Object-centric Visual Reasoning Framework, which fuses counterfactual unsafe scenarios with existing safe driving datasets for safety prediction. We conduct comprehensive experiments on the Nexar real-world collision dataset and show that SafeVL achieves 76% accuracy in the zero-shot setting, representing a 20% improvement over existing models. Finally, we integrate SafeVL into an end-to-end driving policy (UniAD) as a planning trajectory filter, reducing closed-loop collision rates by 8% on the NeuroNCAP benchmark, demonstrating its downstream practical benefits for safer autonomous driving.

Our Approach


SafeVL Reasoning Pipeline

SafeVL performs structured, interpretable reasoning over real and counterfactual driving scenarios to evaluate safety in autonomous driving. The framework first employs a Road-Graph Counterfactual Data Generation Engine that perturbs agent actions such as acceleration and lane changes to synthesize diverse unsafe outcomes. This process expands the range of rare yet critical collision-prone scenarios. Then, an Object-centric Visual Reasoning Framework processes each driving video through four stages: (S1) scene understanding, (S2) key object detection, (S3) behavior prediction for ego and surrounding agents, and (S4) safety analysis. By aligning safe trajectories with their counterfactual unsafe variants under the same reasoning pipeline, SafeVL learns to identify and explain high-risk events before collisions occur, producing transparent and reliable Safe/Unsafe decisions.

BibTeX


@inproceedings{vpd_lm_neurips2025,
  title  = {SafeVL: Driving Safety Evaluation via Meticulous Reasoning in Vision Language Models},
  author = {Anonymous Author(s)},
  booktitle = {Submitted to ICRA 2026},
  year   = {2025}
}