Introducing veri_wm_v1: An Interactive World Model for Driving Simulation

Coming soon!

2026-02-12

Today we're releasing veri_wm_v1, our first world model for interactive driving simulation. It generates realistic driving video from a single dashcam image, a text prompt, and real-time WASD camera controls.

Overview

veri_wm_v1 takes three inputs:

A dashcam image — the starting frame
A text prompt — describing the scene or driving conditions
WASD controls — mapped to camera trajectories for interactive steering

From these, it generates coherent driving video at 480p or 720p resolution, up to a minute long. The model runs on a single H100 GPU.

Architecture

veri_wm_v1 is built on the Wan 2.2 video generation backbone. The key architectural components:

Backbone: 40 transformer layers with unified 3D attention over spatial and temporal dimensions. The model uses a patch size of (1, 2, 2) and a VAE stride of (4, 8, 8), operating at 5120 hidden dimensions with 40 attention heads.

Mixture-of-Experts (MoE): The model uses a dual-expert architecture — a high-noise expert and a low-noise expert — with 14B parameters each (28B total). During diffusion, the boundary between experts is at timestep t=0.947. The high-noise expert handles the initial coarse structure generation, while the low-noise expert refines details. This split allows each expert to specialize in its noise regime, improving generation quality without increasing per-step compute.

Camera Control: We use Plucker ray embeddings to represent camera poses. These 6D ray coordinates (origin + direction per pixel) are injected into each transformer block via Adaptive Layer Normalization (AdaLN). This gives the model precise geometric understanding of camera motion — when you press W to go forward, the model receives a physically-grounded camera trajectory, not just a text description.

Three-Stage Training

veri_wm_v1 is trained in three stages:

Stage I: Pre-training (Video Prior)

The base Wan 2.2 model is pre-trained on large-scale video data to learn general video generation. This gives us a strong motion and appearance prior.

Stage II: Middle-training (World Knowledge + Action Control)

This is where the model learns to be a world model. We train on driving datasets with:

Camera pose conditioning via Plucker embeddings
MoE activation — both experts are active, with classifier-free guidance (CFG=5.0) over 70 denoising steps
Bidirectional attention — the model can attend to all frames, producing the highest quality output

The Stage II checkpoint (LingBot-World-Base) is the foundation: it requires both experts and CFG to produce clean output.

Stage III: Post-training (Causal + Fast)

For interactive use, we need causal generation — the model should generate frames left-to-right without seeing the future. Stage III converts the bidirectional model to causal:

Initialize from the high-noise expert (the coarser expert generalizes better to the causal setting)
Block causal attention — each chunk of frames can only attend to previous chunks
Diffusion forcing — training with mixed noise levels across chunks for robust autoregressive generation
DMD distillation — reducing the number of denoising steps for faster inference
KV caching — FIFO eviction cache for streaming generation without recomputing past frames

Camera Control in Detail

The WASD mapping works as follows:

W/S — translate the camera forward/backward along the viewing direction
A/D — rotate the camera left/right (yaw)

Each keypress generates a camera trajectory delta, which is converted to a sequence of Plucker embeddings. These embeddings encode the exact 3D camera transformation per frame, giving the model geometric grounding rather than relying on ambiguous text descriptions.

The Plucker embedding for each pixel is a 6D vector: the camera origin crossed with the ray direction, concatenated with the ray direction. This representation is invariant to camera intrinsics and directly encodes the epipolar geometry between frames.

What's Next

We're working on several improvements:

Real-time interactive generation — streaming output fast enough for live WASD control
Driving-specific LoRA — fine-tuning on curated driving datasets for improved realism
Playground access — a web-based demo where you can upload a dashcam image and drive through the generated world

The model weights are available on HuggingFace. Everything is open source.

Questions? daniel@veri.studio