WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation;

WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

Jing Wang^1,2,*, Ao Ma^2,*,†, Ke Cao^2,*, Jun Zheng¹, Zhanjie Zhang², Jiasong Feng²,
Shanyuan Liu², Yuhang Ma², Bo Cheng², Daiwei Leng^2,‡, Yuhui Yin², Xiaodan Liang^1,3,‡

¹Shenzhen Campus of Sun Yat-Sen University ²360 AI Research ³Peng Cheng Laboratory
^*Equal Contribution. ^†Project Lead. ^‡Corresponding Authors.

Abstract

Recent advances in text-to-video (T2V) generation, exemplified by models such as SoRA and Kling, have demonstrated strong potential for constructing world simulators. However, existing T2V models still struggle to understand abstract physical principles and to generate videos that faithfully obey physical laws. This limitation stems primarily from the lack of explicit physical guidance, caused by a significant gap between high-level physical concepts and the generative capabilities of current models. To address this challenge, we propose the World Simulator Assistant (WISA), a novel framework designed to systematically decompose and integrate physical principles into T2V models. Specifically, WISA decomposes physical knowledge into three hierarchical levels: textual physical descriptions, qualitative physical categories, and quantitative physical properties. It then incorporates several carefully designed modules—such as Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier—to effectively encode these attributes and enhance the model’s adherence to physical laws during generation. In addition, most existing video datasets feature only weak or implicit representations of physical phenomena, limiting their utility for learning explicit physical principles. To bridge this gap, we present WISA-80K, a new dataset comprising 80,000 human-curated videos that depict 17 fundamental physical laws across three core domains of physics: dynamics, thermodynamics, and optics. Experimental results show that WISA substantially improves the alignment of T2V models (such as CogVideoX and Wan2.1) with real-world physical laws, achieving notable gains on the VideoPhy benchmark. Our data, code, and models will be open source.