PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layout-related tasks, showing its great potential.

BibTeX

@misc{he2025plangen,
      title={PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models}, 
      author={Runze He and Bo Cheng and Yuhang Ma and Qingxiang Jia and Shanyuan Liu and Ao Ma and Xiaoyu Wu and Liebucha Wu and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2503.10127},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10127}, 
}

PlanGen:

Towards Unified Layout Planning and Image Generation

in Auto-Regressive Vision Language Models

Abstract

Overview

Layout-Image Joint Generation

Layout-to-Image Generation

Image Layout Understanding

Layout-guided Image Manipulation

BibTeX