PlanGen:

Towards Unified Layout Planning and Image Generation

in Auto-Regressive Vision Language Models

360 AI Research

PlanGen models layout planning and image generation jointly, allowing layout planning before generating corresponding images, and the two processes are completed in a unified model. PlanGen can perform multi-type tasks related to layout, including a) layout-image joint generation, b) layout-to-image generation, c) image layout understanding and d) layout-guided image manipulation.

Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layout-related tasks, showing its great potential.

Overview


Upper: PlanGen models layout planning and image generation jointly in an autoregressive visual-language model through a unified prompting design with the next-token prediction training objective. Lower: Illustration of PlanGen’s multitasking related to layout: a) layout-image joint generation, b) layout to image generation, c) image layout understanding and d) layout-guided image manipulation.

Layout-Image Joint Generation

PlanGen can complete layout planning and layout-to-image generation in a unified model. Just like thinking about what object each area should be before generating an image, such an explicit planning process allows the model to enjoy more powerful image generation capabilities.

Layout-to-Image Generation

PlanGen takes layout conditions as context input, with extremely strong flexibility and scalability, no need to compress the layout conditions as previous methods, allowing as detailed and complex local descriptions as possible, leading to a more aligned layout-to-image generation thanks to transformer’s long context dependencies.

Image Layout Understanding

The ability to understand the layout of real images can help the model generate images that are more in line with the layout conditions, intuitively, because this can further strengthen the model’s in-depth understanding of the relationship between layout conditions and corresponding images. Including the task of image layout understanding to PlanGen can move further towards a more general layout VLM.

Layout-guided Image Manipulation

Benefiting from the modeling paradigm defined by layout conditions, PlanGen could control the generation of the contents of the local area based on the corresponding local captions and bounding boxes, which allows us to easily extend PlanGen to layout-guided image manipulation without further task-specific training.

BibTeX

@misc{he2025plangen,
      title={PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models}, 
      author={Runze He and Bo Cheng and Yuhang Ma and Qingxiang Jia and Shanyuan Liu and Ao Ma and Xiaoyu Wu and Liebucha Wu and Dawei Leng and Yuhui Yin},
      year={2025},
      eprint={2503.10127},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10127}, 
}