Abstract

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenizied Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to PixArt-α).

Qihoo-T2I / T2V / T2MV

Method

method

The image or video undergoes processing through a 3D VAE, followed by noise addition, patch embedding, and positional encoding to generate latent tokens. We replace global attention with proxy-tokenized attention to establish contextual associations and employ visual cross-attention to propagate this information to all tokens, thereby reducing computational redundancy. Moreover, texture detail modeling is enhanced through window attention and shifted window attention.

Comparison of Computational Complexity

efficiency

PT-DiT effectively reduces the computational complexity in both image and video generation tasks. The GFLOPs of PT-DiT/H are significantly less than Lumina-Next across multiple scales. Specifically, at resolutions of 512 and 2048, PT-DiT/H achieves complexity reduction of 82.0% and 82.5% respectively. Since the T2V version of EasyAnimateV4 employs HunyuanDiT with 3D full attention, its memory consumption increases dramatically with the number of video frames. In contrast, PT-DiT, which also utilizes 3D spatial-temporal modeling, experiences only a slight increase in memory consumption due to its well-designed proxy-tokenized attention mechanism.

More cases of the Qihoo-T2I

sample13
A small plant grows in the mud. The sun shines on it, providing warmth and light.
sample14
A black and white drawing of a man with a beard, wearing a jacket, a sketchy style.
sample15
Two pages in the book show a cityscape, with a large spaceship flying over a city.
sample16
A cute owl sitting on a table surrounded by books, cups, spoons and bowls.
sample7
A huge pink planet in the sky lies between large green pyramids.
sample7
A woman wearing brown clothes with bow and arrow standing in front of a city background.
sample7
A futuristic city shrouded in green light.
sample7
A forest with a blend of red, yellow and orange tones and a wide variety of tree species.
sample7
A large colorful castle in cartoon style.
sample7
A beautiful lady is looking up. There are stars in the sky and a flying bird.
sample17
A large future residential building that combines green plants with architecture.
sample18
A large, brightly colored spiral nebula, with a starry background.
sample19
A scene of a river flowing through a grassy field during a beautiful sunset. a watercolor style.
sample20
A woman with red hair, wearing a white sweater. The background is blurry.
sample21
An intricate flame-shaped ornament.
sample22
Close up of a fox with orange and white fur.
sample7
Portrait of a man with leaves surrounding his face.
sample7
A chieftain wearing a feather headdress.
sample7
A vast open field under a blue sky with misty mountains in the distance.
sample7
A futuristic cityscape with a huge bright sun in the center of the scene.
sample7
Close-up of a man's face wearing glasses against a colorful background.
sample7
In a dark purple forest, there is a lady holding a staff, and behind her are his teammates.

More cases of the Qihoo-T2V

The bustling city night scene, tall buildings, the camera shifts from the left side of the frame to right.
Blonde woman with sunflowers, smiling in a sunflower field under blue sky.
Futuristic tunnel with pink-purple lights, conveying speed and movement.
Sunset cityscape with spires, buildings, clouds, warm glow, and trees.
Woman back to camera in hat and purple dress among purple flowers.
A school of fish is seen wandering in the deep sea, and below the video is a coral reef.

More cases of the Qihoo-T2MV

Pixelated Minecraft sword with a yellow handle.
Low poly duck model with orange beak and green cap.
Pixel art model of a robot-doll with red shoes.
A yellow toy fox character.
-->

BibTeX

@misc{wang2024qihoot2xefficiencyfocuseddiffusiontransformer,
                title={Qihoo-T2X: An Efficient Proxy-Tokenizied Diffusion Transformer for Text-to-Any-Task}, 
                author={Jing Wang and Ao Ma and Jiasong Feng and Dawei Leng and Yuhui Yin and Xiaodan Liang},
                year={2024},
                eprint={2409.04005},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2409.04005}, 
}