Qihoo-T2X: An Efficient Proxy-Tokenizied Diffusion Transformer for Text-to-Any-Task

Jing Wang^1,2,*, Ao Ma^2,*,†, Jiasong Feng^2,*, Dawei Leng^2,‡, Yuhui Yin², Xiaodan Liang^1,‡

¹ Sun Yat-sen University ²360 AI Research
^*Equal Contribution. ^†Project Lead. ^‡Corresponding Authors.

Qihoo-T2X Code Qihoo-T2X arXiv

Abstract

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenizied Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to PixArt-α).

Qihoo-T2I / T2V / T2MV

A celestial garden features

A celestial garden features a tree growing on a planet with roots extending outwards, surrounded by a star-filled sky full of vitality and color.

A landscape with a lake

A landscape with a lake surrounded by mountains and the sun setting. Warm tones. The lake is calm and the sun casts a warm glow, highlighting the bright colors of the trees and the lake.

A man stands on

A man stands on a platform in a futuristic city, gazing at the illuminated skyline, which includes a towering skyscraper. The scene captures the essence of a science fiction setting.

A colorful, pixelated world with a smiling face in the center. This face is surrounded by a blue background. The background is full of shapes and colors, creating a vibrant and lively atmosphere.

A large, colorful painting of a dog with its pink tongue sticking out.

>A white plate on

A white plate on a rustic wooden table. Close-up. On the plate there is a soft tortilla with chopped meat, a slice of lime, a piece of chopped tomato and sprinkled with cilantro.

A beautiful mountain

A beautiful mountain landscape with a lake, surrounded by trees. There is mist not far away. The mountains are depicted in almost black, giving the scene a solemn and moody atmosphere.

A yogi with a beard and ponytail

A yogi with a beard and ponytail, wearing a necklace, and a background filled with various celestial elements, creates a sense of spirituality and sacredness.

A vibrant painting

A vibrant painting showing a bouquet of flowers in a vase. The bouquet, a mix of red and yellow flowers, is placed on the dining table, adding to the elegance of the scene.

A group of cartoon family characters, including a man, a woman and their children. They all pose for a photo together in winter clothing

A painting hanging

A painting hanging on the wall depicts a mountain with a starry night sky above it. The mountain is the focal point, surrounded by beautifully painted scenery.

Method

method

The image or video undergoes processing through a 3D VAE, followed by noise addition, patch embedding, and positional encoding to generate latent tokens. We replace global attention with proxy-tokenized attention to establish contextual associations and employ visual cross-attention to propagate this information to all tokens, thereby reducing computational redundancy. Moreover, texture detail modeling is enhanced through window attention and shifted window attention.

Comparison of Computational Complexity

efficiency

PT-DiT effectively reduces the computational complexity in both image and video generation tasks. The GFLOPs of PT-DiT/H are significantly less than Lumina-Next across multiple scales. Specifically, at resolutions of 512 and 2048, PT-DiT/H achieves complexity reduction of 82.0% and 82.5% respectively. Since the T2V version of EasyAnimateV4 employs HunyuanDiT with 3D full attention, its memory consumption increases dramatically with the number of video frames. In contrast, PT-DiT, which also utilizes 3D spatial-temporal modeling, experiences only a slight increase in memory consumption due to its well-designed proxy-tokenized attention mechanism.

More cases of the Qihoo-T2I

A small plant grows in the mud. The sun shines on it, providing warmth and light.

A black and white drawing of a man with a beard, wearing a jacket, a sketchy style.

Two pages in the book show a cityscape, with a large spaceship flying over a city.

A cute owl sitting on a table surrounded by books, cups, spoons and bowls.

A huge pink planet in the sky lies between large green pyramids.

A woman wearing brown clothes with bow and arrow standing in front of a city background.

A futuristic city shrouded in green light.

A forest with a blend of red, yellow and orange tones and a wide variety of tree species.

A large colorful castle in cartoon style.

A beautiful lady is looking up. There are stars in the sky and a flying bird.

A large future residential building that combines green plants with architecture.

A large, brightly colored spiral nebula, with a starry background.

A scene of a river flowing through a grassy field during a beautiful sunset. a watercolor style.

A woman with red hair, wearing a white sweater. The background is blurry.

An intricate flame-shaped ornament.

Close up of a fox with orange and white fur.

Portrait of a man with leaves surrounding his face.

A chieftain wearing a feather headdress.

A vast open field under a blue sky with misty mountains in the distance.

A futuristic cityscape with a huge bright sun in the center of the scene.

Close-up of a man's face wearing glasses against a colorful background.

In a dark purple forest, there is a lady holding a staff, and behind her are his teammates.

More cases of the Qihoo-T2V

The bustling city night scene, tall buildings, the camera shifts from the left side of the frame to right.

Blonde woman with sunflowers, smiling in a sunflower field under blue sky.

Futuristic tunnel with pink-purple lights, conveying speed and movement.

Sunset cityscape with spires, buildings, clouds, warm glow, and trees.

Woman back to camera in hat and purple dress among purple flowers.

A school of fish is seen wandering in the deep sea, and below the video is a coral reef.

More cases of the Qihoo-T2MV

Pixelated Minecraft sword with a yellow handle.

Low poly duck model with orange beak and green cap.

Pixel art model of a robot-doll with red shoes.

A yellow toy fox character.

-->

BibTeX

@misc{wang2024qihoot2xefficiencyfocuseddiffusiontransformer,
                title={Qihoo-T2X: An Efficient Proxy-Tokenizied Diffusion Transformer for Text-to-Any-Task}, 
                author={Jing Wang and Ao Ma and Jiasong Feng and Dawei Leng and Yuhui Yin and Xiaodan Liang},
                year={2024},
                eprint={2409.04005},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2409.04005}, 
}