FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

FancyVideo: Towards Dynamic and Consistent
Video Generation via Cross-frame Textual Guidance

Jiasong Feng^1* Ao Ma^1,*,† Jing Wang^1,2,* Bo Cheng¹ Xiaodan Liang² Dawei Leng^1,‡ Yuhui Yin¹

360 AI Research¹ Sun Yat-sen University²

^* Equal Contribution. ^† Project Lead. ^‡ Corresponding Authors.

Abstract

Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark and facilitates the synthesis of dynamic and consistent videos.

Multiple Resolution Video Generation

768 × 768

A golden retriever has a picnic on a beautiful tropical beach at sunset.

A happy elephant wearing a birthday hat walking under the sea.

A panda standing on a surfboard in the ocean in sunset, 4k, high resolution.

Red sports car coming around a bend in a mountain road.

1024 × 768

A monkey is playing bass guitar, stage background.

Aerial view of a hiker man standing on a mountain peak.

New York Skyline with 'Hello World' written with fireworks on the sky.

Teddy bear surfer rides the wave in the tropics.

768 × 1024

A bear wearing sunglasses and hosting a talk show.

A photo of a Corgi dog riding a bike in Times Square. It is wearing sunglasses and a beach hat.

Impressionist style, a yellow rubber duck floating on the wave on the sunset

1024 × 1024

A cat wearing sunglasses and working as a lifeguard at a pool.

A confused grizzly bear in calculus class.

A dog wearing virtual reality goggles in sunset, 4k, high resolution.

Personized Video Generation

Realcartoon3d

Ultra detailed background ((Cherry Blossoms)),(22 years old Spanish woman),medium breast,wearing flowing dress,golden brown flowing hair glamour,(green eyes),beautiful face,((white mists:1.4)),(pink dust:1.2),mysterious,mysteries of universe,yellow lightnings,volumetric lightnings,dark and blurry background.

Girl with really wild hair,mane,multicolored hairlighting,(from front:0.6).

Technological sense,best quality, masterpiece, illustration, wallpaper, official art, Amazing, finely detail, an extremely delicate and beautiful,extremely detailed,highly detailed,sharp focus,rich background,blurry background,(real person,photograph).

Firefox,award winning illustration.

Toonyou

1girl, collarbone, wavy hair, looking at viewer, blurry foreground, upper body, necklace, contemporary, plain pants, ((intricate, print, pattern)), ponytail, freckles, red hair, dappled sunlight, smile, happy,

1girl, tube top, sideboob, wind, strapless, dolphin shorts, upper body, french braid, looking at viewer, wet, ((solo focus, intricate, paw pose, splashing, in water)), smile, pubis, beach, wide hips, blonde, medium breasts, blurry, mountainous horizon, cloudy sky.

1girl, (curly hair), short hair, cafe, barista, dark skin, happy, looking at viewer, hairband, waist apron, earrings, upper body, ((intricate, print)).

1boy, jacket, beard, walking, beanie, sunglasses, ((from below, looking up, fisheye)), upper body, wasteland, sunset, solo focus, cloudy sky, backpack, hands in pockets.

PixarsRendman

A woman wearing denim jacket and cowboy hat

A happy dog running in a meadow at dusk

A curious cat perching itself in the front of a windowsill

Astronaut walking on the surface of the moon

BibTeX

@misc{feng2024fancyvideodynamicconsistentvideo,
        title={FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance}, 
        author={Jiasong Feng and Ao Ma and Jing Wang and Bo Cheng and Xiaodan Liang and Dawei Leng and Yuhui Yin},
        year={2024},
        eprint={2408.08189},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2408.08189}, 
  }