🏀GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3VIVO AI Lab

: jx.lv1@siat.ac.cn

Comparisons with Baselines

A basketball free falls in the air

GPT4Motion(Ours)

AnimateDiff

ModelScope

Text2Video-Zero

DirecT2V

Comparison of the video results generated by different text-to-video models with the prompt “A basketball free falls in the air”.


More Results

All videos are 1080p, 24 fps.


Basketball (Rigid Objects)

A basketball spins out of the air and falls.


Four basketballs spin randomly in the air and fall.


A basketball is thrown towards the camera.

Water (Liquid)

Water flows into a white mug on a table.


Viscous water flows into a white mug on a table.


Very viscous water flows into a white mug on a table.

T-shirt (Irregular Cloth Object)

A white T-shirt flutters in light wind.


A white T-shirt flutters in the wind.


A white T-shirt flutters in strong wind.

Flag (Cloth Object)

A white flag flaps in light wind.


A white flag flaps in the wind.


A white flag flaps in strong wind.

Abstract

Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.

Architecture

Architecture

The architecture of our GPT4Motion. First, the user prompt is inserted into our designed prompt template. Then, the Python script generated by GPT-4 drives the Blender physics engine to simulate the corresponding motion, producing sequences of edge maps and depth maps. Finally, two ControlNets are employed to constrain the physical motion of video frames generated by Stable Diffusion, where a temporal consistency constraint is designed to enforce the coherence among frames.