FreeNoise: Tuning-Free Longer Video Diffusion
via Noise Rescheduling

Anonymous authors*

* Paper under double-blind review


✅ totally no tuning      ✅ less than 20% extra time      ✅ support 512 frames     

Abstract

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%.

Comparisons of Longer Video Generation

Comparisons of Multi-Prompt Video Generation

Ablation for Noise Rescheduling

Ablation for Motion Injection

Longer Results with 512 Frames

Multi-Prompt Results with 256 Frames

A. Other Noise Scheduling

We have explored some other strategies in our early experiments. We have tried mixed noise and progressive noise to make the fragments generated by each window more correlated [1]. However, it brings poor quality results due to the training-inference gap. In addition, we have tried to flip noise frames spatially. Although it brings more new content, abrupt changes in content are also introduced.
[1] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

B. Case Analysis of Significant Movement

Videos with significant movement can be mainly divided into three types: (1) the lens moving with the subject, (2) the subject moving off the screen, and (3) the subject moving within the screen. These three types are automatically determined during the inference stage based on the sampled random noises and the given prompt. Since the base model (inference without FreeNoise) struggles to deal with the other two types effectively, we have only showcased instances of the lens moving with the subject for videos with significant movement in our previous results.

B1. Effect of Noise Rescheduling (with Training Length 16 Frames)

Noise rescheduling is able to generate new motions while maintaining the main subjects and scenes.

B2. Real Video of Running Horse (Source)

In real videos with significant movement, the lens moving with the subject is a common case.

B3. Lens Moving with the Subject

For videos of this type, the position of the subject does not change much and the movement is shown through the regression of the background.

B4. Subject Moving off the Screen

For videos of this type, the subject will move off the screen. However, the subject will suddenly appear again due to semantic constraints.

B5. Subject Moving within the Screen

For videos of this type, the subject will move within the screen. Due to the size limitation of the screen, the subject will turn around. However, the current pretrained model behaves unnaturally when turning (even for inference without FreeNoise).

B6. Depth2Video

FreeNoise works for ControlNet and additional condition helps to generate more diverse motions. However, naively applying FreeNoise with ControlNet can not work perfectly, because the frame-wise variated depth conditions introduce extra variation to the context. It requires further exploration to make this combination work properly.

FreeNoise+AnimateDiff

Our FreeNoise is also applicable to another Video LDM framework -- AnimateDiff
w/o Noise Rescheduling
w Noise Rescheduling