Breathing Life Into Sketches Using Text-to-Video Priors

We recommend watching the video with sound on

Abstract

A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, ``breathing life into it''), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.

Go to Top

Gallery

How does it work?

Representation

We represent a sketch as a set of strokes placed over a white background. Each stroke is a two dimensional Bézier curve (blue) with four control points (red). Our method predicts an offset for each point (green) for every frame. These offsets deform the sketch in order to create the appearance of motion.

Network

To predict the offsets we train a "neural displacement field", a small MLP that maps the initial sketch coordinates into their per-frame offsets. The network has two paths: A local path (green) which predicts an unconstrained offset for each point, and a global path (blue) which predicts the parameters of a global affine transformation matrix for each frame. This allows the model to focus on small local changes (e.g., bending an arm) while simultaneously creating large global movements or synchronized effects such as shrinking an object as it moves away from the camera.

Training

To train this network, we leverage the motion prior encapsulated in a pre-trained text-to-video model. We begin by predicting the offset for each control point, adjusting the sketch to create all video frames, and rendering them using a differentiable rasterizer. We then use the standard SDS loss in order to extract a signal from the pre-trained diffusion model.

Go to Top

Comparisons to Prior Work

Gen2

ModelScope

VideoCrafter

ZeroScope

Animated Drawings

Ours

Gen2

ModelScope

VideoCrafter

ZeroScope

Animated Drawings

Ours

We compare our method to five baselines: Four image-to-video diffusion models (ZeroScope, ModelScope, VideoCrafter1 and Gen-2 by Runway) and one method tailored for animating children drawings of the human figure. The image-conditioned diffusion models fail to maintain the unique characteristics of the sketch and suffer from multiple visual artifacts. The animated drawings method fares better, but is specifically designed for a humanoid skeleton and for a fixed set of animations. Hence, it cannot create unseen movements (a ballerina's dance) or handle non-human figures such as a fish.
Below we provide the videos used in our quantitative experiement.

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Go to Top

Varying the Prompts

Input

"The boxer is running."

"The boxer is jumping."

"The boxer is punching."

Input

"The gazelle galloped through the grass."

"The gazelle looks around."

"The gazelle jumps."

Input

"The cat is playing."

"The cat is curled up."

"The cat walks forward."

Input

"The biker is pedaling, each leg pumping up and down."

"The bicycle rider jumps over an obstacle."

"The bicycle rider makes a turn at high speed."

Input

"The wine in the wine glass sways from side to side."

"The glass is being filled with wine."

Our method is based on signals from pre-trained text-to-video models. As such, it offers a degree of control over the generated results by simply modifying the prompts that describe the movement. These changes are limited to small motions that the model can create, and that align with the semantics of the initial sketch. Hence, we can fill a glass with wine or ask a boxer to jump, but we may struggle to make the same boxer perform a back-flip.

Fail Case Analysis

Input

"The ballerina is dancing."

"The ballerina performed a grand jete."

"The ballerina bowed."

Prompt-only

Go to Top

Text Prompt Effect

Input Sketch

Baseline

Generic Prompt

Empty Prompt

+"A sketch of .."

+"Abstract sketch. Line drawing"

We examine how the specified prompt affects the animation. We first verify that the text itself influences the results in a meaningful way. To do so, we apply our method to several example sketches, using two alternatives: A "generic prompt" ("the object is moving"), and the empty prompt (""). We further examine the impact of modifying the prompt in a way that would motivate the text-to-video to create a sketch. Specifically, we either prepend the string "A sketch of" or append the string "Abstract sketch. Line drawing" to the prompts.

Go to Top

Abstraction Levels

We demonstrate the performance of our method on sketches with varying levels of abstraction. We employed CLIPasso to generate three sketches for each subject, covering three abstraction levels. These abstractions are achieved by using different numbers of strokes, specifically 16, 8, and 4 strokes. Note that the motion remains apparent even under very abstract settings.

Go to Top

Sketch Representation

Input

Baseline

-lr local

Input

Baseline

-lr local

We illustrate the impact of changing the sketch representation. We applied our method to sketches from the TU-Berlin sketch dataset, a human-drawn class-based sketch dataset. We showcase the results of four representative sketches. Our method was directly applied to the provided SVG files. As can be seen, our method successfully animated the sketches, however their appearance is not fully preserved when using the default hyperparameters. This can be improved by using lower learning rates.

Go to Top

Trade-off

Input Sketch

0.0001

0.0005

0.001

0.005

0.01

We demonstrate the trade-off between the quality of generated motion and the capacity to retain the appearance of the initial sketch. We show the impact of scaling the local learning rate within the range of 0.01 to 0.0001, keeping all parameters constant except for the local learning rate. Observe that as we move from the left (0.0001) to right (0.1), the motion in the animations increases, better aligning with the text prompt but at the cost of preserving the original sketch's appearance. This trade-off introduces additional control for the user, who may prioritize stronger motion over sketch fidelity.

Go to Top

Hyperparameter Effects

Input Sketch

Baseline

+ lr local

+ translation

+ scale

As described in the main paper, there is an inherent tradeoff between the components of our method. Here, we demonstrate how this tradeoff can be utilized to provide further user control over the appearance of the output video by adjusting the method's parameters. It is noteworthy that naturally, we observe different effects across various sketches, which may be attributed to the video model's prior or the initial sketch quality. In the third column ("+lr local"), we showcase the impact of increasing the learning rate of the local MLP. As evident, in some cases (biking and butterfly), this results in stronger motion without compromising the sketch's appearance. However, in other cases (cobra and boat), it affects the fidelity of the sketch, leading to a complete alteration of the original sketch. In the fourth column ("+translation"), we increased the translation prediction weight. As observed, this indeed causes the objects to move more across the frame compared to the baseline.

Go to Top

Comparing Video Models

Input

Baseline

zeroscope
v1
320s

zeroscope
v1-1
320s

zeroscope
v2
30x448x256

zeroscope
v2_576w

zeroscope
v2_dark
30x448x256

In the main paper we utilized the publicly available ModelScope pretrained video model. In particular, we look at a set of ZeroScope models, tuned across a range of resolutions and framerates (see https://huggingface.co/cerspense for more details). As observed, our method succesfully generalizes to these models with no additional changes. Note that different models do lead to different motion patterns, and some of them may result in different tradeoffs between the level of motion and the ability to preserve the sketch. For example, zeroscope-v1-320s (third column) resulted in slower motions, while zeroscope-v2-576w (sixth row) produces more "jumpy" videos.

Go to Top

Ablation

Input Sketch

No Local

No Global

No Neural

Ours

We evaluate the main components of our method. Disabling the local path severely restricts the model's ability to capture natural motion, leading to wobbling and sliding effects rather abstract motion that fits the sketch. Disabling the global path, or replacing the neural network with direct optimization, leads to results that largely align with the prompts, but contain a significant amount of temporal jitter and larger deviations from the input sketch.

Go to Top

Limitations

Sketch Representation

There exist many ways to represent sketches in vector format, including different types of curves or primitive shapes (such as lines or polygons), different attributes for the shapes (such as stroke's width, closed shapes, and more), and with different number of parameters. Our selection of hyperparameters and network design is based on one specific sketch representation. Below is an example of a sketch of a surfer, defined by a sequence of closed segments of cubic Bezier curves, and contains a relatively high number of control points. As can be seen, the sketch resulting translation is significanly increased compared to our common results. In addition, the surfer's apperance is not well preserved as its' scale changed significantly.

Input

Output

Two Objects

Our method assumes that the input sketch depicts a single subject (a common scenario in character animation techniques). When applied directly to sketches involving multiple objects, we observe a degradation in result quality due to the inherent design constraints. Here for example, we expect the basketball to seperate form the player's hand, to achieve a natural dribbling motion. However, with out current settings its impossible to achieve such seperation since the translation parameters are relative to the object, which the basketball is part of. This limitation can be solved with further technical developments.

Input

Output

Scene Sketches

In a similar manner, we observe a degradation in result quality when our method is applied directly to scene sketches. As can be seen in this example, the entire scene moves unnaturally because of the single object assumption.

Input

Output

Shape Preservation

While the trade-off between the motion quality and the sketch's fidelity can be controlled by altering the hyperparameters, we still observe that sometimes the sketch's identity is harmed. Here for example, the squirrel's motion is good, but the aspect ratio of the original squirrel changed. It may be possible to improve on this front by leveraging a mesh-based representation of the sketch, and using an approximate rigidity loss.

Input

Output

Video Model Prior

Our approach inherits the general nature of the text-to-video priors, but it also suffers from their limitations. Such models are trained on large-scale data, but they may be unaware of specific motions, portray strong biases due to their data, or producing sever artefacts. Here for example we show the video produced by our text-to-video backbone model for the text "The ballerina is dancing". As can be seen the video is of very low quality, and contains artefacts such as in the ballerina's face and hands. However, our method is agnostic to the backbone model and hence could likely use newer models as they become available.

BibTeX

@InProceedings{Gal_2024_CVPR,
            author    = {Gal, Rinon and Vinker, Yael and Alaluf, Yuval and Bermano, Amit and Cohen-Or, Daniel and Shamir, Ariel and Chechik, Gal},
            title     = {Breathing Life Into Sketches Using Text-to-Video Priors},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            month     = {June},
            year      = {2024},
            pages     = {4325-4336}
        }

Breathing Life Into Sketches UsingText-to-Video Priors

We recommend watching the video with sound on

Abstract

Gallery

A dolphin swimming and leaping [..]

The hypnotized cobra snake swa [..]

The cat is playing. [..]

The crab scuttled sideways alo [..]

The flower is moving and growing [..]

The penguin is shuffling along [..]

A surfer riding and maneuvering [..]

The runner runs with rhythmic [..]

A hummingbird hovers in mid-ai [..]

The lizard moves with a sinuou [..]

The squirrel uses its dexterou [..]

A hummingbird hovers in mid-ai [..]

A butterfly fluttering its win [..]

A dolphin swimming and leaping [..]

A gazelle galloping and jumpin [..]

The squirrel uses its dexterou [..]

The cat is playing. [..]

The eagle soars majestically, [..]

A parachute descending slowly [..]

The flower is moving and growi [..]

A butterfly fluttering its win [..]

A gazelle galloping and jumpin [..]

The spaceship accelerates rapi [..]

The cat is playing. [..]

The ballerina is dancing. [..]

A ceiling fan rotating blades [..]

A clock hands ticking and rota [..]

The two dancers are passionate [..]

The goldenfish is gracefully m [..]

The jazz saxophonist performs [..]

A galloping horse. [..]

The wine in the wine glass swa [..]

The squirrel uses its dexterou [..]

A basketball player dribbling [..]

The biker is pedaling, each le [..]

A waving flag fluttering and r [..]

The man sailing the boat, his [..]

The hypnotized cobra snake swa [..]

The ballerina is dancing. [..]

The airplane moves swiftly and [..]

A camel is walking

"[...] flowed through her yoga routine [...]"

How does it work?

Representation

Network

Training

Comparisons to Prior Work

Gen2

ModelScope

VideoCrafter

ZeroScope

Animated Drawings

Ours

Gen2

ModelScope

VideoCrafter

ZeroScope

Animated Drawings

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

VideoCrafter

ModelScope

Ours

Input

ZeroScope

Breathing Life Into Sketches Using
Text-to-Video Priors