Breathing Life Into Sketches Using
Text-to-Video Priors


1Tel Aviv University, 2NVIDIA, 3Reichman University
*Indicates Equal Contribution. Order determined by coin flip (:
CVPR 2024 (Highlight)

We recommend watching the video with sound on

Abstract

A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, ``breathing life into it''), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.

How does it work?


Representation

We represent a sketch as a set of strokes placed over a white background. Each stroke is a two dimensional Bézier curve (blue) with four control points (red). Our method predicts an offset for each point (green) for every frame. These offsets deform the sketch in order to create the appearance of motion.

Network

To predict the offsets we train a "neural displacement field", a small MLP that maps the initial sketch coordinates into their per-frame offsets. The network has two paths: A local path (green) which predicts an unconstrained offset for each point, and a global path (blue) which predicts the parameters of a global affine transformation matrix for each frame. This allows the model to focus on small local changes (e.g., bending an arm) while simultaneously creating large global movements or synchronized effects such as shrinking an object as it moves away from the camera.

Training

To train this network, we leverage the motion prior encapsulated in a pre-trained text-to-video model. We begin by predicting the offset for each control point, adjusting the sketch to create all video frames, and rendering them using a differentiable rasterizer. We then use the standard SDS loss in order to extract a signal from the pre-trained diffusion model.

Comparisons to Prior Work


bike
bike
We compare our method to five baselines: Four image-to-video diffusion models (ZeroScope, ModelScope, VideoCrafter1 and Gen-2 by Runway) and one method tailored for animating children drawings of the human figure. The image-conditioned diffusion models fail to maintain the unique characteristics of the sketch and suffer from multiple visual artifacts. The animated drawings method fares better, but is specifically designed for a humanoid skeleton and for a fixed set of animations. Hence, it cannot create unseen movements (a ballerina's dance) or handle non-human figures such as a fish.
Below we provide the videos used in our quantitative experiement.

Varying the Prompts


Our method is based on signals from pre-trained text-to-video models. As such, it offers a degree of control over the generated results by simply modifying the prompts that describe the movement. These changes are limited to small motions that the model can create, and that align with the semantics of the initial sketch. Hence, we can fill a glass with wine or ask a boxer to jump, but we may struggle to make the same boxer perform a back-flip.

Fail Case Analysis


Text Prompt Effect


We examine how the specified prompt affects the animation. We first verify that the text itself influences the results in a meaningful way. To do so, we apply our method to several example sketches, using two alternatives: A "generic prompt" ("the object is moving"), and the empty prompt (""). We further examine the impact of modifying the prompt in a way that would motivate the text-to-video to create a sketch. Specifically, we either prepend the string "A sketch of" or append the string "Abstract sketch. Line drawing" to the prompts.

Abstraction Levels




We demonstrate the performance of our method on sketches with varying levels of abstraction. We employed CLIPasso to generate three sketches for each subject, covering three abstraction levels. These abstractions are achieved by using different numbers of strokes, specifically 16, 8, and 4 strokes. Note that the motion remains apparent even under very abstract settings.

Sketch Representation


We illustrate the impact of changing the sketch representation. We applied our method to sketches from the TU-Berlin sketch dataset, a human-drawn class-based sketch dataset. We showcase the results of four representative sketches. Our method was directly applied to the provided SVG files. As can be seen, our method successfully animated the sketches, however their appearance is not fully preserved when using the default hyperparameters. This can be improved by using lower learning rates.

Trade-off


We demonstrate the trade-off between the quality of generated motion and the capacity to retain the appearance of the initial sketch. We show the impact of scaling the local learning rate within the range of 0.01 to 0.0001, keeping all parameters constant except for the local learning rate. Observe that as we move from the left (0.0001) to right (0.1), the motion in the animations increases, better aligning with the text prompt but at the cost of preserving the original sketch's appearance. This trade-off introduces additional control for the user, who may prioritize stronger motion over sketch fidelity.

Hyperparameter Effects


As described in the main paper, there is an inherent tradeoff between the components of our method. Here, we demonstrate how this tradeoff can be utilized to provide further user control over the appearance of the output video by adjusting the method's parameters. It is noteworthy that naturally, we observe different effects across various sketches, which may be attributed to the video model's prior or the initial sketch quality. In the third column ("+lr local"), we showcase the impact of increasing the learning rate of the local MLP. As evident, in some cases (biking and butterfly), this results in stronger motion without compromising the sketch's appearance. However, in other cases (cobra and boat), it affects the fidelity of the sketch, leading to a complete alteration of the original sketch. In the fourth column ("+translation"), we increased the translation prediction weight. As observed, this indeed causes the objects to move more across the frame compared to the baseline.

Comparing Video Models


In the main paper we utilized the publicly available ModelScope pretrained video model. In particular, we look at a set of ZeroScope models, tuned across a range of resolutions and framerates (see https://huggingface.co/cerspense for more details). As observed, our method succesfully generalizes to these models with no additional changes. Note that different models do lead to different motion patterns, and some of them may result in different tradeoffs between the level of motion and the ability to preserve the sketch. For example, zeroscope-v1-320s (third column) resulted in slower motions, while zeroscope-v2-576w (sixth row) produces more "jumpy" videos.

Ablation


We evaluate the main components of our method. Disabling the local path severely restricts the model's ability to capture natural motion, leading to wobbling and sliding effects rather abstract motion that fits the sketch. Disabling the global path, or replacing the neural network with direct optimization, leads to results that largely align with the prompts, but contain a significant amount of temporal jitter and larger deviations from the input sketch.

Limitations


Sketch Representation

There exist many ways to represent sketches in vector format, including different types of curves or primitive shapes (such as lines or polygons), different attributes for the shapes (such as stroke's width, closed shapes, and more), and with different number of parameters. Our selection of hyperparameters and network design is based on one specific sketch representation. Below is an example of a sketch of a surfer, defined by a sequence of closed segments of cubic Bezier curves, and contains a relatively high number of control points. As can be seen, the sketch resulting translation is significanly increased compared to our common results. In addition, the surfer's apperance is not well preserved as its' scale changed significantly.

Two Objects

Our method assumes that the input sketch depicts a single subject (a common scenario in character animation techniques). When applied directly to sketches involving multiple objects, we observe a degradation in result quality due to the inherent design constraints. Here for example, we expect the basketball to seperate form the player's hand, to achieve a natural dribbling motion. However, with out current settings its impossible to achieve such seperation since the translation parameters are relative to the object, which the basketball is part of. This limitation can be solved with further technical developments.

Scene Sketches

In a similar manner, we observe a degradation in result quality when our method is applied directly to scene sketches. As can be seen in this example, the entire scene moves unnaturally because of the single object assumption.

Shape Preservation

While the trade-off between the motion quality and the sketch's fidelity can be controlled by altering the hyperparameters, we still observe that sometimes the sketch's identity is harmed. Here for example, the squirrel's motion is good, but the aspect ratio of the original squirrel changed. It may be possible to improve on this front by leveraging a mesh-based representation of the sketch, and using an approximate rigidity loss.

Video Model Prior

Our approach inherits the general nature of the text-to-video priors, but it also suffers from their limitations. Such models are trained on large-scale data, but they may be unaware of specific motions, portray strong biases due to their data, or producing sever artefacts. Here for example we show the video produced by our text-to-video backbone model for the text "The ballerina is dancing". As can be seen the video is of very low quality, and contains artefacts such as in the ballerina's face and hands. However, our method is agnostic to the backbone model and hence could likely use newer models as they become available.

BibTeX

@InProceedings{Gal_2024_CVPR,
            author    = {Gal, Rinon and Vinker, Yael and Alaluf, Yuval and Bermano, Amit and Cohen-Or, Daniel and Shamir, Ariel and Chechik, Gal},
            title     = {Breathing Life Into Sketches Using Text-to-Video Priors},
            booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            month     = {June},
            year      = {2024},
            pages     = {4325-4336}
        }