A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, ``breathing life into it''), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.
There exist many ways to represent sketches in vector format, including different types of curves or primitive shapes (such as lines or polygons), different attributes for the shapes (such as stroke's width, closed shapes, and more), and with different number of parameters. Our selection of hyperparameters and network design is based on one specific sketch representation. Below is an example of a sketch of a surfer, defined by a sequence of closed segments of cubic Bezier curves, and contains a relatively high number of control points. As can be seen, the sketch resulting translation is significanly increased compared to our common results. In addition, the surfer's apperance is not well preserved as its' scale changed significantly.
Our method assumes that the input sketch depicts a single subject (a common scenario in character animation techniques). When applied directly to sketches involving multiple objects, we observe a degradation in result quality due to the inherent design constraints. Here for example, we expect the basketball to seperate form the player's hand, to achieve a natural dribbling motion. However, with out current settings its impossible to achieve such seperation since the translation parameters are relative to the object, which the basketball is part of. This limitation can be solved with further technical developments.
In a similar manner, we observe a degradation in result quality when our method is applied directly to scene sketches. As can be seen in this example, the entire scene moves unnaturally because of the single object assumption.
While the trade-off between the motion quality and the sketch's fidelity can be controlled by altering the hyperparameters, we still observe that sometimes the sketch's identity is harmed. Here for example, the squirrel's motion is good, but the aspect ratio of the original squirrel changed. It may be possible to improve on this front by leveraging a mesh-based representation of the sketch, and using an approximate rigidity loss.
Our approach inherits the general nature of the text-to-video priors, but it also suffers from their limitations. Such models are trained on large-scale data, but they may be unaware of specific motions, portray strong biases due to their data, or producing sever artefacts. Here for example we show the video produced by our text-to-video backbone model for the text "The ballerina is dancing". As can be seen the video is of very low quality, and contains artefacts such as in the ballerina's face and hands. However, our method is agnostic to the backbone model and hence could likely use newer models as they become available.
@InProceedings{Gal_2024_CVPR,
author = {Gal, Rinon and Vinker, Yael and Alaluf, Yuval and Bermano, Amit and Cohen-Or, Daniel and Shamir, Ariel and Chechik, Gal},
title = {Breathing Life Into Sketches Using Text-to-Video Priors},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {4325-4336}
}