Last month I watched Loving Vincent. The movie was a story of Van Gogh post Saint Rémy, in his final year in Auvers-sur-Oise. The story reconstructed the lonely artist’s mental breakdown from the perspectives of people close to him, well at least they are there in his paintings. Aside from reconstruction in a sense of narrative, the movie was also reconstruction in material. The entire movie comprised of animations of reimagined Van Gogh’s paintings. It recreated 94 Van Gogh’s originals while creating about 67K oil paintings around them as animation keyframes. Needless to say it is a huge collaboration of thousands of artists and dedicated hours. You can find details of the behind the scenes here.
This blog is about my experiments with Van Gogh’s paintings and the Neural Style Transer (NST) algorithm. Most of my previous attempts with NST had resulted in mediocre results. The problem is that without a proper metric to judge styles that are intrinsically aesthetic, or without benchmark images to compare against, I didn’t know how and what to improve the algorithm with. Loving Vincent did exactly that, provided me a test bench. In this blog we will use NST to recreate some keyframes of the film. We will compare the results with the movie recreations of Van Gogh paintings.. maybe ponder whether the NST technique could have aided in the production of the movie to save countless hours.. Well, it is said that a pen is mightier than a sword. At times, a brush is mightier than a pen. This blog explores what if, we bring a machine gun to a brush fight.
This section is my attempt to simplify some concepts that I found confusing with the NST. But I will assume you know what a neural network is and the lingos around it. NST utilizes Convolutional Neural Networks and combines the style of one image with content of the other. Most style transfer mechanisms before neural networks sucked since the problem is highly non-linear in nature ( planning to dedicate another blog on the non-linearity and the mathematics of art). But for NST, the original paper3 does a great job in explaining the architecture and there are a number of tutorials available to help one build it. I played with couple of them but found Keras implementation [here] to be simple and effective to begin with. God is always in the details but to its core, NST is a simple a+b =c algorithm. We use a pre-trained CNN architecture, VGG-19, and combine losses from Content Image (a) and a Style Image (b) that adds up to the final result (c).
a) Content Loss
Let the input image be denoted by vector p. After a feed forward step, an output image x is generated. The pre-trained VGG-19 filters p with encoding on each layer into feature representations Fl that will eventually result in x. Let Pl denote the encoding on the input image p in feature space for that particular convolution layer l. The content loss is the loss measure that minimizes the squared distance between and Pl and Fl. Or,
where i is the filter index in the node j in layer l. Below is the chunk of the Keras code specific to the content loss. However, instead of sum I used the mean which gave the best results. The code uses the ‘block5_conv2’ or the layer 15 in VGG-19 instead of ‘block4_conv2’ as the paper recommended.
model = vgg19.VGG19(input_tensor=input_tensor, weights='imagenet', include_top=False) # get the symbolic outputs of each "key" layer (we gave them unique names). outputs_dict = dict([(layer.name, layer.output) for layer in model.layers]) def content_loss(base, combination): #return K.sum(K.square(combination - base)) return 0.5*K.mean(K.square(combination - base)) # combine these loss functions into a single scalar loss = K.variable(0.0) layer_features = outputs_dict['block5_conv2'] base_image_features = layer_features[0, :, :, :] combination_features = layer_features[2, :, :, :] loss += content_weight * content_loss(base_image_features, combination_features)
Below are some content-only loss node outputs. First 10 are the outputs for 10 iterations of block5_conv2, next three the 10th iterations for block4_conv2, block3_conv3 and block2_conv2, the final image is the content image used. Note that early convolutions are easily reconstructed by the network.
b) Style Loss
The style loss is generated with the Gram matrix, which is the inner product between representations Fil and Fjl for each convolution feature i and j in layer l. Or,
Let wl be weight of each style layer. Then the contribution to the style loss by each layer is given by El which is the L2 norm of Pl wrt the gram Gl. Nl is the sum of all i i.e. total feature maps for the layer l and Ml is the total pixel size in a feature map. The total loss is the sum of all El given by:
Below is the code pertaining to the Style loss. The minor modification is to use user defined weights for each layer instead of equal normalized weights for all layers.
# compute the neural style loss # first we need to define 4 util functions # the gram matrix of an image tensor (feature-wise outer product) def gram_matrix(x): assert K.ndim(x) == 3 if K.image_data_format() == 'channels_first': features = K.batch_flatten(x) else: features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1))) gram = K.dot(features, K.transpose(features)) return gram # the "style loss" is designed to maintain # the style of the reference image in the generated image. # It is based on the gram matrices (which capture style) of # feature maps from the style reference image # and from the generated image def style_loss(style, combination): assert K.ndim(style) == 3 assert K.ndim(combination) == 3 S = gram_matrix(style) C = gram_matrix(combination) channels = 3 size = img_nrows * img_ncols #return K.sum(K.square(S - C)) / (4.0 * (channels ** 2) * (size ** 2)) return 0.25*K.mean(K.square(S - C)) / (4.0 * (channels ** 2) * (size ** 2)) feature_layers = ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1'] style_layer_weights = [0.2, 0.4, 0.5, 0.1, 0.6]; #style_layer_weights = [0.2, 0.2, 0.3, 0.5, 0.6]; //playing with diff params #style_layer_weights = [0.2, 0.2, 0.2, 0.2, 0.2 ]; i_ = 0 for layer_name in feature_layers: layer_features = outputs_dict[layer_name] style_reference_features = layer_features[1, :, :, :] combination_features = layer_features[2, :, :, :] #sl = style_loss(style_reference_features, combination_features) #loss += (style_weight / len(feature_layers)) * sl sl = style_layer_weights[i_]*style_loss(style_referen_features, combination_features) loss += style_weight * sl i_+=1
Below are some style-only loss node outputs. First 10 are the outputs for 10 iterations of block2_conv1, next three the 10th iterations for block3_conv1, block4_conv1 and block5_conv1, the final image is the style image used. Note that early convolutions show more localized feature representations.
c) Total Loss
In addition to the content loss, a total variation loss is added that works as a regularizer. Hence the total loss:
which is a+b = c as promised. Below is the code for total loss in Keras:
def total_variation_loss(x): assert K.ndim(x) == 4 if K.image_data_format() == 'channels_first': a = K.square( x[:, :, :img_nrows - 1, :img_ncols - 1] - x[:, :, 1:, :img_ncols - 1]) b = K.square( x[:, :, :img_nrows - 1, :img_ncols - 1] - x[:, :, :img_nrows - 1, 1:]) else: a = K.square( x[:, :img_nrows - 1, :img_ncols - 1, :] - x[:, 1:, :img_ncols - 1, :]) b = K.square( x[:, :img_nrows - 1, :img_ncols - 1, :] - x[:, :img_nrows - 1, 1:, :]) #return K.sum(K.pow(a + b, 1.25)) return K.mean(K.pow(a + b, 1.25)) loss += total_variation_weight * total_variation_loss(combination_image)
The fist four images were resulted from some non-Keras implementations of the NST and different optimization attempts. The fourth used original Keras version with sums instead of mean for all the loss functions and when style_weight/content_weight ~ 10000 as recommended in the paper. The fifth and sixth images resulted from using means instead of sums in computing style and content loss functions respectively. The parameters for eighth image was fine tuned to: –content weight 1.0 –style weight 10 –tv_weight 0.1 with style_layer_weights = [0.2, 0.4, 0.5, 0.1, 0.6] with sums substituted to means in all the loss functions. Almost all other images worked well in this range. All the results are the 10th iterations of the L-BFGS-B function minimizer.
The strength of the NST algorithm, at least the best fine tuned implementation I found, decreased significantly as the style image departed from the content image, or when the pixel histogram distances were high. For quick fix, I added some similar images to the background and maintained ambient color consistencies. Observe results in each pair below ( left being the content image and the right the NST solution).
The Keras version of the algorithm is written such that the style image is resized to the content image. So, to get more granular brush strokes, I used a quick hack of making sizes of content image smaller (there must be a better way). For all the triads of images below, the second is the result of the reduced content image size while the third is the result of customized overlays between the first and the second, where the second is resized to the first.
So did the Neural Style Transfer do justice to the artist? I think we tried our best. The neural paintings look structurally and aesthetically good after the optimizations, and pretty post impressionist as Van Gogh’s originals. The brush strokes, however, and maybe I need to fine tune the architecture better, I want to leave it up to your judgment. If you did not know about source of the style images, would you have guessed the neural paintings were Van Gogh styled? How far away from Gauguin, Cezanne, Monet, or Degas..
Could NST have helped in creating keyframes in Loving Vincent? At this point, its a yes for me..but the algorithm would quickly break down when subjects from multiple Van Gogh paintings had to interact in a frame. Not to mention, the movie’s soul rests in all the artists’ dedicated work.
The issue of multiple painting subjects is related to what we observed in the Experiment section, that the strength of the algorithm decreased significantly as the style image departed from the content image. It makes sense, as the algorithm is only learning the style of individual painting and not that of an artist, which would be learned from a collection of the artist’s work. I will try to tackle this issue of Artist Style Transfer in my next Fine Arts post. We will bring in some bad boy GANs into the brush fight. Until then, I hope you keep looking up in the starry nights..
- Bethge, Matthias; Ecker, Alexander S.; Gatys, Leon A. (2016). “Image Style Transfer Using Convolutional Neural Networks”. Cv-foundation.org. pp. 2414–2423. Retrieved 13 February 2019.
- Keras Examples Directory, Neural Style Transfer (2018), GitHub repository, https://github.com/keras-team/keras/blob/master/examples/neural_style_transfer.py
- Loving Vincent et al., http://lovingvincent.com