• Xuehao Liu

The first image neural style transfer

Updated: Mar 14, 2019

First let's define a concept. Now we have two images: a style image A and a content image B. The style transfer of two images is to apply the texture of style image A on the content image B. So the result of the style transfer is that we can generate an image C which has the contour(shape) of image B, and also has the texture(style) of Image A. And we can do this job using Convolutional Neural Networks(CNNs).


The features in different levels of layers

Before we go into the details of the method, we need to talk about the features extracted by CNNs. We could do the classification task and object detection task using CNNs without designing the feature extractor, because the CNNs will do the feature extraction for us. People are curious about what exactly the feature CNNs has extracted. The neural style transfer is a byproduct of the exploration(or explanation) work of the CNN features.

Yosinski, Jason, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. "Understanding neural networks through deep visualization." arXiv preprint arXiv:1506.06579 (2015).

This visualization work shows what exactly these features in CNNs are. In the lower layers(1 or 2), the features are more basic. They are lines or corners, which is the texture. With the layer getting higher, the features are more complicated. At the most top layer, the features can be semantic. We can see the active area has the shape of objects.

So features extracted by a higher layer may contain the shape of the images. And the lower layer's features have the texture information. How could we combine those two kinds of features together?

Combine a contour and with a kind of texture

The idea of doing the style transfer is to find a middle ground between the style and content. This middle ground is an image. If we could find an image contain both the contour features and the texture features, this image is the style transfer of image A, with the content image B.

The work of Gatys et al. provides the method of doing this. This is the result:

I bet you all have already seen this.

This is how they do this:

We have a pre-trained CNN, such as VGG-19. It can extract features properly. Basically, it can provide the label of the input image, no matter whether it is a style image or a content image. In this VGG-19, if we pull some of its layers out, and check the output of those layers, the output should be the extracted content feature basing on content image, and the extracted style feature basing on style image(note: "style feature" and "content feature" are both the outputs of VGG-19 layers. They are the same kind of things. It is easier to say it in this circumstance.) If we push another "result" image into this VGG-19, VGG-19 will provide us the result features. This will be the middle ground we are looking for. We get the content loss by comparing the result features with the content feature, and get the style loss with the style feature. If we keep the layers of VGG-19 fixed, and do the gradient descent on result image, basing on the content and style loss, the result image will be the middle ground we want. That image is the style transfer and the mixture of two images.

Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. "Perceptual losses for real-time style transfer and super-resolution." In European conference on computer vision, pp. 694-711. Springer, Cham, 2016.

So what exactly are these features? Like I mentioned before, the higher layers have the contour feature, so it suits the content loss. We will calculate the Euclidean distance between higher layers to get the content loss. The style loss will be calculated basing on lower layers, as they extract texture information. Unfortunately we cannot calculate the Euclidean distance directly between lower layers. In Gatys et al's work, they calculate the Euclidean distance of the Gram Matrixes of layers. The Gram Matrix is a way to measure the relationship between pixels, which can be seen as a representation of texture That is why we are comparing them. Also, they calculate the style loss on multiple layers in different levels. Since we cannot compare the style feature directly, we need the pixel relationships in every level of layers are the same.

Before the first iteration, the result image is initialized as Gaussian noise, After the gradient decent, we will be the image we are looking at.

32 views0 comments

Recent Posts

See All