top of page
  • Writer's pictureXuehao Liu

pix2pix Image Translation

Updated: Mar 14, 2019

The image generation using Generative Adversarial Networks (GAN) has accomplished great success in the past few years. For example bigGAN is one of those which can generate very vivid images. What is in this blog is a more basic GAN. It is used for image translation. You put an image into the network, the output is the corresponding one in another category. For example, you can change the doodle of an object, such as a bag, to an actual photo of that bag.

Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125-1134. 2017.

So, how does it translate the images?


This is the structure of conditional GAN(cGAN). It is called conditional GAN because the output of the generative network(G) needs to be tested. We first have two categories, X and Y. They are like the examples above. One is the Aerial photo, and the other one is the street map. Or one is the photo taken during the day, and the other is taken during the night. The two datasets must have a mapping relationship. Every instance in X must have the exact corresponding version in Y. In other word, it must have the target of translation. We must know the end of the road.


The idea of cGAN is that if you can generate an image to cheat under the human eyes, you need to fool the computer first. From X to Y, we can generate a fake Y, Y' using the generative network(G). Y' will be monitored by the Discriminator network(D). The goal of G is trying to deceive D. And D wants to tell the truth, which is that the output of G is not a real image. When these two networks are trained at the same time, they will push each other to do better. Finally G will be able to generate a solid fake image which is very vivid.


The input of G is the m × n × 3 image. The output is the same size m × n × 3 vector. G = Encoder + Residual blocks + Decoder. The Encoder is a three-layer convolutional neural network(CNN). There are 9 Residual blocks in G. Each one of them has the same structure in ResNet. The Decoder is another three-layer deconvolutional neural network.

The input of D is the sum of X and Y, or the sum of X and Y'. D is a five-layer CNN. The size of the output is 1, which provides the probability of whether the input is a real pair.

Loss Function

The loss function is the sum of losses of both G and D. The part of G is testing the accuracy of G. The output should be equal to the Y. So we calculate the Euclidean distance between Y and Y'. The part of D is that D should be able to tell which one is the real pair. So we calculate the loss of classification.

More examples:

66 views0 comments

Recent Posts

See All
bottom of page