Let's talk about glorious tale of AI-based human face generation and showcase an absolutely unbelievable new paper in this area. You may be surprised, but this thing is not recent at all. This is 4 year old news! Insanity. Later, researchers turned this whole problem around, and performed something that was previously thought to be impossible. They started using these networks to generate photorealistic images from a written text description. We could create new bird species by specifying that it should have orange legs and a short yellow bill. Then, researchers at NVIDIA recognized and addressed two shortcomings: one was that the images were not that detailed, and two, even though we could input text, we couldn’t exert too much artistic control over the results. In came StyleGAN to the rescue, which was then able to perform both of these difficult tasks really well. Furthermore, there are some features that are highly localized as we exert control over these images, you can see how this part of the teeth and eyes were pinned to a particular location and the algorithm just refuses to let it go, sometimes to the detriment of its surroundings.
A followup work titled StyleGAN2 addresses all of these problems in one go. So, StyleGAN2 was able to perform near-perfect synthesis of human faces, and remember, none of these people that you see here really exist. Quite remarkable. So, how can we improve this magnificent technique? Well, this new work can do so many things,I don’t even know where to start. First, and most important, we now have much much more intuitive artistic control over the output images. We can add or remove a beard, make the subject younger or older, change their hairstyle, make their hairline recede, put a smile on their face, or even make their nose pointier. Absolute witchcraft. So why can we do all this with this new method? The key idea is that it is not using a Generative Adversarial Network, a GAN, in short. A GAN means two competing neural networks,where one is trained to generate new images, and the other one is used to tell whether the generated images are real or fake. GANs dominated this field for a long while because of their powerful generation capabilities, but, on the other hand, they are quite difficult to train and we have only limited control over its output.
Among other changes, this work disassembles this generator network into F and G, and the discriminator network into E and D, or in other words, adds an encoder and decoder network here. Why? The key idea is that the encoder compresses the image data down into a representation that we can edit more easily. This is the land of beards and smiles, or in other words, all of these intuitive features that we can edit exist here, and when we are done, we can decompress the output with the decoder network and produce these beautiful images. This is already incredible, but what else can we do with this new architecture? A lot more. For instance, two, if we add a source and destination subjects, their coarse, middle, or fine styles can also be mixed. What does that mean exactly? The coarse part means that high-level attributes,like pose, hairstyle and face shape will resemble the source subject, in other words, the child will remain a child and inherit some of the properties of the destination subjects. However, as we transition to the “fine from source” part, the effect of the destination subject will be stronger, and the source will only be used to change the color scheme and microstructure of this image. Interestingly, it also changes the background of the subject. Three, it can also perform image interpolation. This means that we have these four images as starting points, and it can compute intermediate images between them.
You see here that as we slowly become Bill Gates, somewhere along the way, glasses appear. Now note that interpolating between images is not difficult in the slightest and has been possible for a long-long time — all we need to do is just compute average results between these images. So what makes a good interpolation process? Well, we are talking about good interpolation,when each of the intermediate images make sense and can stand on their own. I think this technique does amazingly well at that. I’ll stop the process at different places,you can see for yourself and let me know in the comments if you agree or not. I also kindly thank the authors for creating more footage just for us to showcase in this series. That is a huge honor, thank you so much! And note that StyleGAN2 appeared around December of 2019, and now, this paper by the name “Adversarial Latent Autoencoders” appeared only four months later. Four months later. My goodness! This is so much progress in so little time it truly makes my head spin. What a time to be alive!
A followup work titled StyleGAN2 addresses all of these problems in one go. So, StyleGAN2 was able to perform near-perfect synthesis of human faces, and remember, none of these people that you see here really exist. Quite remarkable. So, how can we improve this magnificent technique? Well, this new work can do so many things,I don’t even know where to start. First, and most important, we now have much much more intuitive artistic control over the output images. We can add or remove a beard, make the subject younger or older, change their hairstyle, make their hairline recede, put a smile on their face, or even make their nose pointier. Absolute witchcraft. So why can we do all this with this new method? The key idea is that it is not using a Generative Adversarial Network, a GAN, in short. A GAN means two competing neural networks,where one is trained to generate new images, and the other one is used to tell whether the generated images are real or fake. GANs dominated this field for a long while because of their powerful generation capabilities, but, on the other hand, they are quite difficult to train and we have only limited control over its output.
Among other changes, this work disassembles this generator network into F and G, and the discriminator network into E and D, or in other words, adds an encoder and decoder network here. Why? The key idea is that the encoder compresses the image data down into a representation that we can edit more easily. This is the land of beards and smiles, or in other words, all of these intuitive features that we can edit exist here, and when we are done, we can decompress the output with the decoder network and produce these beautiful images. This is already incredible, but what else can we do with this new architecture? A lot more. For instance, two, if we add a source and destination subjects, their coarse, middle, or fine styles can also be mixed. What does that mean exactly? The coarse part means that high-level attributes,like pose, hairstyle and face shape will resemble the source subject, in other words, the child will remain a child and inherit some of the properties of the destination subjects. However, as we transition to the “fine from source” part, the effect of the destination subject will be stronger, and the source will only be used to change the color scheme and microstructure of this image. Interestingly, it also changes the background of the subject. Three, it can also perform image interpolation. This means that we have these four images as starting points, and it can compute intermediate images between them.
You see here that as we slowly become Bill Gates, somewhere along the way, glasses appear. Now note that interpolating between images is not difficult in the slightest and has been possible for a long-long time — all we need to do is just compute average results between these images. So what makes a good interpolation process? Well, we are talking about good interpolation,when each of the intermediate images make sense and can stand on their own. I think this technique does amazingly well at that. I’ll stop the process at different places,you can see for yourself and let me know in the comments if you agree or not. I also kindly thank the authors for creating more footage just for us to showcase in this series. That is a huge honor, thank you so much! And note that StyleGAN2 appeared around December of 2019, and now, this paper by the name “Adversarial Latent Autoencoders” appeared only four months later. Four months later. My goodness! This is so much progress in so little time it truly makes my head spin. What a time to be alive!
Comments
Post a Comment
If you have any doubts then please let me know through comments