Project 5: Diffusion
Part A: Fun With Diffusion
Part 0: Setup
We’ll start by loading DeepFloyd, a text-to-image model. Here are a few of DeepFloyd’s results given various prompts and inference steps:
In general, increasing the number of inference steps increases the quality of the image. Moreover, the images are pretty high quality, with very accurate depictions of the input image.
For the purpose of reproducibility, we will be using a random seed of 180.
Part 1: Sampling Loops
Throughout this part, we use the noise schedule defined by DeepFloyd to generate .
1.1: Noising
Our first goal will be to denoise an image. To start, we implement a forward noising function, which follows . In essence, we are creating our new image by sampling from a normal distribution with a mean that is scaled from our image and a changing variance. Here are the results of the forward process:
1.2: Gaussian Blur Denoising
Classical denoising methods often involve applying a Gaussian Blur filter to denoise an image by finding the “averages” of a kernel region. We first try this method for denoising, with :
As we can see, as we tend towards pure noise, classical methods are unable to reconstruct the image.
1.3: One Step Denoising
We next try to one step denoising by predicting the noise () with the DeepFloyd U-Net, then isolating the predicted image by solving for using the equation in section 1.1 with a know : . Here are the results:
This does better at recreating the original image.
1.4: Iterative Denoising
If we take the perspective of denoising that the “previous” image must be a convex combination of the current image and the predicted first image from a given timestep (found using the method in 1.3), we can iteratively denoise an image timestep by timestep. This can be found using the following equation:
Here are the results of iterative denoising:
We can also view the iterative denoising predictions at different timesteps:
1.5: Diffusion Model Sampling
In the previous part, we passed in a partially noised photo of the Campanile at timestep 690. If we instead pass in a completely noisy photo, , into our iterative denoising model, we see we are able to get a series of different, yet semi-meaningful, pictures:
1.6: Classifier-Free Guidance (CFG) For Diffusion
Unfortunately, the images above have distortions and some (like the bottom #5) seem to have no meaning. To improve the image quality, we use classifier free guidance (widely called a “cheat code” for diffusion model creation), where we run a forward pass of our U-NET with two different prompt inputs: a conditional one and an unconditional one. We then form our noise estimate for iterative denoising using a combination of these two estimates: . This results in much better results. Here, we show when :
In contrast to 1.5, these images are significantly more colorful and realistic.
1.7.0: SDE Edit
We can use the SDE edit algorithm to “hallucinate” new images on a close manifold to our original image. By changing when we start our forward process (ie: what time starts, as labelled above), we are able to reconstruct images that look more and more similar to the original image (as we move toward a manifold that is closer to the original image)!
Reference images:
SDE Edits:
My favorite part is the little ghost Totoro at the end of the last generation!
And here’s it with the American flag. Works pretty well!
1.7.1: Editing Hand-Drawn and Web Images
We can also transform drawings into more realistic photos by noising the image and trying to project the image onto a more realistic manifold. Here are three images we attempt this method on, one from Goya and 2 from me!
1.7.2: Inpainting
We can also use our model to select regions an inpaint according to prompts. Here are the input images, mask, to replace and final image
Here are 2 results:
And:
1.7.2: Text-Conditional Image-to-image Translation
We now change the condition to other prompts. Here’s the campanile as a rocket ship:
Ronaldo wearing a hat:
Totoro as a pencil:
1.7.2: Visual Angragrams
We can also make visual anagrams. Here are a few variants an anagram of a campfire and an old man.
Here’s a man wearing a hat and a rocket ship:
And a photo of a hipster barisita or a dog:
1.7.2: Hybrid Images
We can also combine images using high and low frequencies. Here is a lithography of a skull and a waterfall:
A rocket ship and mountains
The amlafi coast and a pupper:
Part B: Diffusion Models from Scratch!
Part 1: Training a Single-Step Denoising UNet
1.1: Implementing the UNet
We start by implementing a UNet, which was the structure we used for the previous part. The architecture we implement is described below:
After implementing our archetecture, our goal is to figure out the noiseless image via the noised version, according to minimizing the L2 loss . To do this, we first noise a set of images to different levels of such that we follow for as we move down the rows.
The model was able to successfully train:
And we can display the results from epochs 1 and 5:
We can also try out of distribution images corresponding to in decreasing order
Part 2: Training a Time Conditioning, Class Conditioning U-Net
2.3: Time Condition
We add time conditioning. It is trained with the following loss:
We can sample the diffusion plots at different epochs:
As we can see, training improved performance!
2.5: Class and Time Condition