Project 5: Diffusion

Part A: Fun With Diffusion

Part 0: Setup

We’ll start by loading DeepFloyd, a text-to-image model. Here are a few of DeepFloyd’s results given various prompts and inference steps:

In general, increasing the number of inference steps increases the quality of the image. Moreover, the images are pretty high quality, with very accurate depictions of the input image.

For the purpose of reproducibility, we will be using a random seed of 180.

Part 1: Sampling Loops

Throughout this part, we use the noise schedule defined by DeepFloyd to generate $\alpha$ .

1.1: Noising

Our first goal will be to denoise an image. To start, we implement a forward noising function, which follows $x_t = \sqrt{\bar{\alpha_t}}x_0 + = \sqrt{1-\bar{\alpha_t}}N(0,1)$ . In essence, we are creating our new image by sampling from a normal distribution with a mean that is scaled from our image and a changing variance. Here are the results of the forward process:

1.2: Gaussian Blur Denoising

Classical denoising methods often involve applying a Gaussian Blur filter to denoise an image by finding the “averages” of a kernel region. We first try this method for denoising, with $\text{kernel size} = 5, \sigma = \frac{5}{3}$ :

As we can see, as we tend towards pure noise, classical methods are unable to reconstruct the image.

1.3: One Step Denoising

We next try to one step denoising by predicting the noise ( $\varepsilon$ ) with the DeepFloyd U-Net, then isolating the predicted image by solving for $x_0$ using the equation in section 1.1 with a know $\varepsilon$ : $x_0 = \frac{x_t - \sqrt{1-\bar{\alpha_t}}\varepsilon}{\sqrt{\bar{\alpha_t}}}$ . Here are the results:

This does better at recreating the original image.

1.4: Iterative Denoising

If we take the perspective of denoising that the “previous” image must be a convex combination of the current image and the predicted first image from a given timestep (found using the method in 1.3), we can iteratively denoise an image timestep by timestep. This can be found using the following equation: $x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma$

Here are the results of iterative denoising:

We can also view the iterative denoising predictions at different timesteps:

1.5: Diffusion Model Sampling

In the previous part, we passed in a partially noised photo of the Campanile at timestep 690. If we instead pass in a completely noisy photo, $N(0, I)$ , into our iterative denoising model, we see we are able to get a series of different, yet semi-meaningful, pictures:

1.6: Classifier-Free Guidance (CFG) For Diffusion

Unfortunately, the images above have distortions and some (like the bottom #5) seem to have no meaning. To improve the image quality, we use classifier free guidance (widely called a “cheat code” for diffusion model creation), where we run a forward pass of our U-NET with two different prompt inputs: a conditional one and an unconditional one. We then form our noise estimate for iterative denoising using a combination of these two estimates: $\varepsilon = \varepsilon_u + \gamma(\varepsilon_c - \varepsilon_u)$ . This results in much better results. Here, we show when $\gamma = 7$ :

In contrast to 1.5, these images are significantly more colorful and realistic.

1.7.0: SDE Edit

We can use the SDE edit algorithm to “hallucinate” new images on a close manifold to our original image. By changing when we start our forward process (ie: what time starts, as labelled above), we are able to reconstruct images that look more and more similar to the original image (as we move toward a manifold that is closer to the original image)!

Reference images:

SDE Edits:

My favorite part is the little ghost Totoro at the end of the last generation!

And here’s it with the American flag. Works pretty well!

1.7.1: Editing Hand-Drawn and Web Images

We can also transform drawings into more realistic photos by noising the image and trying to project the image onto a more realistic manifold. Here are three images we attempt this method on, one from Goya and 2 from me!

1.7.2: Inpainting

We can also use our model to select regions an inpaint according to prompts. Here are the input images, mask, to replace and final image

Here are 2 results:

And:

1.7.2: Text-Conditional Image-to-image Translation

We now change the condition to other prompts. Here’s the campanile as a rocket ship:

Ronaldo wearing a hat:

Totoro as a pencil:

1.7.2: Visual Angragrams

We can also make visual anagrams. Here are a few variants an anagram of a campfire and an old man.

Here’s a man wearing a hat and a rocket ship:

And a photo of a hipster barisita or a dog:

1.7.2: Hybrid Images

We can also combine images using high and low frequencies. Here is a lithography of a skull and a waterfall:

A rocket ship and mountains

The amlafi coast and a pupper:

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.1: Implementing the UNet

We start by implementing a UNet, which was the structure we used for the previous part. The architecture we implement is described below:

After implementing our archetecture, our goal is to figure out the noiseless image via the noised version, according to minimizing the L2 loss $\|\mathbb{E}[D_\theta(x) - x]\|^2$ . To do this, we first noise a set of images to different levels of $\sigma$ such that we follow $z = x + \sigma\varepsilon$ for $\sigma = 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1$ as we move down the rows.