Project 5 - Owen Gozali

Part A: Fun with Diffusion Models

Introduction

In this part of the project I will be making use of stable diffusion models to create images from noise. I will also be making cool stuff like orientation-based illusion images and frequency-based illusion images.

We will start by creating a function that adds noise to an image. Here is the result of running it on our test image which is the Campanile.

Noise Level 250

Noise Level 500

Noise Level 750

We can start off by trying out a classical method of denoising which is adding gaussian blur. But we can see this doesn't really work and sacrifices a lot of image quality.

Noise Level 250

Noise Lvl 250, Gaussian Blurred

Noise Level 500

Noise Lvl 500, Gaussian Blurred

Noise Level 750

Noise Lvl 750, Gaussian Blurred

We can then use the pretrained UNet to one-step denoise the image from various noise levels.

Noise Level 250

Noise Lvl 250, Denoised by UNet

Noise Level 500

Noise Lvl 500, Denoised by UNet

Noise Level 750

Noise Lvl 750, Denoised by UNet

Using one step denoising usually doesn't really yield good results so in practice we try to iteratively denoise the image by denoising it bit by bit each tiem.

Iterative Denoise i=0

Iterative Denoise i=5

Iterative Denoise i=10

Iterative Denoise i=15

Iterative Denoise i=20

Here is the image cleaned with iterative denoising compared to one-step denoising and gaussian blur:.

Iteratively Denoised Campanile

One-step Denoised Campanile

Gaussian-Blur-Denoised Campanile

Using pure noise as the first input to the image, we can prompt the image to generate something from scratch.

Generated Image 1

Generated Image 2

Generated Image 3

Generated Image 4

Generated Image 5

We can gain better results by using classifier free guidance. Using the prompt "a high quality image" and a gamma of 5 we gain the following results:

CFG Generated Image 1

CFG Generated Image 2

CFG Generated Image 3

CFG Generated Image 4

CFG Generated Image 5

By adding some noise to an image and putting it into the model, we can generate small edits to existing images.

SDEdit with start_i=1

SDEdit with start_i=3

SDEdit with start_i=5

SDEdit with start_i=7

SDEdit with start_i=10

SDEdit with start_i=20

We can do the same with hand drawn images or non-realistic images from the web to generate realistic versions of them.

Online Image with start_i=1

Online Image with start_i=3

Online Image with start_i=5

Online Image with start_i=7

Online Image with start_i=10

Online Image with start_i=20

Hand-drawn Snowman with start_i=1

Hand-drawn Snowman with start_i=3

Hand-drawn Snowman with start_i=5

Hand-drawn Snowman with start_i=7

Hand-drawn Snowman with start_i=10

Hand-drawn Snowman with start_i=20

Hand-drawn Face with start_i=1

Hand-drawn Face with start_i=3

Hand-drawn Face with start_i=5

Hand-drawn Face with start_i=7

Hand-drawn Face with start_i=10

Hand-drawn Face with start_i=20

We can also do inpainting by blocking out a patch of the image and prompting the model to fill in the blank, keeping the rest of the image the same.

Clean Image of Campanile

Mask for Inpainting

Part of image to be painted over

Inpainted Image

We can also use the prompt to transform one image into another by adding some noise and running it through the model.

Campanile to Rocket with start_i=1

Campanile to Rocket with start_i=3

Campanile to Rocket with start_i=5

Campanile to Rocket with start_i=7

Campanile to Rocket with start_i=10

Campanile to Rocket with start_i=20

We can also make visual anagrams as well by prompting the model to generate different things when the image is upright vs upside down.

Upright Image of Old Man

Upside Down Image of People Around Campfire

In a similar vein, we can create illusions by prompting the model differently for low and high freq.

From Afar: Image of Old Man

Up Close: Image of People Around Campfire

Part B: Diffusion Models from Scratch

Introduction

In this part of the project I will be making a stable diffusion model from scratch using Pytorch.

Part 1: Training a Single-Step Denoising UNet

First we take the MNIST handwritten digits dataset and add varying levels of noise to it as so:

A sample image from the dataset with varying amounts of noise [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].

We're trying to create a model that takes a noisy image and predicts the clean image it came from. We can achieve this using a UNet architecture with some skip connections. To train the model, we feed in the images of the noisy digits and take evaluate its output using a L2 loss with the clean image. After training it over 5 epochs we get the following loss curve:

Loss curve for one step denoising UNet.

Here is what the model's denoising capabilities look like after one and five epochs of training:

One-step denoising output after 1 epoch

One-step denoising output after 3 epochs

One-step denoising output after 5 epochs

Despite the model only being trained to denoise a level of alpha = 0.5, we can see that it also performs quite well for noise levels that it wasn't trained on:

Part 2: Training a Diffusion Model

Adding Time Conditioning for Image Generation

With the model being able to successfully denoise an image at a certain noise level (0.5), we can now generalize the model to work with any noise level by conditioning the image on how much noise is the image expected to have. This will be useful when creating iterative denoising because we start from a very noisy image and slowly denoise one step at a time.

With some manipulation of the model architecture to condition its output on another input t, which represents the stage of denoising that the current image is on, we can train it a similar way as before and yield the following results:

Loss curve for UNet with time conditioning

We can generate some images by using iterative denoising and starting from images of pure noise. We can see the model getting progressively better at generating digits, but ultimately doesn't do a stellar job.

Epoch 1

Epoch 5

Epoch 10

Epoch 20

Here is a GIF of the iterative denoising process:

Adding Class Conditioning for Image Generation

Lastly, we also want our model to be able to generate a specific digit instead of just a random one. This should hopefully yield better results as the model is more guided towards a certain digit and should be able to generate more realistic images.

To achieve this, we simply condition the model on c as well, which is a one hot encoding of which digit the image should be. Training it the same way as above yields the following loss curve:

Loss curve for UNet with class conditioning as well

For image generation, we use classifier free guidance to provide better images. The course website said to use a gamma of 5 but my model performed better with gamma=2 so I will stick with that. Here are the results:

5 epochs

20 epochs