Final Project - Owen Gozali

Neural Radiance Fields

Image 1

Introduction

This project focuses on using Neural Radiance Fields (NeRFs) to reconstruct 3D structures programmatically from 2D images taken from various perspectives. By leveraging deep learning and computer vision techniques, NeRFs enable high-quality, photorealistic 3D renderings with detailed lighting and geometry.

Part 1: 2D NeRFs

Overview

We will start with generating 2D NeRFs, which is a simplified version of NeRFS to get us started. The goal is to feed one image to train a network and the network should be able to return the color when prompted by a specific pixel in the image.

Network

The core of our implementation is a Multilayer Perceptron (MLP) designed to take 2D pixel coordinates as input and output RGB color values. To achieve this, the model utilizes Sinusoidal Positional Encoding (PE), which expands the 2D coordinates into a higher-dimensional representation. This encoding enhances the model’s ability to learn complex spatial patterns by including higher-frequency information.

The MLP consists of multiple fully connected layers with non-linear activations (ReLU), followed by a final Sigmoid activation layer to ensure that the output RGB values remain in the range [0, 1]. Normalization of input image data is critical—pixel values are scaled from [0, 255] to [0, 1], aligning with the output constraints. For hyperparameters, we chose to use an Adam optimizer with lr = 1e-2, batch_size = 10000, and num_iterations = 2000.

Image 1

Model Architecture

Model Architecture:

This architecture allows the MLP to represent complex spatial patterns and predict the color of any pixel in the image.

Dataloader

To optimize training on high-resolution images, we implemented a custom dataloader that randomly samples a subset of pixels at each iteration. This approach allows us to be sample efficient and allows the model to learn what the image looks like without seeing every possible pixel. This approach becomes more important when we do 3D NeRFs where feeding the model every posible point is intractible.

How It Works:

Loss Function, Optimizer, and Metric

Loss Function:

We used Mean Squared Error (MSE) as the loss function to quantify the difference between the predicted RGB values and the ground truth. MSE is simple and effective for measuring pixel-wise reconstruction quality.

Optimizer:

The model was trained using the Adam optimizer with a learning rate of 1 × 10-2. Adam's adaptive learning rate ensures efficient convergence even with non-linear activations and positional encodings.

Metric:

While MSE measures reconstruction accuracy during training, Peak Signal-to-Noise Ratio (PSNR) is used as the primary evaluation metric. PSNR is better suited for comparing image quality and is calculated from the MSE using the formula:

            PSNR = 20 × log10(1 / sqrt(MSE))
        

PSNR values provide a clearer picture of the network’s ability to reconstruct fine details as training progresses.

Results

We then trained the model using our dataloader and we can see the predicted image being progressively better as we increase step size.

Image 1

We can see this improvement in quality reflected in the PSNR as well through a quantitative graph:

Image 1

PSNR for Fox Image

We tried running it on a different image as well using the same architecture and hyperparameters as well. This method should work well for other images, too.

Image 1
Image 1

PSNR for Temple Image

Hyperparameter Tuning

Exploring Hyperparameters:

We experimented with two key hyperparameters:

Varying L

After trying varying values of L = [0, 2, 5, 10, 20, 40], we saw that increasing L generally improves the network’s ability to capture finer details but increases model complexity which may lead to slow model training and potential overfitting if set too high.

Image 1
Image 1