Final Project - Owen Gozali

Neural Radiance Fields

Introduction

This project focuses on using Neural Radiance Fields (NeRFs) to reconstruct 3D structures programmatically from 2D images taken from various perspectives. By leveraging deep learning and computer vision techniques, NeRFs enable high-quality, photorealistic 3D renderings with detailed lighting and geometry.

Part 1: 2D NeRFs

Overview

We will start with generating 2D NeRFs, which is a simplified version of NeRFS to get us started. The goal is to feed one image to train a network and the network should be able to return the color when prompted by a specific pixel in the image.

Network

The core of our implementation is a Multilayer Perceptron (MLP) designed to take 2D pixel coordinates as input and output RGB color values. To achieve this, the model utilizes Sinusoidal Positional Encoding (PE), which expands the 2D coordinates into a higher-dimensional representation. This encoding enhances the model’s ability to learn complex spatial patterns by including higher-frequency information.

The MLP consists of multiple fully connected layers with non-linear activations (ReLU), followed by a final Sigmoid activation layer to ensure that the output RGB values remain in the range [0, 1]. Normalization of input image data is critical—pixel values are scaled from [0, 255] to [0, 1], aligning with the output constraints. For hyperparameters, we chose to use an Adam optimizer with lr = 1e-2, batch_size = 10000, and num_iterations = 2000.

Model Architecture

Model Architecture:

Input: 2D pixel coordinates expanded to 42 dimensions via PE.
Hidden Layers: Fully connected layers with ReLU activations.
Output Layer: A 3D vector representing RGB values, constrained with a Sigmoid activation.

This architecture allows the MLP to represent complex spatial patterns and predict the color of any pixel in the image.

Dataloader

To optimize training on high-resolution images, we implemented a custom dataloader that randomly samples a subset of pixels at each iteration. This approach allows us to be sample efficient and allows the model to learn what the image looks like without seeing every possible pixel. This approach becomes more important when we do 3D NeRFs where feeding the model every posible point is intractible.

How It Works:

At each iteration, the dataloader samples N random pixels from the image.
It returns both the normalized 2D coordinates (scaled by image dimensions) and the corresponding RGB values (scaled to [0, 1]).

Loss Function, Optimizer, and Metric

Loss Function:

We used Mean Squared Error (MSE) as the loss function to quantify the difference between the predicted RGB values and the ground truth. MSE is simple and effective for measuring pixel-wise reconstruction quality.

Optimizer:

The model was trained using the Adam optimizer with a learning rate of 1 × 10^-2. Adam's adaptive learning rate ensures efficient convergence even with non-linear activations and positional encodings.

Metric:

While MSE measures reconstruction accuracy during training, Peak Signal-to-Noise Ratio (PSNR) is used as the primary evaluation metric. PSNR is better suited for comparing image quality and is calculated from the MSE using the formula:

            PSNR = 20 × log10(1 / sqrt(MSE))

PSNR values provide a clearer picture of the network’s ability to reconstruct fine details as training progresses.

Results

We then trained the model using our dataloader and we can see the predicted image being progressively better as we increase step size.

We can see this improvement in quality reflected in the PSNR as well through a quantitative graph:

PSNR for Fox Image

We tried running it on a different image as well using the same architecture and hyperparameters as well. This method should work well for other images, too.

PSNR for Temple Image

Hyperparameter Tuning

Exploring Hyperparameters:

We experimented with two key hyperparameters:

Max Frequency (L) for Positional Encoding: Determines the richness of high-frequency details in the expanded input.
Learning Rate: Controls the step size for weight updates during optimization.

Varying L

After trying varying values of L = [0, 2, 5, 10, 20, 40], we saw that increasing L generally improves the network’s ability to capture finer details but increases model complexity which may lead to slow model training and potential overfitting if set too high.

Varying LR

After trying varying values of lr = [1e-1, 5e-2, 1e-2, 5e-3, 1e-3, 5e-4, 1e-4], we saw that an lr value between 0.01 and 0.001 yielded the best results. This could be because the model never gets to a minima if the lr is set too high and it reaches the minima too slowly when lr is set too low.

Part 2: 3D NeRFs

Overview

We can now continue to making fully 3D NeRFs, which take in 2D images from different perspectives and generates a 3D structure out of them. The approach is similar to the 2D NeRFs where we query the model for the color at a particular point, but now we query with a 3D point alongside a ray direction and receive both the color and density of the point.

Part 2.1: Create Rays from Cameras

Camera to World Coordinate Conversion

To convert camera coordinates to world coordinates, we use the camera's extrinsic matrix, which encapsulates the rotation and translation needed to align the camera's coordinate system with the world coordinate system. Specifically:

$Image 1$

Relationship between world coordinates and camera coordinates

We can use the inverse of the matrix above to get the "c2w" matrix, which is luckily already provided to us in the training data.

Pixel to Camera Coordinate Conversion

Pixel coordinates in an image are converted to camera coordinates using the camera's intrinsic matrix. This matrix accounts for the focal length and principal point offset of the camera.

$Image 1$

Camera Intrinsic Matrix, K

We can construct such a matrix and use it to convert image pixels in to camera coordinates

Pixel to Ray

To convert pixel coordinates into rays:

Convert pixel coordinates to camera coordinates using the intrinsic matrix.
Transform these camera coordinates into world coordinates using the extrinsic matrix.
The ray origin (ray_o) is the camera's position in world coordinates, obtained directly from the extrinsic matrix.
The ray direction (ray_d) is computed as the normalized vector pointing from the ray origin to the transformed world coordinate of the pixel.

$Image 1$

This results in a set of rays (ray_o, ray_d) for each pixel in the image. We then combine this with the color associated with each queried pixel to return the (ray_o, ray_d, color) tuples.

Part 2.2: Sampling

Sampling Rays from Images

To efficiently train the NeRF model, we sample rays from images by selecting random pixels. This involves:

Choosing a subset of images to sample pixels from.
Choosing a random subset of pixels from each image.
Using the methods described in Part 2.1 to compute the rays (ray_o and ray_d) for the selected pixels.
Returning these rays along with their corresponding RGB color values, extracted from the image.

This approach ensures that the model is trained on diverse parts of the image while keeping memory usage manageable.

Sampling Points along Rays

To feed the NeRF model, we sample 3D points along the rays by generating points spaced equally along it. This could be done by generating t values that are equally spaced from a specified depth range and calculating the following:

$Image 1$

To prevent systematic overfitting, we can add some randomness to the T's generated to ensure we don't keep sampling the same points.

Part 2.3: Putting It All Together

In this section, we integrate the components we've built so far to form a complete pipeline for sampling rays and visualizing them.

Using the Ray Sampler

Now that we've implemented the ray sampling functionality, we can query a set of random rays from the input images. For each ray, we can sample a set of points along its path to be fed into the model. This process ensures that we have a diverse set of rays and corresponding 3D points that can be used for training the NeRF model.

Specifically, the ray sampler generates rays by selecting random pixels from the input image, converting these pixel coordinates into rays, and then sampling points along each ray. The resulting rays and sampled points are used as input for the model, helping it learn to generate realistic 3D reconstructions.

Visualizing Rays with Viser

Once we have sampled the rays and points, we can visualize the results using Viser, an open-source software developed by NerfStudio. Viser provides a convenient way to visualize rays and sampled points in a 3D space. It renders the rays as lines and the sampled points as small spheres, making it easy to inspect the ray sampling process.

Viser also allows us to interactively explore the rays and sampled points, providing a deeper understanding of how the rays traverse the 3D scene. This visualization tool is essential for debugging and optimizing the sampling process.

Visualizations of the Sampled Rays

We visualized the rays sampled randomly (from a random subsample of ~20 images) as well as the rays sampled from a single image for visualization purposes:

Rays Sampled From Plenty Images

Rays Sampled From One Image

Close-Up of Single-Image Sampled Rays

Neural Radiance Field Network

In this section, we build a neural network model that extends the concepts from Part 1, but with key differences suited for 3D space. We list the differences as follows:

3D Coordinates: The model now queries 3D points in space rather than just 2D pixel coordinates.
Ray Direction: In addition to the 3D coordinates, the model also takes the direction of the ray into account to model light behavior.
RGB and Density Output: The model returns both RGB values and the density at each point, unlike in Part 1 where only RGB values were returned.
Deeper Architecture: The model is deeper to capture more complex representations necessary for 3D scene understanding.
Concatenations: Concatenating the 3D position and ray direction at various layers ensures that the model retains key pieces of information throughout the network.

The specific model architecture details are shown as follows:

Model Architecture

Part 2.5: Volume Rendering

Rendering Equation

Volume rendering involves using the model's output to generate a realistic image by simulating how light travels through a scene. Given the model's output—RGB values and density at sampled points along each ray—we apply a rendering equation that takes into account the density and color along the ray to see what a view from that angle would look like. Explicitly we calculate it with the following:

Discrete Volumetric Rendering Equation

Which is a discrete (and thus computable) version of the true volumetric rendering equation (which would otherwise feature an integral)

Results

Using the function, we're able to render what the image would look like from a specified angle. We can use it to visualize how the model's generated image would look like throughout training.

After 100 Steps

After 200 Steps

After 300 Steps

After 400 Steps

After 500 Steps

After 1000 Steps

After 2500 Steps

We can also track the PSNR values on the validation set throughout training and see it steadily increasing over time before reaching >24 PSNR near the end of training.

After fully training the network, we can also visualize the scene in a 360 manner using a novel test set of camera angles. We then generate a GIF by combining all the images together in a sequence.

Final Result

B&W Result

Bells and Whistles

Lastly, for Bells and Whistles we can change the background color of the scene by modifying our volumetric rendering function as follows:

Original Volume Rendering Formula

Modified Volume Rendering Formula

Where c_bg represents our desired background color.