VI3NR: Variance Informed Initialization for Implicit Neural Representations

Abstract

Implicit Neural Representations (INRs) are a versatile and powerful tool for encoding various forms of data, including images, videos, sound, and 3D shapes. A critical factor in the success of INRs is the initialization of the network, which can significantly impact the convergence and accuracy of the learned model. Unfortunately, commonly used neural network initializations are not widely applicable for many activation functions, especially those used by INRs. In this paper, we improve upon previous initialization methods by deriving an initialization that has stable variance across layers, and applies to any activation function. We show that this generalizes many previous initialization methods, and has even better stability for well studied activations. We also show that our initialization leads to improved results with INR activation functions in multiple signal modalities. Our approach is particularly effective for Gaussian INRs, where we demonstrate that the theory of our initialization matches with task performance in multiple experiments, allowing us to achieve improvements in image, audio, and 3D surface reconstruction.

tl;dr

We introduce an initialization for MLP networks that works for any activation function, thus generalizing Xavier ¹ and Kaiming initialization ² (which were derived for Tanh and ReLU respectively). This is important for implicit neural representations (INRs) as the literature uses (and keeps proposing) activation functions that are very different to classic activation functions. SIREN ⁴, which uses sine activations and provides an initialization for them, shows the importance of initialization tailored to the activation function, and we build heavily off them. We show that our initialization does a better job compared to previous initializations (we can maintain steady variance in both the forward and backward pass instead of just one), and show significant task performance improvement in INR tasks.

Our Initialization

Consider a general MLP network with activation function \(f\):

A standard approach to initializing networks is to ensure the variance of the preactivations at each layer is the same, and the variance of the gradients of loss w.r.t. the preactivations at each layer is the same. We first show that the distribution of preactivations at layer \(i\) (i.e. the elements of \(z_i\)), converges in distribution to \[ \mathcal{N}\left(0, M_i\left(\mu^2(x_i)+\sigma^2(x_i)\right)\sigma^2(W_i)\right) \] (note that this was first shown by Kumar (2017) ³, however we provide a more rigourous proof by generalizing Sitzmann et al.'s proof for SIREN networks ⁴).

This means that the variance of the preactivations at layer \(i\) (\(z_i\)) depends on the variance of the weight at that layer (\(W_i\)) and the distribution of the input at that layer (\(x_i\)). The key to our approach is to set the distribution of the preactivations at each layer to be \(\mathcal{N}\left(0, \sigma_p^2\right)\) where we choose \(\sigma_p\). To do this, we initialize the weights at layer \(i\) to have variance \[ \sigma^2(W_i) = \frac{\sigma_p^2}{M_i\left(\mu^2(x_i)+\sigma^2(x_i)\right)} \] where the statistics of the output of the previous layer can be computed using the fact that the preactivations in that layer have been set to have distribution \(\mathcal{N}\left(0, \sigma_p^2\right)\): \[ \mu(x_i) = \mu(f(z_{i-1})) = \mathbf{E}_{z\sim \mathcal{N}(0,\sigma^2_p)}\left[f(z)\right] \] \[ \sigma^2(x_i) = \sigma^2(f(z_{i-1})) = \mathbf{Var}_{z\sim \mathcal{N}(0,\sigma^2_p)}\left[f(z)\right]. \]
Calculating these statistics analytically is difficult depending on \(f\). We use Monte Carlo sampling, which we show is efficient and accurate unlike previous approaches.

Similarly, we derive the condition for the backward pass: \[ \sigma^2(W_i) = \frac{1}{M_{i+1}\left(\mu^2(f'(z_i)) + \sigma^2(f'(z_i))\right)}. \]
Unlike previous approaches (e.g. Xavier, Kaiming) we can make the condition for the forward and backward pass both hold due to \(\sigma_p\) being a free parameter for us to set. Thus we need to find \(\sigma_p\) that satisfies \[ \sigma_p^2 \frac{M_{i+1}}{M_i} \frac{\mu^2(f'(z_i)) + \sigma^2(f'(z_i))}{\mu^2(x_i)+\sigma^2(x_i)} = 1. \] This is non-trivial to do analytically as many of the terms are expectations over the distribution \(\mathcal{N}\left(0, \sigma_p^2\right)\), we instead perform a fast grid search to find the \(\sigma_p\) that makes the left hand side as close to 1 as possible.

Deriving Xavier and Kaiming init

Our init is a more general form than Xavier and Kaiming init, so we can easily derive them from our init. Note that when \(\sigma_p=1\) and the asumptions for Xavier init hold (\(f(x)\approx x \implies \mu(x_i)=0,\sigma^2(x_i)=1\)), our conditions reduce to Xavier init's conditions \[ \sigma^2(W_i) = \frac{1}{M_i} \text{ (Forward Pass)}\] \[ \sigma^2(W_i) = \frac{1}{M_{i+1}} \text{ (Backward Pass)} \] and when \(\sigma_p=1\) and we have ReLU then we get Kaiming init (see paper for explanation) \[ \sigma^2(W_i) = \frac{2}{M_i} \text{ (Forward Pass)}\] \[ \sigma^2(W_i) = \frac{2}{M_{i+1}} \text{ (Backward Pass)}\] As both initialization cannot satisfy both conditions at the same time like ours, Xavier init takes the average of their conditions, and Kaiming suggests to use either.

Comparison to PyTorch, and proper "gain"

PyTorch ⁵ generalizes Xavier and Kaiming init by introducing "gain", so for forward pass Kaiming (they call this fan-in Kaiming) \[ \sigma^2(W_i) = \text{gain}^2(f)\frac{1}{M_i} \] While never explictly defined, it is motivated as a scaling term on the weight's variance to compensate for the activation function, and is often implied to be \[ \text{gain}^2(f) \approx \frac{\sigma^2(z_i)}{\sigma^2(f(z_i))} \] which is not well defined. As a result, it is often treated as a hyperparameter to brute force search for a value that keeps the variance through the network stable. Our formulation makes it well defined due to \(\sigma_p\) (the fact that we are setting the preactivation variance of the current layer given that it has been set for the previous layer), thus for the forward pass \[ \text{gain}^2(f, \sigma_p) = \frac{\sigma_p^2}{\mathbf{E}_{z\sim \mathcal{N}(0,\sigma^2_p)}\left[f(z)\right]^2+\mathbf{Var}_{z\sim \mathcal{N}(0,\sigma^2_p)}\left[f(z)\right]}. \]

References

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. International Conference on Artificial Intelligence and Statistics. Link
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. IEEE International Conference on Computer Vision (ICCV). Link
Kumar, S. K. (2017). On weight initialization in deep neural networks. arXiv preprint arXiv:1704.08863. Link
Sitzmann, V., Martel, J., Bergman, A., Lindell, D., & Wetzstein, G. (2020). Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems (NeurIPS). Link
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems (NeurIPS 32). Link

VI³NR: Variance Informed Initialization for Implicit Neural Representations

CVPR 2025