Paper Review - DreamFusion: Text-to-3D using 2D Diffusion

DreamFusion represents a breakthrough in text-to-3D synthesis by leveraging pretrained 2D text-to-image diffusion models to generate high-quality 3D assets without any 3D training data. This novel approach circumvents the limitation of scarce 3D datasets by distilling knowledge from large-scale 2D models, enabling the creation of coherent 3D objects and scenes from natural language descriptions.

Key Innovation

Tags: 3D synthesis text-to-3D diffusion models neural radiance fields computer vision

Paper Review - A Primer on the Inner Workings of Transformer-Based Language Models

This work summarizes the current techniques used to interpret the inner workings of transformer language models. As such, summarizing a summary is a challenge, but we will give it a try nonetheless. The paper starts by introducing the transformer architecture and notation. The authors adopt the residual stream perspective regarding interpretability. In this view each input embedding gets updated via vector additions from the attention and feed-forward blocks, producing residual stream states. The final layer residual stream state is then projected into the vocabulary space via the unembedding matrix and normalized via softmax to obtain a distribution over our sampled output token.

Transformer Architecture

Transformer Layer

Layer Norm

A common operation used to stabilize the training process, today commonly employed before the attention block due to empirical success. The layer norm operation given representation z is $((\frac{z-\mu(z)}{\sigma(z)}) \cdotp \gamma + \beta)$ where $\mu$ and $\sigma$ calculate the mean and std, and $\gamma \in \mathbb{R^d}$ and $\beta \in \mathbb{R}^d$ refer to learned element-wise transformation and bias respectively. The authors discuss how layer norm can be interpreted by visualizing the mean subtraction as a projection onto a hyperplane defined by the normal vector $[1,1,1,1….,1] \in \mathbb{R}^d$ and the following scaling as a mapping of the resulting representation to a hypersphere (look into this further).

Attention Block

The attention block is composed of multiple attention head. At a decoding step i, each attnetion head reads from residual streams across previous position, decides which position to attend to, gathers information from those and finally writes it into the current residual stream. Using tensor rearrangement operations one can simplify the analysis of residual stream contributions.

Tags: transformers language models attention mechanisms NLP

Paper Review - 3D Gaussian Splatting for Real-Time Radiance Field Rendering

3D Gaussian Splatting introduces a groundbreaking approach to novel view synthesis that achieves both state-of-the-art quality and real-time rendering speeds. The method represents scenes using anisotropic 3D Gaussians that are optimized from multi-view images, combining the quality of neural volumetric rendering with the speed of traditional rasterization pipelines.

Key Innovation

The core innovation of 3D Gaussian Splatting is a hybrid representation that bridges the gap between continuous volumetric radiance fields and discrete, explicit primitives:

Tags: computer vision 3D rendering machine learning graphics neural rendering

Retrieval Head Mechanistically Explains Long-Context Factuality

In this paper the researcher are able to show that there exists a special kind of attention heads which are responsible for retrieval of information from long context. They outline a few intriguing properties that these heads posess:

Tags: mechanistic interpretability model compression summary