Paper Review - CAASL: Amortized Active Causal Induction with Deep Reinforcement Learning

CAASL: Amortized Active Causal Induction with Deep Reinforcement Learning CAASL (Causal Amortized Active Structure Learning) presents a breakthrough approach for selecting informative interventions in causal discovery tasks without requiring access to the data likelihood. This method bridges the gap between traditional causal structure learning and intervention design by using reinforcement learning to train an amortized policy, offering both rapid inference and sample-efficient causal discovery. Key Innovation The core innovations of CAASL address fundamental challenges in causal intervention design:

Amortized Intervention Policy: A single transformer-based network that directly proposes interventions without intermediate causal graph inference Likelihood-Free Approach: Elimination of the need to compute or approximate data likelihoods during intervention design Reinforcement Learning Framework: Training with SAC to maximize cumulative information gain about causal structure Symmetry-Preserving Architecture: Neural architecture that encodes key design space symmetries for generalization

This combination enables both speed and sample efficiency that exceeds previous approaches, which typically required either explicit likelihood computations or intermediate causal graph inference steps. Implementation Transformer-Based Policy The policy leverages a transformer architecture with several key features:

Alternating Self-Attention: Applied over both the variable axis and samples axis to ensure permutation equivariance and invariance History Encoding: Represents past interventional data and outcomes in a format that preserves causal information Permutation Invariance: The transformer’s max pooling over samples ensures the ordering of data doesn’t affect intervention selection Gaussian-Tanh Distribution: Policy outputs are parameterized as a Gaussian-Tanh distribution for both intervention targets and values

Reward Function The reward function is derived from improvement in causal graph estimation:

AVICI Integration: Uses a pre-trained amortized causal discovery model (AVICI) that directly estimates graph posteriors Adjacency Matrix Accuracy: Measures improvement in correctly identified edges rather than log-likelihood Telescoping Reward Structure: Rewards represent incremental improvement at each step Mathematical Formulation: R(ht, It, ht−1, {A, θ}) = E[∑i,j I[Â(i,j) = A(i,j)]] - R(ht−1, It−1, ht−2, {A, θ})

Training Process The policy is trained through reinforcement learning in simulated environments:

Simulator-Based Training: Environments sample causal graphs from a prior where ground truth is known REDQ/SAC Algorithm: Uses an off-policy RL algorithm that improves sample efficiency Hidden-Parameter MDP: Formulates the problem as a HiP-MDP where hidden parameters are the unknown causal graph Q-Function Networks: Multiple Q-function approximators with transformer-based history encoding

Results CAASL demonstrated exceptional performance in comparisons:

Synthetic Environment Performance:

Superior returns and structural hamming distance compared to random interventions and observational data Competitive with or outperforming likelihood-based methods like DiffCBED and SS Finite Converges to optimal intervention strategies in ~20 iterations

Single-Cell Gene Expression Performance:

Successfully applies to complex SERGIO simulator with differential equation mechanics Handles significant technical noise and data missingness (~74% dropout rate) Outperforms random intervention strategies despite biological complexity

Generalization Capabilities:

Robust to distribution shifts in graph priors (Erdős–Rényi to Scale-Free) Handles changes in noise distributions and mechanism parameters Zero-shot generalization to higher dimensional problems (up to 3x training dimensionality) Adapts to different intervention types not seen during training

Theoretical Foundations The method’s success is grounded in several key theoretical insights:

Connection to Bayesian Experimental Design: CAASL’s reward function is related to a lower bound on Expected Information Gain (EIG) Barber-Agakov Bound: The theoretical foundation lies in the bound: EIG(A; πϕ) ≥ E[log q(A | ht)] + const.

From Log-Likelihood to Accuracy: Replacing log-likelihood with adjacency matrix accuracy provides a more direct and effective reward signal Bernoulli Representation of Causal Graphs: The posterior treats each potential edge as an independent Bernoulli random variable Feature Integration: The mathematical conditioning of the transformer through alternating attention ensures both variable and sample symmetries are preserved

Key Lessons

Amortization Matters: Training a single policy network that generalizes across causal graphs dramatically reduces computational cost compared to per-instance optimization Likelihood-Free Approach: Eliminating the need for likelihood computation enables application to complex domains where likelihoods are intractable Symmetry Encoding: Designing the network architecture to respect fundamental symmetries in the causal discovery problem enables generalization Reinforcement Learning for Design: RL provides an effective framework for sequential decision-making under uncertainty, particularly valuable for intervention design Simulator-Based Training: Using simulated environments where ground truth is known enables effective policy learning that transfers to real-world scenarios Adjacency Accuracy as Reward: Directly rewarding correct graph structure identification provides a more effective signal than complex information-theoretic quantities

Points to Remember

Likelihood-Free vs. Likelihood-Based: Traditional intervention design requires computing p(data|causal model), which is intractable in many real-world settings. CAASL eliminates this requirement, making it applicable to complex biological systems. Bernoulli Graph Representation: Representing the causal graph posterior as independent Bernoulli random variables for each edge provides a probabilistic framework that quantifies uncertainty rather than just making point estimates. Simulation for Training: While we can’t simulate environments we don’t know the structure of, we can train on many simulated environments where we do know the structure, and the learned strategies transfer to new, unknown environments. Adjacency Matrix Accuracy: Using the number of correctly identified edges as a reward is a practical simplification that still maximizes information gain about the graph structure while being more numerically stable and interpretable. Two Model Roles: The system uses separate models for graph inference (AVICI) and intervention selection (CAASL), with the policy learning to propose interventions that help the inference model discover the true graph structure.

Tags: reinforcemnt learning causality graph learning

Written on May 8, 2025 • 5 min read

Paper Review - pixelNeRF: Neural Radiance Fields from One or Few Images

pixelNeRF presents a learning framework that enables predicting Neural Radiance Fields (NeRF) from just one or a few images in a feed-forward manner. This approach overcomes key limitations of the original NeRF, which requires many calibrated views and significant per-scene optimization time, by introducing an architecture that conditions a neural radiance field on image features in a fully convolutional manner.

Key Innovation

The core insight of pixelNeRF is that by conditioning a neural radiance field on image features, the network can learn scene priors across multiple scenes, enabling it to:

Tags: 3D rendering neural radiance fields novel view synthesis computer vision machine learning

Written on April 14, 2025 • 9 min read

Paper Review - The Geometry of Categorical and Hierarchical Concepts in Large Language Models

In this work the authors attempt to extend the linear representation hypothesis to cases where the concepts no longer have natural binary contrasts and hence do not admit a straightforward notion of direction (i.e {male, female} vs. {is_animal}). They extend the theory for both binary concepts, categorical concepts and hierarchical representations. This is done by replacing the notion of a representation as a direction to representation as a vector. The authors summarize there main contributions as:

Showing that categorical concepts are represented as polytopes where each vertex is the vector representation of one of the elements of the concept.
Semantic hierarchy between concepts is encoded geometrically as orthogonality between representations
Validating these results on Gemma and LLaMA-3 using the WordNet hierarchy
Preliminaries

Concepts

The authors formalize a concept as a latent variable W that is caused by the context X and causes the output Y. Further, they introduce the notion of causally separable concepts. Concepts W and Z are causally separable if the potential outcome $Y(W=w, Z=z)$ is well defined for all w, z. That is, two variables are causally separable if they can be freely manipulated.

The Causal inner product

Following the results of Park et al. the authors unify the two distinct representation spaces dealt with, the context embedding and the token unembeddings, by performing an invertible affine transformation on them. The Euclidean inner product in the transformed spaces is the causal inner product, an inner product in which causally separable concepts are orthogonal, and the Riesz isomorphism between the embedding and unembedding spaces is the vector transpose operation.

Representation of concepts

Categorical concepts

Definition 1

Tags: language models representation geometry AI interpretability concept learning

Written on April 14, 2025 • 9 min read

Paper Review - Targeted Cause Discovery with Data-Driven Learning

The paper introduces Targeted Cause Discovery with Data-Driven Learning (TCD-DL), a machine learning-based approach to identify all causal variables—direct and indirect—of a target variable in large-scale systems, such as gene regulatory networks (GRNs). Traditional causal discovery methods often falter due to scalability issues and error propagation when tracing indirect causes. TCD-DL overcomes these challenges by leveraging a pre-trained neural network and a scalable inference strategy.

Key Components and Methodology

Pre-Trained Feature Extractor:
The core of TCD-DL is a feature extractor, implemented as a multi-layer axial Transformer. This component is trained on simulated graphs with diverse structures (e.g., Erdös-Rényi, Scale-Free) to learn generalizable features of causal relationships. These features are then used by a score calculator to infer causal structures in new systems. The training assumes some similarity in causal dynamics between the simulated graphs and real-world systems, but the learned features are transferable, allowing the model to generalize to unseen graphs like biological networks.
Scalability via Local Inference:
TCD-DL tackles large systems by subsampling the data into smaller subsets. It performs local inference on each subset and aggregates the results through ensembling, reducing computational complexity from exponential or quadratic to linear. This makes it practical for systems with thousands of variables.

Tags: machine learning causal discovery gene regulatory networks scalable algorithms neural networks

Written on April 14, 2025 • 2 min read

Paper Review - TabPFN: Understanding and Advancing Tabular Foundation Models

The Core Problem

Traditional deep learning has revolutionized domains like computer vision and NLP, but tabular data remains dominated by classical approaches like gradient-boosted trees. This stems from tabular data’s unique challenges: heterogeneous features, complex dependencies, and small dataset sizes.

Tags: machine learning tabular data deep learning transformer architecture distributional prediction

Written on April 14, 2025 • 3 min read

Paper Review - SIREN: Implicit Neural Representations with Periodic Activation Functions

SIREN (Sinusoidal Representation Networks) introduced a groundbreaking approach for implicit neural representations using periodic activation functions. This architecture revolutionized how neural networks can represent complex natural signals and their derivatives, establishing a foundation for solving a wide range of problems involving differential equations, 3D shape representation, and complex signal modeling.

Implementation

Architecture

The core innovation of SIREN is surprisingly simple: replacing standard activation functions with sine activations throughout the network. Formally, a SIREN layer implements:

\[\Phi_i(x) = \sin(W_i x + b_i)\]

where the network approximates continuous functions through a composition of these layers:

\[\Phi(x) = W_n(\sin(W_{n-1}(...\sin(W_0x + b_0)...) + b_{n-1}) + b_n\]

Tags: neural networks implicit neural representations signal processing computer vision

Written on April 14, 2025 • 5 min read

Paper Review - Randomization Inference When N Equals One

Core Problem

N-of-1 trials (where a single subject serves as both treatment and control over time) traditionally require long “washout” periods between treatments to prevent interference effects. This paper addresses how to perform valid statistical inference when treatment effects persist over time, enabling more frequent treatment switching and shorter trials.

Tags: causal inference time series analysis personalized medicine digital health online experimentation

Written on April 14, 2025 • 2 min read

Paper Review - PointNet and PointNet++

PointNet and its successor PointNet++ introduced groundbreaking approaches for directly processing point cloud data without intermediary representations, establishing a foundation for 3D deep learning. These architectures effectively addressed the fundamental challenges of permutation invariance, transformation invariance, and hierarchical feature learning on unordered point sets, achieving state-of-the-art performance across multiple 3D understanding tasks.

Implementation

PointNet

The key architectural innovation of PointNet is its approach to achieving permutation invariance through symmetric functions. The network processes each point independently and aggregates information through a global max pooling operation. Formally, the network approximates a function on point sets as:

\[f({x_1, x_2, ..., x_n}) \approx \gamma(MAX{h(x_1), h(x_2), ..., h(x_n)})\]

Tags: 3D deep learning point cloud processing neural networks computer vision geometric deep learning

Written on April 14, 2025 • 5 min read

Paper Review - OpenScene: 3D Scene Understanding with Open Vocabularies

OpenScene presents a breakthrough approach to 3D scene understanding that eliminates reliance on labeled 3D data and enables open-vocabulary querying. By co-embedding 3D points with text and image pixels in the CLIP feature space, OpenScene can perform zero-shot 3D semantic segmentation and novel tasks like querying scenes for materials, affordances, and activities.

Core Innovations

Open-Vocabulary 3D Scene Understanding

Traditional 3D scene understanding relies on supervised learning with fixed label sets. OpenScene introduces:

Zero-shot learning: No labeled 3D data required
Open-vocabulary querying: Ability to use arbitrary text to query 3D scenes
Co-embedding: 3D points, image pixels, and text exist in the same semantic feature space
Extended capabilities: Beyond object categorization to materials, affordances, activities, and room types

Co-Embedding with CLIP Feature Space

The key insight is aligning 3D point features with CLIP’s rich semantic space:

Tags: 3D scene understanding computer vision semantic segmentation zero-shot learning open-vocabulary querying

Written on April 14, 2025 • 11 min read

Paper Review - One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

One-2-3-45++ presents a breakthrough approach for transforming a single image into a high-quality 3D textured mesh in approximately one minute. This method bridges the gap between image-based and 3D modeling by combining the power of 2D diffusion models with 3D native diffusion, offering both rapid generation and high fidelity to input images.

Key Innovation

The core innovations of One-2-3-45++ address two fundamental challenges in image-to-3D conversion:

Consistent Multi-View Generation: A novel approach to generate multiple coherent views of an object from a single image
3D Diffusion with Multi-View Conditioning: A two-stage 3D diffusion process guided by multi-view images
Texture Refinement: A lightweight optimization technique to enhance texture quality
End-to-End Pipeline: Integration of these components into a system that produces high-quality 3D meshes in under one minute

Tags: single image to 3D 3D generation multi-view consistency diffusion models

Written on April 14, 2025 • 12 min read

Paper Review - Neural Fields as Learnable Kernels for 3D Reconstruction

Neural Kernel Fields (NKF) introduced a novel approach to 3D reconstruction that bridges the gap between data-driven methods and traditional kernel techniques. This approach achieves state-of-the-art results when reconstructing 3D objects and scenes from sparse oriented points, with remarkable generalization capabilities to unseen shape categories and varying point densities.

Key Innovation

The core insight of NKF is that kernel methods are extremely effective for reconstructing shapes when the chosen kernel has an appropriate inductive bias. The paper factors the problem of shape reconstruction into two complementary parts:

A backbone neural network which learns kernel parameters from data
A kernel ridge regression that fits input points on-the-fly by solving a simple positive definite linear system

This factorization creates a method that gains the benefits of data-driven approaches while maintaining interpolatory behavior that converges to ground truth as input sampling density increases.

Implementation

Neural Splines Foundation

NKF builds upon Neural Splines, a kernel-based approach where an implicit field is represented as:

Tags: 3D reconstruction neural fields machine learning computer vision

Written on April 14, 2025 • 8 min read

Paper Review - Meta Statistical Learning

Meta-statistical learning is an innovative framework that employs neural networks to perform statistical inference tasks, such as parameter estimation and hypothesis testing. Unlike traditional statistical methods that rely on manually derived estimators, meta-statistical learning treats entire datasets as inputs and learns to predict properties of the data-generating distribution directly from synthetic data. This approach aims to address the limitations of traditional methods, particularly in scenarios with low sample sizes or non-Gaussian distributions.

Algorithm Backgrounds

Meta-statistical models consist of two primary components: an encoder and a prediction head. The encoder processes the dataset into a fixed-size embedding, which the prediction head then transforms into the final prediction. Three types of encoders are explored in this framework:

Tags: machine learning statistics neural networks data analysis meta-learning few-shot learning

Written on April 14, 2025 • 4 min read

Paper Review - LION: Latent Point Diffusion Models for 3D Shape Generation

Overview

This paper introduces LION (Latent Point Diffusion Model), a novel approach for 3D shape generation that combines variational autoencoders (VAEs) with denoising diffusion models (DDMs) in latent space. The authors aim to create a 3D generative model that satisfies three key requirements for digital artists: high-quality shape generation, flexibility for manipulation, and the ability to output smooth meshes. LION outperforms previous state-of-the-art methods on various benchmarks and enables multiple applications such as multimodal shape generation, voxel-guided synthesis, and shape interpolation.

Architecture

LION employs a hierarchical framework with two main components:

Hierarchical VAE Structure:
- Global Shape Latent (z₀): A vector representation that captures overall shape information
- Point-structured Latent (h₀): A point cloud structure with 3D coordinates and additional features that represents local details
- The point latent h₀ is conditioned on the global shape latent z₀, creating a hierarchical relationship
Latent Diffusion Models:
- One diffusion model trained on the global shape latent z₀
- A second diffusion model trained on the point-structured latent h₀, conditioned on z₀
- Both models operate entirely in latent space rather than directly on point clouds

Tags: 3d shape generation variational autoencoders denoising diffusion models point clouds generative models

Written on April 14, 2025 • 5 min read

Paper Review - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Müller et al. introduced a versatile input encoding for neural networks that dramatically accelerates the training and inference of neural graphics primitives. By combining a multiresolution hash table structure with small MLPs, they achieved training speeds several orders of magnitude faster than previous approaches while maintaining high quality across diverse graphics applications. This approach enables real-time rendering and training of neural representations that previously required hours to converge.

Implementation

Multiresolution Hash Encoding

The core innovation of this paper is a multiresolution hash encoding that maps spatial coordinates to feature vectors through a hierarchy of hash tables:

Multiresolution Structure: The method uses L=16 resolution levels with a geometric progression between the coarsest resolution Nmin and finest resolution Nmax:
\[N_l = \lfloor N_{min} \cdot b^l \rfloor\]
where b is determined by:

Tags: neural rendering 3D reconstruction radiance fields hash encoding

Written on April 14, 2025 • 8 min read

Paper Review - DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

DreamGaussian presents a novel framework for 3D content generation that achieves both efficiency and quality simultaneously. This approach addresses the slow per-sample optimization limitation of previous methods which relied on Neural Radiance Fields (NeRF) with Score Distillation Sampling (SDS). By leveraging 3D Gaussian Splatting for efficient initialization and introducing a mesh extraction algorithm followed by texture refinement, DreamGaussian dramatically reduces generation time while maintaining high-quality results.

Key Innovation

The core insight of DreamGaussian is threefold:

Adapting 3D Gaussian Splatting to generative tasks provides a more efficient optimization landscape than NeRF for SDS supervision

Tags: 3D generation gaussian splatting text-to-3D generative models

Written on April 14, 2025 • 9 min read

Paper Review - DreamFusion: Text-to-3D using 2D Diffusion

DreamFusion represents a breakthrough in text-to-3D synthesis by leveraging pretrained 2D text-to-image diffusion models to generate high-quality 3D assets without any 3D training data. This novel approach circumvents the limitation of scarce 3D datasets by distilling knowledge from large-scale 2D models, enabling the creation of coherent 3D objects and scenes from natural language descriptions.

Key Innovation

Tags: 3D synthesis text-to-3D diffusion models neural radiance fields computer vision

Written on April 14, 2025 • 10 min read

Paper Review - A Primer on the Inner Workings of Transformer-Based Language Models

This work summarizes the current techniques used to interpret the inner workings of transformer language models. As such, summarizing a summary is a challenge, but we will give it a try nonetheless. The paper starts by introducing the transformer architecture and notation. The authors adopt the residual stream perspective regarding interpretability. In this view each input embedding gets updated via vector additions from the attention and feed-forward blocks, producing residual stream states. The final layer residual stream state is then projected into the vocabulary space via the unembedding matrix and normalized via softmax to obtain a distribution over our sampled output token.

Transformer Architecture

Transformer Layer

Layer Norm

A common operation used to stabilize the training process, today commonly employed before the attention block due to empirical success. The layer norm operation given representation z is $((\frac{z-\mu(z)}{\sigma(z)}) \cdotp \gamma + \beta)$ where $\mu$ and $\sigma$ calculate the mean and std, and $\gamma \in \mathbb{R^d}$ and $\beta \in \mathbb{R}^d$ refer to learned element-wise transformation and bias respectively. The authors discuss how layer norm can be interpreted by visualizing the mean subtraction as a projection onto a hyperplane defined by the normal vector $[1,1,1,1….,1] \in \mathbb{R}^d$ and the following scaling as a mapping of the resulting representation to a hypersphere (look into this further).

Attention Block

The attention block is composed of multiple attention head. At a decoding step i, each attnetion head reads from residual streams across previous position, decides which position to attend to, gathers information from those and finally writes it into the current residual stream. Using tensor rearrangement operations one can simplify the analysis of residual stream contributions.

Tags: transformers language models attention mechanisms NLP

Written on April 14, 2025 • 18 min read

Paper Review - 3D Gaussian Splatting for Real-Time Radiance Field Rendering

3D Gaussian Splatting introduces a groundbreaking approach to novel view synthesis that achieves both state-of-the-art quality and real-time rendering speeds. The method represents scenes using anisotropic 3D Gaussians that are optimized from multi-view images, combining the quality of neural volumetric rendering with the speed of traditional rasterization pipelines.

Key Innovation

The core innovation of 3D Gaussian Splatting is a hybrid representation that bridges the gap between continuous volumetric radiance fields and discrete, explicit primitives:

Tags: computer vision 3D rendering machine learning graphics neural rendering

Written on April 14, 2025 • 10 min read

Retrieval Head Mechanistically Explains Long-Context Factuality

In this paper the researcher are able to show that there exists a special kind of attention heads which are responsible for retrieval of information from long context. They outline a few intriguing properties that these heads posess:

Tags: mechanistic interpretability model compression summary

Written on January 6, 2025 • 4 min read

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

RLHF is the most used method to align LLMs with human preferences. RLHF methods can be roughly categorized as either reward-free or reward-based.

Tags: fine tuning model alignment summary

Written on January 5, 2025 • 4 min read

The Mathematics and Philosophy Behind MSE

When we perform linear regression, we’re making two fundamental modeling choices that are worth examining separately:

Assuming our data follows a linear relationship
Using mean squared error (MSE) as our loss function

Tags: knowledge learning meditation

Written on January 4, 2025 • 3 min read

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Compute is a significant bottleneck for current AI model performance and scale ((cite)), making model compression increasingly important. In this paper, the authors propose a novel method to dynamically allocate compute across layers, reducing total cost while maintaining performance.

Tags: model compression Dynamic compute allocation summary

Written on January 4, 2025 • 4 min read

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

The field of artificial intelligence faces significant computational constraints, particularly in the deployment and training of Large Language Models (LLMs). This makes LLM compression a crucial area of research, where successful techniques could substantially reduce computational requirements while preserving model performance.

Tags: model compression summary

Written on January 3, 2025 • 3 min read

Universal Language Learning Paragims - UL2

This paper presents a novel framework for model pre-training. To date, there seems to be no consensus as to what the optimal pre-training procedure should be, with different procedures generalizing differently for downstream tasks later on.

Tags: pretraining summary

Written on January 2, 2025 • 2 min read

On Writing to Think

Many successful intellectuals mention a consistent writing schedule[^1] as an integral part of their daily routine and their success. I always found this a bit weird - there’s something surprising about the notion that to learn, you need to sit in solitude, with no outside interaction, and put words on paper (or screen). If this is the case, why couldn’t we just “know it”? Why do we need this whole ordeal of being alone and thinking?

Tags: knowledge learning meditation

Written on January 1, 2025 • 4 min read

Effects of Scale on Model Finetuning

Finetuning is a type of [[knowledge editing for LLM]], it is required and widely adopted to unlock new and robust capabilities for creative tasks, get the most for focused downstream tasks, and align its value with human preferences. Whenever there is more annotated task-specific data that has been accumulated, the more finetuning is a good option. FInetuning performance is affected by pretraining conditions such as LLM model size and pretraining data size as well as finetuning conditions, such as downstream task, finetuning data size and finetuning methods. Intuitively, the pretraining controls the quality of the learned representation and knowledge in pretrained LLMS and the finetuning affects the degree of transfer to the downstream task. This paper explores whether and how LLM finetuning scales with the aforementioned factors.

Tags: llm finetuning scaling-laws

Written on January 1, 2025 • 5 min read

Harel Lidar

Featured Posts

All Posts

Key Innovation

Preliminaries

Concepts

The Causal inner product

Representation of concepts

Categorical concepts

Definition 1

Key Components and Methodology

The Core Problem

Implementation

Architecture

Core Problem

Implementation

PointNet

Core Innovations

Open-Vocabulary 3D Scene Understanding

Co-Embedding with CLIP Feature Space

Key Innovation

Key Innovation

Implementation

Neural Splines Foundation

Algorithm Backgrounds

Overview

Architecture

Implementation

Multiresolution Hash Encoding

Key Innovation

Key Innovation

Transformer Architecture

Transformer Layer

Layer Norm

Attention Block

Key Innovation