Paper Review - CAASL: Amortized Active Causal Induction with Deep Reinforcement Learning

CAASL: Amortized Active Causal Induction with Deep Reinforcement Learning CAASL (Causal Amortized Active Structure Learning) presents a breakthrough approach for selecting informative interventions in causal discovery tasks without requiring access to the data likelihood. This method bridges the gap between traditional causal structure learning and intervention design by using reinforcement learning to train an amortized policy, offering both rapid inference and sample-efficient causal discovery. Key Innovation The core innovations of CAASL address fundamental challenges in causal intervention design:

Amortized Intervention Policy: A single transformer-based network that directly proposes interventions without intermediate causal graph inference Likelihood-Free Approach: Elimination of the need to compute or approximate data likelihoods during intervention design Reinforcement Learning Framework: Training with SAC to maximize cumulative information gain about causal structure Symmetry-Preserving Architecture: Neural architecture that encodes key design space symmetries for generalization

This combination enables both speed and sample efficiency that exceeds previous approaches, which typically required either explicit likelihood computations or intermediate causal graph inference steps. Implementation Transformer-Based Policy The policy leverages a transformer architecture with several key features:

Alternating Self-Attention: Applied over both the variable axis and samples axis to ensure permutation equivariance and invariance History Encoding: Represents past interventional data and outcomes in a format that preserves causal information Permutation Invariance: The transformer’s max pooling over samples ensures the ordering of data doesn’t affect intervention selection Gaussian-Tanh Distribution: Policy outputs are parameterized as a Gaussian-Tanh distribution for both intervention targets and values

Reward Function The reward function is derived from improvement in causal graph estimation:

AVICI Integration: Uses a pre-trained amortized causal discovery model (AVICI) that directly estimates graph posteriors Adjacency Matrix Accuracy: Measures improvement in correctly identified edges rather than log-likelihood Telescoping Reward Structure: Rewards represent incremental improvement at each step Mathematical Formulation: R(ht, It, ht−1, {A, θ}) = E[∑i,j I[Â(i,j) = A(i,j)]] - R(ht−1, It−1, ht−2, {A, θ})

Training Process The policy is trained through reinforcement learning in simulated environments:

Simulator-Based Training: Environments sample causal graphs from a prior where ground truth is known REDQ/SAC Algorithm: Uses an off-policy RL algorithm that improves sample efficiency Hidden-Parameter MDP: Formulates the problem as a HiP-MDP where hidden parameters are the unknown causal graph Q-Function Networks: Multiple Q-function approximators with transformer-based history encoding

Results CAASL demonstrated exceptional performance in comparisons:

Synthetic Environment Performance:

Superior returns and structural hamming distance compared to random interventions and observational data Competitive with or outperforming likelihood-based methods like DiffCBED and SS Finite Converges to optimal intervention strategies in ~20 iterations

Single-Cell Gene Expression Performance:

Successfully applies to complex SERGIO simulator with differential equation mechanics Handles significant technical noise and data missingness (~74% dropout rate) Outperforms random intervention strategies despite biological complexity

Generalization Capabilities:

Robust to distribution shifts in graph priors (Erdős–Rényi to Scale-Free) Handles changes in noise distributions and mechanism parameters Zero-shot generalization to higher dimensional problems (up to 3x training dimensionality) Adapts to different intervention types not seen during training

Theoretical Foundations The method’s success is grounded in several key theoretical insights:

Connection to Bayesian Experimental Design: CAASL’s reward function is related to a lower bound on Expected Information Gain (EIG) Barber-Agakov Bound: The theoretical foundation lies in the bound: EIG(A; πϕ) ≥ E[log q(A | ht)] + const.

From Log-Likelihood to Accuracy: Replacing log-likelihood with adjacency matrix accuracy provides a more direct and effective reward signal Bernoulli Representation of Causal Graphs: The posterior treats each potential edge as an independent Bernoulli random variable Feature Integration: The mathematical conditioning of the transformer through alternating attention ensures both variable and sample symmetries are preserved

Key Lessons

Amortization Matters: Training a single policy network that generalizes across causal graphs dramatically reduces computational cost compared to per-instance optimization Likelihood-Free Approach: Eliminating the need for likelihood computation enables application to complex domains where likelihoods are intractable Symmetry Encoding: Designing the network architecture to respect fundamental symmetries in the causal discovery problem enables generalization Reinforcement Learning for Design: RL provides an effective framework for sequential decision-making under uncertainty, particularly valuable for intervention design Simulator-Based Training: Using simulated environments where ground truth is known enables effective policy learning that transfers to real-world scenarios Adjacency Accuracy as Reward: Directly rewarding correct graph structure identification provides a more effective signal than complex information-theoretic quantities

Points to Remember

Likelihood-Free vs. Likelihood-Based: Traditional intervention design requires computing p(data|causal model), which is intractable in many real-world settings. CAASL eliminates this requirement, making it applicable to complex biological systems. Bernoulli Graph Representation: Representing the causal graph posterior as independent Bernoulli random variables for each edge provides a probabilistic framework that quantifies uncertainty rather than just making point estimates. Simulation for Training: While we can’t simulate environments we don’t know the structure of, we can train on many simulated environments where we do know the structure, and the learned strategies transfer to new, unknown environments. Adjacency Matrix Accuracy: Using the number of correctly identified edges as a reward is a practical simplification that still maximizes information gain about the graph structure while being more numerically stable and interpretable. Two Model Roles: The system uses separate models for graph inference (AVICI) and intervention selection (CAASL), with the policy learning to propose interventions that help the inference model discover the true graph structure.

Tags: reinforcemnt learning causality graph learning

Paper Review - pixelNeRF: Neural Radiance Fields from One or Few Images

pixelNeRF presents a learning framework that enables predicting Neural Radiance Fields (NeRF) from just one or a few images in a feed-forward manner. This approach overcomes key limitations of the original NeRF, which requires many calibrated views and significant per-scene optimization time, by introducing an architecture that conditions a neural radiance field on image features in a fully convolutional manner.

Key Innovation

The core insight of pixelNeRF is that by conditioning a neural radiance field on image features, the network can learn scene priors across multiple scenes, enabling it to:

Tags: 3D rendering neural radiance fields novel view synthesis computer vision machine learning

Paper Review - The Geometry of Categorical and Hierarchical Concepts in Large Language Models

In this work the authors attempt to extend the linear representation hypothesis to cases where the concepts no longer have natural binary contrasts and hence do not admit a straightforward notion of direction (i.e {male, female} vs. {is_animal}). They extend the theory for both binary concepts, categorical concepts and hierarchical representations. This is done by replacing the notion of a representation as a direction to representation as a vector. The authors summarize there main contributions as:

  1. Showing that categorical concepts are represented as polytopes where each vertex is the vector representation of one of the elements of the concept.
  2. Semantic hierarchy between concepts is encoded geometrically as orthogonality between representations
  3. Validating these results on Gemma and LLaMA-3 using the WordNet hierarchy

    Preliminaries

    Concepts

    The authors formalize a concept as a latent variable W that is caused by the context X and causes the output Y. Further, they introduce the notion of causally separable concepts. Concepts W and Z are causally separable if the potential outcome $Y(W=w, Z=z)$ is well defined for all w, z. That is, two variables are causally separable if they can be freely manipulated.

    The Causal inner product

    Following the results of Park et al. the authors unify the two distinct representation spaces dealt with, the context embedding and the token unembeddings, by performing an invertible affine transformation on them. The Euclidean inner product in the transformed spaces is the causal inner product, an inner product in which causally separable concepts are orthogonal, and the Riesz isomorphism between the embedding and unembedding spaces is the vector transpose operation.

Representation of concepts

Categorical concepts

Definition 1

Tags: language models representation geometry AI interpretability concept learning

Paper Review - Targeted Cause Discovery with Data-Driven Learning

The paper introduces Targeted Cause Discovery with Data-Driven Learning (TCD-DL), a machine learning-based approach to identify all causal variables—direct and indirect—of a target variable in large-scale systems, such as gene regulatory networks (GRNs). Traditional causal discovery methods often falter due to scalability issues and error propagation when tracing indirect causes. TCD-DL overcomes these challenges by leveraging a pre-trained neural network and a scalable inference strategy.

Key Components and Methodology
  • Pre-Trained Feature Extractor:
    The core of TCD-DL is a feature extractor, implemented as a multi-layer axial Transformer. This component is trained on simulated graphs with diverse structures (e.g., Erdös-Rényi, Scale-Free) to learn generalizable features of causal relationships. These features are then used by a score calculator to infer causal structures in new systems. The training assumes some similarity in causal dynamics between the simulated graphs and real-world systems, but the learned features are transferable, allowing the model to generalize to unseen graphs like biological networks.

  • Scalability via Local Inference:
    TCD-DL tackles large systems by subsampling the data into smaller subsets. It performs local inference on each subset and aggregates the results through ensembling, reducing computational complexity from exponential or quadratic to linear. This makes it practical for systems with thousands of variables.

Tags: machine learning causal discovery gene regulatory networks scalable algorithms neural networks

Paper Review - TabPFN: Understanding and Advancing Tabular Foundation Models

The Core Problem

Traditional deep learning has revolutionized domains like computer vision and NLP, but tabular data remains dominated by classical approaches like gradient-boosted trees. This stems from tabular data’s unique challenges: heterogeneous features, complex dependencies, and small dataset sizes.

Tags: machine learning tabular data deep learning transformer architecture distributional prediction