The Mathematics and Philosophy Behind MSE

When we perform linear regression, we’re making two fundamental modeling choices that are worth examining separately:

  1. Assuming our data follows a linear relationship
  2. Using mean squared error (MSE) as our loss function
Tags: knowledge learning meditation

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Compute is a significant bottleneck for current AI model performance and scale ((cite)), making model compression increasingly important. In this paper, the authors propose a novel method to dynamically allocate compute across layers, reducing total cost while maintaining performance.

Tags: model compression Dynamic compute allocation summary

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

The field of artificial intelligence faces significant computational constraints, particularly in the deployment and training of Large Language Models (LLMs). This makes LLM compression a crucial area of research, where successful techniques could substantially reduce computational requirements while preserving model performance.

Tags: model compression summary

Universal Language Learning Paragims - UL2

This paper presents a novel framework for model pre-training. To date, there seems to be no consensus as to what the optimal pre-training procedure should be, with different procedures generalizing differently for downstream tasks later on.

Tags: pretraining summary

On Writing to Think

Many successful intellectuals mention a consistent writing schedule[^1] as an integral part of their daily routine and their success. I always found this a bit weird - there’s something surprising about the notion that to learn, you need to sit in solitude, with no outside interaction, and put words on paper (or screen). If this is the case, why couldn’t we just “know it”? Why do we need this whole ordeal of being alone and thinking?

Tags: knowledge learning meditation