ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

2 minute read

The field of artificial intelligence faces significant computational constraints, particularly in the deployment and training of Large Language Models (LLMs). This makes LLM compression a crucial area of research, where successful techniques could substantially reduce computational requirements while preserving model performance.

The authors of this paper introduce a novel pruning method based on a metric called Block Influence (BI), which evaluates layer importance by measuring the similarity between a layer’s input and output states. The BI metric is defined as:

\[BI_i = 1 - \mathbb{E}_{X,t}\frac{X^T_{i,t}X_{i+1,t}}{||X_{i,t}||_2||X_{i+1,t}||_2}\]

where $X_i,t$ represents the $t^{th}$ row of hidden state of the $i^{th}$ layer.

The authors demonstrate several key findings:

Their method successfully removes approximately 25% of model parameters while retaining 90% of performance
The approach outperforms existing pruning methods across multiple benchmarks
The technique generalizes beyond transformer architectures to alternative model architectures like RWKV and Mamba
The method can be combined with quantization techniques, with the two approaches working independently (orthogonally) to achieve further compression

Personal Opinion

The discovery of significant redundancy in LLM architectures is unsurprising given the field’s recent trajectory. The “scaling hypothesis” drove us toward a methodology of “make it bigger,” throwing more compute and data at our problems with the expectation that they would magically solve themselves. This brute-force approach seems unlikely to be optimal from the perspective of parameter efficiency. Recent work reinforces this view, showing that smaller models can match or exceed the performance of their larger counterparts through architectural innovations and intelligent data curation.

However, the dramatic performance degradation on generative tasks raises serious concerns about this pruning approach. The fact that relatively small changes in parameter count can lead to catastrophic failures in more complex tasks suggests that our understanding of parameter importance may be incomplete. We need a more nuanced approach to parameter reduction that goes beyond removing entire layers, perhaps guided by information-theoretic principles that better capture the true importance of each parameter in the network.

The relationship between pre-normalization, model performance, and layer redundancy is particularly intriguing. The fact that pre-norm simultaneously improves model performance while increasing layer redundancy suggests there are fundamental principles about neural network optimization that we have yet to understand. Until we develop training dynamics that naturally encourage parameter efficiency and better understand these architectural trade-offs, we may be stuck with suboptimal solutions to the compression problem. While this work represents progress toward more efficient models, it also highlights how much we still have to learn about the underlying principles of neural network computation.