Looking inside LLMs

Published

October 7, 2024

Meta launched its Llama 3.1 model not too long ago claiming it to be the biggest open source LLM so far. And as such there has been a flurry of open-source LLMs out in the market. Companies are spending thousands in training these models and millions in the infrastructure needed to train and deploy these models.

From a personal perspective, I am fascinated by the architecture of these vast networks—the layers, the weight distributions, and more. In my previous work 1, we employed the activations of different layers to prune the neural network, eliminating redundant neurons and thereby shrinking the model for edge deployment. A key insight from that study was that the internal representations of these networks—often sparse despite having billions of parameters—contain valuable information that can further our understanding of the models. Interestingly, a significant number of these parameters are zero or near-zero, underscoring the inherent sparsity of these networks. Yet, this has not deterred corporate investments in discovering these “giganormous” sparse matrices. Over time, we anticipate learning more efficient methods to uncover these structures, which may also illuminate aspects of LLM explainability, as several studies have attempted by analyzing network activations.

As an illustrative exercise, I have plotted Kernel Density Estimates (KDEs)—an advanced technique for visualizing distributions, superior to traditional histograms (for newcomers to KDE, see here 2). These plots reveal various patterns in the parameters of several open-source LLMs. I encourage you to explore these graphs and share your observations in the comments below.

In the coming days, I plan to dive deeper into the exploration of these open-source LLMs and learn more about their underlying parameters. Stay tuned for more insights.

Kernel Density Estimates for LLM parameters