“big llms are memory bound”

22 May 2025

There is wisdom oft repeated that “big neural nets are limited by memory bandwidth.” This is utter horseshit and I will show why.

LLMs are typically implemented as autoregressive feed-forward neural nets. This means that to generate a sentence, you provide a prompt which the neural net then uses to generate the next token. That prompt + token is fed back into the neural net repeatedly until it produces an EOF token, marking the end of generation.

We want to derive an equation predicting token rate TT. Let’s define some variables:

TT: token rate (tokens / second)

MM: memory bandwidth (bytes / second)

PP: model size (parameters)

CC: compute throughput (parameters / second)

QQ: model quantization (bytes / parameter)

Since each token requires accessing the entire model’s parameters, then on an infinitely powerful computer:

T=MPQT = \frac{M}{P \cdot Q}

As the model size PP grows, token rate TT drops; as memory bandwidth MM grows, token rate TT increases. Likewise, quantizing the model eases memory pressure, so reducing bytes/param QQ increases token rate TT. This is all expected.

However, most of our computers do not have infinite compute throughput. We must then adjust our equation:

T=min(MQ,C)PT = \frac{\min(\frac{M}{Q}, C)}{P}

Token rate TT increases until we saturate compute CC or memory bandwidth MQ\frac{M}{Q}, then it stops. Totally reasonable.

Notably, token rate uniformly drops as parameter count increases. The common wisdom that “big models are memory bound lol” is complete horseshit.

This equation helps you balance your compute against your memory bandwidth. You can calculate your system’s memory bandwidth as follows, assuming you have DDR5:

McM_c: memory channels

MsM_s: memory speed (GT/s)

M=Ms8McM = M_s \cdot 8 \cdot M_c

(Source: wikipedia)

So if you have 12 channels of DDR5 @ 6000 MT/s, that works out to 1286=57612 \cdot 8 \cdot 6 = 576 GB/s.

Consider a model like DeepSeek-V3-0324 in 2.42 bit quant. This bad boy is a mixture of experts (MoE) with 37B activated parameters per token. So at 2.42 bits / parameter, that works out to ~11.19 GB / token. Assuming infinite compute, the upper bound on token generation rate is 576 / 12.53 = 51.46 tokens / second.

I hate to be the bearer of bad news. You will not see this token rate. On my shitass server with an EPYC 9115 CPU and 12 channels of ECC DDR5 @ 6000 MT/s, I only see 4.6 tok/s. That implies that my CPU is more than 10x less than what I need to saturate my memory subsystem. I’m using a recent build of llama-cli for this test, and a relatively small context window (8k max).

In conclusion:

  1. The theory behind token rate is very simple once you grok that LLMs are just autoregressors, and they need to page every active parameter into memory once per token to operate.
  2. You can extrapolate expected performance from smaller models, since memory bandwidth and compute dictate throughput in inverse proportion to model size.
  3. People on the internet (especially redditors) are fucking stupid.

back to main page