22 May 2025
There is wisdom oft repeated that “big neural nets are limited by memory bandwidth.” This is utter horseshit and I will show why.
LLMs are typically implemented as autoregressive feed-forward neural nets. This means that to generate a sentence, you provide a prompt which the neural net then uses to generate the next token. That prompt + token is fed back into the neural net repeatedly until it produces an EOF token, marking the end of generation.
We want to derive an equation predicting token rate . Let’s define some variables:
: token rate (tokens / second)
: memory bandwidth (bytes / second)
: model size (parameters)
: compute throughput (parameters / second)
: model quantization (bytes / parameter)
Since each token requires accessing the entire model’s parameters, then on an infinitely powerful computer:
As the model size grows, token rate drops; as memory bandwidth grows, token rate increases. Likewise, quantizing the model eases memory pressure, so reducing bytes/param increases token rate . This is all expected.
However, most of our computers do not have infinite compute throughput. We must then adjust our equation:
Token rate increases until we saturate compute or memory bandwidth , then it stops. Totally reasonable.
Notably, token rate uniformly drops as parameter count increases. The common wisdom that “big models are memory bound lol” is complete horseshit.
This equation helps you balance your compute against your memory bandwidth. You can calculate your system’s memory bandwidth as follows, assuming you have DDR5:
: memory channels
: memory speed (GT/s)
(Source: wikipedia)
So if you have 12 channels of DDR5 @ 6000 MT/s, that works out to GB/s.
Consider a model like DeepSeek-V3-0324 in 2.42 bit quant. This bad boy is a mixture of experts (MoE) with 37B activated parameters per token. So at 2.42 bits / parameter, that works out to ~11.19 GB / token. Assuming infinite compute, the upper bound on token generation rate is 576 / 12.53 = 51.46 tokens / second.
I hate to be the bearer of bad news. You will not see this token rate. On my shitass server with an EPYC 9115 CPU and 12 channels of ECC DDR5 @ 6000 MT/s, I only see 4.6 tok/s. That implies that my CPU is more than 10x less than what I need to saturate my memory subsystem. I’m using a recent build of llama-cli for this test, and a relatively small context window (8k max).
In conclusion: