Today, basically any language model you can name is a Transformer model. OpenAI’s ChatGPT, Google’s Gemini, and GitHub’s Copilot are all powered by Transformers, to name a few. However, Transformers suffer from a fundamental flaw: they are powered by Attention, which scales quadratically with sequence length. Simply put, for quick exchanges (asking ChatGPT to tell a joke), this is fine. But for queries that require lots of words (asking ChatGPT to summarize a 100-page document), Transformers can become prohibitively slow. […] Mamba appears to outperform similarly-sized Transformers while scaling linearly with sequence length.
— Mamba: The Easy Way, by Jack Cook
A wonderful explanation of the architectural differences in mamba and how it is much faster than existing transformer implementations. May require some CNN/RNN background to fully understand.