<aside>

TLDR:

There are no known alternative models to LLM’s that would “blow everyone out of the water” in terms of performance clearly on the horizon (albeit a breakthrough can come from anywhere and happen at any time). Transformer models are expected to keep getting better through to 2030.

What Models should you expect to see in 2026?

 I would closely watch the development of the Continuous Thought Machine.

Written on Dec 19, 2025

</aside>

Diffusion Models

TLDR: A diffusion model works a lot like a Transformer Model, only it can fill in multiple tokens at once (rather than one token at a time, left to right) by starting with a mask, and gradually unmasking as it becomes confident in predicting a token correctly.

image.png

https://nathan.rs/posts/roberta-diffusion/

Sub Quadratic Model

TLDR: Hyperscalers are likely to soon use a combination of the transformer (for smaller context windows) and a linear attention model, which will allow for essentially compressing part of the context window (for larger context windows). This may enable larger context windows, or help compress large, unwieldy context windows.