<aside>
TLDR:
There are no known alternative models to LLM’s that would “blow everyone out of the water” in terms of performance clearly on the horizon (albeit a breakthrough can come from anywhere and happen at any time). Transformer models are expected to keep getting better through to 2030.
What Models should you expect to see in 2026?
- Diffusion → 10x faster than transformers, model self corrects it’s guesses
- Sub-Quadratic → Using Linear Attention for large context windows + Transformers for small ones
- RNN’s → would allow models to “think” longer and use their own representations.
- Continual Learning → Learning at Inference time & Memory. Updates based on surprise.
- Continuous Thought Machine → Takes much longer, deeper understanding of the properties it models. Can estimate it’s own certainty. Not parallelizable (yet).
⭐ I would closely watch the development of the Continuous Thought Machine.
Written on Dec 19, 2025
</aside>
Diffusion Models
TLDR: A diffusion model works a lot like a Transformer Model, only it can fill in multiple tokens at once (rather than one token at a time, left to right) by starting with a mask, and gradually unmasking as it becomes confident in predicting a token correctly.

https://nathan.rs/posts/roberta-diffusion/
-
What they do: start with pure noise, fill in rough guess, continually refine response
-
Produces tokens in parallel, starts with all masked tokens, progressively unmasks
-
Predicts masked and corrupted tokens
-
Autoregressive (like Transformer) - but far fewer loops / iterations (10x faster)
-
Full attention at each step (does not use KV cache)
-
How a Transformer Model works (for comparison)
- Build from left to right
- Builds one token at a time
- First few words affect the output, so is forced to build on (potential) mistakes
- correction requires checking output, and re-running inference or trying to fix a section of the response
- KV cache may become a memory bottleneck > 10k token context lengths
-
Pros
- Refinement is natural to the process (as opposed to LLM’s that need to finish inference).
- More natural support for insertion / deletion / substitution that corrects responses holistically.
- May be better for long context prompts >10k token
- Flexibility. Can rewrite a single section in the prompt competently.
- Fixes mistakes at inference time, giving it an opportunity for deeper coherence
-
Resources
https://x.com/asapzzhou/status/1998098118827770210?s=20
https://github.com/ZHZisZZ/dllm
Sub Quadratic Model
TLDR: Hyperscalers are likely to soon use a combination of the transformer (for smaller context windows) and a linear attention model, which will allow for essentially compressing part of the context window (for larger context windows).
This may enable larger context windows, or help compress large, unwieldy context windows.
- Uses a hybird of both linear attention and transformer model
- Small context window → transformer
- Large context window→ linear attention model
- Linear Attention
- Calculates a representation of the whole sequence, which takes O(N) time upfront, and can be represented in O(1) in memory
- Multiplies each token by the representation, which takes O(n)
- In english, it computes a representation for any input sequence of tokens (words) and multiplies each word by this representation. This allows for a compression of the meaning of sorts, which loses some of the detail, but runs much faster than the Transformer model.