Google launches DiffusionGemma open model for faster local AI workflows

1 hour ago 9

Google has introduced DiffusionGemma, an experimental open model designed to generate text faster by using diffusion instead of the token by token process behind most large language models.

DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs.

Instead of predicting word-by-word, it generates entire blocks of text simultaneously. This lets the model self-correct and format complex markdown in real time. pic.twitter.com/S62OSbfWff

— Google DeepMind (@GoogleDeepMind) June 10, 2026

The model is a 26 billion parameter Mixture of Experts system released under an Apache 2.0 license. Google said DiffusionGemma activates only 3.8 billion parameters during inference and can run within 18GB of VRAM when quantized, making it suitable for high end consumer GPUs.

Unlike traditional autoregressive models, which generate one token at a time from left to right, DiffusionGemma generates blocks of text in parallel. Google said the model can draft 256 tokens at once and refine them over multiple passes, allowing the full text block to be evaluated as it is being produced.

The result is a major speed improvement for local AI workflows. Google said DiffusionGemma can generate more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an NVIDIA GeForce RTX 5090. The company said the model can deliver up to four times faster output on dedicated GPUs.

Google is positioning the model for researchers and developers building latency sensitive tools, including inline editing, code infilling, rapid iteration, and non linear text generation.

Its bidirectional attention allows every token in a block to attend to the others, which Google said could help in areas such as math graphs, amino acid sequences, and structured editing.

The company was also careful to frame DiffusionGemma as experimental. Google said its standard Gemma 4 models remain the better option for applications that require maximum output quality, while DiffusionGemma is aimed at developers exploring interactive local AI systems where speed matters more.

The speed advantage is also not universal. Google said the model is most useful for local and low concurrency inference, while traditional autoregressive models may remain more efficient in high volume cloud deployments where requests can be batched at scale.

DiffusionGemma is available through Hugging Face, with support for MLX, vLLM, Hugging Face Transformers, Unsloth, NVIDIA NeMo, and other developer tools. Google said official llama.cpp support is coming soon.

Disclosure: This article was edited by Estefano Gomez. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article