What changed, and our verdict.

Fast takes on major model drops, published within ~24 hours. Each piece dated, opinionated, and short — what shipped, what it actually changes, and where it fits in the planner.

The quarterly state-of-local-AI snapshot lands at /methodology. The dated diff for each quarter lands here. The RSS feed is at /changes/feed.xml.

Recent drops

2026-06-04 · NVIDIA NEMOTRON-3 ULTRA 550B-A55BNemotron-3 Ultra — NVIDIA ships the best US open-weight model, fast, and fully openNVIDIA released Nemotron-3 Ultra 550B-A55B on June 4 (announced at Jensen Huang's Computex keynote on June 1) — a 550B-total / 55B-active hybrid Mamba-Transformer MoE under the new OpenMDW 1.1 license, with 1M context and native NVFP4. Artificial Analysis scores it 48 on its Intelligence Index: the strongest US/Western open-weight model to date — ahead of Gemma 4 31B (39) and gpt-oss-120b (33), but behind the Chinese-led frontier (Kimi K2.6 at 54). The headline is speed-for-intelligence: 300+ tok/s, several times faster than DeepSeek/Kimi peers.Read →2026-06-03 · GOOGLE GEMMA 4 12BGemma 4 12B — frontier-ish multimodal that actually runs on a 16 GB laptop, under Apache 2.0Google shipped Gemma 4 12B on June 3 — a ~12B dense, encoder-free unified multimodal model (text + image + audio + video in, text out) under Apache 2.0. Google's pitch is the deployment envelope: it runs locally on 16 GB of VRAM or unified memory while landing benchmark numbers Google says approach its 26B-A4B MoE, at under half the total memory footprint. It slots into the existing Gemma 4 family (31B dense + 26B MoE, April) as the laptop-friendly multimodal pick.Read →2026-06-02 · MICROSOFT MAI FAMILY (BUILD 2026)Microsoft MAI at Build 2026 — seven first-party models, all hosted-only, none you can downloadAt Build 2026 on June 2, Mustafa Suleyman unveiled Microsoft AI's MAI family — seven first-party models spanning reasoning (MAI-Thinking-1), coding (MAI-Code-1, MAI-Code-1-Flash), image (MAI-Image-2.5 + Flash), transcription (MAI-Transcribe-1.5), and voice (MAI-Voice-2). Trained from scratch, no distillation. The story is Microsoft reducing its reliance on OpenAI — but for a local-AI site the headline is the asterisk: every MAI model is Microsoft Foundry / Azure-hosted, closed-weight, API-only. None are on Hugging Face, none are downloadable, none change a single local pick.Read →2026-06-02 · XAI GROK BUILD (GROK-BUILD-0.1)Grok Build — xAI ships a terminal coding agent on the hosted grok-build-0.1 modelxAI opened Grok Build to public beta via the xAI API in early June — a Rust, terminal-native coding agent and CLI (xAI's answer to Claude Code, Codex, and Gemini CLI), powered by the new grok-build-0.1 model. It is a fast agentic coding model: 256K context, text + image in, function calling + structured outputs + reasoning, at $1 / 1M input and $2 / 1M output ($0.20 cached). The catch for a local-AI site: grok-build-0.1 is hosted / API-only — no open weights — so it shifts the cloud landscape, not your local picks.Read →2026-05-28 · ANTHROPIC CLAUDE OPUS 4.8Claude Opus 4.8 — price held flat, fast mode got 3× cheaper, and it catches its own code bugs nowAnthropic shipped Opus 4.8 on May 28 at unchanged standard pricing ($5 in / $25 out per 1M) — no tokenizer surprise this time, it keeps the 4.7 tokenizer. Two things actually changed: fast mode dropped to $10/$50 (3× cheaper than 4.6/4.7), and Anthropic claims the model is ~4× less likely to let flaws in code it wrote slip past unremarked. Still hosted-only; nothing changes for local picks today.Read →2026-05-25 · GRANITE-SWITCH 4.1 8B PREVIEW (IBM)Granite-Switch 4.1 — IBM's 12-adapter-in-one-checkpoint pattern is the deployment storyIBM uploaded the Granite-Switch 4.1 family (3B / 8B / 30B previews) to Hugging Face on May 25. Each checkpoint is the base Granite 4.1 dense model with 12 task-specialized LoRA adapters embedded, activated per-token via control tokens in the chat template. Three libraries: Core (requirement check, context attribution, uncertainty), RAG (query rewrite, query clarification, answerability, hallucination detection, citation generation), Guardian (safety detection, factuality detection + correction, policy guardrails). Apache 2.0, 128K context, 12 languages.Read →2026-05-22 · MINICPM5-1B (OPENBMB)MiniCPM5-1B — OpenBMB's On-Policy Distillation pipeline lands at 1BOpenBMB uploaded MiniCPM5-1B and MiniCPM5-1B-SFT to Hugging Face on May 22 — a 1.08B-parameter dense Llama-class model trained with a three-stage SFT → RL → On-Policy Distillation pipeline. Apache 2.0, 128K context, English + Chinese, hybrid `<think>` reasoning toggle, native XML-style tool calling. Claims 1B-class open-source SOTA against LFM2.5-1.2B-Thinking, Qwen3-0.6B/think, and Qwen3.5-0.8B/think.Read →2026-05-20 · COMMAND A+ (COHERE)Command A+ — Cohere ships the Apache 2.0 frontier MoE the open-weight stack was missingCohere released Command A+ on May 20 — a 218B-param / 25B-active sparse MoE under Apache 2.0, with hybrid sliding-window + global attention, 128K input / 64K generation, native multimodal (text + image), and 48-language coverage. Available in BF16, FP8, and W4A4 quants on day one. The W4A4 build fits 2× H100 80 GB; the FP8 fits 1× MI300X. The first time you can self-host a 218B-class frontier MoE without a hosted-only or research-license caveat.Read →2026-05-19 · GEMINI 3.5 FLASH (GOOGLE)Gemini 3.5 Flash — Google's new daily-driver API price-shifts the cloud-vs-local mathGoogle launched Gemini 3.5 Flash GA at I/O 2026 (May 19) at $1.50/$9.00 per 1M tokens with 1M context. Reported to beat Gemini 3.1 Pro on Terminal-Bench 2.1 / MCP Atlas / CharXiv at roughly 60% of Pro's blended cost. Alongside it: a tiered consumer subscription restructure — new $100/mo Google AI Ultra tier (5× Pro limits, Antigravity 2.0 access) and the top Ultra cut from $250 → $200.Read →2026-05-18 · BITCPM4-CANN FAMILY (OPENBMB)BitCPM4-CANN — OpenBMB ships the first native ternary 8B LLM familyOpenBMB released the BitCPM4-CANN family (0.5B / 1B / 3B / 8B) in mid-May — the first publicly reported end-to-end 1.58-bit (ternary {-1, 0, 1}) training stack at 8B scale, trained natively on Huawei Ascend NPU. Apache 2.0. The 8B model retains 95.7% of full-precision MiniCPM4 performance at ~6× memory reduction; the 0.5B variant retains 90.1% of its full-precision baseline at ~100 MB on-disk. Not the strongest model at its size — but the smallest credible model at this quality level.Read →2026-05-15 · MINICPM-V-4.6 (OPENBMB)MiniCPM-V-4.6 — vision-language at 1B that prices like a sub-billion text modelOpenBMB shipped MiniCPM-V-4.6 on May 15 — a 1B-param vision-language model built on SigLIP2-400M + Qwen3.5-0.8B that scores higher than its own LLM backbone on Artificial Analysis Intelligence Index (13 vs 10) at ~19× lower token cost. Apache 2.0. Day-one GGUF, BNB, AWQ, GPTQ quants plus a Thinking variant. The newest entry in the V (vision-only) branch, parallel to the MiniCPM-o omnimodal line.Read →2026-05-08 · HIDREAM-O1-IMAGEHiDream-O1-Image — pixel-space generation finally lands as an open-weight, MIT-licensed modelHiDream open-sourced the O1 series on May 8: an 8B image foundation model that generates in pixel space — no VAE, no separate text encoder, one Pixel-level Unified Transformer handling text-to-image, edit, and subject-driven personalization at up to 2,048². Debuted top-10 on Artificial Analysis T2I Arena. Both undistilled and distilled Dev variants ship same-day under MIT.Read →2026-05-01 · MOSS-MUSIC-8B (OPENMOSS)MOSS-Music-8B — open-weight music understanding lands at usable accuracyOpenMOSS released MOSS-Music-8B (Instruct + Thinking) on May 1 — an Apache 2.0 audio-text-to-text model that does lyrics ASR with time-aligned transcription, music captioning, key/tempo/chord reasoning, structural analysis, instrument recognition, and music QA at production accuracy. 80.38% average on music-QA benchmarks; 4.36/5.0 on MusicCaps captioning. No open-weight model previously covered this category usably.Read →2026-04-29 · MISTRAL MEDIUM 3.5Mistral Medium 3.5 — one 128B dense model replaces three specialist MistralsMistral retired Magistral (reasoning) and Devstral 2 (coding) and folded both into Medium 3.5: a 128B dense weight set with 256K context, native multimodal vision, 77.6% on SWE-Bench Verified, and a per-request `reasoning_effort` toggle that swaps modes without swapping checkpoints. Modified MIT license — commercial OK below a revenue threshold.Read →2026-04-29 · IBM GRANITE 4.1IBM Granite 4.1 — Apache 2.0 dense at 3B/8B/30B with 512K contextIBM dropped the Granite 4.1 family on April 29 — three dense sizes (3B / 8B / 30B), Apache 2.0, 128K default context extending to 512K via training-stage continuation, native Ollama tags live the same day. The headline claim from IBM: the new 8B instruct matches their prior Granite 4.0 32B-A9B MoE. Believable but unverified by community benchmarks at time of writing.Read →2026-04-24 · DEEPSEEK V4 (PRO + FLASH)DeepSeek V4 — the architecture is the story, not the sizeA 1.6T Pro and a 284B Flash sibling, both MIT, both 1M context, released the same day. Skip the size headlines: the real news is the architectural change that drops V3.2 single-token FLOPs by ~73% and KV cache by ~90%.Read →2026-04-23 · OPENAI GPT-5.5GPT-5.5 — OpenAI doubled the GPT-5 line, local hardware just got cheaper by comparisonGPT-5.5 launched at $5 input / $30 output per 1M, exactly 2× the GPT-5.4 rates. OpenAI claims ~40% fewer output tokens on Codex tasks, which mostly offsets the increase to a ~20% net rise. Crucially, that ~20% comes out of cloud users' pockets; local-hardware buyers get a free improvement to their break-even math.Read →2026-04-22 · QWEN 3.6-27B (DENSE)Qwen 3.6-27B — a dense Q4 model that claims to beat the prior 397B MoE flagshipA week after Qwen 3.6-35B-A3B, Alibaba shipped the dense 27B. Apache 2.0, 262K native context, multimodal, ~17 GB at Q4. The community claim worth weighing: it beats the prior generation's 397B MoE on coding while staying single-card.Read →2026-04-20 · KIMI K2.6 (MOONSHOT)Kimi K2.6 — open-weights 1T MoE, but the headline is agent orchestration not raw size1T parameters total, 32B activated per token, Modified MIT (commercial OK below 100M MAU / $20M MRR). Tops SWE-Bench Pro at 58.6 (vs GPT-5.4 xhigh 57.7, Opus 4.6 max 53.4) and lands #4 on the Artificial Analysis Intelligence Index. The real differentiator is Agent Swarm scaling to 300 sub-agents over 4,000 coordinated steps — a different shape of capability than "bigger model, better single-step."Read →2026-04-16 · ANTHROPIC CLAUDE OPUS 4.7Claude Opus 4.7 — same per-token price, but the new tokenizer raises real cost ~35%Anthropic shipped Opus 4.7 at unchanged $5/$25 per 1M, paired with Claude Design (visual collaboration). Most coverage missed the tokenizer change — Opus 4.7 generates ~35% more tokens for the same prompts as 4.6. Effective cost rose without the price tag moving.Read →2026-04-16 · QWEN 3.6-35B-A3BQwen 3.6-35B-A3B — the MoE sibling that pairs with the dense 27BAlibaba shipped the 35B-A3B MoE first, then the dense 27B six days later. 3B active per token, 35B total, Apache 2.0, 262K context. ~17 GB at Q4 — fits 24 GB cards comfortably. Picks between the sibling pair come down to dense-vs-MoE tradeoffs.Read →2026-04-14 · VOXCPM2 (OPENBMB)VoxCPM2 — the first open-weight TTS that designs voices from text aloneOpenBMB shipped a 2B Apache-2.0 TTS that does what no other open-weight model does — generate a voice from a natural-language description, no reference audio required. Plus 30 languages, 48 kHz output, tokenizer-free diffusion AR.Read →2026-04-10 · MOSS-TTS-NANO (OPENMOSS)MOSS-TTS-Nano — multilingual voice cloning runs on 4 CPU coresA 100M Apache-2.0 model that fills the gap Kokoro-82M doesn't cover — multilingual TTS with voice cloning from a short reference audio, real-time on 4 CPU cores. The first time those three properties co-existed in an open-weight pick.Read →2026-04-07 · GLM-5.1 (Z.AI)GLM-5.1 — first open-weight model to lead SWE-Bench ProZ.ai's 744B MoE shipped MIT, hit 58.4 on SWE-Bench Pro, narrowly beats GPT-5.4 and Claude Opus 4.6 on that benchmark. Crucial caveat: at 466 GB Q4 it's hosted-only realistic. The 'open-weight' framing matters less than the SWE-Bench Pro leadership.Read →2026-04-02 · GOOGLE GEMMA 4Gemma 4 — the license change is a bigger story than the modelGoogle shipped Gemma 4 (31B dense + 26B MoE / 3.8B active) under Apache 2.0 — moving off the custom Gemma Terms that constrained commercial work in Gemma 1–3. For builders shipping commercial products, this matters more than the benchmark gains.Read →