the AI bench
VERIFIED JUNE 2026
All models

MODEL · META · 70B DENSE

Llama 3.3 70B Instruct

The community-standard 70B dense for local. Reliable, well-supported across llama.cpp / vLLM / TensorRT-LLM / MLX, and the proven daily driver for 48 GB+ discrete or 96 GB+ unified. Qwen 3.5 has no 70B dense (jumps from 27B to 122B-A10B), so Llama 3.3 still owns this slot in 2026.

License: Llama 3.3 Community License (custom — not Apache; commercial OK with attribution + 700M MAU cap) · Context: 128K · Released: December 6, 2024

The decision in five lines

The call
Skip for local — for coding
Best for
coding · chat · docs · agents
Runs on
8 hardware picks fit (cheapest: Dual RTX 3090 (used) · $1,800)
Watch out
Tight VRAM budgets — at Q4 it needs ~46 GB total with KV at 32K context.
Evidence
Estimated · last verified April 2026

70B dense
PARAMETERS
DENSE
TYPE
128K
CONTEXT
~40 GB (Q4_K_M) / ~70 GB (BF16 on DGX Spark)
VRAM AT Q4

Where we recommend this

Every tier slot in the planner where this model is a top or alternate pick. Pulled live from planner.js — when the planner refreshes, this table stays current.

CODING ·
Llama 3.3 70B Q4 denseCommunity-standard 70B dense — fits 96 GB Mac unified cleanly (no tweak), 22 tok/s on M5 Max 128 GB, full BF16 on DGX Spark. Mature across llama.cpp / vLLM / TensorRT-LLM / MLX.
CHAT ·
Llama 3.3 70B Q4 denseProven daily-driver 70B dense — focused, reliable, well-supported across all runners. The community pick for "what should my $4K rig run for chat."
DOCS ·
Llama 3.3 70B Q4 + RAG (128K reliable)Community consensus: no locally-runnable model has truly reliable 1M context as of June 2026. Llama 3.3 70B at 128K context with proper RAG is the stable, proven path. Practical reliable limit is 32-64K of input attention.
AGENTS ·
Llama 3.3 70B Q4 denseThe reliable 70B for production agent loops — battle-tested, broad framework support (LangGraph / CrewAI / AutoGen / Qwen Code).

The call

The community-standard 70B dense for local. Reliable, well-supported across llama.cpp / vLLM / TensorRT-LLM / MLX, and the proven daily driver for 48 GB+ discrete or 96 GB+ unified. Qwen 3.5 has no 70B dense (jumps from 27B to 122B-A10B), so Llama 3.3 still owns this slot in 2026.

When not to use: Tight VRAM budgets — at Q4 it needs ~46 GB total with KV at 32K context. RTX 5090 32 GB cannot fit Q4 (would need IQ2 quality compromise). Also, multimodal — Llama 3.3 is text-only.

Runner notes

Ollama tag `llama3.3:70b` (Q4_K_M default). On DGX Spark 128 GB, BF16 fits without quantization. M5 Max 128 GB: ~22 tok/s at Q4. M4 Max 96 GB: 8-15 tok/s with sysctl wired-memory tweak. AMD ROCm path: HIP build of llama.cpp is reliable at this size.

License
Llama 3.3 Community License (custom — not Apache; commercial OK with attribution + 700M MAU cap)
Released
December 6, 2024
Maker
Meta

Hardware that fits

Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.

Next step

Find-by-model — see what hardware runs this