Llama 3.3 70B Instruct

The community-standard 70B dense for local. Reliable, well-supported across llama.cpp / vLLM / TensorRT-LLM / MLX, and the proven daily driver for 48 GB+ discrete or 96 GB+ unified. Qwen 3.5 has no 70B dense (jumps from 27B to 122B-A10B), so Llama 3.3 still owns this slot in 2026.

License: Llama 3.3 Community License (custom — not Apache; commercial OK with attribution + 700M MAU cap) · Context: 128K · Released: December 6, 2024

The decision in five lines

The call: Skip for local — for coding
Best for: coding · chat · docs · agents
Runs on: 8 hardware picks fit (cheapest: Dual RTX 3090 (used) · $1,800)
Watch out: Tight VRAM budgets — at Q4 it needs ~46 GB total with KV at 32K context.
Evidence: Estimated · last verified April 2026

70B dense: PARAMETERS
DENSE: TYPE
128K: CONTEXT
~40 GB (Q4_K_M) / ~70 GB (BF16 on DGX Spark): VRAM AT Q4

Where we recommend this

Every tier slot in the planner where this model is a top or alternate pick. Pulled live from planner.js — when the planner refreshes, this table stays current.

CODING ·

Llama 3.3 70B Q4 denseCommunity-standard 70B dense — fits 96 GB Mac unified cleanly (no tweak), 22 tok/s on M5 Max 128 GB, full BF16 on DGX Spark. Mature across llama.cpp / vLLM / TensorRT-LLM / MLX.

CHAT ·

Llama 3.3 70B Q4 denseProven daily-driver 70B dense — focused, reliable, well-supported across all runners. The community pick for "what should my $4K rig run for chat."

DOCS ·

Llama 3.3 70B Q4 + RAG (128K reliable)Community consensus: no locally-runnable model has truly reliable 1M context as of June 2026. Llama 3.3 70B at 128K context with proper RAG is the stable, proven path. Practical reliable limit is 32-64K of input attention.

AGENTS ·

Llama 3.3 70B Q4 denseThe reliable 70B for production agent loops — battle-tested, broad framework support (LangGraph / CrewAI / AutoGen / Qwen Code).

The call

The community-standard 70B dense for local. Reliable, well-supported across llama.cpp / vLLM / TensorRT-LLM / MLX, and the proven daily driver for 48 GB+ discrete or 96 GB+ unified. Qwen 3.5 has no 70B dense (jumps from 27B to 122B-A10B), so Llama 3.3 still owns this slot in 2026.
When not to use: Tight VRAM budgets — at Q4 it needs ~46 GB total with KV at 32K context. RTX 5090 32 GB cannot fit Q4 (would need IQ2 quality compromise). Also, multimodal — Llama 3.3 is text-only.

Runner notes

Ollama tag `llama3.3:70b` (Q4_K_M default). On DGX Spark 128 GB, BF16 fits without quantization. M5 Max 128 GB: ~22 tok/s at Q4. M4 Max 96 GB: 8-15 tok/s with sysctl wired-memory tweak. AMD ROCm path: HIP build of llama.cpp is reliable at this size.

License: Llama 3.3 Community License (custom — not Apache; commercial OK with attribution + 700M MAU cap)
Released: December 6, 2024
Maker: Meta
Model card: huggingface.co/meta-llama/Llama-3.3-70B-Instruct →

Hardware that fits

Every hardware pick whose memory fits this model at the quant we recommend. Sorted cheapest-first — the top row is your best-value fit. Click through for the full buyer’s guide.

Next step

Find-by-model — see what hardware runs this→