Which LLMs Actually Run on CPU-Only Hardware — And When It Makes Sense
You don’t need a $10,000 GPU to run a local LLM. Here’s a practical breakdown of what runs on CPU, how fast it goes, and where it fits in a real stack.
Ollama
GGUF
Llamafile
vLLM
Why CPU-Only LLM Inference Matters
The default assumption in AI infrastructure is that you need GPUs. For training, that’s true. For inference — running a model that’s already been trained — it’s not always the case.
CPU-only inference solves real problems that GPUs can’t:
I run models on CPU in my homelab and for prototyping AI agents at 45Squared. Here’s the landscape as it stands in 2026 — what works, what doesn’t, and where the performance boundaries are.
The Models: What Actually Runs on CPU
Not every LLM is practical on CPU. The sweet spot is 1B to 14B parameters with 4-bit quantization. Beyond that, you’re waiting more than generating.
Tier 1: Fast and Practical (1–3B Parameters)
These models run at 20–50 tokens/sec on consumer hardware. Good enough for real-time chat on a laptop.
| Model | Creator | Params | RAM (Q4) | Best For |
|---|---|---|---|---|
| SmolLM2 | Hugging Face | 135M–1.7B | ~1–2 GB | On-device, IoT, edge |
| Llama 3.2 | Meta | 1B, 3B | ~2–3 GB | General chat, lightweight agents |
| Phi-3 Mini | Microsoft | 3.8B | ~3 GB | Reasoning, instruction following |
| Gemma 3 | 1B, 4B | ~1–3 GB | Multilingual, summarization | |
| Qwen 3 | Alibaba | 0.6B–4B | ~1–3 GB | Coding, multilingual |
Tier 2: The Sweet Spot (7–14B Parameters)
8–22 tokens/sec on a decent CPU. This is where CPU inference gets genuinely useful for real work — RAG pipelines, code generation, document processing.
| Model | Creator | Params | RAM (Q4) | Best For |
|---|---|---|---|---|
| Llama 3.1 / 3.3 | Meta | 8B | ~5–6 GB | General-purpose, huge community |
| Mistral 7B v0.3 | Mistral AI | 7B | ~4–5 GB | Best quality/size ratio |
| Mistral Nemo | Mistral / NVIDIA | 12B | ~7–8 GB | 128K context, quantization-aware |
| Qwen 3 | Alibaba | 8B | ~5–6 GB | Strong coding, MoE variants |
| Phi-4 | Microsoft | 14B | ~8–9 GB | Reasoning, math |
| DeepSeek-R1 Distill | DeepSeek | 7B, 8B, 14B | ~4–9 GB | Chain-of-thought reasoning |
Tier 3: Possible But Slow (32–70B Parameters)
2–7 tokens/sec. Requires 32–64 GB of RAM. Usable for batch processing where latency isn’t critical, but not ideal for interactive use.
| Model | Params | RAM (Q4) | Notes |
|---|---|---|---|
| Qwen 3 32B | 32B | ~18 GB | Best quality at this tier |
| DeepSeek-R1 Distill 32B | 32B | ~18 GB | Strong reasoning at 32B |
| Llama 3.1 70B | 70B | ~35 GB | Needs 64 GB+ system RAM; very slow |
| Llama 4 Scout (MoE) | 17B active / 109B total | ~60 GB+ | 16 experts; technically CPU-runnable |
Quantization: The Key to CPU Inference
Full-precision models are too large for CPU inference. Quantization compresses model weights from 16-bit floats to 4-bit or lower integers, shrinking the model by 4x with minimal quality loss.
GGUF (GPT-Generated Unified Format) is the standard for CPU inference. It replaced the older GGML format and is natively supported by llama.cpp, Ollama, and LM Studio.
The “K-quant” variants use mixed precision — more important weights get higher precision, less important weights get lower. This is why Q4_K_M outperforms a naive 4-bit quantization at the same file size.
| Quant Level | Bits | Quality Retained | Recommendation |
|---|---|---|---|
| Q2_K | 2-bit | ~85–90% | Last resort — noticeable quality loss |
| Q3_K_M | 3-bit | ~90–93% | Tight RAM budgets only |
| Q4_K_M | 4-bit | ~93–96% | Default choice — best balance of size and quality |
| Q5_K_M | 5-bit | ~95–97% | If you have the RAM to spare |
| Q6_K | 6-bit | ~97–98% | Near-lossless |
| Q8_0 | 8-bit | ~99%+ | Virtually no quality loss; 2x the size of Q4 |
Quick RAM Formula
RAM (GB) ≈ Parameters (B) × 0.57 + 1.5 GB overhead
Example: 7B model at Q4_K_M = 7 × 0.57 + 1.5 ≈ 5.5 GB
Inference Engines: How to Run Them
The model is the weights. The inference engine is the runtime that loads those weights and generates tokens. Here are the ones that matter for CPU.
llama.cpp
The foundational C/C++ inference engine. Maximum performance and flexibility. Uses AVX2, AVX512, and ARM NEON vector intrinsics for hardware-level optimization.
Best for: Maximum throughput, fine-grained control
Ollama
A user-friendly wrapper around llama.cpp with a model registry and REST API. One command to download and run any model. Docker-native. Flash Attention enabled by default.
Best for: Quick setup, API integration, Docker deployments
Llamafile
Bundles the model + runtime into a single executable. Zero dependencies. Copy one file to an air-gapped network, run it. Mozilla maintains the project.
Best for: Air-gapped environments, maximum portability
vLLM (CPU Mode)
Production-grade inference server with PagedAttention, dynamic batching, and Intel IPEX optimizations. CPU mode is available but less mature than GPU.
Best for: Multi-user serving on EPYC/Xeon server hardware
Performance: What to Actually Expect
Here’s the reality. CPU inference is slower than GPU. The question is whether it’s fast enough for your use case. Human reading speed is roughly 4–5 tokens/second — anything above that feels responsive for interactive chat.
| Model Size | Consumer CPU (i5/Ryzen 5) | High-End (i9/Ryzen 9) | Server (EPYC/Xeon) |
|---|---|---|---|
| 1–3B | 30–50 tok/s | 40–70 tok/s | 50–100+ tok/s |
| 7–8B | 8–15 tok/s | 15–22 tok/s | 20–40 tok/s |
| 13–14B | 4–8 tok/s | 8–14 tok/s | 12–25 tok/s |
| 32B | 2–4 tok/s | 4–7 tok/s | 8–15 tok/s |
| 70B | <1–2 tok/s | 2–4 tok/s | 5–10 tok/s |
The Bottleneck Is Memory Bandwidth, Not Compute
CPU inference speed is limited by how fast the CPU can read model weights from RAM. DDR5 at 6400 MT/s delivers ~50–100 GB/s depending on channel count. GPU HBM delivers 1–3 TB/s. That 10–30x bandwidth gap is why GPUs are faster — not because they have better math units.
The practical takeaway: If you’re running 7–8B models for a single user, CPU is fine. If you need 70B quality or multi-user serving, you either need server-grade hardware (EPYC with 8+ memory channels) or a GPU.
Hardware Recommendations
Memory bandwidth matters more than core count or clock speed for CPU inference. Prioritize RAM channels and speed over everything else.
1–3B Models
Any modern CPU (Intel 8th gen+ / Ryzen 3000+) with 8 GB RAM. Good for simple chatbots, text classification, and summarization.
7–8B Models
Intel i5/i7 12th gen+ or Ryzen 5/7 5000+ with 16 GB DDR5. Handles RAG pipelines, code generation, and document processing.
14–32B Models
Intel i9 or Ryzen 9 / Threadripper with 32–64 GB DDR5 (dual-channel minimum). Higher quality responses and complex reasoning tasks.
32–70B Models
AMD EPYC 9000 or Intel Xeon 6th gen with 128–512 GB RAM (8–12 memory channels). Multi-user serving and production API endpoints.
When CPU Is Enough — and When It Isn’t
The 10x rule of thumb: GPU inference is roughly 10x faster than CPU for the same model. An RTX 4070 runs Llama 3.1 8B at ~68 tok/s vs. ~8–12 tok/s on a comparable CPU. The question is whether you need that speed.
CPU Is Sufficient When
- Models up to ~8B for interactive single-user chat
- Models up to ~14B for batch or async processing
- Air-gapped / on-prem with no GPU infrastructure
- Edge deployment on standard hardware
- Budget-constrained environments
- Always-on services where power consumption matters
- Prototyping and development before production deployment
You Need GPU When
- Running 70B+ models at interactive speed
- Serving multiple concurrent users with low latency
- Processing images or video with vision models
- Training or fine-tuning any model
- Throughput needs exceed ~20 tok/s on 8B+ models
- Real-time applications (live transcription, instant code completion)
Real Deployment Scenarios
Homelab / Personal Use
Ollama + Llama 3.2 3B or Qwen 3 8B on a 16 GB desktop. Fast enough for a personal chatbot, code help, and document Q&A. Additional cost beyond existing hardware: $0.
Edge / IoT
SmolLM2 1.7B or Gemma 3 1B on ARM single-board computers or Intel NUCs with 8 GB RAM. Runs at 15–30 tok/s. Use cases include retail kiosks, field-service assistants, and in-vehicle systems.
Air-Gapped / Classified Environments
Llamafile with Mistral 7B or Phi-3 bundled as a single executable. Copy one file to the secure network, run it. No package managers, no internet, no dependencies.
Cost-Sensitive Startup
vLLM on EPYC servers running Qwen 3 14B or Llama 3.1 8B. Serve dozens of users without GPU costs. Monthly server spend: ~$200–400 vs. $2,000+ for GPU instances.
Developer Workstation
LM Studio or Ollama on a dev laptop for local code completion, docs generation, and testing prompts before sending to a production API. Keep the expensive API calls for production.
My Recommendation
If you’re starting out with local LLMs on CPU, here’s the simplest path:
1. Install Ollama.
2. Run ollama run llama3.2:3b if you have 8 GB RAM.
3. Run ollama run qwen3:8b if you have 16 GB RAM.
4. Use Q4_K_M quantization (Ollama handles this automatically).
5. Hit the REST API at localhost:11434 from your application.
That’s it. You’ll get a local LLM running in under 5 minutes with zero GPU spend. From there, you can evaluate whether you need to scale up to a larger model, switch to llama.cpp for more performance, or add a GPU to the mix.
The models and tools have caught up. CPU-only inference isn’t a compromise anymore — it’s a legitimate architecture choice for the right workloads.