45Squared

Which LLMs Actually Run on CPU-Only Hardware — And When It Makes Sense

AI & Infrastructure

Which LLMs Actually Run on CPU-Only Hardware — And When It Makes Sense

You don’t need a $10,000 GPU to run a local LLM. Here’s a practical breakdown of what runs on CPU, how fast it goes, and where it fits in a real stack.

llama.cpp
Ollama
GGUF
Llamafile
vLLM
8–22
Tokens/sec on 7B Models
~5 GB
RAM for a Quantized 7B Model
$0
GPU Cost Required

Why CPU-Only LLM Inference Matters

The default assumption in AI infrastructure is that you need GPUs. For training, that’s true. For inference — running a model that’s already been trained — it’s not always the case.

CPU-only inference solves real problems that GPUs can’t:

Cost. A-series and H-series GPUs cost thousands of dollars. CPU inference runs on hardware you already own.
Privacy. Air-gapped, on-premises deployment where data never leaves the building. No API calls, no third-party exposure.
Edge Deployment. Retail kiosks, field-service equipment, embedded systems — these have CPUs, not discrete GPUs.
Portability. CPU inference works on x86 and ARM without vendor lock-in to NVIDIA’s CUDA ecosystem.

I run models on CPU in my homelab and for prototyping AI agents at 45Squared. Here’s the landscape as it stands in 2026 — what works, what doesn’t, and where the performance boundaries are.

The Models: What Actually Runs on CPU

Not every LLM is practical on CPU. The sweet spot is 1B to 14B parameters with 4-bit quantization. Beyond that, you’re waiting more than generating.

Tier 1: Fast and Practical (1–3B Parameters)

These models run at 20–50 tokens/sec on consumer hardware. Good enough for real-time chat on a laptop.

Model Creator Params RAM (Q4) Best For
SmolLM2 Hugging Face 135M–1.7B ~1–2 GB On-device, IoT, edge
Llama 3.2 Meta 1B, 3B ~2–3 GB General chat, lightweight agents
Phi-3 Mini Microsoft 3.8B ~3 GB Reasoning, instruction following
Gemma 3 Google 1B, 4B ~1–3 GB Multilingual, summarization
Qwen 3 Alibaba 0.6B–4B ~1–3 GB Coding, multilingual

Tier 2: The Sweet Spot (7–14B Parameters)

8–22 tokens/sec on a decent CPU. This is where CPU inference gets genuinely useful for real work — RAG pipelines, code generation, document processing.

Model Creator Params RAM (Q4) Best For
Llama 3.1 / 3.3 Meta 8B ~5–6 GB General-purpose, huge community
Mistral 7B v0.3 Mistral AI 7B ~4–5 GB Best quality/size ratio
Mistral Nemo Mistral / NVIDIA 12B ~7–8 GB 128K context, quantization-aware
Qwen 3 Alibaba 8B ~5–6 GB Strong coding, MoE variants
Phi-4 Microsoft 14B ~8–9 GB Reasoning, math
DeepSeek-R1 Distill DeepSeek 7B, 8B, 14B ~4–9 GB Chain-of-thought reasoning

Tier 3: Possible But Slow (32–70B Parameters)

2–7 tokens/sec. Requires 32–64 GB of RAM. Usable for batch processing where latency isn’t critical, but not ideal for interactive use.

Model Params RAM (Q4) Notes
Qwen 3 32B 32B ~18 GB Best quality at this tier
DeepSeek-R1 Distill 32B 32B ~18 GB Strong reasoning at 32B
Llama 3.1 70B 70B ~35 GB Needs 64 GB+ system RAM; very slow
Llama 4 Scout (MoE) 17B active / 109B total ~60 GB+ 16 experts; technically CPU-runnable

Quantization: The Key to CPU Inference

Full-precision models are too large for CPU inference. Quantization compresses model weights from 16-bit floats to 4-bit or lower integers, shrinking the model by 4x with minimal quality loss.

GGUF (GPT-Generated Unified Format) is the standard for CPU inference. It replaced the older GGML format and is natively supported by llama.cpp, Ollama, and LM Studio.

The “K-quant” variants use mixed precision — more important weights get higher precision, less important weights get lower. This is why Q4_K_M outperforms a naive 4-bit quantization at the same file size.

Quant Level Bits Quality Retained Recommendation
Q2_K 2-bit ~85–90% Last resort — noticeable quality loss
Q3_K_M 3-bit ~90–93% Tight RAM budgets only
Q4_K_M 4-bit ~93–96% Default choice — best balance of size and quality
Q5_K_M 5-bit ~95–97% If you have the RAM to spare
Q6_K 6-bit ~97–98% Near-lossless
Q8_0 8-bit ~99%+ Virtually no quality loss; 2x the size of Q4

Quick RAM Formula

RAM (GB) ≈ Parameters (B) × 0.57 + 1.5 GB overhead

Example: 7B model at Q4_K_M = 7 × 0.57 + 1.5 ≈ 5.5 GB

Inference Engines: How to Run Them

The model is the weights. The inference engine is the runtime that loads those weights and generates tokens. Here are the ones that matter for CPU.

llama.cpp

The foundational C/C++ inference engine. Maximum performance and flexibility. Uses AVX2, AVX512, and ARM NEON vector intrinsics for hardware-level optimization.

Best for: Maximum throughput, fine-grained control

Ollama

A user-friendly wrapper around llama.cpp with a model registry and REST API. One command to download and run any model. Docker-native. Flash Attention enabled by default.

Best for: Quick setup, API integration, Docker deployments

Llamafile

Bundles the model + runtime into a single executable. Zero dependencies. Copy one file to an air-gapped network, run it. Mozilla maintains the project.

Best for: Air-gapped environments, maximum portability

vLLM (CPU Mode)

Production-grade inference server with PagedAttention, dynamic batching, and Intel IPEX optimizations. CPU mode is available but less mature than GPU.

Best for: Multi-user serving on EPYC/Xeon server hardware

Performance: What to Actually Expect

Here’s the reality. CPU inference is slower than GPU. The question is whether it’s fast enough for your use case. Human reading speed is roughly 4–5 tokens/second — anything above that feels responsive for interactive chat.

Model Size Consumer CPU (i5/Ryzen 5) High-End (i9/Ryzen 9) Server (EPYC/Xeon)
1–3B 30–50 tok/s 40–70 tok/s 50–100+ tok/s
7–8B 8–15 tok/s 15–22 tok/s 20–40 tok/s
13–14B 4–8 tok/s 8–14 tok/s 12–25 tok/s
32B 2–4 tok/s 4–7 tok/s 8–15 tok/s
70B <1–2 tok/s 2–4 tok/s 5–10 tok/s

The Bottleneck Is Memory Bandwidth, Not Compute

CPU inference speed is limited by how fast the CPU can read model weights from RAM. DDR5 at 6400 MT/s delivers ~50–100 GB/s depending on channel count. GPU HBM delivers 1–3 TB/s. That 10–30x bandwidth gap is why GPUs are faster — not because they have better math units.

The practical takeaway: If you’re running 7–8B models for a single user, CPU is fine. If you need 70B quality or multi-user serving, you either need server-grade hardware (EPYC with 8+ memory channels) or a GPU.

Hardware Recommendations

Memory bandwidth matters more than core count or clock speed for CPU inference. Prioritize RAM channels and speed over everything else.

Entry Level

1–3B Models

Any modern CPU (Intel 8th gen+ / Ryzen 3000+) with 8 GB RAM. Good for simple chatbots, text classification, and summarization.

Sweet Spot

7–8B Models

Intel i5/i7 12th gen+ or Ryzen 5/7 5000+ with 16 GB DDR5. Handles RAG pipelines, code generation, and document processing.

Homelab / Enthusiast

14–32B Models

Intel i9 or Ryzen 9 / Threadripper with 32–64 GB DDR5 (dual-channel minimum). Higher quality responses and complex reasoning tasks.

Server / Production

32–70B Models

AMD EPYC 9000 or Intel Xeon 6th gen with 128–512 GB RAM (8–12 memory channels). Multi-user serving and production API endpoints.

When CPU Is Enough — and When It Isn’t

The 10x rule of thumb: GPU inference is roughly 10x faster than CPU for the same model. An RTX 4070 runs Llama 3.1 8B at ~68 tok/s vs. ~8–12 tok/s on a comparable CPU. The question is whether you need that speed.

CPU Is Sufficient When

  • Models up to ~8B for interactive single-user chat
  • Models up to ~14B for batch or async processing
  • Air-gapped / on-prem with no GPU infrastructure
  • Edge deployment on standard hardware
  • Budget-constrained environments
  • Always-on services where power consumption matters
  • Prototyping and development before production deployment

You Need GPU When

  • Running 70B+ models at interactive speed
  • Serving multiple concurrent users with low latency
  • Processing images or video with vision models
  • Training or fine-tuning any model
  • Throughput needs exceed ~20 tok/s on 8B+ models
  • Real-time applications (live transcription, instant code completion)

Real Deployment Scenarios

Homelab / Personal Use

Ollama + Llama 3.2 3B or Qwen 3 8B on a 16 GB desktop. Fast enough for a personal chatbot, code help, and document Q&A. Additional cost beyond existing hardware: $0.

Edge / IoT

SmolLM2 1.7B or Gemma 3 1B on ARM single-board computers or Intel NUCs with 8 GB RAM. Runs at 15–30 tok/s. Use cases include retail kiosks, field-service assistants, and in-vehicle systems.

Air-Gapped / Classified Environments

Llamafile with Mistral 7B or Phi-3 bundled as a single executable. Copy one file to the secure network, run it. No package managers, no internet, no dependencies.

Cost-Sensitive Startup

vLLM on EPYC servers running Qwen 3 14B or Llama 3.1 8B. Serve dozens of users without GPU costs. Monthly server spend: ~$200–400 vs. $2,000+ for GPU instances.

Developer Workstation

LM Studio or Ollama on a dev laptop for local code completion, docs generation, and testing prompts before sending to a production API. Keep the expensive API calls for production.

My Recommendation

If you’re starting out with local LLMs on CPU, here’s the simplest path:

1. Install Ollama.
2. Run ollama run llama3.2:3b if you have 8 GB RAM.
3. Run ollama run qwen3:8b if you have 16 GB RAM.
4. Use Q4_K_M quantization (Ollama handles this automatically).
5. Hit the REST API at localhost:11434 from your application.

That’s it. You’ll get a local LLM running in under 5 minutes with zero GPU spend. From there, you can evaluate whether you need to scale up to a larger model, switch to llama.cpp for more performance, or add a GPU to the mix.

The models and tools have caught up. CPU-only inference isn’t a compromise anymore — it’s a legitimate architecture choice for the right workloads.

Need Help Building AI Into Your Stack?

45Squared runs fixed-scope Implementation Sprints — 2 weeks, production-ready, on AWS. Whether it’s local LLM deployment, AI agent architecture, or cloud infrastructure, we build the engine so you can focus on your business.

Exit mobile version