Run Frontier Models on Your Own Hardware

Open-weights models now match proprietary performance at 90-97% of frontier benchmarks. Here's how to run them locally.

Quick Start: 5 Minutes to Local AI

Ollama (Easiest)

One-command install, pull models, start chatting.


# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run a frontier model
ollama run llama4:8b

# Or a coding specialist
ollama run qwen3-coder:7b

LM Studio (GUI)

Desktop app with model browser and chat UI.


# Download from lmstudio.ai
# Then search & download models:
# - llama-4-8b
# - qwen3-coder-7b
# - deepseek-v3

OpenClaw ⚡

Run local models with a full AI agent interface. Chat, automate, and integrate with your tools.


# Install OpenClaw
npm install -g openclaw

# Configure with local model
openclaw config set model local

# Start chatting
openclaw chat

Learn more about OpenClaw →

Open Weights Model Comparison

Top open-weights models ranked by performance

Model	Params	Context	Best For	Min GPU	Score
GPT-OSS-120BNew	120B	128K	General, Coding, Reasoning	80GB	9.2
Llama 4 405B	405B	128K	Customization, Research	8×80GB	9.1
GLM-5	744B MoE	200K	Bilingual, Agents	80GB	9.2
Qwen3-Coder-Next	80B	128K	Coding	48GB	9.3
DeepSeek V3.2	37B	128K	Budget, High-volume	24GB	8.9
Llama 4 8B	8B	128K	Consumer GPUs	8GB	8.5

Scores are weighted averages of coding, reasoning, and tool-use benchmarks.

Local LLM Tools Compared

Ollama

+ Easiest setup
+ Large model library
+ Cross-platform
+ REST API built-in

- Less control over inference
- No GPU selection

Best for: Getting started fast

LM Studio

+ Beautiful GUI
+ Model browser
+ GPU selection
+ OpenAI-compatible server

- Desktop only
- No CLI

Best for: Non-developers, experimentation

vLLM

+ Highest throughput
+ Production-ready
+ Paged attention
+ Multi-GPU support

- Complex setup
- Requires Python

Best for: Production deployments

llama.cpp

+ Runs anywhere
+ Apple Silicon optimized
+ Minimal dependencies
+ Quantization support

- Manual model management
- CLI-focused

Best for: Maximum compatibility

Hardware Requirements by Model Size

8B Models

8GB VRAM

RTX 4070 / 3080
Apple M1/M2 16GB
RTX 3060 12GB

Llama 4 8B, Qwen3 7B

70B Models

48GB VRAM

RTX 6000 Ada
2× RTX 4090
RTX A6000

Llama 4 70B, Qwen3 72B

120B+ Models

80GB VRAM

H100 / H200
MI300X
RTX 6000 Ada (quantized)

GPT-OSS-120B, GLM-5

400B+ Models

Multi-GPU

8× H100
4× H200
Cloud inference

Llama 4 405B (full)

💡 Pro tip: Use quantized models (4-bit, 8-bit) to run larger models on consumer hardware. A 70B model fits in 24GB at 4-bit quantization with minimal quality loss.

Open Weights vs Proprietary: The Gap is Gone

90% LiveCodeBench score vs frontier

97% AIME 2025 math benchmark

$0 Per-token cost (self-hosted)

Source: WhatLLM.org February 2026 benchmarks. Open-weights models now routinely match or exceed proprietary models on coding and reasoning tasks.

See Full Comparison →