Running large language models locally used to require a PhD and a five-figure GPU budget. Not anymore. Ollama makes running LLMs on your own hardware as simple as ollama run llama3 — no API keys, no cloud costs, no data leaving your network.
In this guide, you’ll set up Ollama on your server, run popular models, expose an OpenAI-compatible API, and integrate it with tools like Open WebUI for a full ChatGPT replacement you own.
Why Run LLMs Locally?
- Privacy: Your prompts never leave your network. No training on your data.
- Cost: Zero per-token fees. Pay once for hardware, run forever.
- Speed: No rate limits. No API outages. Latency = your hardware speed.
- Offline: Works without internet after downloading models.
- Customization: Fine-tune models, create custom system prompts, merge adapters.
Hardware Requirements
Ollama runs on CPU, but GPU acceleration makes it practical for real use:
| Setup | RAM | GPU VRAM | Models You Can Run |
|---|---|---|---|
| Minimum | 8GB | None (CPU) | Phi-3 Mini, TinyLlama, Gemma 2B |
| Recommended | 16GB | 8GB | Llama 3 8B, Mistral 7B, Gemma 7B |
| Power User | 32GB | 16-24GB | Llama 3 70B (quantized), Mixtral 8x7B |
| Homelab Beast | 64GB+ | 48GB+ | Llama 3 70B full, DeepSeek Coder 33B |
Rule of thumb: You need roughly 1GB of RAM/VRAM per billion parameters for Q4 quantized models.
Installation
Option 1: Direct Install (Recommended)
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama as a systemd service. It auto-detects NVIDIA, AMD, and Intel GPUs.
Verify the installation:
ollama --version
systemctl status ollama
Option 2: Docker
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# Uncomment for NVIDIA GPU support:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
volumes:
ollama_data:
docker compose up -d
GPU Setup
NVIDIA: Install the NVIDIA Container Toolkit, then uncomment the GPU section in the compose file.
AMD: Use the rocm tag: ollama/ollama:rocm
CPU-only: Works out of the box, just slower. Expect ~5-10 tokens/second for 7B models on modern CPUs.
Running Your First Model
Pull and run a model:
# Interactive chat
ollama run llama3.1
# Pull without running
ollama pull mistral
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.1
Recommended Models to Start With
| Model | Size | Best For |
|---|---|---|
llama3.1:8b | 4.7GB | General purpose, great quality/speed balance |
mistral | 4.1GB | Fast, good at code and reasoning |
gemma2:9b | 5.4GB | Google’s model, strong at tasks |
codellama | 3.8GB | Code generation and completion |
phi3:mini | 2.3GB | Lightweight, runs on anything |
deepseek-coder-v2 | 8.9GB | Best open-source coding model |
llama3.1:70b | 40GB | Near-GPT-4 quality (needs beefy hardware) |
The Ollama API
Ollama exposes an OpenAI-compatible API on port 11434:
# Generate a response
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain Docker volumes in one paragraph"
}'
# Chat endpoint (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# List models
curl http://localhost:11434/api/tags
This means any tool that supports the OpenAI API can use your local Ollama. Just point it at http://your-server:11434/v1.
Creating Custom Models (Modelfiles)
Ollama’s killer feature: custom models with system prompts baked in.
# Modelfile
FROM llama3.1
SYSTEM """
You are a senior DevOps engineer. You give concise, practical answers
focused on Docker, Kubernetes, and Linux systems. Always include
code examples. Never suggest cloud-managed services when self-hosted
alternatives exist.
"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
Build and run:
ollama create devops-assistant -f Modelfile
ollama run devops-assistant
Now you have a custom model tuned for your use case.
Exposing Ollama on Your Network
By default, Ollama only listens on localhost. To make it available on your network:
Systemd Method
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl restart ollama
Docker Method
Already handled — the compose file maps port 11434 to all interfaces.
Reverse Proxy with Auth (Recommended)
Don’t expose Ollama directly. Use a reverse proxy with authentication:
# Add to your existing Traefik, Caddy, or Nginx setup
# Caddy example:
# ollama.yourdomain.com {
# basicauth {
# admin $2a$14$your_hashed_password
# }
# reverse_proxy localhost:11434
# }
Connecting to Open WebUI
Open WebUI gives you a ChatGPT-like interface for your local models:
# Add to docker-compose.yml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open-webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
open-webui_data:
Navigate to http://your-server:3000, create an account, and start chatting with your local models through a polished web interface.
Performance Tuning
Increase Context Window
# Default is 2048 tokens. Increase for longer conversations:
ollama run llama3.1 --num-ctx 8192
Keep Models Loaded
By default, models unload after 5 minutes of inactivity:
# Keep loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Or set in systemd environment
Environment="OLLAMA_KEEP_ALIVE=-1"
Multiple Models Simultaneously
Ollama can load multiple models if you have the VRAM. Each model needs its own memory allocation. Monitor with:
ollama ps # Show running models
nvidia-smi # GPU memory usage (NVIDIA)
Quantization Choices
Most models on Ollama are Q4_0 quantized by default. For better quality at the cost of more memory:
ollama pull llama3.1:8b-instruct-q8_0 # Higher quality
ollama pull llama3.1:8b-instruct-q4_K_M # Good balance
Security Considerations
- Network: Never expose port 11434 to the internet without auth
- Models: Only download from
ollama.com/libraryor trusted sources - Resources: Set memory limits in Docker to prevent OOM crashes
- Logs: Ollama logs prompts — be aware if handling sensitive data
- Updates:
ollama pull model-namere-downloads latest version
Monitoring
Check Ollama’s health and resource usage:
# Service status
systemctl status ollama
# Logs
journalctl -u ollama -f
# API health check
curl http://localhost:11434/api/tags
# GPU usage (update every 1s)
watch -n 1 nvidia-smi
For full monitoring, pipe Ollama metrics into your Grafana + Prometheus stack.
Troubleshooting
Model won’t load: Check available RAM/VRAM with free -h and nvidia-smi. Try a smaller model or quantization.
Slow generation: Ensure GPU is detected: ollama run llama3.1 --verbose shows if it’s using GPU or CPU.
Connection refused: Check if Ollama is running (systemctl status ollama) and listening on the right interface (ss -tlnp | grep 11434).
Docker GPU not working: Verify NVIDIA Container Toolkit: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
What’s Next?
Once Ollama is running, you can:
- Add Open WebUI for a ChatGPT-like interface
- Connect Continue.dev for AI-powered coding in VS Code
- Set up MCP servers to give your AI access to files, databases, and tools (browse MCP servers)
- Build custom Modelfiles for specialized assistants
- Run embedding models for local RAG (retrieval-augmented generation)
You now have a private, free, unlimited AI running on hardware you control. No subscriptions. No data leaving your network. Welcome to self-hosted AI.