Running Ollama: Local LLMs on Your Own Hardware

Running large language models locally used to require a PhD and a five-figure GPU budget. Not anymore. Ollama makes running LLMs on your own hardware as simple as ollama run llama3 — no API keys, no cloud costs, no data leaving your network.

In this guide, you’ll set up Ollama on your server, run popular models, expose an OpenAI-compatible API, and integrate it with tools like Open WebUI for a full ChatGPT replacement you own.

Why Run LLMs Locally?

Privacy: Your prompts never leave your network. No training on your data.
Cost: Zero per-token fees. Pay once for hardware, run forever.
Speed: No rate limits. No API outages. Latency = your hardware speed.
Offline: Works without internet after downloading models.
Customization: Fine-tune models, create custom system prompts, merge adapters.

Hardware Requirements

Ollama runs on CPU, but GPU acceleration makes it practical for real use:

Setup	RAM	GPU VRAM	Models You Can Run
Minimum	8GB	None (CPU)	Phi-3 Mini, TinyLlama, Gemma 2B
Recommended	16GB	8GB	Llama 3 8B, Mistral 7B, Gemma 7B
Power User	32GB	16-24GB	Llama 3 70B (quantized), Mixtral 8x7B
Homelab Beast	64GB+	48GB+	Llama 3 70B full, DeepSeek Coder 33B

Rule of thumb: You need roughly 1GB of RAM/VRAM per billion parameters for Q4 quantized models.

Installation

Option 1: Direct Install (Recommended)

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama as a systemd service. It auto-detects NVIDIA, AMD, and Intel GPUs.

Verify the installation:

ollama --version
systemctl status ollama

Option 2: Docker

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Uncomment for NVIDIA GPU support:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

volumes:
  ollama_data:

docker compose up -d

GPU Setup

NVIDIA: Install the NVIDIA Container Toolkit, then uncomment the GPU section in the compose file.

AMD: Use the rocm tag: ollama/ollama:rocm

CPU-only: Works out of the box, just slower. Expect ~5-10 tokens/second for 7B models on modern CPUs.

Running Your First Model

Pull and run a model:

# Interactive chat
ollama run llama3.1

# Pull without running
ollama pull mistral

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.1

Recommended Models to Start With

Model	Size	Best For
`llama3.1:8b`	4.7GB	General purpose, great quality/speed balance
`mistral`	4.1GB	Fast, good at code and reasoning
`gemma2:9b`	5.4GB	Google’s model, strong at tasks
`codellama`	3.8GB	Code generation and completion
`phi3:mini`	2.3GB	Lightweight, runs on anything
`deepseek-coder-v2`	8.9GB	Best open-source coding model
`llama3.1:70b`	40GB	Near-GPT-4 quality (needs beefy hardware)

The Ollama API

Ollama exposes an OpenAI-compatible API on port 11434:

# Generate a response
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain Docker volumes in one paragraph"
}'

# Chat endpoint (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

# List models
curl http://localhost:11434/api/tags

This means any tool that supports the OpenAI API can use your local Ollama. Just point it at http://your-server:11434/v1.

Creating Custom Models (Modelfiles)

Ollama’s killer feature: custom models with system prompts baked in.

# Modelfile
FROM llama3.1

SYSTEM """
You are a senior DevOps engineer. You give concise, practical answers
focused on Docker, Kubernetes, and Linux systems. Always include
code examples. Never suggest cloud-managed services when self-hosted
alternatives exist.
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Build and run:

ollama create devops-assistant -f Modelfile
ollama run devops-assistant

Now you have a custom model tuned for your use case.

Exposing Ollama on Your Network

By default, Ollama only listens on localhost. To make it available on your network:

Systemd Method

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

sudo systemctl restart ollama

Docker Method

Already handled — the compose file maps port 11434 to all interfaces.

Reverse Proxy with Auth (Recommended)

Don’t expose Ollama directly. Use a reverse proxy with authentication:

# Add to your existing Traefik, Caddy, or Nginx setup
# Caddy example:
# ollama.yourdomain.com {
#     basicauth {
#         admin $2a$14$your_hashed_password
#     }
#     reverse_proxy localhost:11434
# }

Connecting to Open WebUI

Open WebUI gives you a ChatGPT-like interface for your local models:

# Add to docker-compose.yml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open-webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  open-webui_data:

Navigate to http://your-server:3000, create an account, and start chatting with your local models through a polished web interface.

Performance Tuning

Increase Context Window

# Default is 2048 tokens. Increase for longer conversations:
ollama run llama3.1 --num-ctx 8192

Keep Models Loaded

By default, models unload after 5 minutes of inactivity:

# Keep loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Or set in systemd environment
Environment="OLLAMA_KEEP_ALIVE=-1"

Multiple Models Simultaneously

Ollama can load multiple models if you have the VRAM. Each model needs its own memory allocation. Monitor with:

ollama ps       # Show running models
nvidia-smi      # GPU memory usage (NVIDIA)

Quantization Choices

Most models on Ollama are Q4_0 quantized by default. For better quality at the cost of more memory:

ollama pull llama3.1:8b-instruct-q8_0    # Higher quality
ollama pull llama3.1:8b-instruct-q4_K_M  # Good balance

Security Considerations

Network: Never expose port 11434 to the internet without auth
Models: Only download from ollama.com/library or trusted sources
Resources: Set memory limits in Docker to prevent OOM crashes
Logs: Ollama logs prompts — be aware if handling sensitive data
Updates: ollama pull model-name re-downloads latest version

Monitoring

Check Ollama’s health and resource usage:

# Service status
systemctl status ollama

# Logs
journalctl -u ollama -f

# API health check
curl http://localhost:11434/api/tags

# GPU usage (update every 1s)
watch -n 1 nvidia-smi

For full monitoring, pipe Ollama metrics into your Grafana + Prometheus stack.

Troubleshooting

Model won’t load: Check available RAM/VRAM with free -h and nvidia-smi. Try a smaller model or quantization.

Slow generation: Ensure GPU is detected: ollama run llama3.1 --verbose shows if it’s using GPU or CPU.

Connection refused: Check if Ollama is running (systemctl status ollama) and listening on the right interface (ss -tlnp | grep 11434).

Docker GPU not working: Verify NVIDIA Container Toolkit: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

What’s Next?

Once Ollama is running, you can:

Add Open WebUI for a ChatGPT-like interface
Connect Continue.dev for AI-powered coding in VS Code
Set up MCP servers to give your AI access to files, databases, and tools (browse MCP servers)
Build custom Modelfiles for specialized assistants
Run embedding models for local RAG (retrieval-augmented generation)

You now have a private, free, unlimited AI running on hardware you control. No subscriptions. No data leaving your network. Welcome to self-hosted AI.

Why Run LLMs Locally?#

Hardware Requirements#

Installation#

Option 1: Direct Install (Recommended)#

Option 2: Docker#

GPU Setup#

Running Your First Model#

Recommended Models to Start With#

The Ollama API#

Creating Custom Models (Modelfiles)#

Exposing Ollama on Your Network#

Systemd Method#

Docker Method#

Reverse Proxy with Auth (Recommended)#

Connecting to Open WebUI#

Performance Tuning#

Increase Context Window#

Keep Models Loaded#

Multiple Models Simultaneously#

Quantization Choices#

Security Considerations#

Monitoring#

Troubleshooting#

What’s Next?#

📬 Get Self-Hosting Tips in Your Inbox