Running large language models locally used to require a PhD and a five-figure GPU budget. Not anymore. Ollama makes running LLMs on your own hardware as simple as ollama run llama3 — no API keys, no cloud costs, no data leaving your network.

In this guide, you’ll set up Ollama on your server, run popular models, expose an OpenAI-compatible API, and integrate it with tools like Open WebUI for a full ChatGPT replacement you own.

Why Run LLMs Locally?

  • Privacy: Your prompts never leave your network. No training on your data.
  • Cost: Zero per-token fees. Pay once for hardware, run forever.
  • Speed: No rate limits. No API outages. Latency = your hardware speed.
  • Offline: Works without internet after downloading models.
  • Customization: Fine-tune models, create custom system prompts, merge adapters.

Hardware Requirements

Ollama runs on CPU, but GPU acceleration makes it practical for real use:

SetupRAMGPU VRAMModels You Can Run
Minimum8GBNone (CPU)Phi-3 Mini, TinyLlama, Gemma 2B
Recommended16GB8GBLlama 3 8B, Mistral 7B, Gemma 7B
Power User32GB16-24GBLlama 3 70B (quantized), Mixtral 8x7B
Homelab Beast64GB+48GB+Llama 3 70B full, DeepSeek Coder 33B

Rule of thumb: You need roughly 1GB of RAM/VRAM per billion parameters for Q4 quantized models.

Installation

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama as a systemd service. It auto-detects NVIDIA, AMD, and Intel GPUs.

Verify the installation:

ollama --version
systemctl status ollama

Option 2: Docker

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Uncomment for NVIDIA GPU support:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

volumes:
  ollama_data:
docker compose up -d

GPU Setup

NVIDIA: Install the NVIDIA Container Toolkit, then uncomment the GPU section in the compose file.

AMD: Use the rocm tag: ollama/ollama:rocm

CPU-only: Works out of the box, just slower. Expect ~5-10 tokens/second for 7B models on modern CPUs.

Running Your First Model

Pull and run a model:

# Interactive chat
ollama run llama3.1

# Pull without running
ollama pull mistral

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.1
ModelSizeBest For
llama3.1:8b4.7GBGeneral purpose, great quality/speed balance
mistral4.1GBFast, good at code and reasoning
gemma2:9b5.4GBGoogle’s model, strong at tasks
codellama3.8GBCode generation and completion
phi3:mini2.3GBLightweight, runs on anything
deepseek-coder-v28.9GBBest open-source coding model
llama3.1:70b40GBNear-GPT-4 quality (needs beefy hardware)

The Ollama API

Ollama exposes an OpenAI-compatible API on port 11434:

# Generate a response
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain Docker volumes in one paragraph"
}'

# Chat endpoint (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

# List models
curl http://localhost:11434/api/tags

This means any tool that supports the OpenAI API can use your local Ollama. Just point it at http://your-server:11434/v1.

Creating Custom Models (Modelfiles)

Ollama’s killer feature: custom models with system prompts baked in.

# Modelfile
FROM llama3.1

SYSTEM """
You are a senior DevOps engineer. You give concise, practical answers
focused on Docker, Kubernetes, and Linux systems. Always include
code examples. Never suggest cloud-managed services when self-hosted
alternatives exist.
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Build and run:

ollama create devops-assistant -f Modelfile
ollama run devops-assistant

Now you have a custom model tuned for your use case.

Exposing Ollama on Your Network

By default, Ollama only listens on localhost. To make it available on your network:

Systemd Method

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
sudo systemctl restart ollama

Docker Method

Already handled — the compose file maps port 11434 to all interfaces.

Don’t expose Ollama directly. Use a reverse proxy with authentication:

# Add to your existing Traefik, Caddy, or Nginx setup
# Caddy example:
# ollama.yourdomain.com {
#     basicauth {
#         admin $2a$14$your_hashed_password
#     }
#     reverse_proxy localhost:11434
# }

Connecting to Open WebUI

Open WebUI gives you a ChatGPT-like interface for your local models:

# Add to docker-compose.yml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open-webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  open-webui_data:

Navigate to http://your-server:3000, create an account, and start chatting with your local models through a polished web interface.

Performance Tuning

Increase Context Window

# Default is 2048 tokens. Increase for longer conversations:
ollama run llama3.1 --num-ctx 8192

Keep Models Loaded

By default, models unload after 5 minutes of inactivity:

# Keep loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Or set in systemd environment
Environment="OLLAMA_KEEP_ALIVE=-1"

Multiple Models Simultaneously

Ollama can load multiple models if you have the VRAM. Each model needs its own memory allocation. Monitor with:

ollama ps       # Show running models
nvidia-smi      # GPU memory usage (NVIDIA)

Quantization Choices

Most models on Ollama are Q4_0 quantized by default. For better quality at the cost of more memory:

ollama pull llama3.1:8b-instruct-q8_0    # Higher quality
ollama pull llama3.1:8b-instruct-q4_K_M  # Good balance

Security Considerations

  • Network: Never expose port 11434 to the internet without auth
  • Models: Only download from ollama.com/library or trusted sources
  • Resources: Set memory limits in Docker to prevent OOM crashes
  • Logs: Ollama logs prompts — be aware if handling sensitive data
  • Updates: ollama pull model-name re-downloads latest version

Monitoring

Check Ollama’s health and resource usage:

# Service status
systemctl status ollama

# Logs
journalctl -u ollama -f

# API health check
curl http://localhost:11434/api/tags

# GPU usage (update every 1s)
watch -n 1 nvidia-smi

For full monitoring, pipe Ollama metrics into your Grafana + Prometheus stack.

Troubleshooting

Model won’t load: Check available RAM/VRAM with free -h and nvidia-smi. Try a smaller model or quantization.

Slow generation: Ensure GPU is detected: ollama run llama3.1 --verbose shows if it’s using GPU or CPU.

Connection refused: Check if Ollama is running (systemctl status ollama) and listening on the right interface (ss -tlnp | grep 11434).

Docker GPU not working: Verify NVIDIA Container Toolkit: docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

What’s Next?

Once Ollama is running, you can:

  • Add Open WebUI for a ChatGPT-like interface
  • Connect Continue.dev for AI-powered coding in VS Code
  • Set up MCP servers to give your AI access to files, databases, and tools (browse MCP servers)
  • Build custom Modelfiles for specialized assistants
  • Run embedding models for local RAG (retrieval-augmented generation)

You now have a private, free, unlimited AI running on hardware you control. No subscriptions. No data leaving your network. Welcome to self-hosted AI.