Running LocalAI: OpenAI-Compatible API on Your Own Server

You’re already using OpenAI’s API for chat, embeddings, or image generation. But every request costs money, sends your data to a third party, and depends on their uptime. What if you could run the same API — same endpoints, same format — on your own hardware?

That’s exactly what LocalAI does.

What Is LocalAI?

LocalAI is a drop-in replacement for the OpenAI API. It runs entirely on your server and supports:

Text generation (LLaMA, Mistral, Phi, and hundreds more)
Image generation (Stable Diffusion)
Speech-to-text (Whisper)
Text-to-speech
Embeddings (for RAG and vector search)
Function calling (tool use, just like OpenAI)
Vision models (multimodal)

Any app that talks to the OpenAI API can talk to LocalAI by changing one environment variable: the base URL.

LocalAI vs Ollama

Both run local LLMs, but they serve different purposes:

Feature	LocalAI	Ollama
OpenAI API compatible	✅ Full	Partial
Image generation	✅ Stable Diffusion	❌
Speech-to-text	✅ Whisper	❌
Text-to-speech	✅	❌
Embeddings	✅	✅
Function calling	✅	✅
GPU support	✅ CUDA/ROCm/Metal	✅
Ease of setup	Medium	Easy
Resource usage	Higher	Lower

Use Ollama if you just want to chat with local models. Use LocalAI if you need a full OpenAI API replacement for apps and automation.

Already running Ollama? Check our Ollama guide for getting started with local LLMs.

Prerequisites

Linux server with 8GB+ RAM (16GB recommended)
Docker and Docker Compose
Optional: NVIDIA GPU with CUDA drivers (dramatically improves performance)
10-50GB free disk space (for models)

Hardware Recommendations

Setup	RAM	Models You Can Run
Minimum	8GB	Small models (Phi-2, TinyLlama)
Recommended	16GB	7B models (Mistral 7B, LLaMA 3 8B)
Power user	32GB+	13B-30B models, multiple concurrent
With GPU	8GB+ VRAM	Fast inference, larger models

Step 1: Deploy LocalAI with Docker

Create a project directory:

mkdir -p ~/localai && cd ~/localai

CPU-Only Setup

# docker-compose.yml
version: '3.8'
services:
  localai:
    image: localai/localai:latest-cpu
    container_name: localai
    restart: always
    ports:
      - "8080:8080"
    volumes:
      - ./models:/build/models
    environment:
      - THREADS=4
      - CONTEXT_SIZE=2048
      - DEBUG=false

NVIDIA GPU Setup

version: '3.8'
services:
  localai:
    image: localai/localai:latest-gpu-nvidia-cuda-12
    container_name: localai
    restart: always
    ports:
      - "8080:8080"
    volumes:
      - ./models:/build/models
    environment:
      - THREADS=4
      - CONTEXT_SIZE=4096
      - DEBUG=false
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Deploy:

docker compose up -d

LocalAI starts on port 8080. It ships with no models by default — you’ll add them next.

Step 2: Install Models

From the Gallery (Easiest)

LocalAI has a built-in model gallery. Install models with one command:

# Install Mistral 7B (great all-rounder)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "mistral-7b-instruct"
}'

# Install LLaMA 3 8B
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "llama3-8b-instruct"
}'

# Install Whisper (speech-to-text)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "whisper-1"
}'

# Install Stable Diffusion (image generation)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "stablediffusion"
}'

Check Installed Models

curl http://localhost:8080/v1/models | jq

Manual Model Installation

Download any GGUF model from Hugging Face and drop it in the ./models directory:

cd ~/localai/models

# Download a model manually
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

Create a config file ./models/mistral.yaml:

name: mistral
backend: llama-cpp
parameters:
  model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 4096
threads: 4
template:
  chat_message: |
    [INST] {{.Input}} [/INST]

Restart LocalAI to load the new model.

Step 3: Use the OpenAI-Compatible API

LocalAI speaks the same language as OpenAI. Here are the endpoints:

Chat Completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct",
    "messages": [
      {"role": "user", "content": "Explain Docker in 3 sentences"}
    ]
  }'

Embeddings

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "Self-hosting is the practice of running your own services"
  }'

Image Generation

curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stablediffusion",
    "prompt": "A cozy home server room with blinking LEDs",
    "size": "512x512"
  }'

Speech-to-Text

curl http://localhost:8080/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

Step 4: Connect Your Apps

The magic of LocalAI is that any app using the OpenAI SDK works with zero code changes. Just point it to your server.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LocalAI doesn't require a key
)

response = client.chat.completions.create(
    model="mistral-7b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Node.js

const OpenAI = require('openai');

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'mistral-7b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Environment Variable (Universal)

Most OpenAI-compatible apps respect OPENAI_BASE_URL:

export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed

Now tools like LangChain, AutoGPT, Continue.dev, and hundreds of others just work.

Step 5: Performance Tuning

CPU Optimization

environment:
  - THREADS=8          # Match your CPU core count
  - CONTEXT_SIZE=2048  # Lower = faster, less memory

GPU Optimization

For NVIDIA GPUs, ensure you’re using the CUDA image and set:

environment:
  - GPU_LAYERS=99  # Offload all layers to GPU

Quantization

Use smaller quantized models for better performance:

Q4_K_M — Best balance of quality and speed (recommended)
Q5_K_M — Slightly better quality, slower
Q8_0 — Near-original quality, much more RAM
Q2_K — Fastest, lowest quality

Memory Usage Guide

Model Size	Q4_K_M RAM	Q8_0 RAM
3B	~2GB	~3.5GB
7B	~4.5GB	~8GB
13B	~8GB	~14GB
30B	~18GB	~32GB

Troubleshooting

Model loading fails

docker logs localai

Common fixes:

Check disk space: df -h
Verify model file isn’t corrupted: check file size matches source
Ensure enough RAM for the model

Slow responses on CPU

Use Q4_K_M quantization (not Q8 or full precision)
Reduce context size to 1024-2048
Use smaller models (7B instead of 13B)
Set THREADS to your physical core count (not hyperthreads)

GPU not detected

# Check NVIDIA drivers
nvidia-smi

# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

If that fails, install the NVIDIA Container Toolkit.

Out of memory (OOM)

Switch to a smaller quantization (Q4_K_M → Q3_K_M → Q2_K)
Use a smaller model
Reduce CONTEXT_SIZE
Add swap space as emergency overflow

What to Run on LocalAI

Some practical uses for your self-hosted AI API:

Personal assistant — Connect to Open WebUI for a ChatGPT-like interface
Document search — Use embeddings + a vector DB for RAG
Code completion — Point Continue.dev at your LocalAI instance
Home automation — Voice commands via Whisper + Home Assistant
Content generation — Blog drafts, summaries, translations
Image generation — Stable Diffusion for thumbnails and graphics

Running Ollama: Local LLMs — Simpler alternative for chat-only use
Open WebUI: ChatGPT Interface — Web UI for your local models
Self-Hosted n8n — Automate workflows with your local AI

LocalAI puts the full OpenAI API stack on your own hardware. No API keys, no per-token costs, no data leaving your network. Once it’s running, every tool in the OpenAI ecosystem becomes a local tool.

What Is LocalAI?#

LocalAI vs Ollama#

Prerequisites#

Hardware Recommendations#

Step 1: Deploy LocalAI with Docker#

CPU-Only Setup#

NVIDIA GPU Setup#

Step 2: Install Models#

From the Gallery (Easiest)#

Check Installed Models#

Manual Model Installation#

Step 3: Use the OpenAI-Compatible API#

Chat Completions#

Embeddings#

Image Generation#

Speech-to-Text#

Step 4: Connect Your Apps#

Python (OpenAI SDK)#

Node.js#

Environment Variable (Universal)#

Step 5: Performance Tuning#

CPU Optimization#

GPU Optimization#

Quantization#

Memory Usage Guide#

Troubleshooting#

Model loading fails#

Slow responses on CPU#

GPU not detected#

Out of memory (OOM)#

What to Run on LocalAI#

Related Guides#

📬 Get Self-Hosting Tips in Your Inbox