You’re already using OpenAI’s API for chat, embeddings, or image generation. But every request costs money, sends your data to a third party, and depends on their uptime. What if you could run the same API — same endpoints, same format — on your own hardware?

That’s exactly what LocalAI does.


What Is LocalAI?

LocalAI is a drop-in replacement for the OpenAI API. It runs entirely on your server and supports:

  • Text generation (LLaMA, Mistral, Phi, and hundreds more)
  • Image generation (Stable Diffusion)
  • Speech-to-text (Whisper)
  • Text-to-speech
  • Embeddings (for RAG and vector search)
  • Function calling (tool use, just like OpenAI)
  • Vision models (multimodal)

Any app that talks to the OpenAI API can talk to LocalAI by changing one environment variable: the base URL.

LocalAI vs Ollama

Both run local LLMs, but they serve different purposes:

FeatureLocalAIOllama
OpenAI API compatible✅ FullPartial
Image generation✅ Stable Diffusion
Speech-to-text✅ Whisper
Text-to-speech
Embeddings
Function calling
GPU support✅ CUDA/ROCm/Metal
Ease of setupMediumEasy
Resource usageHigherLower

Use Ollama if you just want to chat with local models. Use LocalAI if you need a full OpenAI API replacement for apps and automation.

Already running Ollama? Check our Ollama guide for getting started with local LLMs.


Prerequisites

  • Linux server with 8GB+ RAM (16GB recommended)
  • Docker and Docker Compose
  • Optional: NVIDIA GPU with CUDA drivers (dramatically improves performance)
  • 10-50GB free disk space (for models)

Hardware Recommendations

SetupRAMModels You Can Run
Minimum8GBSmall models (Phi-2, TinyLlama)
Recommended16GB7B models (Mistral 7B, LLaMA 3 8B)
Power user32GB+13B-30B models, multiple concurrent
With GPU8GB+ VRAMFast inference, larger models

Step 1: Deploy LocalAI with Docker

Create a project directory:

mkdir -p ~/localai && cd ~/localai

CPU-Only Setup

# docker-compose.yml
version: '3.8'
services:
  localai:
    image: localai/localai:latest-cpu
    container_name: localai
    restart: always
    ports:
      - "8080:8080"
    volumes:
      - ./models:/build/models
    environment:
      - THREADS=4
      - CONTEXT_SIZE=2048
      - DEBUG=false

NVIDIA GPU Setup

version: '3.8'
services:
  localai:
    image: localai/localai:latest-gpu-nvidia-cuda-12
    container_name: localai
    restart: always
    ports:
      - "8080:8080"
    volumes:
      - ./models:/build/models
    environment:
      - THREADS=4
      - CONTEXT_SIZE=4096
      - DEBUG=false
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Deploy:

docker compose up -d

LocalAI starts on port 8080. It ships with no models by default — you’ll add them next.


Step 2: Install Models

LocalAI has a built-in model gallery. Install models with one command:

# Install Mistral 7B (great all-rounder)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "mistral-7b-instruct"
}'

# Install LLaMA 3 8B
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "llama3-8b-instruct"
}'

# Install Whisper (speech-to-text)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "whisper-1"
}'

# Install Stable Diffusion (image generation)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
  "id": "stablediffusion"
}'

Check Installed Models

curl http://localhost:8080/v1/models | jq

Manual Model Installation

Download any GGUF model from Hugging Face and drop it in the ./models directory:

cd ~/localai/models

# Download a model manually
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

Create a config file ./models/mistral.yaml:

name: mistral
backend: llama-cpp
parameters:
  model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 4096
threads: 4
template:
  chat_message: |
    [INST] {{.Input}} [/INST]

Restart LocalAI to load the new model.


Step 3: Use the OpenAI-Compatible API

LocalAI speaks the same language as OpenAI. Here are the endpoints:

Chat Completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct",
    "messages": [
      {"role": "user", "content": "Explain Docker in 3 sentences"}
    ]
  }'

Embeddings

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "Self-hosting is the practice of running your own services"
  }'

Image Generation

curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stablediffusion",
    "prompt": "A cozy home server room with blinking LEDs",
    "size": "512x512"
  }'

Speech-to-Text

curl http://localhost:8080/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

Step 4: Connect Your Apps

The magic of LocalAI is that any app using the OpenAI SDK works with zero code changes. Just point it to your server.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LocalAI doesn't require a key
)

response = client.chat.completions.create(
    model="mistral-7b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Node.js

const OpenAI = require('openai');

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'mistral-7b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Environment Variable (Universal)

Most OpenAI-compatible apps respect OPENAI_BASE_URL:

export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed

Now tools like LangChain, AutoGPT, Continue.dev, and hundreds of others just work.


Step 5: Performance Tuning

CPU Optimization

environment:
  - THREADS=8          # Match your CPU core count
  - CONTEXT_SIZE=2048  # Lower = faster, less memory

GPU Optimization

For NVIDIA GPUs, ensure you’re using the CUDA image and set:

environment:
  - GPU_LAYERS=99  # Offload all layers to GPU

Quantization

Use smaller quantized models for better performance:

  • Q4_K_M — Best balance of quality and speed (recommended)
  • Q5_K_M — Slightly better quality, slower
  • Q8_0 — Near-original quality, much more RAM
  • Q2_K — Fastest, lowest quality

Memory Usage Guide

Model SizeQ4_K_M RAMQ8_0 RAM
3B~2GB~3.5GB
7B~4.5GB~8GB
13B~8GB~14GB
30B~18GB~32GB

Troubleshooting

Model loading fails

docker logs localai

Common fixes:

  • Check disk space: df -h
  • Verify model file isn’t corrupted: check file size matches source
  • Ensure enough RAM for the model

Slow responses on CPU

  • Use Q4_K_M quantization (not Q8 or full precision)
  • Reduce context size to 1024-2048
  • Use smaller models (7B instead of 13B)
  • Set THREADS to your physical core count (not hyperthreads)

GPU not detected

# Check NVIDIA drivers
nvidia-smi

# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

If that fails, install the NVIDIA Container Toolkit.

Out of memory (OOM)

  • Switch to a smaller quantization (Q4_K_M → Q3_K_M → Q2_K)
  • Use a smaller model
  • Reduce CONTEXT_SIZE
  • Add swap space as emergency overflow

What to Run on LocalAI

Some practical uses for your self-hosted AI API:

  • Personal assistant — Connect to Open WebUI for a ChatGPT-like interface
  • Document search — Use embeddings + a vector DB for RAG
  • Code completion — Point Continue.dev at your LocalAI instance
  • Home automation — Voice commands via Whisper + Home Assistant
  • Content generation — Blog drafts, summaries, translations
  • Image generation — Stable Diffusion for thumbnails and graphics


LocalAI puts the full OpenAI API stack on your own hardware. No API keys, no per-token costs, no data leaving your network. Once it’s running, every tool in the OpenAI ecosystem becomes a local tool.