You’re already using OpenAI’s API for chat, embeddings, or image generation. But every request costs money, sends your data to a third party, and depends on their uptime. What if you could run the same API — same endpoints, same format — on your own hardware?
That’s exactly what LocalAI does.
What Is LocalAI?
LocalAI is a drop-in replacement for the OpenAI API. It runs entirely on your server and supports:
- Text generation (LLaMA, Mistral, Phi, and hundreds more)
- Image generation (Stable Diffusion)
- Speech-to-text (Whisper)
- Text-to-speech
- Embeddings (for RAG and vector search)
- Function calling (tool use, just like OpenAI)
- Vision models (multimodal)
Any app that talks to the OpenAI API can talk to LocalAI by changing one environment variable: the base URL.
LocalAI vs Ollama
Both run local LLMs, but they serve different purposes:
| Feature | LocalAI | Ollama |
|---|---|---|
| OpenAI API compatible | ✅ Full | Partial |
| Image generation | ✅ Stable Diffusion | ❌ |
| Speech-to-text | ✅ Whisper | ❌ |
| Text-to-speech | ✅ | ❌ |
| Embeddings | ✅ | ✅ |
| Function calling | ✅ | ✅ |
| GPU support | ✅ CUDA/ROCm/Metal | ✅ |
| Ease of setup | Medium | Easy |
| Resource usage | Higher | Lower |
Use Ollama if you just want to chat with local models. Use LocalAI if you need a full OpenAI API replacement for apps and automation.
Already running Ollama? Check our Ollama guide for getting started with local LLMs.
Prerequisites
- Linux server with 8GB+ RAM (16GB recommended)
- Docker and Docker Compose
- Optional: NVIDIA GPU with CUDA drivers (dramatically improves performance)
- 10-50GB free disk space (for models)
Hardware Recommendations
| Setup | RAM | Models You Can Run |
|---|---|---|
| Minimum | 8GB | Small models (Phi-2, TinyLlama) |
| Recommended | 16GB | 7B models (Mistral 7B, LLaMA 3 8B) |
| Power user | 32GB+ | 13B-30B models, multiple concurrent |
| With GPU | 8GB+ VRAM | Fast inference, larger models |
Step 1: Deploy LocalAI with Docker
Create a project directory:
mkdir -p ~/localai && cd ~/localai
CPU-Only Setup
# docker-compose.yml
version: '3.8'
services:
localai:
image: localai/localai:latest-cpu
container_name: localai
restart: always
ports:
- "8080:8080"
volumes:
- ./models:/build/models
environment:
- THREADS=4
- CONTEXT_SIZE=2048
- DEBUG=false
NVIDIA GPU Setup
version: '3.8'
services:
localai:
image: localai/localai:latest-gpu-nvidia-cuda-12
container_name: localai
restart: always
ports:
- "8080:8080"
volumes:
- ./models:/build/models
environment:
- THREADS=4
- CONTEXT_SIZE=4096
- DEBUG=false
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Deploy:
docker compose up -d
LocalAI starts on port 8080. It ships with no models by default — you’ll add them next.
Step 2: Install Models
From the Gallery (Easiest)
LocalAI has a built-in model gallery. Install models with one command:
# Install Mistral 7B (great all-rounder)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
"id": "mistral-7b-instruct"
}'
# Install LLaMA 3 8B
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
"id": "llama3-8b-instruct"
}'
# Install Whisper (speech-to-text)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
"id": "whisper-1"
}'
# Install Stable Diffusion (image generation)
curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
"id": "stablediffusion"
}'
Check Installed Models
curl http://localhost:8080/v1/models | jq
Manual Model Installation
Download any GGUF model from Hugging Face and drop it in the ./models directory:
cd ~/localai/models
# Download a model manually
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
Create a config file ./models/mistral.yaml:
name: mistral
backend: llama-cpp
parameters:
model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
temperature: 0.7
top_p: 0.9
context_size: 4096
threads: 4
template:
chat_message: |
[INST] {{.Input}} [/INST]
Restart LocalAI to load the new model.
Step 3: Use the OpenAI-Compatible API
LocalAI speaks the same language as OpenAI. Here are the endpoints:
Chat Completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [
{"role": "user", "content": "Explain Docker in 3 sentences"}
]
}'
Embeddings
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-ada-002",
"input": "Self-hosting is the practice of running your own services"
}'
Image Generation
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "stablediffusion",
"prompt": "A cozy home server room with blinking LEDs",
"size": "512x512"
}'
Speech-to-Text
curl http://localhost:8080/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-1"
Step 4: Connect Your Apps
The magic of LocalAI is that any app using the OpenAI SDK works with zero code changes. Just point it to your server.
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # LocalAI doesn't require a key
)
response = client.chat.completions.create(
model="mistral-7b-instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Node.js
const OpenAI = require('openai');
const client = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'not-needed',
});
const response = await client.chat.completions.create({
model: 'mistral-7b-instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
Environment Variable (Universal)
Most OpenAI-compatible apps respect OPENAI_BASE_URL:
export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed
Now tools like LangChain, AutoGPT, Continue.dev, and hundreds of others just work.
Step 5: Performance Tuning
CPU Optimization
environment:
- THREADS=8 # Match your CPU core count
- CONTEXT_SIZE=2048 # Lower = faster, less memory
GPU Optimization
For NVIDIA GPUs, ensure you’re using the CUDA image and set:
environment:
- GPU_LAYERS=99 # Offload all layers to GPU
Quantization
Use smaller quantized models for better performance:
- Q4_K_M — Best balance of quality and speed (recommended)
- Q5_K_M — Slightly better quality, slower
- Q8_0 — Near-original quality, much more RAM
- Q2_K — Fastest, lowest quality
Memory Usage Guide
| Model Size | Q4_K_M RAM | Q8_0 RAM |
|---|---|---|
| 3B | ~2GB | ~3.5GB |
| 7B | ~4.5GB | ~8GB |
| 13B | ~8GB | ~14GB |
| 30B | ~18GB | ~32GB |
Troubleshooting
Model loading fails
docker logs localai
Common fixes:
- Check disk space:
df -h - Verify model file isn’t corrupted: check file size matches source
- Ensure enough RAM for the model
Slow responses on CPU
- Use Q4_K_M quantization (not Q8 or full precision)
- Reduce context size to 1024-2048
- Use smaller models (7B instead of 13B)
- Set
THREADSto your physical core count (not hyperthreads)
GPU not detected
# Check NVIDIA drivers
nvidia-smi
# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
If that fails, install the NVIDIA Container Toolkit.
Out of memory (OOM)
- Switch to a smaller quantization (Q4_K_M → Q3_K_M → Q2_K)
- Use a smaller model
- Reduce
CONTEXT_SIZE - Add swap space as emergency overflow
What to Run on LocalAI
Some practical uses for your self-hosted AI API:
- Personal assistant — Connect to Open WebUI for a ChatGPT-like interface
- Document search — Use embeddings + a vector DB for RAG
- Code completion — Point Continue.dev at your LocalAI instance
- Home automation — Voice commands via Whisper + Home Assistant
- Content generation — Blog drafts, summaries, translations
- Image generation — Stable Diffusion for thumbnails and graphics
Related Guides
- Running Ollama: Local LLMs — Simpler alternative for chat-only use
- Open WebUI: ChatGPT Interface — Web UI for your local models
- Self-Hosted n8n — Automate workflows with your local AI
LocalAI puts the full OpenAI API stack on your own hardware. No API keys, no per-token costs, no data leaving your network. Once it’s running, every tool in the OpenAI ecosystem becomes a local tool.