Self-Hosting Paperless-GPT: AI-Powered Document Classification

You’ve scanned your documents into Paperless-ngx. Maybe hundreds, maybe thousands. Now comes the tedious part — naming them, tagging them, sorting them into the right categories. Every receipt, invoice, letter, and tax form needs a sensible title and the right tags, or your digital filing cabinet becomes a digital junk drawer.

Paperless-GPT solves this by connecting your Paperless-ngx instance to a large language model. Drop a document in, and the AI generates a title, assigns tags, identifies the correspondent, and even extracts custom field data. It can also re-OCR your documents using LLM vision, catching text that traditional OCR engines miss on messy or low-quality scans.

What Paperless-GPT Actually Does

Paperless-GPT sits alongside your existing Paperless-ngx installation as a companion service. It watches for documents tagged with a specific tag (like paperless-gpt), processes them through an LLM, and writes back improved metadata.

Here’s the workflow:

A document arrives in Paperless-ngx (via scanner, email, or file drop)
You tag it with your trigger tag — or set up an automation rule to do it automatically
Paperless-GPT picks it up, sends the content to your configured LLM
The AI generates a title, tags, correspondent, and optionally fills custom fields
You review the suggestions in Paperless-GPT’s web UI — or let auto-processing handle it

The key features:

LLM-enhanced OCR — Uses OpenAI or Ollama vision models to extract text from images, outperforming traditional OCR on handwritten notes, faded receipts, and skewed scans
Automatic title generation — Context-aware names like “2026-03-15 Electric Bill - ConEdison” instead of “scan_0047.pdf”
Smart tagging — Assigns existing Paperless-ngx tags based on document content
Correspondent detection — Identifies who sent the document
Custom field extraction — Pull out invoice numbers, amounts, dates, or any structured data you define
Searchable PDF generation — Creates PDFs with transparent text layers for full-text search

Prerequisites

A Linux server with Docker and Docker Compose installed
A running Paperless-ngx instance
An OpenAI API key or a local Ollama installation
Basic familiarity with Docker Compose and environment variables

Setting Up Paperless-GPT

Step 1: Get Your Paperless-ngx API Token

You need an API token so Paperless-GPT can communicate with your Paperless-ngx instance.

# If you're running Paperless-ngx in Docker:
docker exec -it paperless-ngx python3 manage.py shell -c \
  "from rest_framework.authtoken.models import Token; from django.contrib.auth.models import User; t, _ = Token.objects.get_or_create(user=User.objects.first()); print(t.key)"

Save this token — you’ll need it in the next step.

Step 2: Create the Docker Compose File

Create a new directory and compose file for Paperless-GPT:

mkdir -p ~/docker/paperless-gpt && cd ~/docker/paperless-gpt

# docker-compose.yml
services:
  paperless-gpt:
    image: icereed/paperless-gpt:latest
    container_name: paperless-gpt
    ports:
      - "8080:8080"
    environment:
      - PAPERLESS_BASE_URL=http://paperless-ngx:8000
      - PAPERLESS_API_TOKEN=your-paperless-api-token
      - LLM_PROVIDER=openai
      - OPENAI_API_KEY=sk-your-openai-key
      - LLM_MODEL=gpt-4o-mini
      - AUTO_GENERATE_TITLE=true
      - AUTO_GENERATE_TAGS=true
      - AUTO_GENERATE_CORRESPONDENTS=true
      - PAPERLESS_GPT_MANUAL_TAG=paperless-gpt
    restart: unless-stopped

If your Paperless-ngx instance is on the same Docker network, use the container name as the hostname. Otherwise, use your server’s IP or domain.

Step 3: Choose Your LLM Backend

Option A: OpenAI (easiest, best accuracy)

environment:
  - LLM_PROVIDER=openai
  - OPENAI_API_KEY=sk-your-key-here
  - LLM_MODEL=gpt-4o-mini
  - VISION_LLM_PROVIDER=openai
  - VISION_LLM_MODEL=gpt-4o-mini

Cost is minimal — gpt-4o-mini processes a typical document for under a cent. If you’re processing thousands of documents in bulk, expect a few dollars total.

Option B: Ollama (free, private, runs locally)

environment:
  - LLM_PROVIDER=ollama
  - OLLAMA_HOST=http://ollama:11434
  - LLM_MODEL=qwen3:8b
  - VISION_LLM_PROVIDER=ollama
  - VISION_LLM_MODEL=llava:13b

For Ollama, you’ll want at least 8GB of RAM for smaller models. Reasoning models like qwen3:8b offer the best balance of accuracy and resource usage. For vision OCR, llava:13b handles most document types well.

Option C: Mix and match

You can use Ollama for text classification and OpenAI for vision OCR, or vice versa. Set LLM_PROVIDER and VISION_LLM_PROVIDER independently.

Step 4: Start the Service

docker compose up -d

Access the web UI at http://your-server:8080. You’ll see a clean dashboard showing pending documents and processing status.

Configuring Document Processing

Setting Up the Trigger Tag

By default, Paperless-GPT watches for documents tagged with paperless-gpt. Create this tag in Paperless-ngx:

Go to your Paperless-ngx web UI → Tags → Add Tag
Name it paperless-gpt (must match the PAPERLESS_GPT_MANUAL_TAG variable)
Pick a color — I use orange for “needs processing”

Automatic Processing with Paperless-ngx Rules

Instead of manually tagging documents, create a consumption rule in Paperless-ngx:

Go to Settings → Mail or Consumption Templates
Create a rule that automatically assigns the paperless-gpt tag to new documents
This triggers Paperless-GPT to process every incoming document

Custom Field Extraction

This is where Paperless-GPT gets powerful. Define custom fields in Paperless-ngx (like “Invoice Amount”, “Due Date”, “Account Number”), then enable extraction:

environment:
  - AUTO_GENERATE_CUSTOM_FIELDS=true
  - CUSTOM_FIELDS_WRITE_MODE=append

The three write modes:

Append (safest) — Only fills empty fields, never overwrites
Update — Fills empty fields and overwrites existing ones with new suggestions
Replace — Clears all custom fields and replaces with new suggestions

Start with append until you trust the AI’s accuracy with your document types.

Customizing AI Prompts

The web UI includes a Settings page where you can customize the prompts used for title generation, tagging, and correspondent detection. This is useful when:

Your documents follow a specific naming convention
You want tags in a particular language
Certain document types need special handling

The defaults work well for most English-language documents, but tweaking prompts can dramatically improve accuracy for specialized use cases like medical records, legal documents, or non-English paperwork.

Running Behind a Reverse Proxy

If you’re already running Caddy, Nginx Proxy Manager, or Traefik for your other services:

Caddy:

Nginx Proxy Manager: Add a proxy host pointing to paperless-gpt on port 8080. Enable SSL.

LLM-Enhanced OCR

Traditional OCR (Tesseract, which Paperless-ngx uses by default) struggles with:

Handwritten text
Faded or low-contrast documents
Skewed or rotated scans
Documents with complex layouts (tables, multi-column)

Paperless-GPT’s LLM OCR sends page images to a vision model, which understands context and layout. The result is dramatically better text extraction, especially on difficult documents.

Enable it with:

environment:
  - VISION_LLM_PROVIDER=openai
  - VISION_LLM_MODEL=gpt-4o-mini

The system generates searchable PDFs with transparent text layers positioned over each word — your documents stay visually identical but become fully searchable and selectable.

Alternative OCR Backends

Beyond LLM-based OCR, Paperless-GPT also supports:

Google Document AI — Google’s enterprise OCR service
Azure Document Intelligence — Microsoft’s OCR solution
Docling Server — A self-hosted OCR and document conversion service

These can be useful if you need specific compliance certifications or already have enterprise agreements with these providers.

Troubleshooting

Documents not being picked up

Verify the trigger tag name matches exactly between Paperless-ngx and the PAPERLESS_GPT_MANUAL_TAG variable
Check that PAPERLESS_BASE_URL is reachable from inside the container: docker exec paperless-gpt wget -qO- http://paperless-ngx:8000/api/
Ensure your API token is valid

Poor title/tag suggestions

Try a more capable model (gpt-4o instead of gpt-4o-mini, or a larger Ollama model)
Customize the prompts in the Settings page — add examples of good titles for your document types
Check that OCR text is actually being extracted — some image-only PDFs need vision OCR enabled

High latency with Ollama

Larger models need more RAM and preferably a GPU
qwen3:8b is the sweet spot for CPU-only setups
Consider using OpenAI for the initial bulk processing, then switching to Ollama for ongoing documents

Container can’t reach Paperless-ngx

If they’re in separate Docker Compose files, create a shared network:

# In both compose files:
networks:
  default:
    name: paperless-network
    external: true

docker network create paperless-network

Resource Usage

Paperless-GPT itself is lightweight — it’s a Go application that uses minimal CPU and RAM when idle. The real resource consumption depends on your LLM backend:

Backend	RAM	CPU	Cost per Document
OpenAI gpt-4o-mini	Negligible	Negligible	~$0.005
Ollama qwen3:8b	6-8 GB	Moderate	Free
Ollama llava:13b (vision)	10-16 GB	High	Free

For a typical household generating 5-10 documents per week, OpenAI costs are negligible — maybe $2-3 per year.

Wrapping Up

Paperless-GPT turns your Paperless-ngx instance from a document scanner into an intelligent filing system. The setup takes about 10 minutes if you already have Paperless-ngx running, and the payoff is immediate — no more manually naming and tagging every document that comes through your scanner.

Start with OpenAI and gpt-4o-mini for the best out-of-the-box experience. Once you’re comfortable with the system, you can migrate to Ollama for fully local, private processing. Either way, your documents will be better organized than you’d ever manage by hand.

Useful links:

Self-Hosting Paperless-GPT: AI-Powered Document Classification#

What Paperless-GPT Actually Does#

Prerequisites#

Setting Up Paperless-GPT#

Step 1: Get Your Paperless-ngx API Token#

Step 2: Create the Docker Compose File#

Step 3: Choose Your LLM Backend#

Step 4: Start the Service#

Configuring Document Processing#

Setting Up the Trigger Tag#

Automatic Processing with Paperless-ngx Rules#

Custom Field Extraction#

Customizing AI Prompts#

Running Behind a Reverse Proxy#

LLM-Enhanced OCR#

Alternative OCR Backends#

Troubleshooting#

Documents not being picked up#

Poor title/tag suggestions#

High latency with Ollama#

Container can’t reach Paperless-ngx#

Resource Usage#

Wrapping Up#

📬 Get Self-Hosting Tips in Your Inbox