Self-Hosting LanguageTool: Grammar Checker API with Docker
Every writing tool wants to phone home. Grammarly reads everything you type. Google Docs analyzes your documents on their servers. Even browser extensions quietly ship your text to cloud APIs for grammar checking. If you write anything sensitive — legal documents, medical notes, proprietary code comments, personal journals — that’s a problem.
LanguageTool is an open-source grammar, style, and spell checker that supports over 30 languages. It powers grammar checking in LibreOffice, and its commercial cloud service competes directly with Grammarly. But unlike Grammarly, you can run your own instance. Your text never leaves your network, you get unlimited checks with no word caps, and you can plug it into browser extensions, text editors, and custom applications via a clean REST API.
The self-hosted version doesn’t include LanguageTool’s newer AI-based rules (those are cloud-only), but the rule-based engine catches the vast majority of grammar, spelling, punctuation, and style issues. Add n-gram datasets and you get context-sensitive spell checking that catches commonly confused words like “their” vs “there” — something basic spell checkers miss entirely.
LanguageTool vs Other Writing Tools
| Feature | LanguageTool (Self-Hosted) | Grammarly | ProWritingAid | Vale | Hunspell |
|---|---|---|---|---|---|
| Open source | ✅ LGPL 2.1 | ❌ Proprietary | ❌ Proprietary | ✅ MIT | ✅ Various |
| Self-hostable | ✅ Docker/Java | ❌ | ❌ | ✅ CLI only | ✅ CLI only |
| Languages | ✅ 30+ | ⚠️ ~12 | ⚠️ English only | ⚠️ English-focused | ✅ Many |
| Grammar checking | ✅ Rule-based | ✅ AI + rules | ✅ AI + rules | ⚠️ Style only | ❌ Spell only |
| Style suggestions | ✅ Built-in | ✅ Premium | ✅ | ✅ Configurable | ❌ |
| Context-aware spelling | ✅ With n-grams | ✅ | ✅ | ❌ | ❌ |
| REST API | ✅ | ⚠️ Paid | ❌ | ❌ | ❌ |
| Browser extension | ✅ Custom server | ✅ | ✅ | ❌ | ❌ |
| Privacy | ✅ 100% local | ❌ Cloud-only | ❌ Cloud-only | ✅ Local | ✅ Local |
| Pricing | Free (self-hosted) | From $12/mo | From $10/mo | Free | Free |
LanguageTool hits the sweet spot: real grammar checking (not just spell check) with full privacy and a proper API. If you write in multiple languages, it’s basically the only self-hosted option that handles them all.
Prerequisites
- Docker and Docker Compose installed (Get Docker)
- At least 2 GB of RAM (4+ GB recommended with n-gram datasets)
- Optional: a domain name for remote access (e.g.,
grammar.example.com) - Optional: 8-10 GB of disk space per language for n-gram datasets
Quick Start with Docker Compose
Create a project directory and configuration:
mkdir languagetool && cd languagetool
Create docker-compose.yml:
services:
languagetool:
image: erikvl87/languagetool:latest
container_name: languagetool
ports:
- "8010:8010"
environment:
- Java_Xms=512m
- Java_Xmx=2g
- langtool_pipelinePrewarming=true
- langtool_maxTextLength=50000
volumes:
- languagetool_data:/LanguageTool/data
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8010/v2/languages"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
languagetool_data:
Start the server:
docker compose up -d
The first startup takes 30-60 seconds as LanguageTool loads its rule database and optionally prewarms the processing pipeline. Once ready, test it:
curl -d "language=en-US" -d "text=Their going to the store yesterday." \
http://localhost:8010/v2/check | python3 -m json.tool
You should see matches flagging “Their” (should be “They’re”) and possibly “going” with “yesterday” (tense inconsistency). That confirms your grammar checker is live.
Understanding the Configuration
The key environment variables control how LanguageTool behaves:
| Variable | Default | Description |
|---|---|---|
Java_Xms | 256m | Minimum Java heap size |
Java_Xmx | 512m | Maximum Java heap size — increase for production use |
langtool_pipelinePrewarming | false | Prewarm language pipelines at startup for faster first checks |
langtool_maxTextLength | 40000 | Maximum characters per request (increase for long documents) |
langtool_maxCheckThreads | 10 | Concurrent check threads |
langtool_cacheSize | 0 | Number of cached results (set to 1000+ for repeated checks) |
langtool_requestLimit | 0 | Max requests per requestLimitPeriodInSeconds (0 = unlimited) |
langtool_languageModel | — | Path to n-gram data directory inside container |
For a production setup serving a small team, Java_Xmx=2g with pipeline prewarming handles most workloads comfortably.
Adding N-Gram Datasets for Smarter Checking
The base LanguageTool install catches grammar and spelling errors using rules. N-gram datasets add statistical analysis — LanguageTool compares word sequences against billions of real-world text samples to catch errors that rules miss.
The difference is significant. Without n-grams, “I went to there house” might only flag as a grammar suggestion. With n-grams, LanguageTool confidently identifies “there” → “their” because “their house” appears orders of magnitude more frequently in real text than “there house.”
Download n-gram data (English shown — repeat for other languages):
mkdir -p ./ngrams
cd ./ngrams
# English (~8 GB unzipped)
wget https://languagetool.org/download/ngram-data/ngrams-en-20150817.zip
unzip ngrams-en-20150817.zip
# Optional: German (~8 GB), French (~3 GB), Spanish (~3 GB)
# wget https://languagetool.org/download/ngram-data/ngrams-de-20150819.zip
# wget https://languagetool.org/download/ngram-data/ngrams-fr-20150913.zip
# wget https://languagetool.org/download/ngram-data/ngrams-es-20150915.zip
Update your docker-compose.yml to mount the n-gram data:
services:
languagetool:
image: erikvl87/languagetool:latest
container_name: languagetool
ports:
- "8010:8010"
environment:
- Java_Xms=512m
- Java_Xmx=2g
- langtool_pipelinePrewarming=true
- langtool_maxTextLength=50000
- langtool_languageModel=/ngrams
volumes:
- languagetool_data:/LanguageTool/data
- ./ngrams:/ngrams:ro
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8010/v2/languages"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
languagetool_data:
Restart the container:
docker compose down && docker compose up -d
Startup will be slower with n-gram data loaded (up to 2-3 minutes). Once ready, test the improved checking:
curl -d "language=en-US" \
-d "text=I went to there house and than we went too the store." \
http://localhost:8010/v2/check | python3 -m json.tool
With n-grams, you should see “there” → “their”, “than” → “then”, and “too” → “to” all flagged — confused word pairs that basic spell checkers miss completely.
Connecting Browser Extensions
The LanguageTool browser extension for Chrome and Firefox supports custom servers. This gives you grammar checking across every text field on the web — Gmail, Google Docs, social media, CMS editors — all hitting your private instance.
- Install the LanguageTool extension for your browser
- Click the extension icon → gear icon → Settings
- Scroll to Advanced or Experimental settings
- Select Local server (localhost) or Other server
- Enter your server URL:
http://localhost:8010/v2 - For remote access:
https://grammar.example.com/v2
If you’re accessing from other machines, you’ll need a reverse proxy with HTTPS — browsers increasingly block mixed HTTP content from extensions.
Reverse Proxy Setup
Caddy (Recommended)
Add to your Caddyfile:
Caddy handles HTTPS automatically. If running Caddy in Docker, ensure it’s on the same network as LanguageTool.
Nginx
server {
listen 443 ssl http2;
server_name grammar.example.com;
ssl_certificate /etc/letsencrypt/live/grammar.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/grammar.example.com/privkey.pem;
location / {
proxy_pass http://languagetool:8010;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# LanguageTool can return large responses for long documents
proxy_read_timeout 120s;
proxy_buffer_size 16k;
proxy_buffers 4 32k;
}
}
Integrating with Text Editors
VS Code
Install the LTeX extension and add to your settings.json:
{
"ltex.ltex-ls.languageToolHttpServerUri": "http://localhost:8010",
"ltex.language": "en-US",
"ltex.enabled": ["markdown", "latex", "plaintext", "html"]
}
This gives you real-time grammar checking in Markdown, LaTeX, and plain text files — perfect for documentation and technical writing.
Neovim
With nvim-lspconfig, configure LTeX:
require('lspconfig').ltex.setup{
settings = {
ltex = {
language = "en-US",
languageToolHttpServerUri = "http://localhost:8010",
},
},
}
Obsidian
The Obsidian LanguageTool Plugin supports custom servers. In settings, set the server URL to http://localhost:8010 and enable auto-checking.
API Usage for Custom Applications
LanguageTool’s REST API is straightforward. Here are common patterns:
Basic Check
curl -X POST http://localhost:8010/v2/check \
-d "language=en-US" \
-d "text=This are a test of the grammar checker."
Auto-Detect Language
curl -X POST http://localhost:8010/v2/check \
-d "language=auto" \
-d "text=Dies ist ein Test."
Check with Specific Rules Disabled
curl -X POST http://localhost:8010/v2/check \
-d "language=en-US" \
-d "text=This is a test." \
-d "disabledRules=UPPERCASE_SENTENCE_START,COMMA_PARENTHESIS_WHITESPACE"
Python Integration
import requests
def check_grammar(text, language="en-US"):
response = requests.post(
"http://localhost:8010/v2/check",
data={"text": text, "language": language}
)
result = response.json()
for match in result.get("matches", []):
print(f"Issue: {match['message']}")
print(f" Context: {match['context']['text']}")
if match['replacements']:
print(f" Suggestion: {match['replacements'][0]['value']}")
print()
check_grammar("Their going to there house and than leaving.")
List Supported Languages
curl http://localhost:8010/v2/languages | python3 -m json.tool
Adding Custom Words and Rules
You’ll inevitably have words LanguageTool doesn’t recognize — company names, product names, technical jargon. Rather than ignoring the warnings, add them to custom dictionaries.
Create a custom spelling.txt file:
mkdir -p ./config
cat > ./config/spelling.txt << 'EOF'
# Company and product names
Kubernetes
PostgreSQL
Redis
Nginx
selfhostsetup
Cloudflare
# Technical terms
homelab
proxmox
truenas
EOF
Mount the custom dictionary into the container by adding to your volumes:
volumes:
- languagetool_data:/LanguageTool/data
- ./ngrams:/ngrams:ro
- ./config/spelling.txt:/LanguageTool/org/languagetool/resource/en/hunspell/spelling.txt:ro
For rule customization, you can disable specific rules globally by creating a server.properties file:
cat > ./config/server.properties << 'EOF'
# Disable rules that don't apply to technical writing
disabledRuleIds=WHITESPACE_RULE,UPPERCASE_SENTENCE_START
EOF
Mount it:
volumes:
- ./config/server.properties:/LanguageTool/server.properties:ro
Backup and Restore
LanguageTool’s state is minimal — it’s primarily a stateless API server. Your important data is:
docker-compose.yml— your configuration- Custom dictionaries —
spelling.txtand any rule overrides - N-gram datasets — large but downloadable again
Back up the configuration:
#!/bin/bash
BACKUP_DIR="./backups/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
cp docker-compose.yml "$BACKUP_DIR/"
cp -r config/ "$BACKUP_DIR/" 2>/dev/null
echo "Backup saved to $BACKUP_DIR"
N-gram datasets don’t need backing up — they’re static downloads. Just keep note of which languages you installed.
Troubleshooting
Server takes a long time to start:
Pipeline prewarming loads language models into memory at startup. With n-grams, expect 2-3 minutes. Check progress with docker logs -f languagetool. If it hangs beyond 5 minutes, increase Java_Xmx.
Out of memory errors:
With n-gram datasets, 2 GB of Java heap is the practical minimum. For multiple languages with n-grams, allocate 4+ GB. Monitor with docker stats languagetool.
Browser extension shows “Cannot connect to server”:
Verify the server is running: curl http://localhost:8010/v2/languages. If accessing remotely, ensure HTTPS is configured — browser extensions often reject HTTP connections to non-localhost addresses.
Slow response times:
Enable pipeline prewarming (langtool_pipelinePrewarming=true) and increase cache size (langtool_cacheSize=1000). First request after startup is always slower. For large documents, split into smaller chunks.
Language not detected correctly:
Explicitly pass the language parameter instead of using auto. Automatic detection struggles with short text snippets. For better auto-detection, add fastText (requires building from source or using a custom Docker image).
Custom words not recognized:
Ensure the spelling file is mounted to the correct path for your language. English uses /LanguageTool/org/languagetool/resource/en/hunspell/spelling.txt. Check that the file uses UTF-8 encoding with one word per line.
Power User Tips
- Rate limiting for shared instances: Set
langtool_requestLimit=20andlangtool_requestLimitPeriodInSeconds=60to prevent abuse on shared servers - Multiple languages: Load n-gram data for each language you need — LanguageTool detects the language and uses the appropriate dataset
- CI/CD integration: Use the API in your build pipeline to check documentation PRs. Fail the build on grammar errors above a threshold
- Monitoring: Hit
/v2/languagesas a health endpoint. If it responds, the server is healthy - Resource tuning: Start with
Java_Xmx=1gwithout n-grams orJava_Xmx=2gwith them. Monitor actual usage withdocker statsand adjust - Docker networking: Put LanguageTool on an internal Docker network with your reverse proxy. No need to expose port 8010 to the host if you’re only accessing through the proxy
- Multi-user setup: LanguageTool is inherently multi-tenant — concurrent requests are handled by the thread pool. No user accounts needed for basic use
Wrapping Up
Self-hosting LanguageTool gives you a private grammar checking API that works with browser extensions, text editors, and custom applications. It’s one of those services where self-hosting makes obvious sense — your writing is some of the most personal data you have, and there’s no reason to send it to a third party for basic grammar checking.
The setup is straightforward: a single container, optional n-gram datasets for smarter checking, and a reverse proxy if you want remote access. Add it to your browser extension config and you’ve got Grammarly-like checking everywhere you type, without the subscription or the privacy tradeoff.
Related guides: