Performance Tuning
Auto-Tuning Benchmark
Don't guess — let the benchmark find optimal settings for your hardware:
npm run tune
Creates tuned_environment_variables.env with optimal values in ~60-90 seconds.
For local setup, just run npm run tune — defaults work out of the box:
QDRANT_URLdefaults tohttp://localhost:6333EMBEDDING_BASE_URLdefaults tohttp://localhost:11434EMBEDDING_MODELdefaults tounclemusclez/jina-embeddings-v2-base-code:latest
For remote setup, configure via environment variables:
QDRANT_URL=http://192.168.1.100:6333 \
EMBEDDING_BASE_URL=http://192.168.1.100:11434 \
npm run tune
The benchmark tests 7 parameters and shows estimated indexing times:
Phase 1: Embedding Batch Size ... Optimal: 128
Phase 2: Embedding Concurrency ... Optimal: 2
Phase 3: Qdrant Batch Size ... Optimal: 384
Then add tuned values to your MCP config:
claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \
-e QDRANT_URL=http://localhost:6333 \
-e EMBEDDING_BATCH_SIZE=256 \
-e EMBEDDING_CONCURRENCY=2 \
-e QDRANT_UPSERT_BATCH_SIZE=384
Embeddings-Only Benchmark
For GPU-specific optimization without Qdrant (embedding calibration only):
npm run benchmark-embeddings
This benchmark uses a three-phase plateau detection algorithm designed for production robustness.
Three-Phase Calibration Algorithm
Phase 1: Batch Plateau Detection (CONCURRENCY=1)
- Tests batch sizes: [256, 512, 1024, 2048, 3072, 4096]
- Uses adaptive chunk count:
min(batch×2, MAX_TOTAL_CHUNKS)per batch - Stops when improvement < 3% (plateau detected) or timeout exceeded
- Plateau timeout: calculated from previous throughput to detect degradation early
Phase 2: Concurrency Testing (on plateau only)
- Tests concurrency: [1, 2, 4] on plateau batches only
- Plateau timeout: calculated from baseline (CONC=1) throughput
- Stops testing higher concurrency if timeout exceeded (degradation)
- Reuses CONC=1 results from Phase 1 (no redundant testing)
Phase 3: Robust Selection
- Selects from configurations within 2% of maximum throughput
- Prefers: Lower concurrency → Lower batch size → Higher throughput
- Avoids overfitting to noise, reduces tail-risk
Key Principles
The algorithm follows engineering best practices:
- Adaptive Workload: Each batch size tests
min(batch×2, MAX_TOTAL_CHUNKS)chunks - Plateau Over Peak: Seeks stable performance range, not theoretical maximum
- Noise Tolerance: Differences < 2-3% considered measurement noise
- Robustness: Prefers simpler configs (lower concurrency/batch) when performance is equivalent
When to Use
Use npm run benchmark-embeddings when you:
- Want to understand GPU/CPU characteristics without Qdrant
- Are comparing different embedding models
- Need to diagnose embedding bottlenecks
- Want to see the full calibration process with detailed output
Use npm run tune (full benchmark) when you:
- Need complete end-to-end optimization (embedding + Qdrant)
- Want production-ready configuration file
- Are setting up new deployment
Example Output (Remote GPU)
Phase 1: Batch Plateau Detection (CONCURRENCY=1)
512 chunks @ batch 256 143 chunks/s (3.6s) +100.0%
1024 chunks @ batch 512 (max 11.1s) 145 chunks/s (7.1s) +1.4%
→ Plateau detected, stopping
Plateau batches: [256, 512]
Phase 2: Concurrency Effect Test
BATCH=256
CONC=1 (from Phase 1) 143 chunks/s
CONC=2 (1024 chunks, max 11.1s) STABLE 152 chunks/s +6.3%
CONC=4 (2048 chunks, max 22.1s) STABLE 153 chunks/s +0.7%
→ Concurrency plateau, stopping
BATCH=512
CONC=1 (from Phase 1) 145 chunks/s
CONC=2 (2048 chunks, max 21.8s) STABLE 147 chunks/s +1.4%
→ Concurrency plateau, stopping
Phase 3: Configuration Selection
Acceptable configurations: 5/5
Max throughput: 153 chunks/s
Recommended configurations:
🏠 Local GPU: BATCH_SIZE=512 CONCURRENCY=2 (153 chunks/s)
🌐 Remote GPU: BATCH_SIZE=256 CONCURRENCY=4 (153 chunks/s)
Detected setup: 🌐 Remote
Selected: BATCH_SIZE=256, CONCURRENCY=4
Hardware: Remote AMD Radeon 7800M GPU via LAN, jina-embeddings-v2-base-code
Testing Different Models
# Test jina-embeddings-v2-base-code (768 dims) - default
npm run benchmark-embeddings
# Test different model
EMBEDDING_MODEL=mxbai-embed-large:latest npm run benchmark-embeddings
# Remote GPU
EMBEDDING_BASE_URL=http://192.168.1.100:11434 npm run benchmark-embeddings
Indexing Benchmarks
| Codebase | LoC | Local Setup | Remote GPU |
|---|---|---|---|
| Small CLI tool | 10K | ~2s | ~2s |
| Medium library | 50K | ~11s | ~9s |
| Large library | 100K | ~21s | ~17s |
| Enterprise app | 500K | ~2 min | ~1.5 min |
| Large codebase | 1M | ~3.5 min | ~3 min |
| VS Code | 3.5M | ~12 min | ~10 min |
| Kubernetes | 5M | ~18 min | ~15 min |
| Linux kernel | 10M | ~36 min | ~29 min |
Deployment Topologies
🏠 Fully Local Setup
Everything runs on your machine — lowest latency, fully offline.
Best for: Powerful local GPU (M3 Pro or better), offline work, minimal setup.
⭐ Remote GPU + Local Qdrant (Recommended)
Embedding on dedicated GPU server, Qdrant runs locally in Docker.
Best for: Most users — fast GPU embedding + fast local storage. Fastest overall: ~7m 39s for VS Code (3.5M LoC).
🌐 Full Remote Setup
Both Qdrant and Ollama on dedicated server.
Best for: Cannot run Docker locally, indexing from multiple thin clients. Trade-off: Network latency reduces storage rate (~4x slower than local).
Performance Comparison
| Metric | 🏠 Fully Local | ⭐ Remote GPU + Local Qdrant | 🌐 Full Remote |
|---|---|---|---|
| Optimal batch | 512 | 256 | 256 |
| Optimal concurrency | 1 | 6 | 4 |
| Embedding rate | 87 ch/s | 156 ch/s | 154 ch/s |
| Storage rate | 6273 ch/s | 6966 ch/s | 1810 ch/s |
| VS Code (3.5M LoC) | 13m 36s | 7m 39s | 8m 13s |
Network latency crushes remote Qdrant storage: 6966 ch/s → 1810 ch/s (3.8x drop). Always run Qdrant locally if possible. Embedding benefits from dedicated GPU (1.8x faster than M3 Pro), and concurrency hides network latency.
Key Performance Insights
Batch Size
Rule of thumb:
- Local GPU: Use large batches (512) + CONCURRENCY=1 to minimize per-batch overhead
- Remote GPU: Use smaller batches (256) + CONCURRENCY=4-6 to hide network latency
# Local GPU — GPU-bound, concurrency adds overhead
export EMBEDDING_BATCH_SIZE=512
export EMBEDDING_CONCURRENCY=1
# Remote GPU — network latency hidden by parallel requests
export EMBEDDING_BATCH_SIZE=256
export EMBEDDING_CONCURRENCY=4
Concurrency
Critical insight: Concurrency is only beneficial for remote GPU. Local GPU sees no improvement — it adds overhead without benefit.
| Setup | Concurrency | Why |
|---|---|---|
| Local GPU | 1 | GPU-bound, parallel requests add overhead |
| Remote GPU | 4-6 | Hide network latency with overlapping I/O |
Qdrant Ordering
| Mode | Use case | Performance |
|---|---|---|
weak | Local Qdrant, single indexer | Fastest |
medium | Balanced | Default |
strong | Remote Qdrant, multiple indexers | Safest, highest consistency |
# Local setup
export QDRANT_BATCH_ORDERING=weak
# Remote GPU + Local Qdrant
export QDRANT_BATCH_ORDERING=strong
# Full remote
export QDRANT_BATCH_ORDERING=weak
Bottleneck Analysis
Embedding is always the bottleneck:
- Embedding: 87-156 ch/s
- Storage: 1810-6966 ch/s
Even at 156 ch/s (remote GPU), embedding is 40x slower than storage. Invest in GPU, not Qdrant infrastructure.
Tuning Guidelines
For Large Codebases (500K+ LOC)
- ✅ Run
npm run tuneto find optimal batch sizes - ✅ Increase
MAX_IO_CONCURRENCY=100for SSD - ✅ Increase Qdrant memory limits in docker-compose.yml
- ✅ Use
.contextignoreto exclude node_modules, build artifacts
For Slow Search
- ✅ Use filters:
fileTypes,pathPatternto narrow scope - ✅ Limit results to needed count
- ✅ Enable hybrid search for technical queries
- ✅ Verify Qdrant has payload indexes
For Memory Issues
- ⚠️ Reduce
CODE_CHUNK_SIZE(default 2500) - ⚠️ Reduce
QDRANT_UPSERT_BATCH_SIZE - ⚠️ Increase Qdrant memory in docker-compose.yml
- ⚠️ Index subdirectories separately
Monitoring & Debug
Enable detailed timing logs:
export DEBUG=1
Logs are written to ~/.tea-rags-mcp/logs/.
Check index status:
/mcp__qdrant__get_index_status /path/to/project
Returns: status, chunk count, last update time, collection statistics.
Hardware Recommendations
Minimum (Development)
- 4GB RAM, SSD, CPU embedding (slow but works)
Recommended (Production)
- 8GB RAM, GPU with 8GB+ VRAM, SSD, Dedicated Qdrant
Enterprise (Large Codebases)
- 16GB+ RAM, GPU with 12GB+ VRAM, NVMe SSD, Clustered Qdrant
Pre-configured Settings by Setup
🏠 Local Setup (MacBook M3 Pro)
EMBEDDING_BATCH_SIZE=512
EMBEDDING_CONCURRENCY=1
QDRANT_UPSERT_BATCH_SIZE=192
QDRANT_BATCH_ORDERING=weak
QDRANT_FLUSH_INTERVAL_MS=100
QDRANT_DELETE_BATCH_SIZE=1500
QDRANT_DELETE_CONCURRENCY=12
# Embedding: 87 ch/s, Storage: 6273 ch/s
Why these values:
BATCH_SIZE=512+CONCURRENCY=1: GPU-bound workload, concurrency adds overheadQDRANT_BATCH_ORDERING=weak: Local Qdrant doesn't need strong orderingFLUSH_INTERVAL_MS=100: Fast flushes for local SSD
⭐ Remote GPU + Local Qdrant (AMD 7800M)
EMBEDDING_BATCH_SIZE=256
EMBEDDING_CONCURRENCY=6
QDRANT_UPSERT_BATCH_SIZE=512
QDRANT_BATCH_ORDERING=strong
QDRANT_FLUSH_INTERVAL_MS=250
QDRANT_DELETE_BATCH_SIZE=500
QDRANT_DELETE_CONCURRENCY=16
# Embedding: 156 ch/s, Storage: 6966 ch/s
Why these values:
BATCH_SIZE=256+CONCURRENCY=6: Network latency hidden by concurrent requestsQDRANT_BATCH_ORDERING=strong: Higher ordering ensures data consistency- Higher storage rate (6966 ch/s) because Qdrant is local
🌐 Full Remote Setup
EMBEDDING_BATCH_SIZE=256
EMBEDDING_CONCURRENCY=4
QDRANT_UPSERT_BATCH_SIZE=256
QDRANT_BATCH_ORDERING=weak
QDRANT_FLUSH_INTERVAL_MS=500
QDRANT_DELETE_BATCH_SIZE=1000
QDRANT_DELETE_CONCURRENCY=12
# Embedding: 154 ch/s, Storage: 1810 ch/s
Why these values:
BATCH_SIZE=256+CONCURRENCY=4: Balance between hiding latency and avoiding queue buildupQDRANT_BATCH_ORDERING=weak: Reduce round-trips over networkFLUSH_INTERVAL_MS=500: Larger flush windows amortize network latency- Storage bottlenecked by network (1810 ch/s vs 6966 for local)