Skip to main content

Performance Tuning

Auto-Tuning Benchmark

Don't guess — let the benchmark find optimal settings for your hardware:

npm run tune

Creates tuned_environment_variables.env with optimal values in ~60-90 seconds.

For local setup, just run npm run tune — defaults work out of the box:

  • QDRANT_URL defaults to http://localhost:6333
  • EMBEDDING_BASE_URL defaults to http://localhost:11434
  • EMBEDDING_MODEL defaults to unclemusclez/jina-embeddings-v2-base-code:latest

For remote setup, configure via environment variables:

QDRANT_URL=http://192.168.1.100:6333 \
EMBEDDING_BASE_URL=http://192.168.1.100:11434 \
npm run tune

The benchmark tests 7 parameters and shows estimated indexing times:

Phase 1: Embedding Batch Size ... Optimal: 128
Phase 2: Embedding Concurrency ... Optimal: 2
Phase 3: Qdrant Batch Size ... Optimal: 384

Then add tuned values to your MCP config:

claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \
-e QDRANT_URL=http://localhost:6333 \
-e EMBEDDING_BATCH_SIZE=256 \
-e EMBEDDING_CONCURRENCY=2 \
-e QDRANT_UPSERT_BATCH_SIZE=384

Embeddings-Only Benchmark

For GPU-specific optimization without Qdrant (embedding calibration only):

npm run benchmark-embeddings

This benchmark uses a three-phase plateau detection algorithm designed for production robustness.

Three-Phase Calibration Algorithm

Phase 1: Batch Plateau Detection (CONCURRENCY=1)

  • Tests batch sizes: [256, 512, 1024, 2048, 3072, 4096]
  • Uses adaptive chunk count: min(batch×2, MAX_TOTAL_CHUNKS) per batch
  • Stops when improvement < 3% (plateau detected) or timeout exceeded
  • Plateau timeout: calculated from previous throughput to detect degradation early

Phase 2: Concurrency Testing (on plateau only)

  • Tests concurrency: [1, 2, 4] on plateau batches only
  • Plateau timeout: calculated from baseline (CONC=1) throughput
  • Stops testing higher concurrency if timeout exceeded (degradation)
  • Reuses CONC=1 results from Phase 1 (no redundant testing)

Phase 3: Robust Selection

  • Selects from configurations within 2% of maximum throughput
  • Prefers: Lower concurrency → Lower batch size → Higher throughput
  • Avoids overfitting to noise, reduces tail-risk

Key Principles

The algorithm follows engineering best practices:

  1. Adaptive Workload: Each batch size tests min(batch×2, MAX_TOTAL_CHUNKS) chunks
  2. Plateau Over Peak: Seeks stable performance range, not theoretical maximum
  3. Noise Tolerance: Differences < 2-3% considered measurement noise
  4. Robustness: Prefers simpler configs (lower concurrency/batch) when performance is equivalent

When to Use

Use npm run benchmark-embeddings when you:

  • Want to understand GPU/CPU characteristics without Qdrant
  • Are comparing different embedding models
  • Need to diagnose embedding bottlenecks
  • Want to see the full calibration process with detailed output

Use npm run tune (full benchmark) when you:

  • Need complete end-to-end optimization (embedding + Qdrant)
  • Want production-ready configuration file
  • Are setting up new deployment

Example Output (Remote GPU)

Phase 1: Batch Plateau Detection (CONCURRENCY=1)
512 chunks @ batch 256 143 chunks/s (3.6s) +100.0%
1024 chunks @ batch 512 (max 11.1s) 145 chunks/s (7.1s) +1.4%
→ Plateau detected, stopping
Plateau batches: [256, 512]

Phase 2: Concurrency Effect Test
BATCH=256
CONC=1 (from Phase 1) 143 chunks/s
CONC=2 (1024 chunks, max 11.1s) STABLE 152 chunks/s +6.3%
CONC=4 (2048 chunks, max 22.1s) STABLE 153 chunks/s +0.7%
→ Concurrency plateau, stopping
BATCH=512
CONC=1 (from Phase 1) 145 chunks/s
CONC=2 (2048 chunks, max 21.8s) STABLE 147 chunks/s +1.4%
→ Concurrency plateau, stopping

Phase 3: Configuration Selection
Acceptable configurations: 5/5
Max throughput: 153 chunks/s

Recommended configurations:
🏠 Local GPU: BATCH_SIZE=512 CONCURRENCY=2 (153 chunks/s)
🌐 Remote GPU: BATCH_SIZE=256 CONCURRENCY=4 (153 chunks/s)

Detected setup: 🌐 Remote
Selected: BATCH_SIZE=256, CONCURRENCY=4

Hardware: Remote AMD Radeon 7800M GPU via LAN, jina-embeddings-v2-base-code

Testing Different Models

# Test jina-embeddings-v2-base-code (768 dims) - default
npm run benchmark-embeddings

# Test different model
EMBEDDING_MODEL=mxbai-embed-large:latest npm run benchmark-embeddings

# Remote GPU
EMBEDDING_BASE_URL=http://192.168.1.100:11434 npm run benchmark-embeddings

Indexing Benchmarks

CodebaseLoCLocal SetupRemote GPU
Small CLI tool10K~2s~2s
Medium library50K~11s~9s
Large library100K~21s~17s
Enterprise app500K~2 min~1.5 min
Large codebase1M~3.5 min~3 min
VS Code3.5M~12 min~10 min
Kubernetes5M~18 min~15 min
Linux kernel10M~36 min~29 min

Deployment Topologies

🏠 Fully Local Setup

Everything runs on your machine — lowest latency, fully offline.

Best for: Powerful local GPU (M3 Pro or better), offline work, minimal setup.

Embedding on dedicated GPU server, Qdrant runs locally in Docker.

Best for: Most users — fast GPU embedding + fast local storage. Fastest overall: ~7m 39s for VS Code (3.5M LoC).

🌐 Full Remote Setup

Both Qdrant and Ollama on dedicated server.

Best for: Cannot run Docker locally, indexing from multiple thin clients. Trade-off: Network latency reduces storage rate (~4x slower than local).

Performance Comparison

Metric🏠 Fully Local⭐ Remote GPU + Local Qdrant🌐 Full Remote
Optimal batch512256256
Optimal concurrency164
Embedding rate87 ch/s156 ch/s154 ch/s
Storage rate6273 ch/s6966 ch/s1810 ch/s
VS Code (3.5M LoC)13m 36s7m 39s8m 13s
Why Remote GPU + Local Qdrant is Fastest

Network latency crushes remote Qdrant storage: 6966 ch/s → 1810 ch/s (3.8x drop). Always run Qdrant locally if possible. Embedding benefits from dedicated GPU (1.8x faster than M3 Pro), and concurrency hides network latency.

Key Performance Insights

Batch Size

Rule of thumb:

  • Local GPU: Use large batches (512) + CONCURRENCY=1 to minimize per-batch overhead
  • Remote GPU: Use smaller batches (256) + CONCURRENCY=4-6 to hide network latency
# Local GPU — GPU-bound, concurrency adds overhead
export EMBEDDING_BATCH_SIZE=512
export EMBEDDING_CONCURRENCY=1

# Remote GPU — network latency hidden by parallel requests
export EMBEDDING_BATCH_SIZE=256
export EMBEDDING_CONCURRENCY=4

Concurrency

Critical insight: Concurrency is only beneficial for remote GPU. Local GPU sees no improvement — it adds overhead without benefit.

SetupConcurrencyWhy
Local GPU1GPU-bound, parallel requests add overhead
Remote GPU4-6Hide network latency with overlapping I/O

Qdrant Ordering

ModeUse casePerformance
weakLocal Qdrant, single indexerFastest
mediumBalancedDefault
strongRemote Qdrant, multiple indexersSafest, highest consistency
# Local setup
export QDRANT_BATCH_ORDERING=weak

# Remote GPU + Local Qdrant
export QDRANT_BATCH_ORDERING=strong

# Full remote
export QDRANT_BATCH_ORDERING=weak

Bottleneck Analysis

Embedding is always the bottleneck:

  • Embedding: 87-156 ch/s
  • Storage: 1810-6966 ch/s

Even at 156 ch/s (remote GPU), embedding is 40x slower than storage. Invest in GPU, not Qdrant infrastructure.

Tuning Guidelines

For Large Codebases (500K+ LOC)

  • ✅ Run npm run tune to find optimal batch sizes
  • ✅ Increase MAX_IO_CONCURRENCY=100 for SSD
  • ✅ Increase Qdrant memory limits in docker-compose.yml
  • ✅ Use .contextignore to exclude node_modules, build artifacts
  • ✅ Use filters: fileTypes, pathPattern to narrow scope
  • ✅ Limit results to needed count
  • ✅ Enable hybrid search for technical queries
  • ✅ Verify Qdrant has payload indexes

For Memory Issues

  • ⚠️ Reduce CODE_CHUNK_SIZE (default 2500)
  • ⚠️ Reduce QDRANT_UPSERT_BATCH_SIZE
  • ⚠️ Increase Qdrant memory in docker-compose.yml
  • ⚠️ Index subdirectories separately

Monitoring & Debug

Enable detailed timing logs:

export DEBUG=1

Logs are written to ~/.tea-rags-mcp/logs/.

Check index status:

/mcp__qdrant__get_index_status /path/to/project

Returns: status, chunk count, last update time, collection statistics.

Hardware Recommendations

Minimum (Development)

  • 4GB RAM, SSD, CPU embedding (slow but works)
  • 8GB RAM, GPU with 8GB+ VRAM, SSD, Dedicated Qdrant

Enterprise (Large Codebases)

  • 16GB+ RAM, GPU with 12GB+ VRAM, NVMe SSD, Clustered Qdrant

Pre-configured Settings by Setup

🏠 Local Setup (MacBook M3 Pro)

EMBEDDING_BATCH_SIZE=512
EMBEDDING_CONCURRENCY=1
QDRANT_UPSERT_BATCH_SIZE=192
QDRANT_BATCH_ORDERING=weak
QDRANT_FLUSH_INTERVAL_MS=100
QDRANT_DELETE_BATCH_SIZE=1500
QDRANT_DELETE_CONCURRENCY=12
# Embedding: 87 ch/s, Storage: 6273 ch/s

Why these values:

  • BATCH_SIZE=512 + CONCURRENCY=1: GPU-bound workload, concurrency adds overhead
  • QDRANT_BATCH_ORDERING=weak: Local Qdrant doesn't need strong ordering
  • FLUSH_INTERVAL_MS=100: Fast flushes for local SSD

⭐ Remote GPU + Local Qdrant (AMD 7800M)

EMBEDDING_BATCH_SIZE=256
EMBEDDING_CONCURRENCY=6
QDRANT_UPSERT_BATCH_SIZE=512
QDRANT_BATCH_ORDERING=strong
QDRANT_FLUSH_INTERVAL_MS=250
QDRANT_DELETE_BATCH_SIZE=500
QDRANT_DELETE_CONCURRENCY=16
# Embedding: 156 ch/s, Storage: 6966 ch/s

Why these values:

  • BATCH_SIZE=256 + CONCURRENCY=6: Network latency hidden by concurrent requests
  • QDRANT_BATCH_ORDERING=strong: Higher ordering ensures data consistency
  • Higher storage rate (6966 ch/s) because Qdrant is local

🌐 Full Remote Setup

EMBEDDING_BATCH_SIZE=256
EMBEDDING_CONCURRENCY=4
QDRANT_UPSERT_BATCH_SIZE=256
QDRANT_BATCH_ORDERING=weak
QDRANT_FLUSH_INTERVAL_MS=500
QDRANT_DELETE_BATCH_SIZE=1000
QDRANT_DELETE_CONCURRENCY=12
# Embedding: 154 ch/s, Storage: 1810 ch/s

Why these values:

  • BATCH_SIZE=256 + CONCURRENCY=4: Balance between hiding latency and avoiding queue buildup
  • QDRANT_BATCH_ORDERING=weak: Reduce round-trips over network
  • FLUSH_INTERVAL_MS=500: Larger flush windows amortize network latency
  • Storage bottlenecked by network (1810 ch/s vs 6966 for local)