Skip to main content

ONNX (Built-in)

Local embedding provider using ONNX Runtime via @huggingface/transformers. Zero external dependencies — no services to install, no API keys.

TypeLocal
Price🟢 Free
Scale*~700k LoC
Default modeljinaai/jina-embeddings-v2-base-code-fp16
Dimensions768
URL— (built-in, no external service)

* Estimated lines of code for initial full indexing within 45 minutes. Benchmarked on Apple M3 Pro with WebGPU — actual throughput depends on your hardware (GPU, memory bandwidth, device type).

Key Features

  • Zero config — just set EMBEDDING_PROVIDER=onnx and go
  • No external services — runs inside the Node.js process via a daemon
  • WebGPU acceleration — auto-detects Metal (macOS), D3D12 (Windows), Vulkan (Linux)
  • CPU fallback — works everywhere, even without a GPU
  • HuggingFace models — any ONNX-compatible model from HuggingFace Hub
  • Persistent daemon — model loads once, stays warm across indexing runs
  • Adaptive GPU batching — calibration probe auto-detects optimal batch size at startup

Setup

No installation needed. ONNX provider is bundled with TeaRAGs.

# That's it — just set the provider
export EMBEDDING_PROVIDER=onnx

The first run downloads the model (~260 MB) to a local cache. Subsequent runs start instantly.

Configuration

{
"mcpServers": {
"tea-rags": {
"command": "node",
"args": ["/path/to/tea-rags/build/index.js"],
"env": {
"EMBEDDING_PROVIDER": "onnx"
}
}
}
}
tip

QDRANT_URL is not needed — Qdrant is built-in and starts automatically. Add it only if using external Qdrant.

Optional variables:

VariableDescriptionDefault
EMBEDDING_MODELHuggingFace model IDjinaai/jina-embeddings-v2-base-code-fp16
EMBEDDING_DIMENSIONSVector dimensions768 (auto-detected)
EMBEDDING_TUNE_BATCH_SIZETexts per embedding batchAuto-calibrated
EMBEDDING_DEVICECompute device: auto, cpu, webgpu, cuda, dmlauto
HF_TOKENHuggingFace access token (for gated/private models)

Available Models

ModelDimensionsNotes
jinaai/jina-embeddings-v2-base-code768Default. Code-optimized, 30+ programming languages
nomic-ai/nomic-embed-text-v1.5768General purpose, strong quality
Xenova/all-MiniLM-L6-v2384Lightweight, fast, good for experiments
Xenova/bge-base-en-v1.5768English-focused, MTEB top-ranked
Xenova/multilingual-e5-base768100+ languages
BAAI/bge-small-en-v1.5384Smallest footprint, fast inference

Any ONNX-compatible model from HuggingFace Hub can be used by setting EMBEDDING_MODEL to the repository ID. Models with onnx in the library tag work out of the box.

FP16 quantization

Append -fp16 to the model ID to use FP16-quantized weights (smaller download, faster on GPU):

EMBEDDING_MODEL=jinaai/jina-embeddings-v2-base-code-fp16

Device Options

DeviceBackendPlatformWhen to use
autoBest available, with CPU fallbackAllDefault, recommended
webgpuMetal / D3D12 / VulkanmacOS, Windows, LinuxForce WebGPU acceleration
cudaNVIDIA CUDALinux x64NVIDIA GPUs
dmlDirectMLWindows x64/arm64Any GPU (NVIDIA, AMD, Intel)
cpuCPU onlyAllNo GPU available or Docker

Private & Gated Models

Some HuggingFace models require authentication (gated models like Llama, or private repos). To use them:

  1. Create an access token at huggingface.co/settings/tokens
  2. Add HF_TOKEN to your MCP config:
{
"mcpServers": {
"tea-rags": {
"command": "node",
"args": ["/path/to/tea-rags/build/index.js"],
"env": {
"EMBEDDING_PROVIDER": "onnx",
"EMBEDDING_MODEL": "your-org/private-model",
"HF_TOKEN": "hf_..."
}
}
}
}

Tuning Notes

Batch size is auto-calibrated on first startup. The daemon runs a GPU calibration probe that tests batch sizes [1, 4, 8, 16, 32, 64, 128] and picks the optimal one for your hardware. The result is cached in ~/.tea-rags/onnx-calibration.json — subsequent startups use the cached value instantly. Override with EMBEDDING_TUNE_BATCH_SIZE if needed.

Concurrency (INGEST_PIPELINE_CONCURRENCY) should stay at 1. The ONNX daemon processes requests sequentially on a single model instance. Higher concurrency adds queue overhead without improving throughput.

Runtime adaptation — the daemon monitors per-text inference latency and dynamically adjusts the internal GPU batch size: halves on pressure spikes, doubles when stable. This handles thermal throttling and competing GPU workloads automatically.

When to Use

  • Small-to-medium projects (up to ~700k LoC for comfortable indexing speed)
  • Air-gapped environments with no internet access (after initial model download)
  • Quick experiments — no setup overhead
  • CI/CD pipelines where installing Ollama is impractical