Embedding Providers

TeaRAGs supports five embedding providers — from zero-config local inference to high-throughput cloud APIs. Choose based on your codebase size, privacy requirements, and budget.

Provider Comparison

Provider	Type	Price	Scale*	Key Feature
ONNX	Local	🟢 Free	~700k LoC	Zero-config, built-in runtime, adaptive GPU batching
Ollama	Local	🟢 Free	~8M+ LoC (depends on hardware)	GPU acceleration, 100+ models
OpenAI	Cloud	🟡 Pay-per-use ($0.02/1M tokens)	~800k–8M LoC (depends on API tier)	Highest quality, easy setup
Cohere	Cloud	🟡 Pay-per-use ($0.10/1M tokens)	~1M LoC	Multilingual support
Voyage	Cloud	🟡 Pay-per-use ($0.12/1M tokens)	~2.4M LoC	Code-specialized models

* Estimated lines of code for initial full indexing within 45 minutes. Benchmarked on Apple M3 Pro with WebGPU — actual throughput depends on your hardware. Incremental reindexing is fast on any provider — typically only 1–5% of files change between runs.

How to Choose

Want zero setup? Start with ONNX — no external services, no API keys, works out of the box. Best for small-to-medium projects.

Have a GPU? Use Ollama — free, private, and handles millions of lines of code. The default choice for serious local development.

Need cloud scale or quality? Pick OpenAI for the best embedding quality and familiar API. Consider Voyage if your codebase is code-heavy — their models are trained specifically on source code. Choose Cohere if you need multilingual embeddings.

Privacy matters? ONNX and Ollama keep everything local. No data leaves your machine.

Common Configuration

All providers share these tuning variables:

Variable	Description	Default
`EMBEDDING_PROVIDER`	Provider name: `onnx`, `ollama`, `openai`, `cohere`, `voyage`	`ollama`
`EMBEDDING_MODEL`	Model name (provider-specific)	Provider default
`EMBEDDING_DIMENSIONS`	Vector dimensions (auto-detected from model)	Auto
`EMBEDDING_TUNE_BATCH_SIZE`	Texts per embedding batch	Provider-specific (see below)
`EMBEDDING_TUNE_RETRY_ATTEMPTS`	Retry count on failure	`3`
`EMBEDDING_TUNE_RETRY_DELAY_MS`	Initial retry delay (exponential backoff)	`1000`

Default Batch Sizes

EMBEDDING_TUNE_BATCH_SIZE is automatically set per provider — you don't need to configure it unless you want to override. Defaults are optimized based on API limits and throughput characteristics:

Provider	Default Batch Size	Rationale
ONNX	Auto-calibrated	GPU probe sets optimal batch size at startup
Ollama	1024	GPU-optimized, native batch API
OpenAI	2048	Max texts per API request
Cohere	96	API limit: 96 texts per request
Voyage	128	Balanced for 120k token/request limit

Override with EMBEDDING_TUNE_BATCH_SIZE if needed.

Pipeline Concurrency

INGEST_PIPELINE_CONCURRENCY controls pipeline worker concurrency (default: 1). The pipeline already handles parallelism via batch accumulation, and increasing concurrency adds complexity without improving throughput for most providers. Leave at 1 unless you have a specific reason to change it.

See individual provider pages for provider-specific variables and setup instructions.

Provider Comparison​

How to Choose​

Common Configuration​

Default Batch Sizes​

Provider Comparison

How to Choose

Common Configuration

Default Batch Sizes