Skip to main content

Local Exposed GPU Setup

Use a dedicated machine with a powerful GPU in your local network as an embedding server for TeaRAGs. This setup delivers the best performance — fast GPU embedding + fast local Qdrant storage.

Why This Setup?

Best of both worlds:

  • Dedicated GPU for fast embedding (1.5-2x faster than M3 Pro)
  • Local Qdrant on your development machine (microsecond latency)
  • Ollama accessible from multiple machines in your network
  • No cloud costs, fully local and private

Recommended topology:

GPU Server Setup

1. Choose Your GPU Server

Any machine with a dedicated GPU:

  • Desktop PC with NVIDIA/AMD GPU
  • Laptop with discrete GPU
  • External GPU (eGPU) enclosure
  • Mac Studio / Mac Mini with M-series chip
  • Used gaming PC or workstation

Minimum specs:

  • 8GB+ VRAM
  • Gigabit LAN connection
  • 16GB+ RAM

Recommended:

  • NVIDIA RTX 3060/4060 (12GB VRAM) or better
  • AMD RX 6800/7800 (12GB+ VRAM)
  • Apple M-series (16GB+ unified memory)

2. Install GPU Drivers

NVIDIA (CUDA)

Linux
# Ubuntu/Debian
sudo apt update
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit

# Verify installation
nvidia-smi
Windows

Resources:

AMD (ROCm)

Linux
# Ubuntu 22.04
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_*.deb
sudo apt install ./amdgpu-install_*.deb
sudo amdgpu-install --usecase=rocm

# Verify installation
rocm-smi
Windows

ROCm can work on Windows only with AMD Radeon PRO drivers (blue logo), not Adrenaline (gaming drivers):

  • Download AMD Radeon PRO Software
  • Supports only RDNA2 (RX 6000) and RDNA3 (RX 7000) architectures
  • Older GCN cards (RX 5000 and below) are not supported on Windows
  • Alternative: Use Docker with Linux container + ROCm

Supported GPU architectures:

  • ✅ RDNA3 (RX 7900/7800/7700/7600) — best support
  • ✅ RDNA2 (RX 6900/6800/6700/6600) — good support
  • ⚠️ GCN (RX 5000 and older) — Linux only, limited support
  • ❌ Older GCN cards — not recommended

Resources:

Intel Arc

Linux
# Ubuntu 22.04
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | sudo gpg --dearmor -o /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt update
sudo apt install intel-opencl-icd intel-level-zero-gpu level-zero

# Verify installation
clinfo
Windows

Resources:

Driver Compatibility

Ensure GPU drivers are compatible with your OS version. Mismatched drivers can cause crashes or poor performance. Check manufacturer documentation for your specific GPU model.

3. Install Qdrant (Optional)

If you want to run both Qdrant and Ollama on the GPU server:

# Docker (recommended)
docker run -d \
--name qdrant \
-p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
--memory=4g \
qdrant/qdrant:latest
Local vs Remote Qdrant

Recommended: Run Qdrant locally on your development machine for best storage performance (6966 ch/s vs 1810 ch/s). Only run Qdrant on GPU server if you can't use Docker on your development machine.

4. Install Ollama

Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
# Download from https://ollama.com/download
# Run installer
macOS (Mac Studio / Mac Mini)
brew install ollama

Option 2: Docker with GPU

Linux + NVIDIA
# Install NVIDIA Container Toolkit first
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Run Ollama with GPU
docker run -d \
--name ollama \
--gpus all \
-p 11434:11434 \
-v ollama_models:/root/.ollama \
ollama/ollama:latest
Linux + AMD ROCm
docker run -d \
--name ollama \
--device /dev/kfd \
--device /dev/dri \
-p 11434:11434 \
-v ollama_models:/root/.ollama \
ollama/ollama:rocm

5. Configure Network Access

Enable Ollama Network Access

Native Ollama:

Create or edit Ollama service configuration:

Linux (systemd)
# Create override
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama
macOS (launchd)
# Set environment variable
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"

# Restart Ollama app
# Or via command line:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Windows
# Set environment variable (system-wide)
[System.Environment]::SetEnvironmentVariable('OLLAMA_HOST', '0.0.0.0:11434', 'Machine')

# Restart Ollama service or app
Restart-Service Ollama
Docker
# Already exposed on 0.0.0.0:11434 by default
# No additional configuration needed

Open Firewall Ports

Linux (ufw)
# Ollama
sudo ufw allow 11434/tcp

# Qdrant (if running on GPU server)
sudo ufw allow 6333/tcp

# Check status
sudo ufw status
Windows Firewall
# Ollama
New-NetFirewallRule -DisplayName "Ollama" -Direction Inbound -Protocol TCP -LocalPort 11434 -Action Allow

# Qdrant (if running on GPU server)
New-NetFirewallRule -DisplayName "Qdrant" -Direction Inbound -Protocol TCP -LocalPort 6333 -Action Allow
macOS
# macOS firewall allows local network by default
# If enabled, add Ollama to allowed apps in System Settings → Network → Firewall
Firewall Configuration

Firewall rules vary by OS and distribution. Search for "open port [YOUR_OS]" if commands above don't work for your system. Common tools: ufw (Ubuntu), firewalld (RHEL/Fedora), Windows Defender Firewall, macOS System Settings.

Assign a static IP to your GPU server to avoid connection issues when the IP changes.

  1. Log into your router admin panel (usually 192.168.1.1 or 192.168.0.1)
  2. Find DHCP Reservation or Static DHCP settings
  3. Add reservation:
    • MAC Address: Your GPU server's network interface MAC
    • IP Address: e.g., 192.168.1.100
  4. Save and reboot GPU server

How to find MAC address:

Linux
ip link show
# Look for "link/ether XX:XX:XX:XX:XX:XX"
Windows
ipconfig /all
# Look for "Physical Address"
macOS
ifconfig en0 | grep ether

Option 2: Static IP on Server

Linux (netplan)
# /etc/netplan/01-network.yaml
network:
version: 2
ethernets:
eth0: # or your interface name
dhcp4: no
addresses:
- 192.168.1.100/24
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]

Apply:

sudo netplan apply
Windows
  • Control Panel → Network → Change adapter settings
  • Right-click network adapter → Properties
  • IPv4 → Properties → Use the following IP address
  • Set IP: 192.168.1.100, Subnet: 255.255.255.0, Gateway: 192.168.1.1
macOS
  • System Settings → Network → Ethernet/Wi-Fi → Details
  • TCP/IP → Configure IPv4: Manually
  • Set IP: 192.168.1.100, Subnet Mask: 255.255.255.0, Router: 192.168.1.1
Router vs Server Static IP

Prefer router DHCP reservation — easier to manage, survives OS reinstalls, centralized configuration. Use server-side static IP only if you can't access router settings.

7. Pull Embedding Models

# Default code-specialized model (recommended)
ollama pull unclemusclez/jina-embeddings-v2-base-code:latest

# Alternative models
ollama pull nomic-embed-text:latest
ollama pull mxbai-embed-large:latest

8. Verify Setup

Check Ollama from another machine:

# From your development machine
curl http://192.168.1.100:11434/api/version

# Test embedding
curl http://192.168.1.100:11434/api/embeddings -d '{
"model": "unclemusclez/jina-embeddings-v2-base-code:latest",
"prompt": "test"
}'

Check Qdrant (if running on GPU server):

curl http://192.168.1.100:6333/healthz
# Should return: "healthy"

Development Machine Setup

Configure TeaRAGs

On your development machine, point TeaRAGs to the GPU server:

claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \
-e EMBEDDING_BASE_URL=http://192.168.1.100:11434 \
-e EMBEDDING_CONCURRENCY=4

If Qdrant also runs on GPU server:

claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \
-e QDRANT_URL=http://192.168.1.100:6333 \
-e EMBEDDING_BASE_URL=http://192.168.1.100:11434 \
-e EMBEDDING_CONCURRENCY=4

For best storage performance, run Qdrant locally on your development machine:

docker run -d \
--name qdrant \
-p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant:latest

This gives you:

  • Fast embedding: Remote GPU (154-156 ch/s)
  • Fast storage: Local Qdrant (6966 ch/s)
  • Best overall performance: ~7m 39s for VS Code (3.5M LoC)

Performance Tuning

Auto-Tune for Remote GPU

Run the tuning benchmark pointing to your GPU server:

EMBEDDING_BASE_URL=http://192.168.1.100:11434 npm run tune

Expected optimal settings:

EMBEDDING_BATCH_SIZE=256
EMBEDDING_CONCURRENCY=4-6
QDRANT_UPSERT_BATCH_SIZE=512
QDRANT_BATCH_ORDERING=strong

See Performance Tuning for detailed benchmarks and topology comparison.

Troubleshooting

Cannot Connect to GPU Server

Check network connectivity:

ping 192.168.1.100

Check Ollama is listening on 0.0.0.0:

# On GPU server
sudo netstat -tulpn | grep 11434
# Should show 0.0.0.0:11434, NOT 127.0.0.1:11434

Check firewall:

# Linux
sudo ufw status

# Test from development machine
telnet 192.168.1.100 11434
# OR
nc -zv 192.168.1.100 11434

Slow Embedding Performance

Verify GPU is being used:

NVIDIA:

# On GPU server
nvidia-smi
# Should show ollama process using GPU

AMD:

rocm-smi

Intel:

clinfo

If GPU not used:

  • Check drivers are installed correctly
  • Restart Ollama after driver installation
  • For Docker: verify --gpus all flag (NVIDIA) or --device /dev/kfd --device /dev/dri (AMD)

IP Address Changed

If GPU server IP changes after router reboot:

  1. Check current IP: ip addr (Linux) or ipconfig (Windows)
  2. Update TeaRAGs configuration with new IP
  3. Permanent fix: Set static IP via router DHCP reservation (see above)

Connection Drops During Indexing

Increase timeout:

claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \
-e EMBEDDING_BASE_URL=http://192.168.1.100:11434 \
-e HTTP_REQUEST_TIMEOUT_MS=600000

Check network stability:

  • Use wired Ethernet instead of Wi-Fi
  • Check router logs for connection drops
  • Disable Wi-Fi power saving on GPU server

Security Considerations

Local Network Only

Do NOT expose Ollama to the internet — it has no authentication by default.

Safe: 0.0.0.0:11434 (listens on all interfaces, accessible in LAN) Unsafe: Port forwarding 11434 to internet (⚠️ security risk)

If you need remote access from outside your LAN:

  • Use VPN (WireGuard, Tailscale, OpenVPN)
  • Use SSH tunnel: ssh -L 11434:localhost:11434 user@gpu-server

Firewall Best Practices

Allow only local network:

Linux (ufw):

# Allow from local network only (example: 192.168.1.0/24)
sudo ufw allow from 192.168.1.0/24 to any port 11434 proto tcp

Windows Firewall:

  • Advanced Settings → Inbound Rules → Ollama
  • Scope → Remote IP addresses → Add 192.168.1.0/24

Multi-User Setup

Multiple developers can share the same GPU server:

GPU Server: One instance of Ollama Each Developer: Runs own Qdrant locally, points to shared Ollama

Benefits:

  • Cost-effective — one GPU serves entire team
  • Consistent performance across team
  • Centralized model management

Configuration (same on all dev machines):

claude mcp add tea-rags -s user -- node /path/to/tea-rags-mcp/build/index.js \
-e EMBEDDING_BASE_URL=http://192.168.1.100:11434 \
-e EMBEDDING_CONCURRENCY=4
Shared GPU Performance

Ollama handles concurrent requests well. 4-6 developers can share a single GPU server without significant slowdown. Monitor GPU usage with nvidia-smi to check load.

Next Steps