Skip to content

Conversation

@chindris-mihai-alexandru

Summary

Add support for running DeepResearch 100% locally using llama.cpp with Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy.

Why This?

The main inference path requires vLLM with 8x A100 GPUs. This PR adds an alternative for:

  • Mac users (M1/M2/M3/M4 with Metal acceleration)
  • Local/privacy-focused users
  • Developers who want to experiment without GPU server access
  • Anyone who wants free, offline research capabilities

New Files

File Description
inference/interactive_llamacpp.py ReAct agent CLI that connects to llama.cpp server
scripts/start_llama_server.sh Server startup script with optimized Metal settings
requirements-local.txt Minimal deps: requests, duckduckgo-search, python-dotenv

Features

  • Free web search: Uses DuckDuckGo (no API key required)
  • Page visiting: Uses Jina Reader (optional API key for better results)
  • Loop detection: Prevents infinite tool call cycles (3 consecutive errors → force answer)
  • 32K context: Long research sessions supported
  • Rate limit handling: Exponential backoff retry for DuckDuckGo
  • URL validation: Validates URLs before attempting to visit

Requirements

  • llama.cpp built with Metal (-DLLAMA_METAL=ON) or CUDA support
  • GGUF model from bartowski
  • 32GB+ RAM for Q4_K_M quantization (~18GB model)

Quick Start

# Install minimal dependencies
pip install -r requirements-local.txt

# Terminal 1: Start the server
./scripts/start_llama_server.sh

# Terminal 2: Run research queries
python inference/interactive_llamacpp.py

Testing

Tested on Apple M1 Max with 32GB RAM:

  • Model loads in ~30-60 seconds
  • Inference runs at ~10-15 tokens/sec
  • Tool calls (search, visit) work correctly
  • Loop detection prevents runaway tool calls

Related

This is a cleaner alternative to PR #220 (MLX support), which had issues with chat template handling. llama.cpp is more mature and widely used.

Add support for running DeepResearch locally using llama.cpp with
Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy.

New files:
- inference/interactive_llamacpp.py: ReAct agent CLI for llama.cpp
- scripts/start_llama_server.sh: Server startup script with optimized settings
- requirements-local.txt: Minimal dependencies for local inference

Features:
- Free web search via DuckDuckGo (no API key required)
- Optional Jina Reader for better page content extraction
- Loop detection to prevent infinite tool call cycles
- 32K context window for long research sessions
- Exponential backoff retry for rate limits
- URL validation before visiting pages

Works with bartowski's GGUF quantizations (~18GB for Q4_K_M).
- Add Exa, Tavily, Serper, DuckDuckGo providers with automatic fallback
- Remove emojis for cleaner professional output
- Update documentation with search provider information
- Show available search providers on startup
- Add Tavily as 2nd provider (Exa → Tavily → Serper → DuckDuckGo)
- Use Bearer token auth for Tavily (per API docs)
- Add quota error handling (Exa 402, Tavily 432/433)
- Add sanitize_query() for input validation
- Switch Serper from http.client to requests
- Handle ConnectionError, JSONDecodeError, DuckDuckGoSearchException
- Use Exa type: auto for better search quality
- Change default context size from 32K to 16K (saves ~8GB RAM)
- Disable --mlock by default (prevents locking 18GB model in wired memory)
- Add --mlock flag to opt-in if needed for performance
- Add --low-memory flag for constrained systems (8K context)
- Display mlock status in configuration output

This prevents the system from running out of memory when running
llama-server alongside other apps like Firefox.
- Reduce MAX_ROUNDS from 30 to 10 for faster, more practical research queries
- Add --alias flag to llama-server for cleaner model naming (fixes Open WebUI compatibility)
- Better UX when using with web UIs like Open WebUI that expect short model names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant