Add llama.cpp local inference support for Mac/local users #222

chindris-mihai-alexandru · 2025-11-28T14:36:54Z

Summary

Add support for running DeepResearch 100% locally using llama.cpp with Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy.

Why This?

The main inference path requires vLLM with 8x A100 GPUs. This PR adds an alternative for:

Mac users (M1/M2/M3/M4 with Metal acceleration)
Local/privacy-focused users
Developers who want to experiment without GPU server access
Anyone who wants free, offline research capabilities

New Files

File	Description
`inference/interactive_llamacpp.py`	ReAct agent CLI that connects to llama.cpp server
`scripts/start_llama_server.sh`	Server startup script with optimized Metal settings
`requirements-local.txt`	Minimal deps: `requests`, `duckduckgo-search`, `python-dotenv`

Features

Free web search: Uses DuckDuckGo (no API key required)
Page visiting: Uses Jina Reader (optional API key for better results)
Loop detection: Prevents infinite tool call cycles (3 consecutive errors → force answer)
32K context: Long research sessions supported
Rate limit handling: Exponential backoff retry for DuckDuckGo
URL validation: Validates URLs before attempting to visit

Requirements

llama.cpp built with Metal (-DLLAMA_METAL=ON) or CUDA support
GGUF model from bartowski
32GB+ RAM for Q4_K_M quantization (~18GB model)

Quick Start

# Install minimal dependencies
pip install -r requirements-local.txt

# Terminal 1: Start the server
./scripts/start_llama_server.sh

# Terminal 2: Run research queries
python inference/interactive_llamacpp.py

Testing

Tested on Apple M1 Max with 32GB RAM:

Model loads in ~30-60 seconds
Inference runs at ~10-15 tokens/sec
Tool calls (search, visit) work correctly
Loop detection prevents runaway tool calls

Add support for running DeepResearch locally using llama.cpp with Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy. New files: - inference/interactive_llamacpp.py: ReAct agent CLI for llama.cpp - scripts/start_llama_server.sh: Server startup script with optimized settings - requirements-local.txt: Minimal dependencies for local inference Features: - Free web search via DuckDuckGo (no API key required) - Optional Jina Reader for better page content extraction - Loop detection to prevent infinite tool call cycles - 32K context window for long research sessions - Exponential backoff retry for rate limits - URL validation before visiting pages Works with bartowski's GGUF quantizations (~18GB for Q4_K_M).

- Add Exa, Tavily, Serper, DuckDuckGo providers with automatic fallback - Remove emojis for cleaner professional output - Update documentation with search provider information - Show available search providers on startup

- Add Tavily as 2nd provider (Exa → Tavily → Serper → DuckDuckGo) - Use Bearer token auth for Tavily (per API docs) - Add quota error handling (Exa 402, Tavily 432/433) - Add sanitize_query() for input validation - Switch Serper from http.client to requests - Handle ConnectionError, JSONDecodeError, DuckDuckGoSearchException - Use Exa type: auto for better search quality

- Change default context size from 32K to 16K (saves ~8GB RAM) - Disable --mlock by default (prevents locking 18GB model in wired memory) - Add --mlock flag to opt-in if needed for performance - Add --low-memory flag for constrained systems (8K context) - Display mlock status in configuration output This prevents the system from running out of memory when running llama-server alongside other apps like Firefox.

- Reduce MAX_ROUNDS from 30 to 10 for faster, more practical research queries - Add --alias flag to llama-server for cleaner model naming (fixes Open WebUI compatibility) - Better UX when using with web UIs like Open WebUI that expect short model names

chindris-mihai-alexandru mentioned this pull request Nov 28, 2025

Add MLX support for Apple Silicon inference #220

Closed

chindris-mihai-alexandru added 4 commits November 28, 2025 16:52

Add multi-provider search support to llama.cpp CLI

448000f

- Add Exa, Tavily, Serper, DuckDuckGo providers with automatic fallback - Remove emojis for cleaner professional output - Update documentation with search provider information - Show available search providers on startup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add llama.cpp local inference support for Mac/local users #222

Add llama.cpp local inference support for Mac/local users #222

Uh oh!

chindris-mihai-alexandru commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add llama.cpp local inference support for Mac/local users #222

Are you sure you want to change the base?

Add llama.cpp local inference support for Mac/local users #222

Uh oh!

Conversation

chindris-mihai-alexandru commented Nov 28, 2025

Summary

Why This?

New Files

Features

Requirements

Quick Start

Testing

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant