A FastAPI-based web application for comparing multiple language models side by side using Ollama. Features dynamic model selection, per-pane controls, and stability limits to prevent system freezes.
- Basic Workflow
- Quick Start
- Features
- Usage
- Stability Features
- API Endpoints
- Configuration
- Troubleshooting
- Project Structure
- Dependencies
- Performance Tips
- Contributing
- License
- Acknowledgments
At its core, the LLM Behaviour Lab enables systematic exploration of how deterministic, interpretable and corrigible human-defined parameters extrinsic to the model interact with the intrinsic, probabilistic model outputs. These deterministic parameters include both the direct inference time configuration and code scaffolds (e.g. system/user prompts, temperature, token limits), and the post training inputs (e.g. Q&A, instructions, preferences, reinforcements).
- Select models from the multi-select dropdown (hold Ctrl/Cmd for multiple)
- Click "Add Selected" to create comparison panes
- Craft prompts in the system/user input fields - this is where you control the deterministic variables
- Adjust parameters like temperature (0.0-2.0) and max tokens to see their effects
- Click "Generate" on individual panes or "Generate All" for batch
- Use "Stop" buttons to cancel generation
- Add aliases to distinguish between similar models
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# macOS
brew install ollama
# Windows: Download from https://ollama.ai/downloadollama serve  # Start Ollama in another terminal
# Pull some models to compare
ollama pull qwen2.5:7b       # Instruct model
ollama pull llama3.2:3b      # Smaller model
ollama pull gemma2:9b        # Different architectureBrowse all available models: Visit https://ollama.com/search to explore the full catalog of models available for comparison.
# Clone and setup
git clone https://github.com/Leamsi9/llm-behaviour-lab.git
cd llm-behaviour-lab
# Use the setup script
./setup.sh
# Or manually:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtEdit .env file with your system specs:
# For 32GB RAM systems (default)
MAX_INPUT_LENGTH=12000
MAX_CONTEXT_TOKENS=8192
MAX_OUTPUT_TOKENS=4096
# For 16GB RAM systems
MAX_INPUT_LENGTH=8000
MAX_CONTEXT_TOKENS=4096
MAX_OUTPUT_TOKENS=2048source venv/bin/activate
uvicorn app_ollama:app --host 0.0.0.0 --port 8000 --reloadNavigate to: http://localhost:8000
- ✅ Multi-model comparison: Compare any number of Ollama models simultaneously
- ✅ Dynamic model loading: Automatically detects and lists all pulled Ollama models
- ✅ Per-pane controls: Individual Generate/Stop/Clear/Remove buttons for each model
- ✅ Global controls: Generate All and Stop All buttons for batch operations
- ✅ Real-time streaming: Token-by-token generation with visual indicators
- ✅ Stability limits: Configurable limits to prevent system freezes (.env file)
- ✅ Cancellation support: Properly interrupts generation without leaving orphaned processes
- ✅ Token counting: Detailed metrics (prompt tokens, completion tokens, latency, TPS)
- ✅ Model aliases: Tag each model pane with custom labels
- ✅ Responsive UI: Works on desktop and mobile devices
Each model comparison reveals insights about:
- System Prompt: Defines the AI's role, personality, and behavioral constraints. To compare behaviour under the system prompts of major LLMs, see https://github.com/elder-plinius/CL4R1T4S for a collection of system prompts for major LLMs and tools which you can use.
- User Prompt: The specific task or question being asked
- Temperature: Controls randomness (0.0 = deterministic, 1.0 = creative, 2.0 = chaotic)
- Token limits: Limits output length and computational cost
- Post training inputs: Fine-tuning (instruction, preference and reinforcement) comparable in this app by selecting base and fine tuned models (e.g. qwen2.5:7bvsqwen2.5:7b-instruct) or different fine-tuning approaches to the same base model (e.g. tool-using vs instruction-tuned vs abliterated (uncensored) vs niche-tuned). The post-training becomes intrinsic to the model, but the process of post-training relies on deterministic human artefacts extrinsic to the base models, in the form of explicit instructions, preferences and reinforcements that are fully interpretable and corrigible.
- Architecture differences: Transformer variants, attention mechanisms, parameter counts
- Training data: What knowledge and patterns each model has learned
- Fine-tuning approach: Base models vs instruction-tuned vs tool-using variants
- Token generation: How each model chooses the next word given identical inputs
Compare the same model at different temperatures:
- llama3.2:3b [Temp 0.1]- Precise, factual responses
- llama3.2:3b [Temp 0.7]- Balanced creativity and coherence
- llama3.2:3b [Temp 1.5]- Highly creative, more unpredictable
Compare different model families:
- qwen2.5:7bvs- llama3.2:3bvs- gemma2:9b- Same prompt, different architectures
- Base vs instruction-tuned variants of the same model
- Small vs large parameter counts within the same family
Compare different training approaches:
- Base models (raw, pre-training only)
- Instruction-tuned (RLHF, aligned for helpfulness)
- Tool-using variants (function calling, API integration)
- Domain-specific fine-tunes (coding, medical, legal)
Each pane can have a custom alias (displayed in brackets):
- qwen2.5:7b [Base]- for base model comparisons
- qwen2.5:7b [Creative]- for creative writing tests
- llama3.2:3b [Fast]- for quick iterations
- mistral:7b [Temp 0.1]- for precise, factual responses
- Generate All: Start generation on all panes simultaneously
- Stop All: Cancel all active generations
- Model Status: Shows number of active WebSocket connections
- Generate: Start generation for this model
- Stop: Cancel generation (with "Stopping..." feedback)
- Clear: Reset output and metrics
- Remove: Delete this pane and close its WebSocket
- Character limits: MAX_INPUT_LENGTHprevents memory exhaustion
- Token capping: MAX_OUTPUT_TOKENSlimits generation length
- Context windows: MAX_CONTEXT_TOKENSprevents overflow
- Thread limiting: Caps CPU usage to 4 threads
- Request timeouts: REQUEST_TIMEOUTprevents infinite hangs
- HTTP cleanup: Properly closes connections on cancellation
If you experience freezes:
# Kill processes
pkill -9 ollama
pkill -9 python
# Reduce limits in .env
MAX_INPUT_LENGTH=4000
MAX_CONTEXT_TOKENS=2048
# Restart
ollama serve
uvicorn app_ollama:app --reloadStreaming inference endpoint with cancellation support.
Request payload:
{
  "model_name": "qwen2.5:7b",
  "system": "You are a helpful assistant.",
  "user": "Explain quantum computing.",
  "temp": 0.7,
  "max_tokens": 1024,
  "stop": ["USER:", "ASSISTANT:", "</s>"]
}Response stream:
{"token": "Quantum"}
{"token": " computing"}
{"token": " is"}
{"token": "..."}
{"token": "[DONE]", "done": true, "metrics": {...}}Returns available Ollama models.
Response:
{
  "models": ["qwen2.5:7b", "llama3.2:3b", "gemma2:9b"],
  "current": {
    "base": "qwen2.5:7b-base",
    "instruct": "qwen2.5:7b"
  }
}Health check endpoint.
Response:
{
  "status": "ok",
  "ollama": true,
  "websocket": true,
  "models": {
    "base": "qwen2.5:7b-base",
    "instruct": "qwen2.5:7b"
  }
}Create a .env file in the project root from the .env-example file:
# Stability limits
MAX_INPUT_LENGTH=8000          # Character limit for prompts
MAX_CONTEXT_TOKENS=4096        # Ollama context window
MAX_OUTPUT_TOKENS=2048         # Maximum generation length
REQUEST_TIMEOUT=180.0          # Seconds before timeout| RAM | Input Length | Context Tokens | Output Tokens | Example Models | 
|---|---|---|---|---|
| 8GB | 4,000 | 2,048 | 1,024 | llama3.2:1b,phi3:mini | 
| 16GB | 8,000 | 4,096 | 2,048 | llama3.2:3b,mistral:7b | 
| 32GB | 16,000 | 16,384 | 8,192 | llama3:8b,mixtral:8x7b | 
| 64GB | 32,000 | 32,768 | 16,384 | llama3:70b,qwen2.5:72b | 
# Ensure Ollama is running
ollama serve
# Check connection
curl http://localhost:11434/api/tags
# Change port if needed
export OLLAMA_HOST=0.0.0.0:11435# Pull models
ollama pull qwen2.5:7b
ollama pull llama3.2:3b
# List available
ollama list- 
Reduce limits in .env:MAX_INPUT_LENGTH=4000 MAX_CONTEXT_TOKENS=2048 
- 
Use smaller models: ollama pull llama3.2:1b 
- 
Monitor resources: htop # CPU/RAM watch -n 1 nvidia-smi # GPU (if available) 
- 
Ollama Constraints 
To modify constraints directly in Ollama for better stability, set these environment variables before running ollama serve:
export OLLAMA_NUM_THREADS=4       # Limit CPU threads to 4
export OLLAMA_GPU_LAYERS=35       # Limit GPU layers (0 disables GPU)
export OLLAMA_MAX_LOADED_MODELS=3 # Limit concurrent loaded modelsThese environment variables allow fine-tuning Ollama's resource consumption to match your system's capabilities, preventing freezes and ensuring stable operation.
- Check browser console for connection issues
- Ensure no firewall blocks WebSocket connections
- Try different browser (Chrome recommended)
llm-behaviour-lab/
├── app_ollama.py          # FastAPI application with Ollama integration
├── static/
│   └── ui_multi.html      # Multi-model comparison UI
├── .env-example          # Environment configuration template
├── .gitignore            # Git ignore rules
├── requirements.txt      # Python dependencies
├── setup.sh             # Automated setup script
├── README.md            # This file
└── Stability.md         # Detailed stability configuration
- FastAPI: Web framework
- Uvicorn: ASGI server
- httpx: HTTP client for Ollama API
- python-dotenv: Environment configuration
- Ollama: Local LLM inference server
- Model caching: Pull frequently used models for faster startup
- Concurrent limits: Don't run too many large models simultaneously
- GPU acceleration: Ollama automatically uses GPU if available
- Memory management: Clear unused models with ollama stop <model>
- Fork the repository
- Create a feature branch
- Make your changes
- Test with different model configurations
- Submit a pull request
This project is licensed under the MIT License
This software is fully Free and Open Source. You are free to:
- ✅ Use it for any purpose (personal, commercial, educational)
- ✅ Modify and distribute your changes
- ✅ Include it in other projects
- ✅ Use it in production environments
Ismael Velasco - Original developer and maintainer
- Ollama for efficient local LLM inference
- FastAPI for the web framework
- Meta Llama and other model providers