Fine-tune small language models (Qwen2.5 0.5B/1.5B/3B) to classify observability metric tag keys by cardinality using LoRA.
Observability systems need to detect high-cardinality tags (like timestamps, UUIDs) that can explode metric storage costs. Zero-shot LLMs exhibit "model collapse" - they ignore the input and predict the same class every time.
Baseline Results (Binary Classification - Zero-shot):
- Qwen2.5-0.5B: 48.0% accuracy (random predictions, 54.5% HIGH / 42.9% LOW)
- Qwen2.5-1.5B: 60.0% accuracy (severe model collapse: 9.1% HIGH / 100% LOW)
- Qwen2.5-3B: 80.0% accuracy (best baseline: 63.6% HIGH / 92.9% LOW)
- Qwen2.5-7B: 80.0% accuracy (balanced: 81.8% HIGH / 78.6% LOW)
Goal: Beat 80% baseline through fine-tuning smaller models (0.5B, 1.5B) for efficient deployment.
Binary Classification (Clear cases only):
- HIGH: Tags with many unique values (timestamps, IDs, UUIDs, IP addresses, session IDs)
- LOW: Tags with few unique values (environment, region, status, method, service)
Removed: BORDERLINE tags (pod_name, container_id, hostname) were excluded as ambiguous training examples.
- Total: 163 examples (28 BORDERLINE tags removed from original 191)
- Split: 70% train (114), 15% val (24), 15% test (25)
- Balance:
- Train: 62 LOW, 52 HIGH
- Val: 10 LOW, 14 HIGH
- Test: 14 LOW, 11 HIGH
- Format: Instruction/Response format (based on LoRA best practices)
- Source:
tag_cardinality_training_data.json
# Install dependencies
pip3 install torch transformers peft datasets accelerate
# Verify installation
python3 -c "import torch; print(f'PyTorch: {torch.__version__}')"python3 prepare_training_data.py \
--input tag_cardinality_training_data.json \
--output-dir .This generates:
train_data.jsonl(114 examples)val_data.jsonl(24 examples)test_data.jsonl(25 examples)
Test zero-shot models with explicit prompting via Ollama:
# Ensure Ollama is running with models installed:
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:1.5b
ollama pull qwen2.5:3b
python3 test_baselines.py --test-data test_data.jsonlExpected Output:
qwen2.5:0.5b: 48.0% overall (54.5% HIGH, 42.9% LOW)
qwen2.5:1.5b: 60.0% overall (9.1% HIGH, 100% LOW) - Model collapse
qwen2.5:3b: 80.0% overall (63.6% HIGH, 92.9% LOW) - Best baseline!
Train all three model sizes:
# 0.5B model (~5-10 minutes on Apple Silicon)
python3 train_lora.py \
--model Qwen/Qwen2.5-0.5B \
--train-data train_data.jsonl \
--val-data val_data.jsonl \
--epochs 20 \
--batch-size 4 \
--learning-rate 1e-4 \
--lora-rank 16 \
--output-dir checkpoints_0.5b_clean
# 1.5B model (~20-30 minutes on Apple Silicon)
python3 train_lora.py \
--model Qwen/Qwen2.5-1.5B \
--train-data train_data.jsonl \
--val-data val_data.jsonl \
--epochs 20 \
--batch-size 4 \
--learning-rate 1e-4 \
--lora-rank 16 \
--output-dir checkpoints_1.5b_clean
# 3B model (~40-60 minutes on Apple Silicon)
python3 train_lora.py \
--model Qwen/Qwen2.5-3B \
--train-data train_data.jsonl \
--val-data val_data.jsonl \
--epochs 20 \
--batch-size 4 \
--learning-rate 1e-4 \
--lora-rank 16 \
--output-dir checkpoints_3b_cleanTraining Configuration:
- LoRA rank: 16, alpha: 32
- Batch size: 4 (effective 16 with gradient accumulation)
- Learning rate: 1e-4
- Early stopping: patience=2 eval steps (stops if no improvement for 2 consecutive evaluations)
Compare base vs fine-tuned performance:
# Evaluate 0.5B
python3 evaluate_lora.py \
--base-model Qwen/Qwen2.5-0.5B \
--adapter-path checkpoints_0.5b_clean/final_model \
--test-data test_data.jsonl \
--compare
# Evaluate 1.5B
python3 evaluate_lora.py \
--base-model Qwen/Qwen2.5-1.5B \
--adapter-path checkpoints_1.5b_clean/final_model \
--test-data test_data.jsonl \
--compare
# Evaluate 3B
python3 evaluate_lora.py \
--base-model Qwen/Qwen2.5-3B \
--adapter-path checkpoints_3b_clean/final_model \
--test-data test_data.jsonl \
--compareSuccess Criteria:
- Overall accuracy: >80% (beat best baseline)
- Per-label accuracy: >75% for both HIGH and LOW
- No severe model collapse (both classes predicted)
Instruction/Response Format (matches training best practices):
{
"text": "### Instruction:\nClassify the cardinality of this observability metric tag: 'timestamp'\n\nRespond with ONLY ONE WORD:\n- HIGH (for tags with many unique values like timestamps, IDs, UUIDs)\n- LOW (for tags with few unique values like environment, region, status)\n\nAnswer:\n\n### Response:\nHIGH"
}This format:
- Makes task expectations crystal clear
- Provides inline examples in the prompt
- Uses explicit
### Instruction:and### Response:markers - Matches the format from LoRA fine-tuning best practices
qwen_tests/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore patterns
│
├── tag_cardinality_training_data.json # Source dataset (191 examples)
│
├── train_data.jsonl # Training data (114 examples)
├── val_data.jsonl # Validation data (24 examples)
├── test_data.jsonl # Test data (25 examples)
│
├── prepare_training_data.py # Generate training data
├── train_lora.py # Fine-tune with LoRA
├── evaluate_lora.py # Evaluate fine-tuned model
├── test_baselines.py # Test baseline via Ollama
└── export_to_ollama.py # Export to Ollama format (TODO)
Generate training data with binary classification (HIGH/LOW only).
python3 prepare_training_data.py \
--input tag_cardinality_training_data.json \
--output-dir .Features:
- Filters out BORDERLINE tags (28 ambiguous examples removed)
- Generates Instruction/Response format
- Creates balanced 70/15/15 train/val/test split
Fine-tune Qwen models with LoRA.
python3 train_lora.py \
--model Qwen/Qwen2.5-1.5B \
--train-data train_data.jsonl \
--val-data val_data.jsonl \
--epochs 20 \
--batch-size 4 \
--learning-rate 1e-4 \
--lora-rank 16 \
--output-dir checkpointsKey Features:
- Auto-detects device (MPS for Apple Silicon, CUDA, or CPU)
- Early stopping to prevent overfitting (patience=2)
- Gradient checkpointing to save memory
- Saves best model based on validation loss
- Evaluates every 20 steps
Evaluate fine-tuned models on test data.
python3 evaluate_lora.py \
--base-model Qwen/Qwen2.5-1.5B \
--adapter-path checkpoints/final_model \
--test-data test_data.jsonl \
--compare # Compare base vs fine-tunedOutput:
- Overall accuracy
- Per-label accuracy (HIGH/LOW)
- Detailed per-example results with raw model output
- Improvement summary (if --compare used)
Test baseline models via Ollama API with explicit prompting.
python3 test_baselines.py --test-data test_data.jsonlRequirements:
- Ollama running locally
- Models installed:
ollama pull qwen2.5:0.5b qwen2.5:1.5b qwen2.5:3b
Output:
- Baseline accuracy for all models
- Per-label breakdown
- Comparison table
Install dependencies:
pip3 install torch transformers peft datasets accelerateFalls back to CPU automatically. To force CPU:
python3 train_lora.py ... --device cpuReduce batch size:
python3 train_lora.py ... --batch-size 2This is expected if validation loss doesn't improve for 2 consecutive evaluations. The best model is saved automatically. This prevents overfitting.
Check for:
- Dataset imbalance (should be roughly balanced: 52 HIGH / 62 LOW in training)
- Training too few epochs (increase --epochs)
- Learning rate too high (try 5e-5 instead of 1e-4)
- Prompt format mismatch between training and evaluation
| Model | Baseline | After Fine-tuning | Improvement |
|---|---|---|---|
| Qwen2.5-0.5B | 48.0% (model collapse) | TBD | TBD |
| Qwen2.5-1.5B | 60.0% (model collapse) | TBD | TBD |
| Qwen2.5-3B | 80.0% | TBD | TBD |
Target: Beat 80% baseline with smaller models (0.5B, 1.5B) for efficient deployment.
- Evaluate fine-tuned models: Run the evaluation commands above
- Compare model sizes: Which is best balance of accuracy vs inference speed?
- Export best model: Use export_to_ollama.py to deploy
- Add more training data: Collect examples from production logs
- Re-introduce BORDERLINE: If needed, train separate 3-class model
We removed BORDERLINE tags (pod_name, hostname, container_id) because:
- They're ambiguous: High cardinality but legitimate
- Unclear training signal for the model
- Better to focus on clear HIGH vs LOW cases first
If you need to handle BORDERLINE tags:
- Keep using binary classifier (HIGH/LOW)
- Add heuristics: if tag matches patterns like
*_name,*_id,host*→ manually classify - OR train separate 3-class model later
Based on LoRA best practices:
- Explicit markers (
### Instruction:,### Response:) help model understand task structure - Inline examples in prompt reduce need for few-shot prompting
- Clearer than conversational format for classification tasks
Aggressive early stopping (patience=2) because:
- Small dataset (114 training examples) → overfits quickly
- Evaluation every 20 steps → 2 evals = ~40 steps without improvement
- Better to stop early than overfit on small dataset
MIT