Tag Cardinality Classification with Fine-tuned Qwen Models

Fine-tune small language models (Qwen2.5 0.5B/1.5B/3B) to classify observability metric tag keys by cardinality using LoRA.

Problem

Observability systems need to detect high-cardinality tags (like timestamps, UUIDs) that can explode metric storage costs. Zero-shot LLMs exhibit "model collapse" - they ignore the input and predict the same class every time.

Baseline Results (Binary Classification - Zero-shot):

Qwen2.5-0.5B: 48.0% accuracy (random predictions, 54.5% HIGH / 42.9% LOW)
Qwen2.5-1.5B: 60.0% accuracy (severe model collapse: 9.1% HIGH / 100% LOW)
Qwen2.5-3B: 80.0% accuracy (best baseline: 63.6% HIGH / 92.9% LOW)
Qwen2.5-7B: 80.0% accuracy (balanced: 81.8% HIGH / 78.6% LOW)

Goal: Beat 80% baseline through fine-tuning smaller models (0.5B, 1.5B) for efficient deployment.

Classification Labels

Binary Classification (Clear cases only):

HIGH: Tags with many unique values (timestamps, IDs, UUIDs, IP addresses, session IDs)
LOW: Tags with few unique values (environment, region, status, method, service)

Removed: BORDERLINE tags (pod_name, container_id, hostname) were excluded as ambiguous training examples.

Dataset

Total: 163 examples (28 BORDERLINE tags removed from original 191)
Split: 70% train (114), 15% val (24), 15% test (25)
Balance:
- Train: 62 LOW, 52 HIGH
- Val: 10 LOW, 14 HIGH
- Test: 14 LOW, 11 HIGH
Format: Instruction/Response format (based on LoRA best practices)
Source: tag_cardinality_training_data.json

Installation

# Install dependencies
pip3 install torch transformers peft datasets accelerate

# Verify installation
python3 -c "import torch; print(f'PyTorch: {torch.__version__}')"

Quick Start

1. Generate Training Data

python3 prepare_training_data.py \
  --input tag_cardinality_training_data.json \
  --output-dir .

This generates:

train_data.jsonl (114 examples)
val_data.jsonl (24 examples)
test_data.jsonl (25 examples)

2. Test Baseline Performance

Test zero-shot models with explicit prompting via Ollama:

# Ensure Ollama is running with models installed:
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:1.5b
ollama pull qwen2.5:3b

python3 test_baselines.py --test-data test_data.jsonl

Expected Output:

qwen2.5:0.5b: 48.0% overall (54.5% HIGH, 42.9% LOW)
qwen2.5:1.5b: 60.0% overall (9.1% HIGH, 100% LOW) - Model collapse
qwen2.5:3b:   80.0% overall (63.6% HIGH, 92.9% LOW) - Best baseline!

3. Fine-tune with LoRA

Train all three model sizes:

# 0.5B model (~5-10 minutes on Apple Silicon)
python3 train_lora.py \
  --model Qwen/Qwen2.5-0.5B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints_0.5b_clean

# 1.5B model (~20-30 minutes on Apple Silicon)
python3 train_lora.py \
  --model Qwen/Qwen2.5-1.5B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints_1.5b_clean

# 3B model (~40-60 minutes on Apple Silicon)
python3 train_lora.py \
  --model Qwen/Qwen2.5-3B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints_3b_clean

Training Configuration:

LoRA rank: 16, alpha: 32
Batch size: 4 (effective 16 with gradient accumulation)
Learning rate: 1e-4
Early stopping: patience=2 eval steps (stops if no improvement for 2 consecutive evaluations)

4. Evaluate Fine-tuned Models

Compare base vs fine-tuned performance:

# Evaluate 0.5B
python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-0.5B \
  --adapter-path checkpoints_0.5b_clean/final_model \
  --test-data test_data.jsonl \
  --compare

# Evaluate 1.5B
python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-1.5B \
  --adapter-path checkpoints_1.5b_clean/final_model \
  --test-data test_data.jsonl \
  --compare

# Evaluate 3B
python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-3B \
  --adapter-path checkpoints_3b_clean/final_model \
  --test-data test_data.jsonl \
  --compare

Success Criteria:

Overall accuracy: >80% (beat best baseline)
Per-label accuracy: >75% for both HIGH and LOW
No severe model collapse (both classes predicted)

Training Data Format

Instruction/Response Format (matches training best practices):

{
  "text": "### Instruction:\nClassify the cardinality of this observability metric tag: 'timestamp'\n\nRespond with ONLY ONE WORD:\n- HIGH (for tags with many unique values like timestamps, IDs, UUIDs)\n- LOW (for tags with few unique values like environment, region, status)\n\nAnswer:\n\n### Response:\nHIGH"
}

This format:

Makes task expectations crystal clear
Provides inline examples in the prompt
Uses explicit ### Instruction: and ### Response: markers
Matches the format from LoRA fine-tuning best practices

File Structure

qwen_tests/
├── README.md                          # This file
├── requirements.txt                    # Python dependencies
├── .gitignore                          # Git ignore patterns
│
├── tag_cardinality_training_data.json # Source dataset (191 examples)
│
├── train_data.jsonl                   # Training data (114 examples)
├── val_data.jsonl                     # Validation data (24 examples)
├── test_data.jsonl                    # Test data (25 examples)
│
├── prepare_training_data.py           # Generate training data
├── train_lora.py                      # Fine-tune with LoRA
├── evaluate_lora.py                   # Evaluate fine-tuned model
├── test_baselines.py                  # Test baseline via Ollama
└── export_to_ollama.py                # Export to Ollama format (TODO)

Scripts Reference

prepare_training_data.py

Generate training data with binary classification (HIGH/LOW only).

python3 prepare_training_data.py \
  --input tag_cardinality_training_data.json \
  --output-dir .

Features:

Filters out BORDERLINE tags (28 ambiguous examples removed)
Generates Instruction/Response format
Creates balanced 70/15/15 train/val/test split

train_lora.py

Fine-tune Qwen models with LoRA.

python3 train_lora.py \
  --model Qwen/Qwen2.5-1.5B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints

Key Features:

Auto-detects device (MPS for Apple Silicon, CUDA, or CPU)
Early stopping to prevent overfitting (patience=2)
Gradient checkpointing to save memory
Saves best model based on validation loss
Evaluates every 20 steps

evaluate_lora.py

Evaluate fine-tuned models on test data.

python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-1.5B \
  --adapter-path checkpoints/final_model \
  --test-data test_data.jsonl \
  --compare  # Compare base vs fine-tuned

Output:

Overall accuracy
Per-label accuracy (HIGH/LOW)
Detailed per-example results with raw model output
Improvement summary (if --compare used)

test_baselines.py

Test baseline models via Ollama API with explicit prompting.

python3 test_baselines.py --test-data test_data.jsonl

Requirements:

Ollama running locally
Models installed: ollama pull qwen2.5:0.5b qwen2.5:1.5b qwen2.5:3b

Output:

Baseline accuracy for all models
Per-label breakdown
Comparison table

Troubleshooting

ModuleNotFoundError: No module named 'torch'

Install dependencies:

pip3 install torch transformers peft datasets accelerate

MPS backend error on Apple Silicon

Falls back to CPU automatically. To force CPU:

python3 train_lora.py ... --device cpu

CUDA out of memory

Reduce batch size:

python3 train_lora.py ... --batch-size 2

Training stops early (early stopping triggered)

This is expected if validation loss doesn't improve for 2 consecutive evaluations. The best model is saved automatically. This prevents overfitting.

Model still predicts everything as one class after fine-tuning

Check for:

Dataset imbalance (should be roughly balanced: 52 HIGH / 62 LOW in training)
Training too few epochs (increase --epochs)
Learning rate too high (try 5e-5 instead of 1e-4)
Prompt format mismatch between training and evaluation

Expected Results

Model	Baseline	After Fine-tuning	Improvement
Qwen2.5-0.5B	48.0% (model collapse)	TBD	TBD
Qwen2.5-1.5B	60.0% (model collapse)	TBD	TBD
Qwen2.5-3B	80.0%	TBD	TBD

Target: Beat 80% baseline with smaller models (0.5B, 1.5B) for efficient deployment.

Next Steps

Evaluate fine-tuned models: Run the evaluation commands above
Compare model sizes: Which is best balance of accuracy vs inference speed?
Export best model: Use export_to_ollama.py to deploy
Add more training data: Collect examples from production logs
Re-introduce BORDERLINE: If needed, train separate 3-class model

Key Design Decisions

Why Binary Classification?

We removed BORDERLINE tags (pod_name, hostname, container_id) because:

They're ambiguous: High cardinality but legitimate
Unclear training signal for the model
Better to focus on clear HIGH vs LOW cases first

If you need to handle BORDERLINE tags:

Keep using binary classifier (HIGH/LOW)
Add heuristics: if tag matches patterns like *_name, *_id, host* → manually classify
OR train separate 3-class model later

Why Instruction/Response Format?

Based on LoRA best practices:

Explicit markers (### Instruction:, ### Response:) help model understand task structure
Inline examples in prompt reduce need for few-shot prompting
Clearer than conversational format for classification tasks

Why Early Stopping = 2?

Aggressive early stopping (patience=2) because:

Small dataset (114 training examples) → overfits quickly
Evaluation every 20 steps → 2 evals = ~40 steps without improvement
Better to stop early than overfit on small dataset

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
evaluate_lora.py		evaluate_lora.py
export_to_ollama.py		export_to_ollama.py
format2_minimal_test.jsonl		format2_minimal_test.jsonl
format2_minimal_train.jsonl		format2_minimal_train.jsonl
format2_minimal_val.jsonl		format2_minimal_val.jsonl
prepare_training_formats.py		prepare_training_formats.py
requirements.txt		requirements.txt
tag_cardinality_training_data.json		tag_cardinality_training_data.json
test_baselines.py		test_baselines.py
test_data.jsonl		test_data.jsonl
train_data.jsonl		train_data.jsonl
train_lora.py		train_lora.py
val_data.jsonl		val_data.jsonl

Eokye/qwen_tests

Folders and files

Latest commit

History

Repository files navigation