Skip to content

Eokye/qwen_tests

Repository files navigation

Tag Cardinality Classification with Fine-tuned Qwen Models

Fine-tune small language models (Qwen2.5 0.5B/1.5B/3B) to classify observability metric tag keys by cardinality using LoRA.

Problem

Observability systems need to detect high-cardinality tags (like timestamps, UUIDs) that can explode metric storage costs. Zero-shot LLMs exhibit "model collapse" - they ignore the input and predict the same class every time.

Baseline Results (Binary Classification - Zero-shot):

  • Qwen2.5-0.5B: 48.0% accuracy (random predictions, 54.5% HIGH / 42.9% LOW)
  • Qwen2.5-1.5B: 60.0% accuracy (severe model collapse: 9.1% HIGH / 100% LOW)
  • Qwen2.5-3B: 80.0% accuracy (best baseline: 63.6% HIGH / 92.9% LOW)
  • Qwen2.5-7B: 80.0% accuracy (balanced: 81.8% HIGH / 78.6% LOW)

Goal: Beat 80% baseline through fine-tuning smaller models (0.5B, 1.5B) for efficient deployment.

Classification Labels

Binary Classification (Clear cases only):

  • HIGH: Tags with many unique values (timestamps, IDs, UUIDs, IP addresses, session IDs)
  • LOW: Tags with few unique values (environment, region, status, method, service)

Removed: BORDERLINE tags (pod_name, container_id, hostname) were excluded as ambiguous training examples.

Dataset

  • Total: 163 examples (28 BORDERLINE tags removed from original 191)
  • Split: 70% train (114), 15% val (24), 15% test (25)
  • Balance:
    • Train: 62 LOW, 52 HIGH
    • Val: 10 LOW, 14 HIGH
    • Test: 14 LOW, 11 HIGH
  • Format: Instruction/Response format (based on LoRA best practices)
  • Source: tag_cardinality_training_data.json

Installation

# Install dependencies
pip3 install torch transformers peft datasets accelerate

# Verify installation
python3 -c "import torch; print(f'PyTorch: {torch.__version__}')"

Quick Start

1. Generate Training Data

python3 prepare_training_data.py \
  --input tag_cardinality_training_data.json \
  --output-dir .

This generates:

  • train_data.jsonl (114 examples)
  • val_data.jsonl (24 examples)
  • test_data.jsonl (25 examples)

2. Test Baseline Performance

Test zero-shot models with explicit prompting via Ollama:

# Ensure Ollama is running with models installed:
ollama pull qwen2.5:0.5b
ollama pull qwen2.5:1.5b
ollama pull qwen2.5:3b

python3 test_baselines.py --test-data test_data.jsonl

Expected Output:

qwen2.5:0.5b: 48.0% overall (54.5% HIGH, 42.9% LOW)
qwen2.5:1.5b: 60.0% overall (9.1% HIGH, 100% LOW) - Model collapse
qwen2.5:3b:   80.0% overall (63.6% HIGH, 92.9% LOW) - Best baseline!

3. Fine-tune with LoRA

Train all three model sizes:

# 0.5B model (~5-10 minutes on Apple Silicon)
python3 train_lora.py \
  --model Qwen/Qwen2.5-0.5B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints_0.5b_clean

# 1.5B model (~20-30 minutes on Apple Silicon)
python3 train_lora.py \
  --model Qwen/Qwen2.5-1.5B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints_1.5b_clean

# 3B model (~40-60 minutes on Apple Silicon)
python3 train_lora.py \
  --model Qwen/Qwen2.5-3B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints_3b_clean

Training Configuration:

  • LoRA rank: 16, alpha: 32
  • Batch size: 4 (effective 16 with gradient accumulation)
  • Learning rate: 1e-4
  • Early stopping: patience=2 eval steps (stops if no improvement for 2 consecutive evaluations)

4. Evaluate Fine-tuned Models

Compare base vs fine-tuned performance:

# Evaluate 0.5B
python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-0.5B \
  --adapter-path checkpoints_0.5b_clean/final_model \
  --test-data test_data.jsonl \
  --compare

# Evaluate 1.5B
python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-1.5B \
  --adapter-path checkpoints_1.5b_clean/final_model \
  --test-data test_data.jsonl \
  --compare

# Evaluate 3B
python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-3B \
  --adapter-path checkpoints_3b_clean/final_model \
  --test-data test_data.jsonl \
  --compare

Success Criteria:

  • Overall accuracy: >80% (beat best baseline)
  • Per-label accuracy: >75% for both HIGH and LOW
  • No severe model collapse (both classes predicted)

Training Data Format

Instruction/Response Format (matches training best practices):

{
  "text": "### Instruction:\nClassify the cardinality of this observability metric tag: 'timestamp'\n\nRespond with ONLY ONE WORD:\n- HIGH (for tags with many unique values like timestamps, IDs, UUIDs)\n- LOW (for tags with few unique values like environment, region, status)\n\nAnswer:\n\n### Response:\nHIGH"
}

This format:

  • Makes task expectations crystal clear
  • Provides inline examples in the prompt
  • Uses explicit ### Instruction: and ### Response: markers
  • Matches the format from LoRA fine-tuning best practices

File Structure

qwen_tests/
├── README.md                          # This file
├── requirements.txt                    # Python dependencies
├── .gitignore                          # Git ignore patterns
│
├── tag_cardinality_training_data.json # Source dataset (191 examples)
│
├── train_data.jsonl                   # Training data (114 examples)
├── val_data.jsonl                     # Validation data (24 examples)
├── test_data.jsonl                    # Test data (25 examples)
│
├── prepare_training_data.py           # Generate training data
├── train_lora.py                      # Fine-tune with LoRA
├── evaluate_lora.py                   # Evaluate fine-tuned model
├── test_baselines.py                  # Test baseline via Ollama
└── export_to_ollama.py                # Export to Ollama format (TODO)

Scripts Reference

prepare_training_data.py

Generate training data with binary classification (HIGH/LOW only).

python3 prepare_training_data.py \
  --input tag_cardinality_training_data.json \
  --output-dir .

Features:

  • Filters out BORDERLINE tags (28 ambiguous examples removed)
  • Generates Instruction/Response format
  • Creates balanced 70/15/15 train/val/test split

train_lora.py

Fine-tune Qwen models with LoRA.

python3 train_lora.py \
  --model Qwen/Qwen2.5-1.5B \
  --train-data train_data.jsonl \
  --val-data val_data.jsonl \
  --epochs 20 \
  --batch-size 4 \
  --learning-rate 1e-4 \
  --lora-rank 16 \
  --output-dir checkpoints

Key Features:

  • Auto-detects device (MPS for Apple Silicon, CUDA, or CPU)
  • Early stopping to prevent overfitting (patience=2)
  • Gradient checkpointing to save memory
  • Saves best model based on validation loss
  • Evaluates every 20 steps

evaluate_lora.py

Evaluate fine-tuned models on test data.

python3 evaluate_lora.py \
  --base-model Qwen/Qwen2.5-1.5B \
  --adapter-path checkpoints/final_model \
  --test-data test_data.jsonl \
  --compare  # Compare base vs fine-tuned

Output:

  • Overall accuracy
  • Per-label accuracy (HIGH/LOW)
  • Detailed per-example results with raw model output
  • Improvement summary (if --compare used)

test_baselines.py

Test baseline models via Ollama API with explicit prompting.

python3 test_baselines.py --test-data test_data.jsonl

Requirements:

  • Ollama running locally
  • Models installed: ollama pull qwen2.5:0.5b qwen2.5:1.5b qwen2.5:3b

Output:

  • Baseline accuracy for all models
  • Per-label breakdown
  • Comparison table

Troubleshooting

ModuleNotFoundError: No module named 'torch'

Install dependencies:

pip3 install torch transformers peft datasets accelerate

MPS backend error on Apple Silicon

Falls back to CPU automatically. To force CPU:

python3 train_lora.py ... --device cpu

CUDA out of memory

Reduce batch size:

python3 train_lora.py ... --batch-size 2

Training stops early (early stopping triggered)

This is expected if validation loss doesn't improve for 2 consecutive evaluations. The best model is saved automatically. This prevents overfitting.

Model still predicts everything as one class after fine-tuning

Check for:

  • Dataset imbalance (should be roughly balanced: 52 HIGH / 62 LOW in training)
  • Training too few epochs (increase --epochs)
  • Learning rate too high (try 5e-5 instead of 1e-4)
  • Prompt format mismatch between training and evaluation

Expected Results

Model Baseline After Fine-tuning Improvement
Qwen2.5-0.5B 48.0% (model collapse) TBD TBD
Qwen2.5-1.5B 60.0% (model collapse) TBD TBD
Qwen2.5-3B 80.0% TBD TBD

Target: Beat 80% baseline with smaller models (0.5B, 1.5B) for efficient deployment.

Next Steps

  1. Evaluate fine-tuned models: Run the evaluation commands above
  2. Compare model sizes: Which is best balance of accuracy vs inference speed?
  3. Export best model: Use export_to_ollama.py to deploy
  4. Add more training data: Collect examples from production logs
  5. Re-introduce BORDERLINE: If needed, train separate 3-class model

Key Design Decisions

Why Binary Classification?

We removed BORDERLINE tags (pod_name, hostname, container_id) because:

  • They're ambiguous: High cardinality but legitimate
  • Unclear training signal for the model
  • Better to focus on clear HIGH vs LOW cases first

If you need to handle BORDERLINE tags:

  1. Keep using binary classifier (HIGH/LOW)
  2. Add heuristics: if tag matches patterns like *_name, *_id, host* → manually classify
  3. OR train separate 3-class model later

Why Instruction/Response Format?

Based on LoRA best practices:

  • Explicit markers (### Instruction:, ### Response:) help model understand task structure
  • Inline examples in prompt reduce need for few-shot prompting
  • Clearer than conversational format for classification tasks

Why Early Stopping = 2?

Aggressive early stopping (patience=2) because:

  • Small dataset (114 training examples) → overfits quickly
  • Evaluation every 20 steps → 2 evals = ~40 steps without improvement
  • Better to stop early than overfit on small dataset

References

License

MIT

About

for qwen SLM model training and testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages