This project implements a multi-stage training pipeline for optimizing meeting recommendations at conferences using agentic reinforcement learning techniques. You'll build an intelligent agent that learns to predict successful meeting outcomes and make optimal recommendations.
This starter code provides the framework for implementing supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) using Direct Preference Optimization (DPO).
torch>=2.0.0
transformers>=4.36.0
peft>=0.7.0
trl>=0.7.0
datasets>=2.14.0
pandas>=2.0.0
numpy>=1.24.0
npcpy>=0.1.0- Clone this repository to your local machine
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install required dependencies:
pip install torch transformers peft trl datasets pandas numpy npcpy- Verify installation by checking imports:
python -c "import torch, transformers, peft, trl, npcpy; print('All dependencies installed successfully')"starter/
├── data_classes.py # Core data models and simulation logic
├── starter_sft.py # Stage 1: Supervised Fine-Tuning implementation
├── starter_agentic_traces.py # Stage 2: Agent trace collection
└── starter_agentic_rlft.py # Stage 3: Reinforcement Learning Fine-Tuning
- Open
starter/starter_sft.py - Complete the
SFTConfigdataclass by filling in the'YOUR CODE HERE'placeholders:- Set appropriate LoRA parameters (r, alpha, dropout)
- Configure training hyperparameters (epochs, learning rate, weight decay)
- Determine overfitting threshold
- Set max_new_tokens for generation
- Run the SFT training:
python starter/starter_sft.py
- Monitor training loss and validation metrics
- Open
starter/starter_agentic_traces.py - Review the agent personas and their decision-making strategies
- Implement any missing tool functions (marked with
'YOUR CODE HERE') - Generate agent traces:
python starter/starter_agentic_traces.py
- Examine the generated CSV file with trace data
- Open
starter/starter_agentic_rlft.py - Complete the
calculate_reward()function with appropriate reward values - Implement the preference pairing strategy in
create_preference_dataset_from_traces() - Run RLFT with DPO:
python starter/starter_agentic_rlft.py
- Evaluate the improved model performance
Test individual components:
# Test the reward function
from starter_agentic_rlft import calculate_reward
trace = {"final_recommendation_parsed": {"recommendation": "YES"},
"tools_used": ["tool1"],
"completed_naturally": True,
"ground_truth": 0.8}
reward = calculate_reward(trace)
assert -1.0 <= reward <= 1.0- Verify SFT model saves correctly to
models/sft_prediction_model_gemma_270m - Check agent traces are saved to CSV with all required fields
- Confirm RLFT model shows improved accuracy over baseline
- SFT: Correlation and MAE between predicted and ground truth probabilities
- Agent Traces: Completion rate, tool usage, and recommendation accuracy
- RLFT: Final accuracy improvement over SFT baseline
- Completed Code: All
'YOUR CODE HERE'placeholders filled with working implementations - Trained Models: SFT and RLFT model checkpoints
- Training Report: Document including:
- Chosen hyperparameters and justification
- Training curves (loss, validation metrics)
- Performance comparison between stages
- Analysis of agent behavior patterns
- Lessons learned and challenges faced
- Start with conservative hyperparameters to avoid overfitting
- Monitor training loss carefully - stop if it drops too low (< 0.01 suggested)
- Experiment with different reward structures in RLFT
- Use smaller batch sizes if running on limited hardware
- Save checkpoints frequently during long training runs
- PyTorch - Deep learning framework
- Transformers - Pre-trained models and training utilities
- PEFT - Parameter-efficient fine-tuning
- TRL - Transformer Reinforcement Learning
- NPCPy - Agent framework for NPC interactions