Skip to content

gtfintechlab/FIFE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IFF - Instruction Following for Finance Framework

Overview

IFF (Instruction Following for Finance) is a specialized evaluation framework designed to assess language models' ability to follow complex instructions in finance-specific contexts. The framework provides over 40 specialized instruction checkers covering various financial domains including equities, credit, FX, compliance, risk management, derivatives, and more.

Table of Contents

Features

Core Capabilities

  • 40+ Finance-Specific Instructions: Comprehensive coverage of financial domains
  • Dual Evaluation Modes: Strict and loose evaluation strategies
  • Multi-Model Support: Tested with various LLMs (Kimi, Llama 3.1 8B, etc.)
  • Flexible Architecture: Modular design for easy extension
  • Detailed Reporting: Prompt-level and instruction-level accuracy metrics
  • Batch Processing: Efficient evaluation of large response sets
  • Type Safety: Full type hints with ty type checker integration

Financial Domains Covered

  • Equities & Trading: Market analysis, trading strategies, portfolio management
  • Credit & Fixed Income: Spread analysis, carry calculations, bond metrics
  • Foreign Exchange: FX calculations, cross-currency analysis
  • Compliance & Regulatory: Rule 10b-5, AML, regulatory reporting
  • Risk Management: VaR calculations, risk metrics, stress testing
  • Derivatives: Options pricing (Black-76), Greeks, structured products
  • Treasury Operations: Liquidity management, settlement processes
  • ESG & Climate Finance: Sustainability metrics, carbon accounting
  • Private Equity & VC: Deal structuring, valuation metrics
  • Quantitative Finance: Algorithmic strategies, pseudocode generation

Installation

Prerequisites

  • Python 3.12 or higher
  • Virtual environment (recommended)

Standard Installation

# Clone the repository
git clone <repository-url>
cd iff

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e .

Using UV Package Manager (Recommended)

# Install UV if not already installed
pip install uv

# Install dependencies with UV
uv pip install .

Development Installation

# Install with development dependencies
pip install -e .

Quick Start

0. Configure API Keys

Create a .env file in the project root and add your model provider keys. For TogetherAI, set the TOGETHERAI_API_KEY variable:

cp .env.example .env
echo "TOGETHERAI_API_KEY=your_api_key" >> .env

1. Generate Test Inputs

python build_input_jsonl.py
# This creates examples/inputs.jsonl with test prompts

2. Generate Model Responses

# Using litellm for multi-provider support (OpenAI, Anthropic, Together, etc.)
python generate_responses.py \
  --input examples/instructions.jsonl \
  --provider together \  # or: openai, anthropic, azure, cohere, huggingface
  --model "together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput"

# To use reasoning models add -no-cot flag to only get final output, currently available with (Anthropic and TogetherAI providers)
# Note: Untested with Anthropic models
python generate_responses.py
  --input examples/instructions.jsonl \
  --provider together \
  --model "deepseek-ai/DeepSeek-R1" \
--no-cot

3. Run Evaluation

python evaluation_bin.py \
--provider together\
--model deepseek-v3.1
# or
python evaluation_bin.py --run_dir results/together/deepseek-v3.1/runs/2025-08-26_19-59-18

4. View Results

# Results are saved in:
results/provider/model/runs
#eg. results/together/gpt-oss-120b/runs/2025-08-26_18-23-14
# results/together/gpt-oss-120b/runs/latest

5. Analyse Results

python analyze_eval.py \
  --strict results/together/gpt-oss-120b/runs/latest/evaluations/strict.jsonl \
  --loose  results/together/gpt-oss-120b/runs/latest/evaluations/loose.jsonl \
  --out    eval_reports/gpt-oss-120b_latest

Architecture

Module Structure

iff/
├── Core Modules
│   ├── evaluation_lib.py        # Core evaluation engine
│   ├── evaluation_bin.py        # CLI evaluation runner
│   ├── instructions_registry.py # Instruction registration system
│   └── instructions_util.py     # Utility functions for text processing
│
├── Instruction Modules
│   ├── instructions_iff.py      # General instruction checkers
│   └── finance_instructions.py  # Finance-specific instruction checkers
│
├── Data Generation
│   ├── build_input_jsonl.py     # Generate test inputs
│   └── generate_responses.py         # Multi-provider response generation
│
├── Configuration
│   ├── pyproject.toml           # Project metadata and dependencies
│   └── requirements.txt         # Python dependencies
│
└── Data Directories
    ├── results               # Responses and evaluation results
    ├── eval_reports          # Analysis of output responses.

Data Flow

graph LR
    A[Input Prompts] --> B[LLM Response Generation]
    B --> C[Response Collection]
    C --> D[Evaluation Engine]
    D --> E[Instruction Checkers]
    E --> F[Results & Metrics]
Loading

Instruction Categories

Core Finance Instructions (Top 10)

  1. fin:equities:bold_intro_italic_risk - Equity analysis with formatting
  2. fin:credit:table_spread_vs_carry - Credit spread and carry analysis
  3. fin:fx:calc_codeblock_limit - FX calculations with code blocks
  4. fin:compliance:rule10b5_numbered - Compliance rule formatting
  5. fin:ops:settlement_checklist - Settlement process checklists
  6. fin:ir:six_bullets_verb_buyback - Investor relations bulletpoints
  7. fin:treasury:liquidity_risk_section - Treasury risk sections
  8. fin:deriv:black76_latex_sigma - Derivatives pricing formulas
  9. fin:risk:var_numbered_boldusd - VaR risk metrics
  10. fin:pe:subheaders_dashes - Private equity formatting

Extended Finance Instructions (11-40+)

  • Quantitative analysis and pseudocode generation
  • Cryptocurrency and digital asset reporting
  • Asset-backed securities analysis
  • REIT analysis with word limits
  • Structured products documentation
  • Central bank communications
  • Credit ratings analysis
  • Pension fund reporting
  • Margin and collateral management
  • ETF analysis with timestamps
  • And many more...

Usage Guide

Creating Custom Instructions

from instructions_util import InstructionChecker

class CustomFinanceInstruction(InstructionChecker):
    def build_description(self, **kwargs):
        self.inst_description = "Your instruction description"

    def check_following(self, text):
        # Implement your checking logic
        return meets_criteria(text)

Registering Instructions

# In instructions_registry.py
CANONICAL["fin:custom:instruction"] = CustomFinanceInstruction

Running Evaluations Programmatically

import evaluation_lib as eval_lib

# Load test data
inputs = eval_lib.read_prompt_list("examples/inputs.jsonl")
responses = eval_lib.read_prompt_to_response_dict("examples/responses.jsonl")

# Run evaluation
outputs = []
for inp in inputs:
    result = eval_lib.test_instruction_following_strict(inp, responses)
    outputs.append(result)

# Generate report
eval_lib.print_report(outputs)

Evaluation Modes

Strict Mode

  • Exact matching of instruction requirements
  • No tolerance for formatting deviations
  • Single response attempt

Loose Mode

  • Multiple response variants tested
  • Tolerates minor formatting issues
  • Removes markdown artifacts
  • Tests with trimmed versions

API Reference

Core Classes

InputExample

@dataclasses.dataclass
class InputExample:
    key: int                    # Unique identifier
    instruction_id_list: list   # List of instruction IDs to check
    prompt: str                 # The prompt text
    kwargs: list               # Parameters for each instruction

OutputExample

@dataclasses.dataclass
class OutputExample:
    instruction_id_list: list      # Instructions checked
    prompt: str                    # Original prompt
    response: str                  # Model response
    follow_all_instructions: bool  # Overall success
    follow_instruction_list: list  # Per-instruction results

Key Functions

evaluation_lib.py

  • read_prompt_list(path) - Load test prompts
  • read_prompt_to_response_dict(path) - Load responses
  • test_instruction_following_strict() - Strict evaluation
  • test_instruction_following_loose() - Loose evaluation
  • print_report(outputs) - Generate evaluation report

instructions_util.py

  • count_words(text) - Word counting
  • split_into_sentences(text) - Sentence tokenization
  • numbered_lines(text) - Extract numbered lists
  • bullet_lines(text) - Extract bullet points
  • Text formatting validators

Performance Metrics

Evaluation Metrics

  • Prompt-Level Accuracy: Percentage of prompts where all instructions are followed
  • Instruction-Level Accuracy: Percentage of individual instructions followed correctly

Typical Results (Needs to be updated)

Model Performance (Example):
- Llama 3.1 8B:
  - Strict: prompt-level: 0.752, instruction-level: 0.891
  - Loose: prompt-level: 0.834, instruction-level: 0.923

- Kimi:
  - Strict: prompt-level: 0.698, instruction-level: 0.867
  - Loose: prompt-level: 0.792, instruction-level: 0.902

Development

Adding New Instructions

  1. Create instruction class in finance_instructions.py
  2. Register in instructions_registry.py
  3. Add test cases in build_input_jsonl.py
  4. Run evaluation pipeline

Testing

# Run unit tests
uv run pytest tests/

# Run specific test
uv run pytest tests/test_evaluation.py::test_strict_mode

# Run with coverage
uv run pytest tests/ --cov=. --cov-report=term-missing

Code Quality

# Type checking with ty (ultrafast Rust-based type checker)
uv run ty check .

# Format code with ruff
uv run ruff format .

# Lint with ruff
uv run ruff check .

# Run all checks
make all  # Runs lint, format, typecheck, and test

Type Checking

This project uses ty, Astral's ultrafast Python type checker written in Rust. All code includes comprehensive type hints.

# Install ty (included in dependencies)
uv add ty

# Run type checking
uv run ty check .

# Type check specific files
uv run ty check instructions_util.py finance_instructions.py

Configuration is in pyproject.toml under [tool.ty].

Troubleshooting

Common Issues

  1. NLTK Data Missing

    # The framework auto-downloads required NLTK data
    # Manual download if needed:
    import nltk
    nltk.download('punkt')
    nltk.download('punkt_tab')
    nltk.download('averaged_perceptron_tagger')
  2. Memory Issues with Large Datasets

    • Process in batches
    • Use --batch_size parameter
    • Increase system memory allocation
  3. API Rate Limits

    • Implement exponential backoff
    • Use --delay parameter between requests
    • Consider local model deployment

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit pull request

License

MIT License - See LICENSE file for details

Citation

If you use IFF in your research, please cite:

@software{iff2024,
  title = {IFF: Instruction Following for Finance},
  year = {2024},
  url = {https://github.com/gtfintechlab/IFF}
}

Contact

For questions and support, please open an issue on GitHub.

Acknowledgments

IFF is built upon and adapted from several open-source projects:

Core Framework

  • IFEval (Instruction Following Evaluation): The core evaluation framework is adapted from IFEval, originally developed by:

    • Google Research (evaluation_lib.py) - Licensed under Apache License 2.0
    • Allen Institute for AI (instructions_registry.py, instructions_util.py) - Licensed under Apache License 2.0

    We have adapted these components for finance-specific instruction evaluation while maintaining the original licensing terms.

Dependencies

Submodules

  • litellm-gateway: Custom gateway module for LLM integration
  • manuscript: Research paper and documentation

All original copyrights and licenses have been preserved in the respective source files. This project is distributed under the MIT License for new contributions while respecting the licensing terms of incorporated components.

About

Financial Instruction Following Evaluation (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •  

Languages