Skip to content

Conversation

srthkdev
Copy link
Contributor

@srthkdev srthkdev commented Sep 8, 2025

Enhanced Metadata Export in vf-eval

Description

This PR enhances the metadata export functionality in the vf-eval CLI tool to provide more comprehensive information when running evaluations with the -s flag. The additional metadata will be useful for analysis and display on the hub. solves #307

Key Changes

  1. Runtime Tracking: Added evaluation runtime tracking with eval_runtime_seconds field in metadata
  2. Parser Results: Added parsed_answer field containing results from Parser.parse_answer()
  3. Rubric State Metadata: Added state_metadata field containing:
    • Judge responses from rubric state
    • Model responses from API calls
    • Tool call information
    • Other custom serializable state fields
  4. Additional Metrics: Added total_rollouts field to track total number of rollouts performed

Type of Change

  • New feature (non-breaking change which adds functionality)

Testing

  • All existing tests pass
  • Code compiles successfully without syntax errors
  • Enhanced metadata fields are properly structured and exported

Test Coverage

  • Current coverage: Not changed
  • Coverage after changes: Not changed

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

Potential Additional Enhancements

More Detailed Performance Metrics:

  • Add breakdown of time spent in different phases (API calls, parsing, scoring)
  • Add throughput metrics (rollouts per second)

Environment-Specific Metadata:

  • Extract and export environment-specific configuration
  • Include dataset information (size, source, etc.)

Model Response Analysis:

  • Add token usage statistics from model responses
  • Include finish reasons from model responses

Enhanced Error Tracking:

  • Track and export information about failed rollouts
  • Include error types and counts

System Information:

  • Export system information (Python version, library versions)
  • Include hardware information when available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant