feat(eval): add evaluation runtime tracking and detailed state metadata #308

srthkdev · 2025-09-08T12:49:07Z

Enhanced Metadata Export in vf-eval

Description

This PR enhances the metadata export functionality in the vf-eval CLI tool to provide more comprehensive information when running evaluations with the -s flag. The additional metadata will be useful for analysis and display on the hub. solves #307

Key Changes

Runtime Tracking: Added evaluation runtime tracking with eval_runtime_seconds field in metadata
Parser Results: Added parsed_answer field containing results from Parser.parse_answer()
Rubric State Metadata: Added state_metadata field containing:
- Judge responses from rubric state
- Model responses from API calls
- Tool call information
- Other custom serializable state fields
Additional Metrics: Added total_rollouts field to track total number of rollouts performed

Type of Change

New feature (non-breaking change which adds functionality)

Testing

All existing tests pass
Code compiles successfully without syntax errors
Enhanced metadata fields are properly structured and exported

Test Coverage

Current coverage: Not changed
Coverage after changes: Not changed

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Potential Additional Enhancements

More Detailed Performance Metrics:

Add breakdown of time spent in different phases (API calls, parsing, scoring)
Add throughput metrics (rollouts per second)

Environment-Specific Metadata:

Extract and export environment-specific configuration
Include dataset information (size, source, etc.)

Model Response Analysis:

Add token usage statistics from model responses
Include finish reasons from model responses

Enhanced Error Tracking:

Track and export information about failed rollouts
Include error types and counts

System Information:

Export system information (Python version, library versions)
Include hardware information when available

feat(eval): add evaluation runtime tracking and detailed state metadata

ad52e16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): add evaluation runtime tracking and detailed state metadata #308

feat(eval): add evaluation runtime tracking and detailed state metadata #308

Uh oh!

srthkdev commented Sep 8, 2025

Uh oh!

Uh oh!

feat(eval): add evaluation runtime tracking and detailed state metadata #308

Are you sure you want to change the base?

feat(eval): add evaluation runtime tracking and detailed state metadata #308

Uh oh!

Conversation

srthkdev commented Sep 8, 2025

Enhanced Metadata Export in vf-eval

Description

Key Changes

Type of Change

Testing

Test Coverage

Checklist

Additional Notes

Potential Additional Enhancements

Uh oh!

Uh oh!