Skip to content

Fix MatrixDepot benchmark CI failures and improve error handling #1341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChrisRackauckas-Claude
Copy link

Summary

  • Fixes CI stall issues in MatrixDepot benchmark that caused PR Another MatrixDepot.jmd size bump #1035 to fail
  • Implements graceful error handling to prevent huge error dumps
  • Adds progress tracking with heartbeats to prevent CI timeouts

Problem

PR #1035 attempted to increase the matrix size limit from 500 to 5000, but the CI system stalled because:

  1. Matrix factorization failures printed huge error stacktraces, flooding the logs
  2. No progress indication made CI think the job was frozen
  3. The benchmark output page became unreadable due to error dumps

Solution

This PR implements comprehensive improvements:

1. Graceful Error Handling

  • Wraps individual algorithm benchmarks in try-catch blocks
  • Records NaN for failed algorithms without printing full stacktraces
  • Only logs brief error type information instead of full dumps

2. Progress Tracking System

  • Adds heartbeat messages every 30 seconds to show CI the job is still running
  • Progress updates every 10 matrices showing:
    • Current matrix number and percentage complete
    • Running counts of successful/failed/skipped matrices
  • Final summary with total runtime and statistics

3. Conservative Size Limit

  • Increases limit from 100 to 1500 (more conservative than Another MatrixDepot.jmd size bump #1035's 5000)
  • Tracks skipped matrices separately
  • Allows benchmarking larger matrices while avoiding CI timeouts

4. Early Termination

  • Stops benchmark if >100 matrices fail to prevent endless failures
  • Prevents CI timeout from excessive failed attempts

5. Better Logging

  • Uses @info, @warn, @debug for appropriate log levels
  • Flushes stdout/stderr regularly for real-time CI updates
  • Summarizes failed matrices at the end (up to 20 shown)

Testing

Tested locally with subset of matrices - error handling and progress tracking work correctly.

Related Issues

🤖 Generated with Claude Code

- Add graceful error handling for matrix factorization failures
- Implement progress tracking with regular heartbeats (every 30s)
- Add detailed progress logs every 10 matrices
- Increase matrix size limit from 100 to 1500 (more conservative than PR SciML#1035's 5000)
- Add early termination if >100 failures to prevent CI timeouts
- Capture failures silently without huge error dumps
- Add comprehensive summary statistics at benchmark completion
- Track successful, failed, and skipped matrices separately
- Use Dates package for timing and heartbeat mechanism

This fixes the CI stall issues by:
1. Preventing huge error printouts that flood logs
2. Providing regular heartbeats so CI knows job is still running
3. Limiting matrix sizes to avoid extremely long computations
4. Early termination on excessive failures

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants