Skip to content

Conversation

@xlliu-scitix
Copy link
Collaborator

…fer; unify event check logic; fix nccltimemout detection

  • Refactor FileFilter:

    • Decouple file loading via shared FileLoader and centralized FileLoaderScheduler
    • Each EventFilter maintains its own read pointer, reading from ring buffer
    • returns structured Result with matched event info
  • Refactor event-based checkers:

    • Remove redundant Collector and Checker modules
    • Use EventFilter.Check() directly and integrate with spec-based check result
  • Fix NCCL event detection:

    • change nccl componet to podlog componet, provides more reusable event matching behavior from pod log
    • Only check event from logs of running pods
  • Fix version print

  • move nccl perftest to infiniband perftest

…fer; unify event check logic; fix nccltimemout detection

- Refactor FileFilter:
  - Decouple file loading via shared FileLoader and centralized FileLoaderScheduler
  - Each EventFilter maintains its own read pointer, reading from ring buffer
  -  returns structured Result with matched event info

- Refactor event-based checkers:
  - Remove redundant Collector and Checker modules
  - Use EventFilter.Check() directly and integrate with spec-based check result

- Fix NCCL event detection:
  - change nccl componet to podlog componet, provides more reusable event matching behavior from pod log
  - Only check event from logs of running pods

- Fix version print

- move nccl perftest to infiniband perftest
@xlliu-scitix xlliu-scitix force-pushed the feat/refactor-eventfilter branch from 77ce817 to 868c4f6 Compare June 23, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant