Skip to content

Conversation

trevin-lee
Copy link

Title

HeterogeneousCore/SonicTriton: add RetryActionDiffServer; expose connectToServer; update tests

Body

PR description

  • Add RetryActionDiffServer to switch to an alternative Triton server upon failure.
  • Expose TritonClient::connectToServer(std::string url) and add a testing constructor to enable unit tests.
  • Update BuildFile.xml; remove obsolete tritonRetryActionTest_cfg.py.
  • Expected physics/output changes: none in nominal operation; only affects behavior on retry path.
  • Builds on the retry framework work (related: fastmachinelearning/cmssw#19).

PR validation

  • Builds in CMSSW_15_0_0_pre3 (scram b -j).
  • New retry action compiles; SonicTriton tests continue to pass.
  • Manually exercised retry by providing an alternative server URL; logs show server switch and successful inference.
  • No changes observed in nominal outputs.

Backport

  • Not a backport.

Reviewers: @jmduarte @kpedro88

@jmduarte jmduarte requested a review from kpedro88 August 12, 2025 03:12
@kpedro88
Copy link

Preliminary comments:

  1. do you mean 15_1_0_pre3? (not 15_0)
  2. there are a few outstanding review comments on @kakwok's PR that should be addressed: Test PR for new Retry Framework #19 (review)

…lection; remove unused parameters and improve documentation.
…tructor and associated test file. Update BuildFile.xml to reflect changes in test structure and retry action configuration.
trevin-lee pushed a commit to trevin-lee/cmssw that referenced this pull request Aug 25, 2025
@trevin-lee
Copy link
Author

Title

HeterogeneousCore/SonicTriton: add RetryActionDiffServer; expose connectToServer; update tests

Body

PR description

  • Add RetryActionDiffServer to switch to an alternative Triton server upon failure.
  • Expose TritonClient::connectToServer(std::string url) and add a testing constructor to enable unit tests.
  • Update BuildFile.xml; remove obsolete tritonRetryActionTest_cfg.py.
  • Expected physics/output changes: none in nominal operation; only affects behavior on retry path.
  • Builds on the retry framework work (related: fastmachinelearning/cmssw#19).

PR validation

  • Builds in CMSSW_15_0_0_pre3 (scram b -j).
  • New retry action compiles; SonicTriton tests continue to pass.
  • Manually exercised retry by providing an alternative server URL; logs show server switch and successful inference.
  • No changes observed in nominal outputs.

Backport

  • Not a backport.

Reviewers: @jmduarte @kpedro88

Superseded by new PR targeting 15_1_0_pre3: #22
Removes altServer* params; uses TritonService registry via TritonClient::updateServer(TritonService::Server::fallbackName)
Moves tests to HeterogeneousCore/SonicTriton/test with proper Catch2 setup; extends tritonTest_cfg.py with --retryAction {same,diff} and a log check
Adds protected default TritonClient ctor for unit testing
Closing this PR; please continue review on the new one.

@trevin-lee trevin-lee closed this Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants