Skip to content

Conversation

GavinZhu-GMI
Copy link
Contributor

Overview:

Fixes division by zero errors in planner's autoscaling logic that prevented scaling decisions in disaggregated SGLang deployments.

Details:

currently Three division by zero scenarios in planner_core.py:make_adjustments():

  1. Line 375: self.p_correction_factor = self.last_metrics.ttft / expect_ttft

    • When expect_ttft = 0 from interpolation edge cases
  2. Line 384: self.d_correction_factor = self.last_metrics.itl / expect_itl

    • When expect_itl = 0 from interpolation edge cases
  3. Line 379: concurrency = self.last_metrics.num_req / len(self.d_endpoints)

    • When len(self.d_endpoints) = 0 (no decode workers initially)

components/planner/src/dynamo/planner/utils/planner_core.py

1. TTFT Correction Factor (lines 375-379):

# Before
self.p_correction_factor = self.last_metrics.ttft / expect_ttft

# After
if expect_ttft > 0:
    self.p_correction_factor = self.last_metrics.ttft / expect_ttft
else:
    logger.warning(f"Expected TTFT is {expect_ttft}, using default correction factor 1.0")
    self.p_correction_factor = 1.0
  1. Concurrency Calculation (lines 381-388):
# Before
concurrency = self.last_metrics.num_req / len(self.d_endpoints) * ...
 # After
 if len(self.d_endpoints) > 0:
     concurrency = (self.last_metrics.num_req / len(self.d_endpoints) * ...)
 else:
     logger.warning("No decode workers available, using default concurrency of 1.0")
     concurrency = 1.0
  1. ITL Correction Factor (lines 394-398):
# Before
self.d_correction_factor = self.last_metrics.itl / expect_itl

# After
if expect_itl > 0:
   self.d_correction_factor = self.last_metrics.itl / expect_itl
else:
   logger.warning(f"Expected ITL is {expect_itl}, using default correction factor 1.0")
   self.d_correction_factor = 1.0

Where should the reviewer start?

components/planner/src/dynamo/planner/utils/planner_core.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Copy link

copy-pr-bot bot commented Sep 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor fix labels Sep 18, 2025
@pull-request-size pull-request-size bot added size/M and removed size/S labels Sep 18, 2025
@GavinZhu-GMI
Copy link
Contributor Author

after the fix, the planner can handle the edge case in the right way.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-contribution Pull request is from an external contributor fix size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant