fix: fixed Planner division by zero errors in sglang backend #3113

GavinZhu-GMI · 2025-09-18T10:00:03Z

Overview:

Fixes division by zero errors in planner's autoscaling logic that prevented scaling decisions in disaggregated SGLang deployments.

Details:

currently Three division by zero scenarios in planner_core.py:make_adjustments():

Line 375: self.p_correction_factor = self.last_metrics.ttft / expect_ttft
- When expect_ttft = 0 from interpolation edge cases
Line 384: self.d_correction_factor = self.last_metrics.itl / expect_itl
- When expect_itl = 0 from interpolation edge cases
Line 379: concurrency = self.last_metrics.num_req / len(self.d_endpoints)
- When len(self.d_endpoints) = 0 (no decode workers initially)

`components/planner/src/dynamo/planner/utils/planner_core.py`

1. TTFT Correction Factor (lines 375-379):

# Before
self.p_correction_factor = self.last_metrics.ttft / expect_ttft

# After
if expect_ttft > 0:
    self.p_correction_factor = self.last_metrics.ttft / expect_ttft
else:
    logger.warning(f"Expected TTFT is {expect_ttft}, using default correction factor 1.0")
    self.p_correction_factor = 1.0

Concurrency Calculation (lines 381-388):

# Before
concurrency = self.last_metrics.num_req / len(self.d_endpoints) * ...

 # After
 if len(self.d_endpoints) > 0:
     concurrency = (self.last_metrics.num_req / len(self.d_endpoints) * ...)
 else:
     logger.warning("No decode workers available, using default concurrency of 1.0")
     concurrency = 1.0

ITL Correction Factor (lines 394-398):

# Before
self.d_correction_factor = self.last_metrics.itl / expect_itl

# After
if expect_itl > 0:
   self.d_correction_factor = self.last_metrics.itl / expect_itl
else:
   logger.warning(f"Expected ITL is {expect_itl}, using default correction factor 1.0")
   self.d_correction_factor = 1.0

Where should the reviewer start?

components/planner/src/dynamo/planner/utils/planner_core.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [BUG]: Planner division by zero errors prevent autoscaling in disaggregated SGLang deployments #3112

Signed-off-by: Gavin.Zhu <[email protected]>

copy-pr-bot · 2025-09-18T10:00:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-09-18T10:00:11Z

👋 Hi GavinZhu-GMI! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

Signed-off-by: Gavin.Zhu <[email protected]>

GavinZhu-GMI · 2025-09-18T12:23:01Z

after the fix, the planner can handle the edge case in the right way.

fix: fixed Planner division by zero errors in sglang backend

5a1f4f9

Signed-off-by: Gavin.Zhu <[email protected]>

GavinZhu-GMI requested review from tedzhouhk, hhzhang16, jasonqinzhou, PeaBrane, Aphoh, alec-flowers and michaelshin as code owners September 18, 2025 10:00

pull-request-size bot added the size/S label Sep 18, 2025

github-actions bot added external-contribution Pull request is from an external contributor fix labels Sep 18, 2025

fix: fixed format for precommit

f7154d5

Signed-off-by: Gavin.Zhu <[email protected]>

pull-request-size bot added size/M and removed size/S labels Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fixed Planner division by zero errors in sglang backend #3113

fix: fixed Planner division by zero errors in sglang backend #3113

Uh oh!

GavinZhu-GMI commented Sep 18, 2025

Uh oh!

copy-pr-bot bot commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

GavinZhu-GMI commented Sep 18, 2025

Uh oh!

Uh oh!

fix: fixed Planner division by zero errors in sglang backend #3113

Are you sure you want to change the base?

fix: fixed Planner division by zero errors in sglang backend #3113

Uh oh!

Conversation

GavinZhu-GMI commented Sep 18, 2025

Overview:

Details:

components/planner/src/dynamo/planner/utils/planner_core.py

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

copy-pr-bot bot commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

GavinZhu-GMI commented Sep 18, 2025

Uh oh!

Uh oh!

`components/planner/src/dynamo/planner/utils/planner_core.py`