Skip to content

Conversation

finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Aug 7, 2025

New & improved version of #807.

Now, we process individual prompts through the queues, and do inflight weight updates (configurable with the inflight_weight_updates flag, set to False by default).

Runs with inflight_updates=False (which should recreate existing behaviour):

  1. Single GPU run: Beaker
  2. Single GPU run with tools: Beaker
  3. Multi-GPU run: Beaker

With inflight_updates=True:

  1. Single GPU run: Beaker
  2. Multi-GPU run: Beaker

This is 40% more efficient (tokens per second go from 349.22 -> 500.42 in the benchmark).

Benchmark results for main at HEAD:

Average results over 4 main benchmark batches:
Average tokens/second: 349.22
Average MFU: 4.76%
Average generation time per batch: 2422.32s
Average new tokens per sample: 13217.39 tokens
Wasted compute % (variable response length): 35.31%

Benchmark results in this PR:

Average results over 4 main benchmark batches:
Average tokens/second: 500.42
Average MFU: 5.43%
Average generation time per batch: 1691.32s
Average new tokens per sample: 13224.54 tokens

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think this is pretty good to merge, mostly just minor nits left, assuming the slowdown stuff is worked out!

vllm_top_p: float = 1.0
"""vLLM top p for nucleus sampling"""
inference_batch_size: Optional[int] = None
"""Number of inference requests to batch together for vLLM processing"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could be more specific: (something like) "number of unique prompts sent to a single vllm engine at once"?


finish_reasons += [finish_reasons[i] for i in sampled_indices]

print(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this logging?

raise ValueError(f"Unknown tool: {tool}")

actor_manager = vllm_utils3.ActorManager.remote()
actor_manager = ray.remote(vllm_utils3.ActorManager).remote()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary since ActorManager is already decorated with ray.remote?

if has_beaker_job:
logger.info(f"is_beaker_job: BEAKER_JOB_ID value: {os.environ.get('BEAKER_JOB_ID')}")
return has_beaker_job
return "BEAKER_JOB_ID" in os.environ
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link

claude bot commented Aug 29, 2025

Code Review for PR #859: Change LLMRayActor to continually process individual prompts

Overview

This PR implements a significant architectural change from batch-based to individual prompt processing, introducing an inflight_updates flag and achieving a reported 40% performance improvement (349.22 → 500.42 tokens/second).

✅ Strengths

Performance Improvements:

  • Excellent 40% performance gain with clear benchmarking data
  • Smarter queue sizing based on actual workload: queue_size = args.num_unique_prompts_rollout * num_batches
  • Better memory management with comprehensive logging functions

Code Quality:

  • Good separation of concerns between submission and generation threads
  • Proper error handling and cleanup in finally blocks
  • Clear progress reporting and logging improvements

Architecture:

  • Clean transition from batch-based to individual prompt processing
  • Backwards compatibility with inflight_updates=False flag
  • Well-structured thread management with proper synchronization

⚠️ Issues and Concerns

Critical Bug - Division by Zero:
In grpo_fast.py line 442: self.inference_batch_size = self.num_unique_prompts_rollout // self.vllm_num_engines
When num_unique_prompts_rollout < vllm_num_engines, this becomes 0, which will cause issues downstream.

Potential Race Conditions:
The dataset_idx calculation in benchmark_generators.py could lead to out-of-bounds access if dataset size assumptions are violated.

Memory Management:

  • New log_memory_usage() function is comprehensive but may impact performance in production due to frequent gc.get_objects() calls
  • Consider making memory logging optional or less frequent

Thread Safety:

  • Multiple threads accessing shared queues - ensure proper synchronization
  • Thread pool size increased to len(vllm_engines) + 1 - verify this doesn't cause resource contention

🔧 Recommended Fixes

  1. Fix Division by Zero: Use max(1, self.num_unique_prompts_rollout // self.vllm_num_engines)
  2. Add bounds checking for dataset access
  3. Optimize memory logging for production use

🧪 Testing Concerns

Limited Test Coverage:

  • Only basic unit tests for device name mapping
  • No integration tests for the new individual prompt processing logic
  • Missing tests for edge cases (empty queues, timeout scenarios)
  • No tests for the new inflight_updates functionality

🛡️ Security Assessment

Low Risk Areas:

  • No obvious security vulnerabilities
  • Proper input validation for queue operations
  • Safe memory management practices

📋 Summary

This is a solid performance-focused PR with a well-architected solution. The 40% performance gain is impressive and the code is generally well-structured. However, there are a few critical bugs that need addressing before merge:

  1. Fix the division by zero bug in inference batch size calculation
  2. Add bounds checking for dataset access
  3. Optimize memory logging for production use
  4. Expand test coverage for the new functionality

The architectural change is sound and the backwards compatibility approach is thoughtful. Once the critical issues are addressed, this will be a valuable improvement to the codebase.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants