Skip to content

Conversation

MasterPtato
Copy link
Contributor

Changes

Copy link
Contributor Author

MasterPtato commented Jun 24, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

cloudflare-workers-and-pages bot commented Jun 24, 2025

Deploying rivet with  Cloudflare Pages  Cloudflare Pages

Latest commit: ca77778
Status:🚫  Build failed.

View logs

@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from c1393f3 to 33c172e Compare June 24, 2025 20:14
@MasterPtato MasterPtato marked this pull request as ready for review June 24, 2025 20:14
@MasterPtato MasterPtato requested a review from NathanFlurry June 24, 2025 20:14
@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from 33c172e to fca7a80 Compare June 26, 2025 01:10
@MasterPtato MasterPtato force-pushed the 06-21-feat_get_multi_actors_working_e2e_on_docker_compose branch from 3fb5f26 to b48c234 Compare June 26, 2025 01:10
Copy link

cloudflare-workers-and-pages bot commented Jun 26, 2025

Deploying rivet-studio with  Cloudflare Pages  Cloudflare Pages

Latest commit: ca77778
Status:🚫  Build failed.

View logs

Copy link

cloudflare-workers-and-pages bot commented Jun 26, 2025

Deploying rivet-hub with  Cloudflare Pages  Cloudflare Pages

Latest commit: ca77778
Status:🚫  Build failed.

View logs

@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from fca7a80 to 9191ac2 Compare June 26, 2025 23:20
@MasterPtato MasterPtato force-pushed the 06-21-feat_get_multi_actors_working_e2e_on_docker_compose branch from b48c234 to 9544f05 Compare June 26, 2025 23:20
Copy link

claude bot commented Jul 3, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add actor kv to runners

I've completed a comprehensive code review focusing on code quality, potential bugs, performance, security, and test coverage.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze actor-kv client changes
  • Review runner protocol modifications
  • Examine manager integration changes
  • Check database migration safety
  • Assess security implications
  • Evaluate performance considerations
  • Review error handling and edge cases
  • Check test coverage implications
  • Provide comprehensive feedback and recommendations

Architecture Overview

This PR implements actor key-value storage by integrating FoundationDB-backed KV operations into the runner system. The key changes include:

  1. ActorKv Client: Streamlined FoundationDB integration with chunking for large values (10KB chunks)
  2. Runner Protocol Extension: Added KV request/response message types with deduplication IDs
  3. Manager Integration: Each actor now gets dedicated ActorKv instances for KV operations
  4. FDB Connection Management: New FdbPool utility for managing FoundationDB connections
  5. Removed V8 Runner: Eliminated the standalone isolate-v8-runner (336 lines deleted)

Critical Issues Found

🔴 Race Condition in Storage Validation

Location: packages/edge/infra/client/actor-kv/src/lib.rs:308-310

let total_size = self.get_subspace_size(subspace).await? as usize;
validate_entries(&entries, total_size)?;

Issue: Size check happens before write operations, but concurrent writes can exceed storage limits between the check and actual write. This bypasses the 1GB storage quota enforcement.

Impact: Actors can exceed their storage limit through concurrent operations.

🔴 Missing Actor Access Control

Location: packages/edge/infra/client/manager/src/runner/mod.rs:212-218

runner_protocol::ToManager::Kv(req) => {
    let _ = self.actor_proxy_tx.send((
        req.actor_id,
        req.generation,
        runner_protocol::ToActor::Kv(req),
    ));
}

Issue: No validation that the requesting actor matches the KV operation's actor_id. Any actor could potentially access other actors' KV data if they know the actor ID.

Impact: Cross-actor data access vulnerability if actor IDs are predictable or discoverable.

🟡 Performance: Blocking KV Operations

Location: packages/edge/infra/client/manager/src/actor/mod.rs:210-277
Issue: KV operations are processed synchronously in the actor observation loop, which can block actor state updates and process monitoring.

Impact: Delays in actor lifecycle management when KV operations are slow.

Code Quality Issues

1. Expensive Data Cloning

Location: packages/edge/infra/client/actor-kv/src/lib.rs:314-315

// TODO: Costly clone
let entries = entries.clone();

Issue: Multiple expensive clones in hot paths that could impact performance with large payloads.

2. Imprecise Size Calculations

Location: packages/edge/infra/client/actor-kv/src/key.rs:33-36

pub fn len(&self) -> usize {
    // Arbitrary 4 accounting for nesting overhead
    self.0.iter().fold(0, |acc, x| acc + x.len()) + 4 * self.0.len()
}

Issue: Size calculation includes "arbitrary" overhead that may not match actual storage usage, affecting quota validation accuracy.

3. Hard-coded Limits

Location: packages/edge/infra/client/actor-kv/src/lib.rs:28-34

const MAX_PUT_PAYLOAD_SIZE: usize = 976 * 1024; // Why 976 KiB specifically?

Issue: Magic numbers should be documented or configurable per deployment.

4. TODO Comments Left Unresolved

  • packages/edge/infra/client/config/src/runner_protocol.rs:56: "This shouldn't require generation"
  • packages/edge/infra/client/manager/src/utils/fdb.rs:9: "Copied from rivet_pools"
  • packages/edge/infra/client/manager/src/actor/mod.rs:211: "Add queue and bg thread for processing kv ops"

Security Concerns

1. Resource Exhaustion

  • Concurrent writes can bypass storage limits (race condition above)
  • No rate limiting on KV operations per actor
  • Large batch operations (128 keys max) could consume significant resources

2. Data Isolation

  • No enforcement that actors can only access their own KV namespaces
  • Silent deduplication in entry processing could mask bugs or malicious behavior

Performance Issues

1. Synchronous Processing

KV operations block the main actor observer loop instead of being processed asynchronously.

2. Multiple FDB Transactions

Each KV operation creates separate transactions instead of batching when possible.

3. Polling-based Size Estimation

get_estimated_range_size_bytes may be inaccurate and could lead to incorrect quota enforcement.

Test Coverage Gap

Critical Finding: No unit tests found for the new KV functionality. The only testing appears to be the system test example in examples/system-test-actor/src/managerClient.ts.

Missing test coverage for:

  • KV operations (get, put, list, delete)
  • Error conditions (size limits, invalid keys)
  • Concurrent access scenarios
  • Storage quota enforcement
  • Cross-actor access prevention
  • Error handling edge cases

Database Migration Analysis

Location: packages/edge/services/pegboard/db/runner/migrations/20200101000000_init.up.sql

The migration looks clean with proper TTL settings and follows good ClickHouse practices.

Positive Aspects

  1. Well-structured Protocol: Clean request/response patterns with proper deduplication
  2. Comprehensive Validation: Good size limits and input validation throughout
  3. Robust Error Handling: Detailed error contexts and proper error propagation
  4. Chunking Implementation: Proper handling of large values with 10KB chunks
  5. FDB Integration: Solid connection management with service discovery support

Recommendations

High Priority (Security & Correctness)

  1. Fix race condition: Use FDB transactions to atomically check and update storage usage
  2. Add actor ID validation: Verify requesting actor matches KV operation actor_id
  3. Add comprehensive tests: Unit tests for all KV operations and error scenarios
  4. Implement async KV processing: Move KV operations to background tasks to prevent blocking

Medium Priority (Performance & Quality)

  1. Address expensive clones: Optimize data copying in hot paths (resolve TODOs)
  2. Make limits configurable: Allow per-deployment configuration of storage/size limits
  3. Add rate limiting: Implement per-actor rate limiting for KV operations
  4. Improve size calculation: Make Key::len() more accurate for quota enforcement

Low Priority (Code Quality)

  1. Resolve TODO comments: Address outstanding technical debt
  2. Add metrics/monitoring: Track KV operation performance and error rates
  3. Consider batching: Optimize FDB operations by batching when possible
  4. Improve documentation: Add more detailed comments for complex logic

Summary

This is a well-architected implementation of actor KV storage with solid FoundationDB integration. The code quality is generally good with comprehensive validation and error handling. However, there are critical security and correctness issues that must be addressed:

  1. Race condition in storage validation could allow quota bypass
  2. Missing access control could enable cross-actor data access
  3. Lack of test coverage is concerning for such a critical feature
  4. Performance issues from synchronous processing could impact actor lifecycle

Recommendation: Requires fixes for critical issues before merge, particularly the race condition and access control problems. Adding comprehensive tests is also essential.


@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from 9a19675 to 846e5a4 Compare July 4, 2025 01:27
Copy link

coderabbitai bot commented Jul 4, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

claude bot commented Jul 4, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add actor kv to runners

I've completed a comprehensive code review focusing on code quality, potential bugs, performance, security, and test coverage.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze actor-kv client implementation changes
  • Review runner protocol modifications
  • Examine manager integration changes
  • Check for security vulnerabilities
  • Evaluate performance considerations
  • Assess test coverage implications
  • Provide comprehensive review feedback

Architecture Overview

This PR implements actor key-value storage by integrating FoundationDB-backed KV operations into the runner system. The key changes include:

  1. ActorKv Client: Streamlined FoundationDB integration with chunking for large values (10KB chunks)
  2. Runner Protocol Extension: Added KV request/response message types with deduplication IDs
  3. Manager Integration: Each actor now gets dedicated ActorKv instances for KV operations
  4. FDB Connection Management: New FdbPool utility for managing FoundationDB connections
  5. Removed V8 Runner: Eliminated the standalone isolate-v8-runner (336 lines deleted)

Critical Issues Found

🔴 Race Condition in Storage Validation

Location: packages/edge/infra/client/actor-kv/src/lib.rs:308-310

let total_size = self.get_subspace_size(subspace).await? as usize;
validate_entries(&entries, total_size)?;

Issue: Size check happens before write operations, but concurrent writes can exceed storage limits between the check and actual write. This bypasses the 1GB storage quota enforcement.

Impact: Actors can exceed their storage limit through concurrent operations.

🔴 Missing Actor Access Control

Location: packages/edge/infra/client/manager/src/runner/mod.rs:212-218

runner_protocol::ToManager::Kv(req) => {
    let _ = self.actor_proxy_tx.send((
        req.actor_id,
        req.generation,
        runner_protocol::ToActor::Kv(req),
    ));
}

Issue: No validation that the requesting actor matches the KV operation's actor_id. Any actor could potentially access other actors' KV data if they know the actor ID.

Impact: Cross-actor data access vulnerability if actor IDs are predictable or discoverable.

🟡 Performance: Blocking KV Operations

Location: packages/edge/infra/client/manager/src/actor/mod.rs:210-277
Issue: KV operations are processed synchronously in the actor observation loop, which can block actor state updates and process monitoring.

Impact: Delays in actor lifecycle management when KV operations are slow.

Code Quality Issues

1. Expensive Data Cloning

Location: packages/edge/infra/client/actor-kv/src/lib.rs:314-315

// TODO: Costly clone
let entries = entries.clone();

Issue: Multiple expensive clones in hot paths that could impact performance with large payloads.

2. Imprecise Size Calculations

Location: packages/edge/infra/client/actor-kv/src/key.rs:33-36

pub fn len(&self) -> usize {
    // Arbitrary 4 accounting for nesting overhead
    self.0.iter().fold(0, |acc, x| acc + x.len()) + 4 * self.0.len()
}

Issue: Size calculation includes "arbitrary" overhead that may not match actual storage usage, affecting quota validation accuracy.

3. Hard-coded Limits

Location: packages/edge/infra/client/actor-kv/src/lib.rs:28-34

const MAX_PUT_PAYLOAD_SIZE: usize = 976 * 1024; // Why 976 KiB specifically?

Issue: Magic numbers should be documented or configurable per deployment.

4. TODO Comments Left Unresolved

  • packages/edge/infra/client/config/src/runner_protocol.rs:56: "This shouldn't require generation"
  • packages/edge/infra/client/manager/src/utils/fdb.rs:9: "Copied from rivet_pools"
  • packages/edge/infra/client/manager/src/actor/mod.rs:211: "Add queue and bg thread for processing kv ops"

Security Concerns

1. Resource Exhaustion

  • Concurrent writes can bypass storage limits (race condition above)
  • No rate limiting on KV operations per actor
  • Large batch operations (128 keys max) could consume significant resources

2. Data Isolation

  • No enforcement that actors can only access their own KV namespaces
  • Silent deduplication in entry processing could mask bugs or malicious behavior

Performance Issues

1. Synchronous Processing

KV operations block the main actor observer loop instead of being processed asynchronously.

2. Multiple FDB Transactions

Each KV operation creates separate transactions instead of batching when possible.

3. Polling-based Size Estimation

get_estimated_range_size_bytes may be inaccurate and could lead to incorrect quota enforcement.

Test Coverage Gap

Critical Finding: No unit tests found for the new KV functionality. The only testing appears to be the system test example in examples/system-test-actor/src/managerClient.ts.

Missing test coverage for:

  • KV operations (get, put, list, delete)
  • Error conditions (size limits, invalid keys)
  • Concurrent access scenarios
  • Storage quota enforcement
  • Cross-actor access prevention
  • Error handling edge cases

Database Migration Analysis

Location: packages/edge/services/pegboard/db/runner/migrations/20200101000000_init.up.sql

The migration looks clean with proper TTL settings and follows good ClickHouse practices.

Positive Aspects

  1. Well-structured Protocol: Clean request/response patterns with proper deduplication
  2. Comprehensive Validation: Good size limits and input validation throughout
  3. Robust Error Handling: Detailed error contexts and proper error propagation
  4. Chunking Implementation: Proper handling of large values with 10KB chunks
  5. FDB Integration: Solid connection management with service discovery support

Recommendations

High Priority (Security & Correctness)

  1. Fix race condition: Use FDB transactions to atomically check and update storage usage
  2. Add actor ID validation: Verify requesting actor matches KV operation actor_id
  3. Add comprehensive tests: Unit tests for all KV operations and error scenarios
  4. Implement async KV processing: Move KV operations to background tasks to prevent blocking

Medium Priority (Performance & Quality)

  1. Address expensive clones: Optimize data copying in hot paths (resolve TODOs)
  2. Make limits configurable: Allow per-deployment configuration of storage/size limits
  3. Add rate limiting: Implement per-actor rate limiting for KV operations
  4. Improve size calculation: Make Key::len() more accurate for quota enforcement

Low Priority (Code Quality)

  1. Resolve TODO comments: Address outstanding technical debt
  2. Add metrics/monitoring: Track KV operation performance and error rates
  3. Consider batching: Optimize FDB operations by batching when possible
  4. Improve documentation: Add more detailed comments for complex logic

Summary

This is a well-architected implementation of actor KV storage with solid FoundationDB integration. The code quality is generally good with comprehensive validation and error handling. However, there are critical security and correctness issues that must be addressed:

  1. Race condition in storage validation could allow quota bypass
  2. Missing access control could enable cross-actor data access
  3. Lack of test coverage is concerning for such a critical feature
  4. Performance issues from synchronous processing could impact actor lifecycle

Recommendation: Requires fixes for critical issues before merge, particularly the race condition and access control problems. Adding comprehensive tests is also essential.


@MasterPtato MasterPtato force-pushed the 06-21-feat_get_multi_actors_working_e2e_on_docker_compose branch from 4801dea to 618f8f4 Compare July 11, 2025 22:23
@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from 846e5a4 to 7f5d1c3 Compare July 11, 2025 22:23
Copy link

claude bot commented Jul 11, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add actor kv to runners

I've completed a comprehensive review of this PR focusing on code quality, potential bugs, performance, security, and test coverage.

Review Checklist:

  • Understand the overall architecture and changes
  • Review actor-kv client implementation
  • Analyze runner protocol modifications
  • Examine manager integration changes
  • Check for security vulnerabilities
  • Evaluate performance considerations
  • Assess test coverage implications
  • Provide comprehensive feedback and recommendations

Architecture Overview

This PR implements actor key-value storage by integrating FoundationDB-backed KV operations into the runner system. The key changes include:

  1. ActorKv Client: Streamlined FoundationDB integration with chunking for large values (10KB chunks)
  2. Runner Protocol Extension: Added KV request/response message types with deduplication IDs
  3. Manager Integration: Each actor now gets dedicated ActorKv instances for KV operations
  4. FDB Connection Management: New FdbPool utility for managing FoundationDB connections
  5. Removed V8 Runner: Eliminated the standalone isolate-v8-runner (336 lines deleted)

Critical Issues Found

🔴 Race Condition in Storage Validation

Location: packages/edge/infra/client/actor-kv/src/lib.rs:308-310

let total_size = self.get_subspace_size(subspace).await? as usize;
validate_entries(&entries, total_size)?;

Issue: Size check happens before write operations, but concurrent writes can exceed storage limits between the check and actual write. This bypasses the 1GB storage quota enforcement.

Impact: Actors can exceed their storage limit through concurrent operations.

🔴 Missing Actor Access Control

Location: packages/edge/infra/client/manager/src/runner/mod.rs:212-218

runner_protocol::ToManager::Kv(req) => {
    let _ = self.actor_proxy_tx.send((
        req.actor_id,
        req.generation,
        runner_protocol::ToActor::Kv(req),
    ));
}

Issue: No validation that the requesting actor matches the KV operation's actor_id. Any actor could potentially access other actors' KV data if they know the actor ID.

Impact: Cross-actor data access vulnerability if actor IDs are predictable or discoverable.

🟡 Performance: Blocking KV Operations

Location: packages/edge/infra/client/manager/src/actor/mod.rs:210-277
Issue: KV operations are processed synchronously in the actor observation loop, which can block actor state updates and process monitoring.

Impact: Delays in actor lifecycle management when KV operations are slow.

Code Quality Issues

1. Expensive Data Cloning

Location: packages/edge/infra/client/actor-kv/src/lib.rs:314-315

// TODO: Costly clone
let entries = entries.clone();

Issue: Multiple expensive clones in hot paths that could impact performance with large payloads.

2. Imprecise Size Calculations

Location: packages/edge/infra/client/actor-kv/src/key.rs:33-36

pub fn len(&self) -> usize {
    // Arbitrary 4 accounting for nesting overhead
    self.0.iter().fold(0, |acc, x| acc + x.len()) + 4 * self.0.len()
}

Issue: Size calculation includes "arbitrary" overhead that may not match actual storage usage, affecting quota validation accuracy.

3. Hard-coded Limits

Location: packages/edge/infra/client/actor-kv/src/lib.rs:28-34

const MAX_PUT_PAYLOAD_SIZE: usize = 976 * 1024; // Why 976 KiB specifically?

Issue: Magic numbers should be documented or configurable per deployment.

4. TODO Comments Left Unresolved

  • packages/edge/infra/client/config/src/runner_protocol.rs:56: "This shouldn't require generation"
  • packages/edge/infra/client/manager/src/utils/fdb.rs:9: "Copied from rivet_pools"
  • packages/edge/infra/client/manager/src/actor/mod.rs:211: "Add queue and bg thread for processing kv ops"

Security Concerns

1. Resource Exhaustion

  • Concurrent writes can bypass storage limits (race condition above)
  • No rate limiting on KV operations per actor
  • Large batch operations (128 keys max) could consume significant resources

2. Data Isolation

  • No enforcement that actors can only access their own KV namespaces
  • Silent deduplication in entry processing could mask bugs or malicious behavior

Performance Issues

1. Synchronous Processing

KV operations block the main actor observer loop instead of being processed asynchronously.

2. Multiple FDB Transactions

Each KV operation creates separate transactions instead of batching when possible.

3. Polling-based Size Estimation

get_estimated_range_size_bytes may be inaccurate and could lead to incorrect quota enforcement.

Test Coverage Gap

Critical Finding: No unit tests found for the new KV functionality. The only testing appears to be the system test example in examples/system-test-actor/src/managerClient.ts.

Missing test coverage for:

  • KV operations (get, put, list, delete)
  • Error conditions (size limits, invalid keys)
  • Concurrent access scenarios
  • Storage quota enforcement
  • Cross-actor access prevention
  • Error handling edge cases

Database Migration Analysis

Location: packages/edge/services/pegboard/db/runner/migrations/20200101000000_init.up.sql

The migration looks clean with proper TTL settings and follows good ClickHouse practices.

Positive Aspects

  1. Well-structured Protocol: Clean request/response patterns with proper deduplication
  2. Comprehensive Validation: Good size limits and input validation throughout
  3. Robust Error Handling: Detailed error contexts and proper error propagation
  4. Chunking Implementation: Proper handling of large values with 10KB chunks
  5. FDB Integration: Solid connection management with service discovery support

Recommendations

High Priority (Security & Correctness)

  1. Fix race condition: Use FDB transactions to atomically check and update storage usage
  2. Add actor ID validation: Verify requesting actor matches KV operation actor_id
  3. Add comprehensive tests: Unit tests for all KV operations and error scenarios
  4. Implement async KV processing: Move KV operations to background tasks to prevent blocking

Medium Priority (Performance & Quality)

  1. Address expensive clones: Optimize data copying in hot paths (resolve TODOs)
  2. Make limits configurable: Allow per-deployment configuration of storage/size limits
  3. Add rate limiting: Implement per-actor rate limiting for KV operations
  4. Improve size calculation: Make Key::len() more accurate for quota enforcement

Low Priority (Code Quality)

  1. Resolve TODO comments: Address outstanding technical debt
  2. Add metrics/monitoring: Track KV operation performance and error rates
  3. Consider batching: Optimize FDB operations by batching when possible
  4. Improve documentation: Add more detailed comments for complex logic

Summary

This is a well-architected implementation of actor KV storage with solid FoundationDB integration. The code quality is generally good with comprehensive validation and error handling. However, there are critical security and correctness issues that must be addressed:

  1. Race condition in storage validation could allow quota bypass
  2. Missing access control could enable cross-actor data access
  3. Lack of test coverage is concerning for such a critical feature
  4. Performance issues from synchronous processing could impact actor lifecycle

Recommendation: Requires fixes for critical issues before merge, particularly the race condition and access control problems. Adding comprehensive tests is also essential.


Copy link

claude bot commented Jul 11, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add actor kv to runners

I've completed a comprehensive code review focusing on code quality, potential bugs, performance, security, and test coverage.

Review Checklist:

  • Understand the overall architecture and scope of changes
  • Review actor-kv client implementation
  • Analyze runner protocol modifications
  • Examine manager integration changes
  • Check for security vulnerabilities
  • Evaluate performance considerations
  • Assess test coverage implications
  • Provide detailed feedback and recommendations

Architecture Overview

This PR implements actor key-value storage by integrating FoundationDB-backed KV operations into the runner system. The key changes include:

  1. ActorKv Client: Streamlined FoundationDB integration with chunking for large values (10KB chunks)
  2. Runner Protocol Extension: Added KV request/response message types with deduplication IDs
  3. Manager Integration: Each actor now gets dedicated ActorKv instances for KV operations
  4. FDB Connection Management: New FdbPool utility for managing FoundationDB connections
  5. Removed V8 Runner: Eliminated the standalone isolate-v8-runner (336 lines deleted)

Critical Issues Found

🔴 Race Condition in Storage Validation

Location: packages/edge/infra/client/actor-kv/src/lib.rs:308-310

let total_size = self.get_subspace_size(subspace).await? as usize;
validate_entries(&entries, total_size)?;

Issue: Size check happens before write operations, but concurrent writes can exceed storage limits between the check and actual write. This bypasses the 1GB storage quota enforcement.

Impact: Actors can exceed their storage limit through concurrent operations.

🔴 Missing Actor Access Control

Location: packages/edge/infra/client/manager/src/runner/mod.rs:212-218

runner_protocol::ToManager::Kv(req) => {
    let _ = self.actor_proxy_tx.send((
        req.actor_id,
        req.generation,
        runner_protocol::ToActor::Kv(req),
    ));
}

Issue: No validation that the requesting actor matches the KV operation's actor_id. Any actor could potentially access other actors' KV data if they know the actor ID.

Impact: Cross-actor data access vulnerability if actor IDs are predictable or discoverable.

🟡 Performance: Blocking KV Operations

Location: packages/edge/infra/client/manager/src/actor/mod.rs:210-277
Issue: KV operations are processed synchronously in the actor observation loop, which can block actor state updates and process monitoring.

Impact: Delays in actor lifecycle management when KV operations are slow.

Code Quality Issues

1. Expensive Data Cloning

Location: packages/edge/infra/client/actor-kv/src/lib.rs:314-315

// TODO: Costly clone
let entries = entries.clone();

Issue: Multiple expensive clones in hot paths that could impact performance with large payloads.

2. Imprecise Size Calculations

Location: packages/edge/infra/client/actor-kv/src/key.rs:33-36

pub fn len(&self) -> usize {
    // Arbitrary 4 accounting for nesting overhead
    self.0.iter().fold(0, |acc, x| acc + x.len()) + 4 * self.0.len()
}

Issue: Size calculation includes "arbitrary" overhead that may not match actual storage usage, affecting quota validation accuracy.

3. Hard-coded Limits

Location: packages/edge/infra/client/actor-kv/src/lib.rs:28-34

const MAX_PUT_PAYLOAD_SIZE: usize = 976 * 1024; // Why 976 KiB specifically?

Issue: Magic numbers should be documented or configurable per deployment.

4. TODO Comments Left Unresolved

  • packages/edge/infra/client/config/src/runner_protocol.rs:56: "This shouldn't require generation"
  • packages/edge/infra/client/manager/src/utils/fdb.rs:9: "Copied from rivet_pools"
  • packages/edge/infra/client/manager/src/actor/mod.rs:211: "Add queue and bg thread for processing kv ops"

Security Concerns

1. Resource Exhaustion

  • Concurrent writes can bypass storage limits (race condition above)
  • No rate limiting on KV operations per actor
  • Large batch operations (128 keys max) could consume significant resources

2. Data Isolation

  • No enforcement that actors can only access their own KV namespaces
  • Silent deduplication in entry processing could mask bugs or malicious behavior

Performance Issues

1. Synchronous Processing

KV operations block the main actor observer loop instead of being processed asynchronously.

2. Multiple FDB Transactions

Each KV operation creates separate transactions instead of batching when possible.

3. Polling-based Size Estimation

get_estimated_range_size_bytes may be inaccurate and could lead to incorrect quota enforcement.

Test Coverage Gap

Critical Finding: No unit tests found for the new KV functionality. The only testing appears to be the system test example in examples/system-test-actor/src/managerClient.ts.

Missing test coverage for:

  • KV operations (get, put, list, delete)
  • Error conditions (size limits, invalid keys)
  • Concurrent access scenarios
  • Storage quota enforcement
  • Cross-actor access prevention
  • Error handling edge cases

Database Migration Analysis

Location: packages/edge/services/pegboard/db/runner/migrations/20200101000000_init.up.sql

The migration looks clean with proper TTL settings and follows good ClickHouse practices.

Positive Aspects

  1. Well-structured Protocol: Clean request/response patterns with proper deduplication
  2. Comprehensive Validation: Good size limits and input validation throughout
  3. Robust Error Handling: Detailed error contexts and proper error propagation
  4. Chunking Implementation: Proper handling of large values with 10KB chunks
  5. FDB Integration: Solid connection management with service discovery support

Recommendations

High Priority (Security & Correctness)

  1. Fix race condition: Use FDB transactions to atomically check and update storage usage
  2. Add actor ID validation: Verify requesting actor matches KV operation actor_id
  3. Add comprehensive tests: Unit tests for all KV operations and error scenarios
  4. Implement async KV processing: Move KV operations to background tasks to prevent blocking

Medium Priority (Performance & Quality)

  1. Address expensive clones: Optimize data copying in hot paths (resolve TODOs)
  2. Make limits configurable: Allow per-deployment configuration of storage/size limits
  3. Add rate limiting: Implement per-actor rate limiting for KV operations
  4. Improve size calculation: Make Key::len() more accurate for quota enforcement

Low Priority (Code Quality)

  1. Resolve TODO comments: Address outstanding technical debt
  2. Add metrics/monitoring: Track KV operation performance and error rates
  3. Consider batching: Optimize FDB operations by batching when possible
  4. Improve documentation: Add more detailed comments for complex logic

Summary

This is a well-architected implementation of actor KV storage with solid FoundationDB integration. The code quality is generally good with comprehensive validation and error handling. However, there are critical security and correctness issues that must be addressed:

  1. Race condition in storage validation could allow quota bypass
  2. Missing access control could enable cross-actor data access
  3. Lack of test coverage is concerning for such a critical feature
  4. Performance issues from synchronous processing could impact actor lifecycle

Recommendation: Requires fixes for critical issues before merge, particularly the race condition and access control problems. Adding comprehensive tests is also essential.


View job

@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from 7f5d1c3 to 2b4037b Compare July 12, 2025 00:40
Copy link

claude bot commented Jul 12, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add actor kv to runners

I've completed a comprehensive code review focusing on code quality, potential bugs, performance, security, and test coverage.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze actor-kv client implementation changes
  • Review runner protocol modifications
  • Examine manager integration changes
  • Check for security vulnerabilities
  • Evaluate performance considerations
  • Assess test coverage implications
  • Provide comprehensive review feedback

Architecture Overview

This PR implements actor key-value storage by integrating FoundationDB-backed KV operations into the runner system. The key changes include:

  1. ActorKv Client: Streamlined FoundationDB integration with chunking for large values (10KB chunks)
  2. Runner Protocol Extension: Added KV request/response message types with deduplication IDs
  3. Manager Integration: Each actor now gets dedicated ActorKv instances for KV operations
  4. FDB Connection Management: New FdbPool utility for managing FoundationDB connections
  5. Removed V8 Runner: Eliminated the standalone isolate-v8-runner (336 lines deleted)

Critical Issues Found

🔴 Race Condition in Storage Validation

Location: packages/edge/infra/client/actor-kv/src/lib.rs:308-310

let total_size = self.get_subspace_size(subspace).await? as usize;
validate_entries(&entries, total_size)?;

Issue: Size check happens before write operations, but concurrent writes can exceed storage limits between the check and actual write. This bypasses the 1GB storage quota enforcement.

Impact: Actors can exceed their storage limit through concurrent operations.

🔴 Missing Actor Access Control

Location: packages/edge/infra/client/manager/src/runner/mod.rs:212-218

runner_protocol::ToManager::Kv(req) => {
    let _ = self.actor_proxy_tx.send((
        req.actor_id,
        req.generation,
        runner_protocol::ToActor::Kv(req),
    ));
}

Issue: No validation that the requesting actor matches the KV operation's actor_id. Any actor could potentially access other actors' KV data if they know the actor ID.

Impact: Cross-actor data access vulnerability if actor IDs are predictable or discoverable.

🟡 Performance: Blocking KV Operations

Location: packages/edge/infra/client/manager/src/actor/mod.rs:210-277
Issue: KV operations are processed synchronously in the actor observation loop, which can block actor state updates and process monitoring.

Impact: Delays in actor lifecycle management when KV operations are slow.

Code Quality Issues

1. Expensive Data Cloning

Location: packages/edge/infra/client/actor-kv/src/lib.rs:314-315

// TODO: Costly clone
let entries = entries.clone();

Issue: Multiple expensive clones in hot paths that could impact performance with large payloads.

2. Imprecise Size Calculations

Location: packages/edge/infra/client/actor-kv/src/key.rs:33-36

pub fn len(&self) -> usize {
    // Arbitrary 4 accounting for nesting overhead
    self.0.iter().fold(0, |acc, x| acc + x.len()) + 4 * self.0.len()
}

Issue: Size calculation includes "arbitrary" overhead that may not match actual storage usage, affecting quota validation accuracy.

3. Hard-coded Limits

Location: packages/edge/infra/client/actor-kv/src/lib.rs:28-34

const MAX_PUT_PAYLOAD_SIZE: usize = 976 * 1024; // Why 976 KiB specifically?

Issue: Magic numbers should be documented or configurable per deployment.

4. TODO Comments Left Unresolved

  • packages/edge/infra/client/config/src/runner_protocol.rs:56: "This shouldn't require generation"
  • packages/edge/infra/client/manager/src/utils/fdb.rs:9: "Copied from rivet_pools"
  • packages/edge/infra/client/manager/src/actor/mod.rs:211: "Add queue and bg thread for processing kv ops"

Security Concerns

1. Resource Exhaustion

  • Concurrent writes can bypass storage limits (race condition above)
  • No rate limiting on KV operations per actor
  • Large batch operations (128 keys max) could consume significant resources

2. Data Isolation

  • No enforcement that actors can only access their own KV namespaces
  • Silent deduplication in entry processing could mask bugs or malicious behavior

Performance Issues

1. Synchronous Processing

KV operations block the main actor observer loop instead of being processed asynchronously.

2. Multiple FDB Transactions

Each KV operation creates separate transactions instead of batching when possible.

3. Polling-based Size Estimation

get_estimated_range_size_bytes may be inaccurate and could lead to incorrect quota enforcement.

Test Coverage Gap

Critical Finding: No unit tests found for the new KV functionality. The only testing appears to be the system test example in examples/system-test-actor/src/managerClient.ts.

Missing test coverage for:

  • KV operations (get, put, list, delete)
  • Error conditions (size limits, invalid keys)
  • Concurrent access scenarios
  • Storage quota enforcement
  • Cross-actor access prevention
  • Error handling edge cases

Database Migration Analysis

Location: packages/edge/services/pegboard/db/runner/migrations/20200101000000_init.up.sql

The migration looks clean with proper TTL settings and follows good ClickHouse practices.

Positive Aspects

  1. Well-structured Protocol: Clean request/response patterns with proper deduplication
  2. Comprehensive Validation: Good size limits and input validation throughout
  3. Robust Error Handling: Detailed error contexts and proper error propagation
  4. Chunking Implementation: Proper handling of large values with 10KB chunks
  5. FDB Integration: Solid connection management with service discovery support

Recommendations

High Priority (Security & Correctness)

  1. Fix race condition: Use FDB transactions to atomically check and update storage usage
  2. Add actor ID validation: Verify requesting actor matches KV operation actor_id
  3. Add comprehensive tests: Unit tests for all KV operations and error scenarios
  4. Implement async KV processing: Move KV operations to background tasks to prevent blocking

Medium Priority (Performance & Quality)

  1. Address expensive clones: Optimize data copying in hot paths (resolve TODOs)
  2. Make limits configurable: Allow per-deployment configuration of storage/size limits
  3. Add rate limiting: Implement per-actor rate limiting for KV operations
  4. Improve size calculation: Make Key::len() more accurate for quota enforcement

Low Priority (Code Quality)

  1. Resolve TODO comments: Address outstanding technical debt
  2. Add metrics/monitoring: Track KV operation performance and error rates
  3. Consider batching: Optimize FDB operations by batching when possible
  4. Improve documentation: Add more detailed comments for complex logic

Summary

This is a well-architected implementation of actor KV storage with solid FoundationDB integration. The code quality is generally good with comprehensive validation and error handling. However, there are critical security and correctness issues that must be addressed:

  1. Race condition in storage validation could allow quota bypass
  2. Missing access control could enable cross-actor data access
  3. Lack of test coverage is concerning for such a critical feature
  4. Performance issues from synchronous processing could impact actor lifecycle

Recommendation: Requires fixes for critical issues before merge, particularly the race condition and access control problems. Adding comprehensive tests is also essential.


@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from 2b4037b to c73f767 Compare July 18, 2025 18:51
Copy link

vercel bot commented Jul 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivet-site Error Error Aug 21, 2025 8:21pm

Copy link

claude bot commented Jul 18, 2025

Claude finished @MasterPtato's task —— View job


PR Review: feat: add actor kv to runners

I've conducted a comprehensive review of this PR focusing on code quality, potential bugs, performance, security, and test coverage.

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze actor-kv client implementation changes
  • Review runner protocol modifications
  • Examine manager integration changes
  • Check for security vulnerabilities
  • Evaluate performance considerations
  • Assess test coverage implications
  • Provide comprehensive review feedback

Architecture Overview

This PR implements actor key-value storage by integrating FoundationDB-backed KV operations into the runner system. The key changes include:

  1. ActorKv Client: Streamlined FoundationDB integration with chunking for large values (10KB chunks)
  2. Runner Protocol Extension: Added KV request/response message types with deduplication IDs
  3. Manager Integration: Each actor now gets dedicated ActorKv instances for KV operations
  4. FDB Connection Management: New FdbPool utility for managing FoundationDB connections
  5. Removed V8 Runner: Eliminated the standalone isolate-v8-runner (336 lines deleted)

Critical Issues Found

🔴 Race Condition in Storage Validation

Location: packages/edge/infra/client/actor-kv/src/lib.rs:308-310

let total_size = self.get_subspace_size(subspace).await? as usize;
validate_entries(&entries, total_size)?;

Issue: Size check happens before write operations, but concurrent writes can exceed storage limits between the check and actual write. This bypasses the 1GB storage quota enforcement.

Impact: Actors can exceed their storage limit through concurrent operations.

🔴 Missing Actor Access Control

Location: packages/edge/infra/client/manager/src/runner/mod.rs:212-218

runner_protocol::ToManager::Kv(req) => {
    let _ = self.actor_proxy_tx.send((
        req.actor_id,
        req.generation,
        runner_protocol::ToActor::Kv(req),
    ));
}

Issue: No validation that the requesting actor matches the KV operation's actor_id. Any actor could potentially access other actors' KV data if they know the actor ID.

Impact: Cross-actor data access vulnerability if actor IDs are predictable or discoverable.

🟡 Performance: Blocking KV Operations

Location: packages/edge/infra/client/manager/src/actor/mod.rs:210-277
Issue: KV operations are processed synchronously in the actor observation loop, which can block actor state updates and process monitoring.

Impact: Delays in actor lifecycle management when KV operations are slow.

Code Quality Issues

1. Expensive Data Cloning

Location: packages/edge/infra/client/actor-kv/src/lib.rs:314-315

// TODO: Costly clone
let entries = entries.clone();

Issue: Multiple expensive clones in hot paths that could impact performance with large payloads.

2. Imprecise Size Calculations

Location: packages/edge/infra/client/actor-kv/src/key.rs:33-36

pub fn len(&self) -> usize {
    // Arbitrary 4 accounting for nesting overhead
    self.0.iter().fold(0, |acc, x| acc + x.len()) + 4 * self.0.len()
}

Issue: Size calculation includes "arbitrary" overhead that may not match actual storage usage, affecting quota validation accuracy.

3. Hard-coded Limits

Location: packages/edge/infra/client/actor-kv/src/lib.rs:28-34

const MAX_PUT_PAYLOAD_SIZE: usize = 976 * 1024; // Why 976 KiB specifically?

Issue: Magic numbers should be documented or configurable per deployment.

4. TODO Comments Left Unresolved

  • packages/edge/infra/client/config/src/runner_protocol.rs:56: "This shouldn't require generation"
  • packages/edge/infra/client/manager/src/utils/fdb.rs:9: "Copied from rivet_pools"
  • packages/edge/infra/client/manager/src/actor/mod.rs:211: "Add queue and bg thread for processing kv ops"

Security Concerns

1. Resource Exhaustion

  • Concurrent writes can bypass storage limits (race condition above)
  • No rate limiting on KV operations per actor
  • Large batch operations (128 keys max) could consume significant resources

2. Data Isolation

  • No enforcement that actors can only access their own KV namespaces
  • Silent deduplication in entry processing could mask bugs or malicious behavior

Performance Issues

1. Synchronous Processing

KV operations block the main actor observer loop instead of being processed asynchronously.

2. Multiple FDB Transactions

Each KV operation creates separate transactions instead of batching when possible.

3. Polling-based Size Estimation

get_estimated_range_size_bytes may be inaccurate and could lead to incorrect quota enforcement.

Test Coverage Gap

Critical Finding: No unit tests found for the new KV functionality. The only testing appears to be the system test example in examples/system-test-actor/src/managerClient.ts.

Missing test coverage for:

  • KV operations (get, put, list, delete)
  • Error conditions (size limits, invalid keys)
  • Concurrent access scenarios
  • Storage quota enforcement
  • Cross-actor access prevention
  • Error handling edge cases

Database Migration Analysis

Location: packages/edge/services/pegboard/db/runner/migrations/20200101000000_init.up.sql

The migration looks clean with proper TTL settings and follows good ClickHouse practices.

Positive Aspects

  1. Well-structured Protocol: Clean request/response patterns with proper deduplication
  2. Comprehensive Validation: Good size limits and input validation throughout
  3. Robust Error Handling: Detailed error contexts and proper error propagation
  4. Chunking Implementation: Proper handling of large values with 10KB chunks
  5. FDB Integration: Solid connection management with service discovery support

Recommendations

High Priority (Security & Correctness)

  1. Fix race condition: Use FDB transactions to atomically check and update storage usage
  2. Add actor ID validation: Verify requesting actor matches KV operation actor_id
  3. Add comprehensive tests: Unit tests for all KV operations and error scenarios
  4. Implement async KV processing: Move KV operations to background tasks to prevent blocking

Medium Priority (Performance & Quality)

  1. Address expensive clones: Optimize data copying in hot paths (resolve TODOs)
  2. Make limits configurable: Allow per-deployment configuration of storage/size limits
  3. Add rate limiting: Implement per-actor rate limiting for KV operations
  4. Improve size calculation: Make Key::len() more accurate for quota enforcement

Low Priority (Code Quality)

  1. Resolve TODO comments: Address outstanding technical debt
  2. Add metrics/monitoring: Track KV operation performance and error rates
  3. Consider batching: Optimize FDB operations by batching when possible
  4. Improve documentation: Add more detailed comments for complex logic

Summary

This is a well-architected implementation of actor KV storage with solid FoundationDB integration. The code quality is generally good with comprehensive validation and error handling. However, there are critical security and correctness issues that must be addressed:

  1. Race condition in storage validation could allow quota bypass
  2. Missing access control could enable cross-actor data access
  3. Lack of test coverage is concerning for such a critical feature
  4. Performance issues from synchronous processing could impact actor lifecycle

Recommendation: Requires fixes for critical issues before merge, particularly the race condition and access control problems. Adding comprehensive tests is also essential.


@MasterPtato MasterPtato force-pushed the 06-21-feat_get_multi_actors_working_e2e_on_docker_compose branch from 0bb001f to b2a0179 Compare August 21, 2025 20:11
@MasterPtato MasterPtato force-pushed the 06-24-feat_add_actor_kv_to_runners branch from c73f767 to ca77778 Compare August 21, 2025 20:11
Copy link

claude bot commented Aug 21, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link

claude bot commented Aug 21, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants