Skip to content

v2.3.0

Latest
Compare
Choose a tag to compare
@CompuIves CompuIves released this 29 Sep 08:03
8a4d86a

2.3.0 (2025-09-29)

History of Hibernation and Persistence

In the original CodeSandbox design, we optimized sandbox lifecycles to feel like cloud-based laptops. When a user stepped away from a project, the sandbox would automatically hibernate after a short period of inactivity. On return, the sandbox would restore almost instantly to its previous state — including memory and persistence.

At the time, most sandboxes were small, short-lived projects. We managed persistence by storing each sandbox on disk for up to seven days before archiving. This provided a seamless user lifecycle: persistence was automatic, reliable, and required no manual intervention.

We also introduced Live Forking, which allowed users to fork a running sandbox while sharing memory with the original. This enabled smooth flows such as starting in a read-only, always-up-to-date main branch, then branching into a writable sandbox without interruption.

This approach delivered:

  • A simple mental model: “My sandbox is always where I left it.”
  • Cost efficiency: unused sandboxes were automatically hibernated or archived.
  • Low configuration overhead: persistence and storage were managed behind the scenes.

What We Have Learned

As CodeSandbox evolved into an SDK, use cases multiplied — many beyond what our original system anticipated. This introduced challenges around scalability, reliability, and flexibility.

The biggest friction today comes from our opinionated approach to hibernation and persistence.

Problems with Automatic Hibernation

  • Timeout fragility: Hibernation is triggered by a configurable timeout, but only certain SDK protocol messages extend it. Other activity (e.g. HTTP requests, file system operations) does not, making the design confusing and brittle.
  • In-sandbox management: Because the timeout is managed internally, it can drift or fail. This sometimes keeps sandboxes alive longer than expected — or hibernates them too early.
  • Network triggers don’t reset timeouts: While SDK users can configure sandboxes to wake on HTTP or WebSocket activity, those interactions don’t extend the timeout. Developers often struggle to keep VMs alive reliably.

Problems with Archiving

  • To stabilize clusters, we recently shortened the archive window from seven to four days (and sometimes just two days during peak load). While this improved system health, it introduced unpredictable resume behavior.
  • Resuming from an archived state (CLEAN) takes significantly longer (up to 60 seconds) compared to a normal resume (RESUME, ~1–3 seconds).
  • SDK users now must detect the boot type, handle longer startup times, and explain this inconsistency to their end users.

Problems with Live Forking

Live Forking introduced scalability bottlenecks. In some scenarios, thousands of sandboxes read simultaneously from a single “origin” sandbox’s memory, degrading performance across the platform.

In short, SDK users have worked hard to adapt our system to their needs. But the wide variety of use cases has proven incompatible with our current hibernation, persistence, and forking models. This mismatch has been a major source of reliability and scalability issues.

What We Are Shipping Today

SDK v2.3.0

SDK version 2.3.0 represents our Best Practices release. This is a NON-BREAKING change. We have also rewritten our docs to show you how to best take advantage of the current state of the service.

Feature

  • Add a delete method
  • Make id in connect and createSession an optional field, allowing you to default to global user

Fix

  • Fix: Defaulting to public-hosts as privacy
  • Fix: .gitignore is now included in the template build

What We Are Working On

A REST-based Sandbox Agent

The current SDK requires a WebSocket connection, adding complexity across different environments. Moving to a REST-based agent will:

  • Simplify the mental model
  • Eliminate connection management overhead
  • Make the interface easier and safer to use

Long-term persistence

When a sandbox is hibernated, we create a snapshot. If that snapshot isn’t resumed within 2–7 days (depending on cluster health), the sandbox is archived. This makes the resume process unpredictable: normally it takes 1–3 seconds, but resuming from an archived state can take up to 60 seconds.

With long-term snapshot persistence, our goal is to eliminate archiving altogether. This ensures predictable resume times of just 1–3 seconds, no matter how long a sandbox has been hibernated.

Replace the current hibernation timeout and automatic wakeup

For the best experience and the most control, we recommend adopting active lifecycle management. This applies both to the current service and to future updates.

That said, we also want to support a simple, reliable timeout and wakeup mechanism. Our goal is to provide a default behavior that’s easy to understand, while still allowing room for configuration where developers need it.

Here are some key questions to consider:

  • Should the hibernation timeout only be extended when calling the Sandbox Agent? (For example, using its health endpoint as a heartbeat.)
  • If the timeout should extend on any request to the sandbox, are there certain types of requests that should not extend it?
  • If any request can wake the sandbox, are there certain types of requests that should not trigger a wakeup?

Feedback & Collaboration

We deeply appreciate our users for supporting us through this transition. One of the clearest lessons has been that you want sandboxes to behave like low-level, controllable resources, not high-level “laptops in the cloud.”

Your feedback has been invaluable in shaping the future of the SDK.

As we move forward, we invite you to share:

  • Comments or concerns
  • Specific use cases you’d like us to consider
  • Interest in joining feedback sessions or implementation discussions

We’re committed to making this transition smooth and giving you the tools and flexibility you need.

With love, The CodeSandbox SDK Team ❤️