The Anatomy of DBOS Cloud

Anatomy of DBOS Cloud - how stateful serverless works
Max Demoulin
September 27, 2024

With the introduction of DBOS Cloud in March 2024, we received many inquiries about DBOS Cloud’s architecture. This post presents said architecture and how it fits our long term vision of tightly integrating the application, operating system, and database.

DBOS Cloud architecture

Building, securing and operating a multi-tenant serverless platform is challenging. Such a platform takes on the burden of managing infrastructure for users and provides an intuitive way to deploy and use applications. Specifically:

  1. Applications must be isolated from each other and the platform
  2. Applications must be highly available
  3. Applications must be reliable
  4. The platform must support frequent code updates

With these properties in mind, we designed a control/data planes architecture where the platform’s intended state resides in a database and both planes are stateless. 

  • The control plane is responsible for manipulating the system’s intended state, e.g., deciding how many resources to allocate to an application. 
  • The data plane is responsible for asynchronously implementing the state, e.g., running DBOS applications with their expected configuration. 

This post focuses on how the control and data planes work together to manage application deployment, and how the unique properties of DBOS Transact facilitate operating the platform.

DBOS Cloud stateful serverless computing architecture diagram

Deploying applications

When you deploy an application to DBOS Cloud, its code is uploaded to the control plane. Upon receiving a deploy request, the control plane increments the application version, configures some metadata, e.g., the amount of resources allocated, and finally schedules a build task on one of the data plane hosts. This task builds the application and runs its configured schema migration commands.

At this point, the application is not yet running: following the steps above, the control plane synchronously declares and persists the intended state of the application in the control plane database. Data plane agents will asynchronously implement this intent and start creating virtual machines executing application code (described in the next section). Because the intended state is persisted in the database by DBOS Cloud, processes can be updated and restarted without losing the state of deployed applications.

Running applications

Data plane hosts are where applications actually run. Applications are isolated in secure sandboxes implemented with Firecracker, a proven virtualization technology used in popular serverless platforms like AWS Lambda. 

The core of a data plane host is a resolving loop which periodically fetches and enacts the intended state of an application from the control plane database, e.g., how many resources and which version of the code. 

Automatic restart and recovery

The first step in the resolving loop restarts and recovers any unresponsive VMs. Unlike a conventional system like Kubernetes, DBOS Cloud not only restarts failing VMs but also recovers application state. After a new VM is started and DBOS Transact becomes available, the data plane agent will call the workflow recovery API with the workflow IDs running on the unresponsive VM. Workflow IDs act as idempotency keys and, of course, DBOS Transact workflows recovered this way will resume exactly where they left off 😉. This automatic restart and recovery makes DBOS applications highly available and reliable (addressing challenges 2 and 3 above).

Note how these capabilities facilitate code updates (further addressing challenge 4 above). Because data plane agents are stateless, they can be updated and restarted with new code without losing application state.

Auto-scaling

The second step in the resolving loop compares how many VMs the host should be running for this application versus how many VMs it actually is running and scales up or down accordingly. The number of allocated VMs is a function of utilization and activity–idle applications are eventually scaled to 0. Because Firecracker VMs can boot in a hundred milliseconds, scaling up (or restarting) application resources is fast.

Decommissioning outdated VMs

The resolving loop accounts for application versioning and must handle decommissioning outdated VMs. When a new version is deployed through the control plane, the second step of the resolving loop will start new VMs running the latest version of the application code. Any new request destined to an application will be served by these new VMs. During a third step, the resolving loop will proceed to decommission VMs running outdated versions of the code, but stop short of terminating VMs currently executing DBOS Transact workflows. This provides continuity of service but also means you could have two versions of the code running concurrently. To perform backward compatible database schema migration, like renaming a column or deleting a table, you can use DBOS Cloud API to perform multi-steps rollouts.

Closing words

Thanks to a control/data planes design where all components are stateless, and combined to the powerful state management features of DBOS Transact, DBOS Cloud provides a highly available, reliable and secure serverless environment for your cloud applications.

To get started with DBOS, just download the open source (MIT license) DBOS Transact open source durable execution library (TypeScript or Python) to start running code locally. Check out the quickstart or download an example application for help.

Once your application is written and running you can deploy it to DBOS Cloud and run it for free. 

© DBOS, Inc. 2024