High Availability

RudderStack's High Availability design document

Any service can go down because of a hardware failure or a software bug. This document explains the engineering design and the deployment model that makes RudderStack highly available in spite of either cause.

Hardware Failures

We leverage Kubernetes and auto scaling groups to handle hardware failures and stay highly available.

To recover from node failures, we recommend to provision the nodes with an auto scaling group. This is how a standard RudderStack production deployment looks like.

Deploying RudderStack in a single availability zone

If the node hosting RudderStack goes down, Kuberenetes will automatically schedule it on another available node. If you have an auto scaling group and Kubernetes is not able to schedule RudderStack, a new node is created and RudderStack will be scheduled.

This is equivalent to the standard High Availability setup where infrastructure team creates an extra backup node to switch the master using a heartbeat mechanism.

Deploying RudderStack in multiple availability zones

There could be situations. like data center failures, where a complete availability zone can go down. If you want RudderStack to be resilient to such failures, your Kubernetes cluster should span multiple availability zones.

Software Failures

RudderStack can switch between different running modes to stay resilient on software failures.

If RudderStack is not available, our web and mobile SDKs cache the events in the customer device and retry till they are delivered to the RudderStack server.

RudderStack Server Running Modes

RudderStack supports the three running modes normal, degraded and maintenance.

Normal

In Normal mode, RudderStack receives events and forwards them to the destinations.

Degraded

If the RudderStack keeps crashing during processing events, after a threshold number of restarts, it enters degraded mode.

In Degraded mode, RudderStack receives events and stores them. It will not forward to destinations. All your events are still safe and will be sent to destination maintaining the order.

Maintenance

If the RudderStack keeps crashing in degraded mode, after a threshold number of restarts, it enters maintenance mode.

In Maintenance mode, RudderStack switches the DB and receives events in a new DB. All your events are still safe. You need to send us the crash reports to identify and fix this issue.

Alerting

RudderStack has an in-built alerting service that will raise an alert when the server enters degraded or maintenance mode. Alerting service supports integrations with PagerDuty and VictorOps. You can configure this to be alerted when something unexpected happens.

Please refer to this page for configuring alerts.

Client SDK Caching

Assume that the RudderStack service has gone down during an unexpected scenario and is not reachable for the SDKs. Web and mobile SDKs will cache the events in the local storage. Pending events will be retried with a backoff and delivered once the service is available again.

Even during an unexpected downtime, all your events are safe and will be delivered to your destinations.

Contact Us

If you want to know more about RudderStack High Availability, feel free to contact us. You can also request a demo to see RudderStack in action.