Plumbee engineering

How we do Canary Deployments

| Comments

Introduction

Testing is one of the most important parts of the development process. However, it is difficult to reproduce all real-world scenarios by simulating user behaviour in a sandbox environment. Even when software is thoroughly tested in the traditional way, there may be specific problems that appear only when real users use the software in the production environment. To improve our confidence in the system beyond what testing can provide, at Plumbee we have implemented a mechanism that allows us to push software changes to a small fraction of the user base in a controlled way - commonly known as Canary Testing or Canary Deployment. The name “canary” is used with reference to the canaries that were often sent to coal mines to alert miners about toxic gases reaching dangerous levels.

Our requirements for canary deployment are:

  1. The upgrade to a new version must occur with no downtime
  2. It must be possible to run both the old version and the new version of the software in parallel - also known as blue-green deployment
  3. It must be possible to run tests against the new version in the production environment
  4. It must be possible to control the percentage of the user base that sees each version
  5. Each user’s experience must remain consistent throughout the upgrade
  6. It must be possible to roll back the upgrade easily when required.

Plumbee Infrastructure

To give a bit of context before proceeding, we will give a very brief description of our infrastructure, which is run on Amazon Web Services.

Requests from our clients go through a load balancer (ELB) before reaching our services, which are deployed on EC2 instances. Each service has a reverse proxy layer which is used to handle authentication, canary deployments and A/B testing; and beyond that a layer that implements the service logic. We implemented our reverse proxy in-house, because to our knowledge no out-of-the-box load balancers support all of the features we require. We keep this component as lightweight as possible, so that it includes only the features we really need.

Challenges

There are three main challenges when implementing any Canary Deployment solution:

  1. Avoiding downtime
  2. Keeping the same user experience during the upgrade - this means that once the user sees the new version he should always see that version, unless a rollback is necessary.
  3. Storing and managing the configuration that represents the progress of the upgrade, which we call the “canary deployment manifest”. This must be persisted and there must be an easy way to manipulate it.

With respect to the challenges just mentioned, we will now detail how we solved them.

Canary Deployment Procedure

From a broad perspective, the deployment of a new version of a service involves the following steps:

  1. Create a new cluster of EC2 instances using CloudFormation with the new version of the service, which we call the “canary cluster”
  2. Make a request to the service reverse proxy with the update of the canary deployment manifest to include the canary cluster identifier and to allow Plumbee employee accounts to see the canary cluster
  3. Run acceptance tests against the canary cluster to make sure there are no misconfigurations with the production environment
  4. After acceptance tests pass, update the manifest to send a small fraction of the user base to the canary cluster
  5. Monitor the behaviour of the canary cluster once requests start arriving
  6. If the canary cluster behaves as expected, do as many manifest updates as necessary to gradually increase the percentage of users sent to the canary cluster to 100%.
  7. Destroy the cluster running the previous version of the software.
  8. If the canary cluster doesn’t behave as expected, rollback to the previous version by updating the manifest accordingly. Destroy the canary cluster once all necessary data to debug the problem has been gathered. Fix the issues and go back to step 1.

In the below sections, we describe this solution in more detail.

User Base Sampling

One way of choosing which users will see the new version is to assign users to either the old or new version randomly. Although this may sound straightforward, it requires persisting the assignment to ensure that the user always sees the correct version. Thus for each request there will be an access to persistent storage required to find out which version the user should see, or to do the assignment.

To avoid an access to persistent storage, we decided instead to use the following calculation to assign the user to either the old or new version:

string assignCluster(userID, primaryClusterID, canaryClusterID, percentage) {
if (md5(userID + "-" + primaryClusterID + "-" + canaryClusterID) mod 100) < percentage) {
    return primaryClusterID;
}
return canaryClusterID;
}

This calculation has the following properties:

  1. All users have the same chance of being assigned to the canary cluster, that is the calculation guarantees an even distribution
  2. It produces the same result each time it is calculated during an upgrade, so that users have a consistent experience
  3. It changes for every upgrade, so that different users are assigned to the canary cluster for each upgrade.

Each canary deployment manifest follows the following format:

The percentage parameter is a value between 0-100 that defines the percentage of users that should be forwarded to the canary cluster. The “canaryClusterID” attribute is optional, and when omitted, 100% of the user base will be served by the version defined in “primaryClusterID”.

Let”s now look at some concrete examples - imagine at the moment Plumbee has a “Wallet Service” live. Our canary deployment manifest for the wallet service should look like this:

wallet.service.version: {
     "primaryClusterID":"wallet-service-v2.5"
}

Now say we want to rollout a new version of our wallet service and assign 10% of the user base to it - we would need to update the canary deployment manifest to:

wallet.service.version: {
    "primaryClusterID":"wallet-service-v2.5",
    "canaryClusterID":"wallet-service-v2.6",
    "canaryClusterPercentage":10
}

To finalize the deployment and assuming all the checks were done, we would only need to assign the new service version to 100% of the population:

wallet.service.version: {
    "primaryClusterID":"wallet-service-v2.6"
}

The next diagram illustrates the series of steps we just described:

Canary Deployment Manifest Management

We store the canary deployment manifest in a DynamoDB table as well as caching it in the memory of the reverse proxy. The cached version is refreshed via periodic polling of the underlying table.

To easily manipulate the canary deployment configuration we expose a protected API that is used by our deployment scripts. The API is also used by our monitoring tools to check the canary deployment status in real-time. The operations exposed by this API are very simple:

Monitoring

Once the first phase of the canary deployment is completed, i.e. the new version has started receiving a small amount of the total traffic, we run a few health checks to assess if the canary is behaving as expected. If our expectations are met we gradually increase the canary targeting until we reach the whole user base. On the other hand, if something goes wrong, we quickly switch back 100% of traffic back to the previous version. We monitor systems metrics like error rate, response times and throughput of all the components involved, in addition to product metrics.

A recent improvement to our Canary Deployment process is to send only Plumbee employee accounts to the canary cluster before any real user accounts are allowed. This has proved useful to make sure there are no obvious issues and to run acceptance tests against the canary cluster. We use both Facebook account roles and a whitelist to allow internal accounts, the latter used in cases where we want to test features without Facebook integration.

Keep in mind that running acceptance tests against the production environment, and having good monitoring tools that can compare the new metrics against a baseline, are both very important. Doing this correctly will give you confidence to automate the whole process and put Continuous Deployment in place, a path we continue to pursue.

- Ricardo Costa

Comments