In today’s world, any action that takes more than a few seconds to complete is considered slow. Software is the most important asset for any type of organization. The importance of having a high availability solution cannot be quantified. If our software is not accessible and fast responding – on the server and UI levels – users will lose trust in the software and abandon it.

This is relevant for software vendors in two main cases:

  • Responding to unplanned downtime incidents
  • Upgrading the software with new features or capabilities

This article will focus on the latter and will attempt to answer the important question: how to upgrade your software without downtime or customer impact? For the purpose of illustration, we will examine Slack as a use case.

If you haven’t heard of Slack, it’s an application that allows organizations to communicate across their business about technical topics, and automates processes. Slack has a server in the cloud that behaves as a platform for the web, desktop, and mobile client applications.

All communications are carried out based on an SSL protocol so the data is encrypted between the clients and the server. It’s important to remember that the same database server assists several organizations; therefore data segregation is mandatory.

Slack, being a communication and automation platform used by large organizations such as NASA, and more, must be at the forefront of technology, and at the 5 nines level of service.

Trying to achieve the highest level of service, Slack encountered several challenges. The first one is reliability during planned downtimes.

High Availability Architecture

When the Slack team wants to update their application, they have many components to focus on. This makes it impossible to upgrade them all without incurring application downtime. Each of these components is crucial for the basic functionality of the application, so they must be available for the user to have access to the software. Here enters the need for a high-availability solution.

A high-availability solution promises to keep software running at all times (even during upgrades), but requires professional design of the application structure and demands a different deployment of all server-side components. When an application is designed for high-availability, all server-side components are replicated, and a special tool to manage these instances of the same component is introduced.

This tool will usually be a Load Balancer, and implements a predefined algorithm to identify the component/instance where the next message should be directed. This new type of design structure gives a lot of power to the application in load management, which affects the application speed and allows for a smooth and unified upgrade process. Production teams can now upgrade each component on its own, while having an identical component “in the air,” responsive to users requests. Using this method provides a highly available and scalable application, which can be upgraded without affecting its users.

Continuous Deployment as part of Continuous Delivery

There are many phases of updating an application, including understanding which components are being upgraded, planning which regression testing is required for the relevant areas, and calculating the effect on the user – and minimizing it as much as possible.

Slack invested in an automatic upgrade process that can handle any type of upgrade. Slack built their backend based on microservices architecture, allowing it to upgrade its backend in a way that each component is independent and can be deployed separately.

A specific process will be initiated for independently upgrading each component, performing sanity level testing, verifying that the code was correctly updated (using a suite of API tests), and eventually checking that the components are running and responsive to users’ requests.

If Slack wants to update its web, desktop or mobile applications, a dedicated process for each app is started by first updating the website/store, then installing the new application on the relevant machines/devices. This is followed by a specific set of tests for the relevant application type and checking that users start downloading and upgrading.

Let’s dig deeper to understand what Continuous Delivery (CD) means. When a company wants to update their website, they want to do it quickly and safely. CD means that every type of change (new features, configuration changes, bug fixes or experiments) is automatically made and verified across the production environment. A trustable CD process -based on a stable CI process- is a product of detailed planning that includes a description of what every step in the process does, what components it affects, how much time it should take, and what tests need to be run.

So when Slack wants to update its website, it starts the CD process by first building the latest code it has, then deploying the new site to a staging environment with the new modules and content. Next it verifies that the deployment was successful and the content is correct. After successful testing and verifying the staging environment, the same steps are repeated for the production environment.

Control Your Release

One method for implementing safer software updates is “controlled release.” Controlled release can verify that the software was correctly upgraded and that users like what they received. Of course, if this is not the case, only a small amount of users were affected, and a rollback process can start.

Controlled release is implemented in a way that exposes the new content to only a small amount of users who behave as a control group for the new content. When Slack introduced its new bot, it wasn’t sure how it was going to affect their existing users or attract new ones. This information was critical, as Slack needed to know if its current infrastructure needed to be reinforced to support a dramatic rise in usage.

Slack decided to expose the new bot to a small group of users. It added the bot to 5% of its active user teams, and observed the effect. It measured the communication in these teams: the usage of the bot, and asked users to answer a few questions about the new feature.

This information gave the Slack team an indication of what to expect when exposing the new bot to the rest of the Slack community. More importantly, it gave the team the ability to have a dry run on the deployment process, and measure – in load aspects – how the new bot affects team communications.

The mobile applications world can contain different versions of the same application. This means that controlled release can be achieved quite easily, as long as the release process knows how to update the correct store.

A few things to remember when planning and running a continuous upgrade process:

  1. “Keep it simple.” The building blocks of the process should be small and lightweight. Each step of the process should be dedicated to a specific component in the application, upgrade quickly, and be autonomic.
  2. Test the process and continuously improve it. Running this process for the first time across a production environment is not the safest thing to do. A lot of things can go wrong in, and production should never be jeopardized. The deployment process should be tried on different environments before being executed on any customer-facing environments. Improving the process goes hand-in-hand with the application changes.
  3. Test what you need, not what you feel. When performing continuous delivery of an application, the most important step is to test after an environment is deployed to make sure the deployment was successful, and that the environment is working properly. The suite of tests to be executed after deployment should focus on the components that were updated. The test set must be small and fast to quickly receive feedback on the status of the deployment. A good practice is to have a small, separate set of tests for every component. The CD process knows which components are being upgraded, then decides which tests to run.

It’s all about Automation

Continuous delivery and continuous deployment must be completely automated. There are few tools that give us the ability to build an automatic process. The first and most common one, which Slack has implemented as its tool of choice, is Jenkins.

Jenkins is an open source tool that allows the user to build a pipeline with several jobs executed in succession. These jobs can be configured to do anything.

Chef is another tool that was built specifically for CD processes. It supplies a variety of out-of-the-box deployment methods, has strong integration with the cloud, and helps build the right process for your application. Chef comes with a price tag, but offers automation delivery.

Travis CI is another open source tool that allows users to easily execute deployment actions. In addition to its on-premise solution it has a SaaS version that allows users to plan, deploy, and test environments without needing to install anything locally.

HPE OO – Operations Orchestration is a tool for enterprise organizations. It focuses on deployment, and features a UI which shows the status of each application component or server. It integrates with most of the common tools in the industry, providing an easy way to build the deployment process and validate the completed deployment steps.

To summarize, implementing and maintaining a successful continuous delivery process requires effort, but launches the application and the company a few steps forward into the DevOps world. Each organization should decide which steps to focus on when delivering its products.

Nonetheless, it’s clear that continuous delivery is becoming a standard in the industry, and eventually everyone will have to make adjustments and implement such processes across their companies. Today, several organizations proudly present the continuous delivery processes they created, and are happy to share their knowledge and experience on how to do it right.