Feature Toggles and A/B testing

Overview

Here at FMP, we practice trunk-based development, with changes committed to the main trunk being deployed by our continuous integration server onto our acceptance, integration and finally production servers. We commit to trunk multiple times daily, and provided that the deployment pipeline is green, that also means we deploy those changes to our production servers multiple times daily.

Trunk-based development can be scary for developers and product owners; changes to the code base are deployed to live and, if a feature is not ready for release, how do we control the release of a new feature or hide a feature still in development?

The answer is feature toggling and the use of feature toggles allows us to safely commit all changes to trunk without accidently exposing unwanted features to our end user. Trunk based development along with feature toggling allows us to separate the process of deployment (automated by our continuous integration server) from the process of release (configured via feature toggles).

If you are unaware of feature toggles then this Pete Hodgson article on feature toggles is a good place to start. For more details on trunk based development, see this Martin Fowler article.

Feature toggle implementation in FMP

Feature toggles are – on the whole – used exclusively at the UI layer to either:

Hide a feature that is still under development. This we would call a feature toggle.
Perform A/B testing on UI elements to determine which approach works best. This we call an experimentation toggle.

Feature toggles are generally not used elsewhere in the stack, the UI is our point of contact with the end user and it’s clear that any entry points into a feature still under development need to be hidden there. Code in the API and micro services layer is typically not controlled via toggles because it is far easier to hide to the code changes deeper within the stack. We could use them, it’s just that generally we don’t need to worry for feature development.

Technology choices

Before we can start to use feature toggles in our UI, we needed some way of managing the available feature toggles. In fact, we had a whole shopping list of requirements for the feature toggle service including:

Easily create and set the state of a toggle depending on the environment requesting the toggle E.g.: A toggle may be on for the integration environment but off for the production environment.
Canary releases – Roll out a new feature to a small subset of users and then to a larger group
Targeted release – Release a feature to only a specific cohort of users
A/B testing – Ability to perform experiments based upon the feature toggle.
An API or SDK available – we also practice repository driven development, so it’s important to have a service that exposes an API that we can leverage to programmatically create feature toggles and manage those toggles.

After a lot of internal discussion, including debates on whether we should develop our own feature toggle service, we eventually decided upon a SaaS called LaunchDarkly. It fits nicely into our technology stack, ticks a number of boxes from our shopping list including an API endpoint to allow us to build automated tools in order to manage our feature toggles. It also has a number of SDKs available, including a Node JS SDK and a client-side JavaScript SDK.

Our web application is built using React and GraphQL/Relay. The LaunchDarkly JavaScript SDK client is embedded in the page allowing each component the ability to query LaunchDarkly for a feature toggle state. In reality, components rarely interact with the LaunchDarkly client but instead interact with a module which wraps the JavaScript client and provides an easy method of displaying a component based upon the value of a toggle.

Let’s take a contrived example and assume that we are updating our registration page. We want this work to be visible to our product owners via our internal integration server but hidden from our customers on our production server. The entry point to the new registration pages is via a Free Registration button which looks exactly the same on both the production and integration servers.

First, we create a feature toggle in LaunchDarkly – let’s call it new-registration. A feature toggle created in LaunchDarkly is available for use across all environments but the state of the toggle can be configured per environment. In our example, we configure the feature toggle to be on for the integration environment and off for production.

Within the React page, we would want to render either the existing registration page or the new registration page. Two button components are required here; one button component would, when clicked, navigate to the existing registration screen while the other would navigate to the new registration page.

First, we bring in a module that wraps the LaunchDarkly JavaScript SDK client to avoid components having a dependency on LaunchDarkly.

import { toggleAB } from 'toggler';

The toggler module is pretty straightforward. Here it is in it’s entirety:

/* global featureToggleClient */

export const toggleOn = (ReactComponentA, toggleName) =>
  featureToggleClient.state(toggleName, false) ? ReactComponentA : () => null;

export const toggleAB = (ReactComponentA, ReactComponentB, toggleName) =>
  featureToggleClient.state(toggleName, false) ? ReactComponentA : ReactComponentB;

(featureToggleClient is just a wrapper around LaunchDarkly’s SDK.)

Next, we create our two buttons which are pretty similar except for the navigation address:

const existingRegButton = () => <Button className='register' to='/register'>Free Registration</Button>;		
const newRegButton = () => <Button className='register' to='/new_register'>Free Registration</Button>;

We use the imported toggleAB to determine which component to render, depending on the state of the ‘new-registration’ feature toggle. If the flag is true then the newRegButton is returned, otherwise the existingRegButton is used.

const RegistrationButton = toggleAB(newRegButton, existingRegButton, ‘new-registration’);

Finally, inside the render function, render the returned RegistrationButton component:

  render(){
    return (
       <div>
        <div>
          ...page content here…
        </div>
        <div>
          <RegistrationButton />
        </div>
      </div>
    );

That is essentially it from the React side of things. When the web application is running on our servers the application knows the environment it’s running under and configures the LaunchDarkly JavaScript client accordingly. So, when running under the integration environment, the LaunchDarkly client is configured for integration and gets the state of the new registration feature flag for the integration environment. In our example, the state is set to on and so the newRegButton is rendered. Similarly, when running under the production servers the client is configured accordingly and the request for the state of the feature toggle from the production environment is returned. For production users, the existingRegButton is returned and the end user is directed to the existing registration page.

Once a feature is complete we can configure the feature toggle on the production environment to set the state to on (in which case all end users will see the new page) or we canary release and roll out the feature to a certain percentage of users – say 20% get a feature toggle state of on while the remaining 80% get a state of off. We can do all of this via the LaunchDarkly UI (or even programmatically via the LaunchDarkly API) without any React code changes or even a re-deployment of the code. Making a feature live to our end users is as simple as flipping a switch on the LaunchDarkly UI and that state change is reflected immediately on our web application.

The toggleAB function we use in the react page can also be used for A/B experiments. Instead of having a button navigate to a different page, we might want to run an experiment on the colour of a button, or the text displayed in the button. (E.g.: does the text Register for free drive more users to register than Register here?) LaunchDarkly has functionality to set goals and assign those goals against the feature toggles. The JavaScript client can UI interactions (such as element clicks) and sends events to LaunchDarkly when a goal has been met (e.g.: A button has been clicked). LaunchDarkly tracks page impressions, element clicks, etc and provides statistics on how many users clicked the button when the feature flag is on or off.

Feature toggle maintenance

One of the criticisms with using feature toggles is that the base code gets covered in feature flag calls that are never tidied up. We could also end up with a lot of feature flags that end up never being removed since developers aren’t sure if these flags are safe to remove. The end result is a mess of code that is difficult to maintain.

LaunchDarkly can help here – it provides useful information on the state of a feature toggle. The screenshot below shows the LaunchDarkly UI hinting that a feature toggle looks like it could be a candidate for removal.

LaunchDarkly feature toggle hints

FMP are developing a tool – code name nagger – which interacts with the LaunchDarkly API and retrieves information on the state of the feature toggles and reports those states back to the engineers. The idea is that any stale feature toggles are reported back to the engineers via a Slack channel. The tool starts to “nag” the engineers to remove stale toggles and the nagging will get louder the longer the stale feature toggle is in the system. Nagger will fail a CI build if it detects stale feature toggles that are older than say, one month for example. Hopefully, the constant reminders about stale toggles, along with the threat of a failed build, should ensure that toggles are removed from the source code and from LaunchDarkly when no longer required.

Feature toggles performance

One of the concerns from the engineers around feature toggling is the performance of the 3rd party service. How does this affect page load speeds and execution times? Will the requests for feature toggle slow down code execution speed?

These are valid concerns – waiting on a 3rd party SaaS is time consuming and potentially brittle. What happens if the 3rd party site is unavailable? Slow network latency? Helpfully, LaunchDarkly has been designed to be fast and resilient – it even works if the LaunchDarkly website is unavailable. (See the LaunchDarkly FAQ for more details on performance.) Performance of feature toggle request is fast - all toggle states are cached within the client so no remote requests are required to get the feature toggle state. Updates to the state of the toggle (for a user) are streamed from the LaunchDarkly servers to the client using server-sent events. This means that, provided we have a network connection, changes to the feature flag state set via the LaunchDarkly UI are reflected almost immediately in the JavaScript client.

However, how do the feature toggle states get into the JavaScript client in the first place? As part of the page load we need to initialise the LaunchDarkly client and that does involve a remote call to LaunchDarkly. In fact, we need to wait for the client to emit the on(‘ready’) event before we can start to ask for feature toggle states. This can slow down the page load speed considerably (LaunchDarkly docs suggest 100ms or more latency here). The React component cannot render the page until the client is initialised, but a 100ms latency wasn’t acceptable.

Once again, LaunchDarkly has a solution. During initialisation of the client, we can bootstrap the client with default values for the feature toggle states. Bootstrapping means that the on('ready') event is fired immediately and we don’t need to wait around for any remote calls. But, where do we get the initial set of toggle states from? We need to get the correct toggle states because we want to ensure that the correct component is rendered, but the only way to get the correct states is to call LaunchDarkly.

We solved this problem using GraphQL/Relay and another microservice named Flipper. The diagram below shows the solution:

Bootstrapping LaunchDarkly JavaScript client

Flipper is a RESTful service that, among other things, serves feature flags gleaned from LaunchDarkly. (It also allows us create feature flags, set state, etc). Flipper exposes an endpoint that returns all the feature toggle states for a user. Because it is a microservice, the Node JS client within Flipper is already initialised with LaunchDarkly and so any requests for feature toggle states are super quick. (Response times from Flipper are < 5ms and we are working to make that faster). The React app composes a GraphQL query that defines the list of feature toggle states for a user. GraphQL calls down to our API level to return the data. The API layer, in turn, calls down to Flipper. The feature toggle states returned from the GraphQL query is then used to bootstrap the LaunchDarkly JavaScript client with the correct set of feature toggle states. Average response times from the GraphQL call are typically < 30ms.

So, that’s how feature toggling is performed at FMP. We would love to hear from you on how you have solved the problem of continuous delivery and trunk based development.