Introduction
One of the best things to come out of the modern DevOps movement is the aggressive push for “feature switches” (also known as “feature flags” or “feature toggles”). At All Around the World , we strongly recommend them for our clients and sometimes we implement them, but I’m still surprised that many companies don’t use them. They’re dead-simple to build and they’re a powerful tool for site reliability. Unfortunately, many articles discuss how to build them or use them, but best practices are neglected. So we’ll skip the implementation and instead focus on using feature switches effectively.
Note: the discussion below refers to “sites” because this is in the context of websites. Obviously, it doesn’t have to be.
A couple of decades ago when I was working for a small web-development firm in Portland, Oregon, we released a new version of our software that had a critical bug in a new feature. The bug was serious enough that it took down the entire web site. We had to revert to an earlier version (from CVS, the good ‘ol days!) and redeploy the web site. Our largest client’s site was down for an hour and that was not good. Had we used feature switches, we could simply have turned the feature off and the client might not have even noticed.
Note that you don’t need to implement everything I suggest in your first pass; you just need to get up and running and comfortable with feature switches. Once you do, you can better understand the additional tooling support needed for your use case.
Programming A New Feature
For the developer, using feature switches is straight-forward. Imagine that you always show a list of categories for products you sell, but you’re fetching it from the database every time. Developers argue for caching the list, but sometimes it changes and arguments over cache invalidation stall development. Finally, an agreement is reached, but you don’t want to roll the new functionality out blindly, so you use a feature switch. Here’s some pseudo-code in Perl.
my $categories;
if ( feature_on('cached_categories') ) {
$categories = $cache->category_names;
}
else {
$categories = $app->category_names;
}
Side note: the above is shown for simplicity. In reality,
$app
variables often imply some kind of global state and if it’s mutable, that can be dangerous, but we’re keeping it simple.
In the above, if the cached_categories
feature is on, we fetch features from
the cache. If it’s not, we fetch it from the database. There are numerous ways
this can be done on the front-end, but we’ll skip those for the purposes of this
article.
Why would you bother to use a feature switch for the above? You’ve tested it. The customer won’t see a difference. You know it works. Right?
No, you don’t. What if your cache invalidation doesn’t work quite as expected? What if your cache isn’t loaded? What if your brand-new caching system isn’t as reliable as you thought and you actually need the following?
my $categories;
if ( feature_on('cached_categories') ) {
$categories = $cache->category_names || $app->category_names;
}
else {
$categories = $app->category_names;
}
Sound paranoid? One definition of a good programmer is someone who looks both ways before crossing a one-way street. So you roll out your new code with your feature turned off and after you verify everything is good, you turn on your feature. How?
Use Control Panels
Typically, a good system for feature switches should have a web-based control panel that lists each feature and shows, amongst other things:
- The name of the feature
- A description of what the feature is for
- The current on-off state
- Whether the feature is permanent (more on that later)
- An “on-off” history of the feature
The “on-off” history is simply an audit log. At minimum, you want:
- Who toggled the feature
- The date it was turned on/off
- Why it was turned on/off
- Possibly the release-tag of the software
An audit log of the history can be immensely useful if you want to turn on a feature that was turned off for several months and you don’t remember why. For companies that make heavy use of feature switches, this can be invaluable. If your company is structured well enough, you might even want to include what team the feature belongs to, or what departments are responsible for it. Because so many companies are structured differently, it’s hard to give clear guidance without knowing the specifics.
Don’t Use Control Panels
Yes, feature switches are usually used via control panels and they make it easy to turn features on and off, but there are caveats. It sometimes happens that enabling a new feature causes every page on your site to crash, taking down the control panel! Oops. Having a control panel running on a dedicated server, separate from the servers using the features can help to mitigate this. Having command-line tools to turn features on or off can also be useful. Do not rely on control panels alone. Nothing’s worse than having everything web-based and losing web access.
Permanent Features
One thing often overlooked for features switches is that some of them should be permanent. Imagine that your site goes viral and you’re hitting unprecedented amounts of load and you’re struggling. Some sites handle this by having optional features on the site wrapped in permanent feature switches. If your site is struggling under a heavy load, being able to toggle off optional features such as autocomplete or “related searches” that aren’t critical can help you manage that load. Your site experience degrades gracefully instead of simply crashing.
“Sampled” Switches
Before turning on a switch, it’s often useful to turn it on only for a subset of requests. For example, you might only want 10% of users to experience it. If a feature turns out to be disastrous, this can limit the impact. However, this can be odd for someone who sees a new feature, refreshes a web page, and that feature disappears because they are no longer in the randomly selected subset. Making such features persistent per user can be tricky, though. Do you rely on cookies? Your mobile app may not use cookies. Or someone might switch browsers, such as switch from their work to their personal computer.
Don’t stress about this (unless you’re doing A/B testing). “Sampled” features are short-lived. As feature switches are primarily a site reliability tool, the limited sample run is just there to avoid disaster and you should go 100% as soon as is feasible.
Naming Conventions
For one of our clients, they have no naming convention for their features, but
they use features heavily. We wrote code to scan our CODEOWNERS
file to dump a
list of all features in our code in order to regularly monitor them. This
allowed us to do heavy cleanup when we found many features that had been running
successfully for years, but were never cleaned. Even this was limited because
many of “our” features were scattered in other parts of the codebase.
To deal with this, establishing a naming convention early on can make it easier
to search for features. Are you on the advertising team and you have a new
feature offering a discount for first-time buyers? You use ADV-
as a prefix:
ADV-FIRST_TIME_BUYER
. Searching for all features prefixed with ADV-
gives
you an overview of the features your team has running.
When you identify a naming convention, do not put ticket numbers in the feature name. Many companies switch ticketing systems and old ticket numbers become useless. Instead, having a “Ticket URL” field on your form when you create a feature allows others to click through and understand the intent of that feature.
Operations
Errors
If you dump out rich error data to error logs, include a list of running
features, sorted by last start time. Frequently, new errors show up due to
recently introduced features and this makes it much easier to debug, especially
if first time buyers sometimes can’t place orders and you see the
ADV-FIRST_TIME_BUYER
feature was just switched on. This makes life much
easier.
Communications
Let other teams know when you’re about to start a new features, especially your site reliability engineers (or equivalent). Give them a heads-up about the intent of the feature and what to look for if their are issues. Such communication can allow tracking down issues faster.
Simplicity
Feature switches are a site reliability tool. Keep them simple. You don’t want your site to break because someone made the feature switch tool too complicated. This is one of the reasons I prefer features to be “booleans” (in other words, “on or off”). Some recommend feature tools that allow you to do this:
my $categories;
if ( 'enhanced' eq feature_on('SOME_FEATURE') ) {
$categories = $app->categories_enhanced;
}
elsif ( 'expanded' eq feature_on('SOME_FEATURE') ) {
$categories = $app->categories_expanded;
}
elsif ( feature_on('SOME_FEATURE') ) {
warn "Unknown feature category for SOME_FEATURE";
}
$categories //= $app->category_names;
Does the above start to look complicated? Well, kinda. If you allow more than
just “on” or “off”, you need to check for the case where the feature value
doesn’t match expectations. Maybe someone added a new value to the feature?
Maybe expanded
was actually expandable
and you misspelled it in your code.
By allowing these multiple values, you’re complicating code that’s a site
reliability tool. KISS: Keep it simple, stupid. Feature switches are there to
make life easier. Don’t overthink them.
Clean Features Regularly
Aside from “permanent” features that you want to always be able to switch off, it’s important to regularly clean features. Otherwise, you get into messes like this:
if (feature_on('FEATURE_1')) {
# do something
if (feature_on('FEATURE_2')) {
# do something else
}
elsif (feature_on('FEATURE_3') {
# do yet another thing
if (feature_on('FEATURE_4')) {
...
}
}
}
Not only does that make the code harder to read, you can wind up with cases where features work well on their own, but there could be some unknown interaction between conflicting features that won’t be noticed unless you have the right combination of features toggled on or off. This is miserable to debug.
Don’t Check Features at Compile=Time
This one is a subtle trap. Feature switches are designed to allow you to quickly turn a feature on or off while the system is still running. However, if your code that checks the feature does so too early and not at runtime, you may find that you can’t turn your feature on or off! Here’s an example in Perl:
package Some::Package;
sub import ($class) {
if ( feature_on('MY_COOL_FEATURE') ) {
...
}
}
# elsewhere
package Another::Package;
use Some::Package;
...
In the above, the use Some::Package
statement will cause Some::Package
to
call import
and that will check the feature switch, but that will happen only
once. Toggling the feature switch later will likely not affect the code while
Another::Package
is resident in memory.
Worse, if some other code is loaded later, but the feature switch has been toggled, you may find that different parts of your codebase do not agree on whether or not a feature is running. This is not a good place to be.
Conclusion
Feature switches are an incredibly powerful site reliability tool which can often eliminate the need to redeploy your software if something goes wrong. This can improve your customer experience and dramatically cut costs. However, like any new system you implement, you want to take some time up front deciding how it will be used and what your best practices are.
You will probably find that the best practices you develop for your company differ from what I describe above. That’s OK. I’m not dogmatic. But planning for this in advance can give you greater confidence in pushing new code that might otherwise be cause for concern.
If you’d like top-notch experienced developers with decades of experience to come in and make your life easier, let me know. Our work at All Around the World , is focused on pragmatic development custom-tailored to client needs.