Availability is a fundamental design concept

Earlier today a conversation on Twitter with Christopher Hoff (@Beaker), James Watters (@wattersjames), George Reese (@georgereese), Benjamin Black (@benjaminblack), and Shlomo Swidler (@ShlomoSwidler) discussed how many people seem to assume that because clouds can scale and rapidly provision servers that they’re always available and that because of this availability doesn’t have to be a fundamental design concept anymore.  It kicked off with @Beaker’s tweet about BitBucket, “Cloudifornication: 20+ hour outage due to EC2/EBS on BitBucket http://bit.ly/A8vCy” BitBucket ran into a problem with EC2/EBS that made their site unavailable for 20+ hours (I’m linking to the comments discussing it on Hacker News since the main BitBucket page is back to normal now, no longer the explanation since the problem is fixed). [UPDATE: Adding BitBucket blog post on the outage.]

The purpose of this post isn’t to analyze the BitBucket situation, it is to help people understand how to design an available architecture while still keeping it efficient in terms of expense.  Given an unlimited budget (or nearly unlimited) most IT architects will be able to build a “bullet proof” configuration.  Most of us don’t function in that world though so compromises are made.  Here I hope to outline how you can compromise effectively by thinking about availability early and often in the design process.  The design recommendations I’m going to outline are general in nature and depending on your specific business and operational model may not fit.  I enjoy discussing specific use cases and designs so if you’d like analysis directly related to your situation comment on the post and lets discuss it.

With that disclaimer here goes…a step by step guide to building a web application that will be available “almost all the time”… [Second disclaimer, I work for Rackspace Hosting, we have a cloud (The Rackspace Cloud), the recommendations here are my opinions, not those of my employer.]

1. Start with DNS — This is overlooked quite a bit and is the easiest thing you can do to ensure availability.  Get a reliable DNS provider that hosts their DNS servers in multiple data centers that each have multiple peering arrangements with documentation on their BGP convergence times.  This DNS provider should let you set the TTL (time to live) on your A records down to a maximum of 5 minutes (some will let you go as low as 1 minute).  Now you have the ability to redirect www.yoursite.com to a new IP address in 1-5 minutes.  While this may not let you recover your site completely, the worst case is in 5 minutes you can have a simplified version of your site up and running “somewhere” in 5 minutes.  Being able to give your customers a “We’re experiencing issues” message with a phone number or other information is invaiuable.  When customers believe you are working on recovering your site and/or have things under control they’re willing to trust you much more than if they get a 404 or 503 error page from their browser — if they are a new visitor and not a customer a 404 most likely means they never come back.

2. Design your application with portability in mind. Using a technology only available from a single provider may sound like a good idea but it locks you into that provider.  While we all believe our hosting provider will be in business forever 5 years ago we all thought we’d never see GM go bankrupt or Lehman Brothers cease to exist.  Cloud computing makes this much easier to test and implement than it used to be.  Part of going from idea to launch should include deploying your application to a minimum of two providers to ensure if something does happen to your provider you’ll be able to continue to run your business.  I don’t recommend trying to run your application on multiple providers as it’ll generally add expense you shouldn’t need — however I do recommend having your code and data with mutiple providers.  This requirement means you should try to avoid customizing at the OS/kernel/filesystem level.  Those are the main items I see causing difficulty in portability.  Next, if you want a hosting provider to support your application infrastructure stack (i.e. the HTTP server [Apache, IIS, etc], database server [Oracle, MySQL, MS SQL, Postgres, etc]) pick standard versions or plan on hiring staff to support your customizations.  While a single provider may agree to support your (or their) modifications others probably won’t.  If your provider has their own special versions of the appliation platform they may be trying to lock you in — beware!

3. Spend some time on BCP/DR (Business Continuity Planning/Disaster Recovery). You’ve spent months (or years) going from idea to application — if you spend a day or two you’ll have a fair BCP/DR plan — if you have somebody with a background in this you can have a good plan in a day or two.  After putting the plan together –TEST IT!  I’ve helped a number of businesses put together a plan and after we’re done they check the box, put it in a filing cabinet and then pray they never have to get it out.  That mindset is like a football team having a “2 minute drill” playbook but never practicing the plays hoping that they’ll never need to use it.  When it comes down the having to do it, if you haven’t practiced how well do you expect it to go with the added stress of an outage? “But Bret, I can’t test it, we can’t take our site offline for a test!” — You don’t have to go all the way to taking your main infrastructure offline (see #1 DNS).  You can bring up the replacement site without ever impacting your real site by modifying the DNS on your test machines (either point them to a BCP system test DNS server or modify the local host files).

Backup your data, backup your data, backup your data.

Backup your data, backup your data, backup your data.

4. Backup your data, backup your data, backup your data. Customers will deal with service outages.  They won’t put up with you losing their data.  You use time capsule, Jungle Disk, Mozy, Dropbox, or any other number of personal backup programs for your personal files.  If your house burned down you’d still have all of your own stuff.  What would happen to your web site if the data center your servers are in burned to the ground?  Is the data gone? If it isn’t gone how long will it take you to restore?  Is that timeframe acceptable to you and your users?  A couple of concepts to familiarize yourself with are RPO (recovery point objective) and RTO (recovery time objective).  RPO means how much data will be lost — if you do a daily backup you have a 24 hour RPO, if you run a transaction replicated database (such as Oracle with Data Guard) with the databases in separate geographic locations your RPO may be under a second. On RTO if you’re restoring from a backup medium like tape you’ll be able to recover ~10-40GB/hr (depending on the tape technology and compression ratio of the backup) — if you have a 400GB database you have a RTO of 10+ hours even if with cloud computing you can instantly have a new database server available to put the data on.  With a live database in a second geographic location your RTO is also potentially under a second (for restoring data, since you don’t have a restore — this doesn’t mean your whole site is automatically online in that same time).  I won’t go into detail here since we’re talking availability and not integrity but having a multi-geographic location replicated database doesn’t insure integrity — you still need snapshots or transaction logs or another way to go back to various points in time if you end up with bad or erased data (see my favoriate XKCD, “Exploits of a Mom”).

So now that we’ve taken all of this into account — what do we do?  My recommendations…

1. Make a “gold build” of each of the server types in your application and understand how long it takes you to have your necessary quantity of each server type online at various providers — cloud makes this much easier, in the dedicated world you’re looking at days typically to provision a new environment.

2. If your business relies on a fully functional web site as a primary revenue stream have a live database at a secondary location with the ability to launch web and app servers to bring your environment online quickly in the event of a primary provider failure.  If you can continue to service your customers via phone and/or e-mail have a static version of your web site running that you can switch to using DNS in the event of a primary provider issue.

3. Keep your source code in multiple locations with the ability for multiple employees to be able to deploy the site in the event of an issue.  I’m a huge fan of collaborative code repositories like GitHub and Beanstalk but if your code is only one one of them and they’re down (or in maintenance window) when you need to have that code to bring up a backup environment you’re stuck — it costs next to nothing to keep that code in multiple places.

I understand that nowhere in this post do I mention HA (high availability) nor do I mention things people generally think of when they hear HA.  Having redundant switches, firewalls, routers, and servers all in a single location (what people generally think of when they hear HA) will ensure that location stays online and you should certainly be doing that but it puts all of your eggs into that basket if you aren’t looking at HA beyond the single infrastructure.  Now that I’ve mentioned it if you want to learn more about HA design in a single location the Internet is full of good information on the topic.

I’ve also focused the discussion on architectures relevant to “most folks”.  If you’re Facebook, eBay, or Google (the search engine) you don’t want to rely on DNS to deal with outages at a specific location.  You’ll want to pair DNS with GLB (global load balancing) and BGP so you can have near real-time re-routing of users and potentially even sessions.  My availability recommendations certainly aren’t free to implement but they also don’t double your expenses.  It is very possible to add between 5-25% to your hosting expense to significantly increase your availability (and decrease your RPO/RTO).

I’m going to also note that I didn’t mention systems management or monitoring here really.  Those are both key items to understand to have an available environment but aren’t directly tied to designing an available architecture.  You’ll need to have proper systems management tools and policies (or you’ll cause outages yourself) and you’ll need monitoring so you know when to implement your BCP/DR plan.

Tags: , , , , , , ,

  • Pingback: » links for 2010-03-24 (Dhananjay Nene)

  • Pingback: Vladimir Vuksan's blog » Blog Archive » Cool DNS tricks you can’t use for fail-overs

  • http://www.filecabinetkey.net/lost-file-cabinet-key Lost File Cabinet Key

    This is the great blog, I'm reading them for a while, thanks for the new posts!

  • http://twitter.com/gagomes gagomes

    I found this post linked in the comments of bitbucket's outage report and enjoyed every word of it. I also found it cool that your favorite xkcd is also mine! Good job, will certainly add to rss!

  • http://twitter.com/katsnelson Leon Katsnelson

    Very good post with a very pragmatic approach. There are lots of best practices on the subject of availability but much of it is very academic and makes it sound as both difficult and expensive (and it can be). Your guidance is easy to understand and implement.