It would be nice, but cloud doesn’t have to be “interoperable”
Disclosure: If you’re coming straight here you may not know work for Rackspace Hosting and I’ve been involved with OpenStack since the inception of the project. The opinions on this blog are my personal ones, not those of my employer.
This post is an assessment, a thought. I don’t really explore the meaning or outcomes completely. I may do that in a future rambling… on to the thought…
For the first decade of networking or more we had many competing technologies that didn’t interoperate: SNA, IPX/SPX, TCP/IP, AppleTalk, DECnet, NetBEUI, and more. The lack of a consistent and unified standard didn’t stop networking from succeeding anymore than it will stop cloud computing. Cloud is a fundamental shift that dramatically increases productivity just like networking did — businesses love increases in productivity and will adopt anything that yields one – often the first one presented to them and they’ll run it for a refresh cycle before switching to the “interoperable” platform. Ok, so I’m a networking geek but this isn’t the only analogy that holds true…
We have many programming languages.. compiled, interpreted, functional, object oriented.. all with major differences.
We have many types of processors from low energy mobile chips to super fast server chips all with different instruction sets.
We have a variety of operating systems all with a loyal following and a vastly different set of capabilities.
I believe cloud could see wider and more rapid adoption if interoperability is figured out but looking back at history, and even history specifically in the technology world, we have many successful markets without true interoperability as a fundamental capability.
Over time most of these markets have achieved the guise interoperability through consolidation and it looks like cloud computing is headed the same way. Networking is predominately IP; programming is C/C++/C# for OS/infrastructure, Java for enterprise applications, and PHP for web applications; processors are x86 in desktops and servers, and ARM in mobile devices; Operating Systems are generally Windows for consumer and SMB/departmental large business IT, and Linux for web and larger business core IT.
With the pace of innovation and the foundation laid down by previous generational shifts the cloud market will grow and reach a critical mass market share much more rapidly as technology companies that are involved know the path to follow. Microprocessors, operating systems, and networking took many decades. Java swept through the enterprise software development market in a decade as did PHP across the web. The cloud market really started to emerge around the start of the decade and by the current look of things by middle we’ll have a clear picture of interoperability for clouds.
Three Predictions about Cloud Computing for 2011
With all the talk in 2010 about cloud computing you’d think the entire Internet was running on it. We’re at the point now with cloud computing as we were in the late ’80s through the mid-late ’90s with networking. Everyone can clearly see the benefits in cloud but the market is hyper fragmented as different pockets of users form a community around one of the available solutions. To ensure new readers are aware, while I am employed by Rackspace Hosting working on the OpenStack project, the opinions expressed on the blog are mine and I try to present a non-biased view of the market.
With that opening I’ll dive into the three items of importance to cloud computing for the coming year..
..with additional more minor predictions in italics.
1. Cloud computing needs will change as we move from early adopters to mainstream users
Thus far the primary users of cloud infrastructure as a service (IaaS) offerings have been early adopter technology savvy users. Those users may be founders of a Web 2.0 startup, consultants working for the R&D department of a system integrator, or forward thinking IT professionals on enterprise IT strategy teams. Based on the stats in the chart, we still have less than 2% of the top 500k websites hosted on IaaS. In 2011 this number will grow to 5-10% of the top 500k sites, more than doubling again like it did in 2010.

Source: http://www.jackofallclouds.com/ - Guy Rosen
Inside startup communities everyone will be using the cloud to start their business and in enterprises departmental innovation success stories will start to bubble up to corporate leadership. This doesn’t mean existing applications will be migrated — people will experiment with migration as disaster recovery innovation but it won’t be a major driver of cloud growth in 2011.
The major driver of growth will be new applications and much of this growth won’t be consumer Internet sites that are easy to track making the “who’s winning in cloud” leader board more difficult to display. This next wave of adopters will also require additional levels of support as they won’t have the same “DIY” mentality as early adopters. Cloud providers will need to raise their own service levels or spend significant effort building a system integrator and consulting ecosystem that can provide it for them.
The ecosystem of tools built on the IaaS cloud APIs will be a foundation to enable the higher service levels. They will be utilized by the practices for the SIs and consultants as well as the software development teams of many ISVs. For cloud providers that do not yet have an ecosystem built around their API this will be the year they move on to adopting one of the open APIs with market traction. Enough providers will use a group of 3-5 APIs that ISVs and startups will refuse to developer for others. API abstractions like jclouds, Libcloud, and Deltacloud will start to add depth rather than additional breadth.
2. Technology heavyweights with large developer communities will escalate their efforts to define and control PaaS
2006 brought us Rackspace Cloud Sites (originally branded as Mosso)…
2007 gave us Force.com…
2008 launched Microsoft Windows Azure and Google App Engine…
2009 delivered Heroku’s commercial release and moved VMware into the platform space with the acquisition of SpringSource…
2010 veterans like Red Hat and Oracle [PDF] start announcing platform strategies and making acquisitions such as the recent purchase of Makara (by Red Hat)…
Over the past few years many platforms and application frameworks simplified development by providing a foundation and abstracting away lower level details. This came with some drawbacks as most frameworks were not aware of their resource utilization nor did they have the ability to utilize the programmatic capabilities of IaaS to change their resource allocation based on load. In 2011 many platform solutions will become tightly integrated with IaaS APIs providing dynamic resource management — the auto-scaling cloud workload across public, community, and private cloud installations will become an early adopter reality.
Building a PaaS solution is possible for a startup if they make it compatible with widely accepted development languages and frameworks as Mosso did by selecting PHP/Python/.NET with support for common applications like WordPress, Drupal, Django, and more. Another example is what Heroku did with Ruby and PostgreSQL. The heavyweights are the only players with deep enough pockets and the required patience to push new programming dynamics. Microsoft will look at how things are evolving and in order to defend .NET/C# will embrace the Mono Project creating a real threat to Java in the enterprise.
Java won’t take things laying down. Despite the fallout in 2010 over Oracle vs. Google, over the Apache Foundation first joking with Oracle and then stepping down from the JCP Executive Committee, and the many other examples such as James Gosling, the creator of Java, coming out against its new steward Java still maintains #1 position on the TIOBE index. Oracle, IBM, and VMware all have deep pockets and big revenue streams tied to the continued success of the language. IBM, while late to the PaaS party, will tie together Tivoli, WAS, and other components to build a robust platform for their customer base.
The blog-o-sphere will erupt in debate about “what is a true platform as a service” as much as they went on and on in 2010 about infrastructure as a service APIs. Despite what the pundits believe the majority of the enterprise IT dollars will go towards “false platform” private cloud solutions. Making the leap for major projects from current development methodologies and procedures to one of the new platforms will be too much — IT organizations need evolution, not revolution.
3. Enterprises will begin to evolve their virtualization deployments into private clouds and they’ll expect networking and audit controls beyond the capabilities of many current systems
Cloud deployments in enterprises will go in two different directions depending on the expectations of the project sponsoring executive team. Departmental usage of cloud that flies under the official process RADAR is not what I’m talking about. Over the past year I’ve had conversations with numerous people on Fortune 500 IT strategy teams and cloud is being looked at a couple of different ways. One group looks at it only as a technology solution that will magically make their operations more efficient. Another group realizes that the automation cloud brings only benefits them if process changes happen in parallel with with the systems improvements. Enterprise cloud projects that are only technology focused will not provide any meaningful savings and enterprises going this route will become disenfranchised. Virtualization was about consolidation not automation and because of that it didn’t include business process changes that cloud requires. You can’t simply install a “cloud upgrade” to your virtualization system and instantly have a cloud.
The cloud projects are going to run into a second set of hurdles. Up to this point typically departmental non-audited, non-regulated applications have been deployed by enterprises on clouds. In 2011 projects will need to address the corporate risk management and IT audit requirements. Major public clouds such as Amazon Web Services and Rackspace have addressed and received various attestations such as SAS70 Type II (Rackspace example), ISO27001 (AWS example), and PCI DSS [PDF] (Visa Global List of Validated Service Providers) showing that it is possible to build cloud services that meet compliance requirements. For enterprise projects to be successful they need to involve risk and audit up front so the proper control mechanisms are in the deployment. Because cloud is about workflow automation having to insert a manual audit control late in the project in order to meet the launch plans will eliminate many, if not all, of the projected benefits.
Corporate risk and IT audit teams will invalidate a number of cloud software fabrics and those platforms will quickly try to re-engage on projects by announcing partnerships with security companies. Platforms that come from service provider and government backgrounds such as OpenStack have a head start along with platforms that have evolved from enterprise DNA such as VMware. As enterprises start to spend significant dollars on cloud the R&D investment in cloud platforms will dwarf what has been spent to date and any head start present currently can see it easily vanish in 2011.
Audit controls for cloud platforms need to include both host and network services. It is possible to architect a cloud and map the controls into existing systems though this won’t be instantly turn-key or easy in 2011. Network virtualization will make cloud systems more flexible at a cost of making compliance controls more complicated. Process wise this also introduces another department into the cloud deployment discussions. Most clouds deployed in 2011 will focus on server automation and networking will be addressed in subsequent phases of the technology transition in 2012+.
Conclusion
2011 will be another big year for the adoption of cloud technology but these fundamental shifts happen slowly — especially when they involve people learning new processes and not just transparent technology replacement. When done right, cloud will make IT vastly more efficient and when cost of services decline the demand for those services often skyrocket.
This post focused on the cloud computing. I’ll be making other posts in January about distributed storage platforms (aka. “cloud storage”) and why they’ll be important for enterprises to understand and have readily available to their users before the middle of the decade. It is a fundamentally different problem as many cloud storage systems can be installed transparently without end user process changes.
Why OpenStack matters to me
I’d like to start off with an apology to everyone out there that over the past 9 months if I didn’t reply to your email, didn’t answer your phone call, or made your life less interesting by disappearing from Twitter and from sharing my thoughts on this blog. I’ll be out, alive and available again now that OpenStack is a reality.
Life is about priorities and hopefully at some point in your life you have already had or will have in the future an opportunity to work on something that has the ability to really make an impact. At Rackspace we are a Strengths based organization. My top 5 are Learner, Achiever, Competition, Analytical, and Focus. I’ll use my strengths as a way to explain the past ~9 months.
When we started exploring the strategy around this all of us had lots to learn. We’d all used open source software. Some of us on the team had contributed to projects, but we all knew we had a lot to learn if we were going to get this right. The great thing about open source, the full history of all of it is on the Internet. You can go back and read mailing list archives, you can find out who contributed to a project, who led them, who had influence and you can reach out to those people and they’re often happy to talk about it. This is very different from trying to do research on businesses where information is hard to find — no corporation will share their full mailing list archive that covers the history of their decision making (heck most don’t even have one). The openness and ability to learn about things easily was a huge motivator for me.
So began the Learner->Analytical->Focus->Achiever “death spiral”, well the “death” of my learning anything not involved on this project that is. The good news is those 4 strengths together make it so I really enjoy learning about new complex systems and figuring the best way to navigate, the bad news is the Focus->Achiever half may let me chase Alice all the way down the rabbit hole to Wonderland. Sometimes this is counterproductive where a decision could have been made “good enough” with less analysis but in this case I’m really happy about it. When forming an open source community you have a lot of choices to make and all of them have different benefits or drawbacks and the perception of is it a benefit or drawback varies from the perspective of the individual or group.
Forming this community is important enough to go all the way down the rabbit hole because thousands of people will become part of it and each potential member of the community is worth more than an hour of my time. This gives me a good segway to talk about scale — If you’re only going to use a piece of software once to solve a single need then you should make it just good enough to get the job done — you should optimize for min(time coding + time for code to run[where you have to pay attention to it]). The opposite end of the spectrum is a project like Linux (or like OpenStack will be — I dream big!) that runs on millions of machines 24/7 all around the globe. If you can make an operation one minute faster on something that runs on a million machines you save 2 years worth of system time. With that same idea we spent all the time we could making sure we got the community started the right way because every hour we spent will be multiplied by each of you that join it.
So now here is where my Competition kicks in. I don’t want to make just an average community and then go watch reruns of “Everybody Loves Raymond” (Ray, hopefully you aren’t offended, you shouldn’t be, you were the first show that I know made it to rerun syndication that popped into my head!) on local TV — I want to make the best community ever. The problem is… the bar is really high.. it isn’t like I said, “I want to make the biggest ball of rainbow yarn a person with a 9 letter long name made on a Tuesday afternoon” — I want to make the best open source community around a distribution of projects out there — and a lot of people have done an excellent job at this. So to do this we’ve learned as much as we could from past projects to lay the proper foundation. With that let me lay out the “4 opens” (I’d like to credit Rick Clark on our team for summarizing these thoughts into a concise and clear manner we can all hopefully understand)…
Open Source: We are committed to creating truly open source software that is usable and scalable. Truly open source software is not feature or performance limited and is not crippled. We will utilize the Apache Software License 2.0 making the code freely available to all. [Personal commentary: What this means is "we accept patches", the project won't block a feature contribution because it competes with a commercial feature a community member has. This doesn't mean all of those commercial entities have to contribute all of their code -- it just means they aren't guaranteed exclusivity.]
Open Design: Every 6 months the development community will hold a design summit to gather requirements and write specifications for the upcoming release. [Personal commentary: The design summits have been great (so far we've had 2) to get people aligned and to really get the complicated items solved. An example on this is the large object support for Object Storage, members of the community had a number of different implementation ideas and through discussion we've come up with a great way to do it.]
Open Development: We will maintain a publicly available source code repository through the entire development process. This will be hosted on Launchpad, the same community used by 100s of projects including the Ubuntu Linux distribution. [Personal commentary: Getting code and designs out in the open as early as possible in the process allows everyone to benefit from the power of a community in the biggest way possible. This also makes finding and fixing big problems much easier as each patch can be tracked and its individual impact measured.]
Open Community: Our core goal is to produce a healthy, vibrant development and user community. Most decisions will be made using a lazy consensus model. All processes will be documented, open and transparent. [Personal commentary: Everyone should have a seat at the table at a level that corresponds to the effort and contributions they're putting into the project. With all of the decision making done in IRC meetings (with transcripts) and over mailing lists members of the community can see "how the sausage was made" rather than just the end result of the decision -- this is really important to build and maintain trust.]
We’re off to a fun and exciting start. Looking at the stats from this week I’m amazed at the amount of contribution we’re seeing from such a large group of developers (stats for the week of 12/3 to 12/9):
- OpenStack Compute (NOVA) Data
- 17 Active Reviews
- 97 Active Branches – owned by 34 people & 4 teams
- 472 commits by 26 people in last month
- OpenStack Object Storage (SWIFT) Data
- 5 Active Reviews
- 41 Active Branches – owned by 19 people & 2 teams
- 184 commits by 15 people in last month
This shows me what we’re doing is working and given the time to continue to grow and bloom OpenStack Compute can help IT make the move to automation the same way manufacturing has over the past 50 years. Yes, I’m saying IT isn’t automated right now. IT automates other tasks inside the Enterprise but they haven’t really automated many of their own tasks (this probably deserves a full post of it’s own).
Object Storage is potentially more important even than the automation. This is a topic I’ve been presenting on frequently because I’m very passionate about it (see the Strengths above) as it allows us to see an order of magnitude increase in efficiency over the TCO of “the average storage solution”. It doesn’t serve every storage use case but the use case it does serve is growing rapidly and over the next decade it’ll be clear to everyone that their largest storage platform (in terms of GB stored) will be object based.
I expect we’ll see additional projects as part of OpenStack over the next year but we should keep that bar high as a community on what is a major project. Both Compute and Object Storage are providing software for ubiquitous problems that are growing in importance to everyone. Some items that clear the bar for me (these are critical issues to all users and operators of clouds a decade from now):
“Networking as a Service” — This should be abstracting from the end-point computing service as it can be utilized by all projects and to provide connection points to other inter-cloud and non-cloud services. Here we can define, routing, switching, and filtering network devices and we can automate their integration with other cloud services.
“Inter-cloud Services” — As different clouds become available with varied services we need an automated way to discover and catalog them the same way routing protocols advertise network availability so we can have a loosely coupled global network (you may be familiar with it.. the Internet). OpenStack is a great place to define a reference implementation of the directory and advertising capabilities as all interested parties can have a seat at the table to contribute their needs.
Some items I’m on the fence about (the reason I’m on the fence isn’t that they aren’t extremely important to some implementations, it is that they aren’t important to all implementations):
“Host Provisioning Automation” — For service providers that are constantly growing and re-provisioning assets automating these tasks is critical. For a SMB that is going to build a 2-6 cabinet cloud solution once this isn’t nearly as important.
“Security & Compliance Services” — Everyone wants “some level” of security but what that level is and what amount of the resources that get dedicated to providing them varies widely.
“Network Block Storage Services” — As the performance and size of local storage continues to increase the need for network block storage decreases. I’m still a big believer in the benefits here for many use cases; it just doesn’t apply for every use case.
I really believe 2011 our community has a chance to really deliver “the promise of cloud” to the masses through the efforts and commercial implementations created by the members of our community. As exciting as getting things off the ground in 2010 I’m even more excited about the future to come.
How to tell the difference between “cloud” and “virtualization”
Many people seem to think “cloud” is just off-premise “virtualization”. Cloud comes in a few flavors and I’ll argue that you can have “private cloud” either hosted off-premise in a provider’s facility or in your own. The fundamental difference between cloud and virtualization is the goal of cloud is to automate provisioning (this applies to IaaS, PaaS, and SaaS) and the goal of virtualization is resource utilization optimization. You can (and many providers do) use virtualization as the basis for building a cloud but it is not required.
If we take a look at the Reductive Labs presentation from OpsCamp slide 3 illustrates the primary benefit of cloud. Cloud helps companies even if their minimum unit of work is larger than a single host machine where virtualization just adds overhead in that case. The difference between “cloud” and “grid computing” or HPC is that grid/HPC process jobs in a batch manner rather than serve interactive applications. You can build a compute grid on top of a cloud but not vice versa.
Other folks are saying “private clouds can’t exist because you can’t have rapid elasticity and pay for what you use”. For a small company you may not be able to have a private cloud but for a large enterprise with many business units you certainly can. An IT infrastructure BU can provide other organizations in the company all of the requirements of a cloud.
Depending on the current utilization across an enterprises infrastructure they may be able to defer spending for a number of years by moving to a fully cloud enabled business. Right now many departments cling to servers they don’t need because they’re afraid if they release it they’ll never get it back. With cloud removing that fear resource hoarding ends and many enterprises will have a significant increase in available computing power.
Over the long term if the public computing clouds continue to grow, increase their transparency, and optimize their delivery models it will no longer make financial sense for enterprises to build their own infrastructure. Public cloud providers will need to prove over the next decade they can deliver on all three corners of the “impossible triangle”.
Public clouds and their features, followed by the future of cloud computing hardware
I’m going to break this post up into two sections, the first will discuss public clouds and their features focused on advanced networking as an example. The second portion will look at the future of cloud computing hardware — both networking and computing.
Public Clouds and Feature Selection
A discussion started on Twitter today after Werner Vogels (@Werner) tweeted about the future of networking through a blog post by James Hamilton entitled, “Networking: The Last Bastion of Mainframe Computing”. Christopher Hoff hasn’t been thrilled (understatement of 2009) with the networking features provided by cloud computing platforms both public and private. Unless I misunderstood his tweet he’d love to hear public cloud providers commit to a flexible API driven networking layer using technology such as OpenFlow.
I tossed back a question asking, “Are customers willing to pay for complex network customization in a cloud? If so, what percentage of them? Thoughts?” and he replied, “In terms of paying for parity in what I can do in even a basic enterprise today? No thanks. That’s on you as a provider in long term”. I threw this question out because here-in
lies the problem… Public clouds will only end up with the features that a broad market will pay for or a small market will pay a very significant premium for. The reason behind this is when a cloud adds a core feature, it adds it everywhere. This leads providers to only invest in new features that a enough of their customers are interested in to offset the cost of deployment and still yield a satisfactory return on capital.
Today at Rackspace customers that want advanced networking configurations are directed to our Private Cloud platform (I say our because I’m employ
ed by Rackspace — the opinions expressed here however are mine alone). They can then create security zones, use IPS/IDS, and enable enhanced DDoS defense services all behind dedicated firewalls and load balancers. The private cloud environment can have bridged network segments that connect to a public Rackspace Cloud Servers(tm) configuration for workloads that do not require advanced networking. The current addressable market interested in both public cloud as a primary platform and advanced networking is small. The early adopter group of start-ups and SMBs doesn’t typically need or is not willing to pay for advanced networking and the enterprises that are willing generally aren’t first movers on new technology.
As the public cloud market matures the addressable market will grow and you’ll start to see public cloud providers adding advanced networking capabilities though the cloud definition of “advanced” won’t ever be truly “cutting edge” on a mass market cloud. I expect we’ll see niche clouds emerge that will cater to specific application use cases that will have advanced features for their target customer. Early examples of this are Force.com or the OpSource Cloud.
The Future of Cloud Computing Hardware
I’m now going to loop back to James’s post that kicked this whole thing off where he compared the current network device situation to mainframe and the vertical scale centralized systems. He asserted that we’ll see a commoditization of the networking layer similar to what we’ve seen in the storage layer through technologies like RAID and through servers with x86. The reason RAID and x86 have been successful is they are multi-purpose with the capabilities to serve a broad range of applications well with proper configuration.
Networking gear is very different because the workloads are all uniform and when you have a uniform workload an ASIC (Application Specific Integrated Circuit) or a FPGA (Field Programmable Gate Array) that has is tailored to a specific type of workload will enable better performance per dollar. The second core difference between the server/storage markets and networking is once you step into the “carrier/cloud class” networking equipment only a few hundred potential customers exist — markets with fewer stronger customers tend to be more consolidated. Networking gear has also been “cloud like” for over a decade now. Lets look at the NIST requirements for a cloud:
On-demand self-service - This requirement is for a cloud to user relationship. I’ll translate this to a network cloud to network engineer relationship. For them, all carrier class networking gear supports SNMP along with other potential programmable configuration methods through management systems with APIs such as the Cisco Configuration Engine [PDF].
Rapid elasticity – This dates back to frame-relay where the concepts of a CIR (Committed Information Rate) was introduced. The space has continually evolved with QoS being introduced on ATM up through the advanced dynamic algorithmic traffic routing today over IP/MPLS networks.
Resource pooling - Doing this for computing is new outside of the HPC market — telecommunication networks have been multi-tenant since the point the 3rd phone was hooked up over 100 years ago.
Measured Service – Networking has been doing this for years as well, down to the minute or byte of data instead of the hour or GB (the smallest unit of measure any public cloud compute or storage platform bills in).
Broad network access – Service provider IP networks are the ultimate in heterogeneous access through standards based communication. They support connectivity over a number of layer 1 physical mediums using quite a few layer 2 communication protocols.
Cloud computing may actually end up bringing the server market closer to the current networking market than vice versa. An IBM Z-series is capable of very efficiently Linux instances. It also supports I/O virtualization for both networking and storage with granular controls — features we still don’t have at the same quality level from x86 virtualization solutions. The Oracle Exadata V2 is another example, it supports 1 million I/O per second for non-sequential workloads on databases up to 140TB in size. How many commodity x86 servers does it take to match either of those configurations and how do they compare in capex and TCO (Total Cost of Ownership) to the IBM or Oracle specialized platforms? We see even specialized x86 platforms being developed and deployed by a number of players. Some examples are the Cisco UCS, SGI Ice Cube, and the Sun Modular Datacenter. These platforms are all designed to optimize spend for virtualization/cloud computing workloads and while they may be made up of x86 sub-components they are designed to function as a complete “mainframe” functional unit.
Conclusions
We’re still very early in the technology transition to a full utility style computing grid. As the transition progresses we’ll see more use cases served by a broader range of features. For the small verticals with complex configuration needs and a low willingness to pay a premium we’ll see niche providers.
Networking hardware has been cloud like for more than a decade and a few major players dominate the market because of the small number of strong buyers. Technologies such as OpenFlow in combination with Moore’s law has the potential to disrupt the market but this isn’t a guarantee. The current clouds being built using a massive number of commodity x86 systems is also not guaranteed to be the future — specialized computing platforms have the potential to deliver better unit economics and in a commodity business it will come down to the financials in the end.
Availability is a fundamental design concept
Earlier today a conversation on Twitter with Christopher Hoff (@Beaker), James Watters (@wattersjames), George Reese (@georgereese), Benjamin Black (@benjaminblack), and Shlomo Swidler (@ShlomoSwidler) discussed how many people seem to assume that because clouds can scale and rapidly provision servers that they’re always available and that because of this availability doesn’t have to be a fundamental design concept anymore. It kicked off with @Beaker’s tweet about BitBucket, “Cloudifornication: 20+ hour outage due to EC2/EBS on BitBucket http://bit.ly/A8vCy” BitBucket ran into a problem with EC2/EBS that made their site unavailable for 20+ hours (I’m linking to the comments discussing it on Hacker News since the main BitBucket page is back to normal now, no longer the explanation since the problem is fixed). [UPDATE: Adding BitBucket blog post on the outage.]
The purpose of this post isn’t to analyze the BitBucket situation, it is to help people understand how to design an available architecture while still keeping it efficient in terms of expense. Given an unlimited budget (or nearly unlimited) most IT architects will be able to build a “bullet proof” configuration. Most of us don’t function in that world though so compromises are made. Here I hope to outline how you can compromise effectively by thinking about availability early and often in the design process. The design recommendations I’m going to outline are general in nature and depending on your specific business and operational model may not fit. I enjoy discussing specific use cases and designs so if you’d like analysis directly related to your situation comment on the post and lets discuss it.
With that disclaimer here goes…a step by step guide to building a web application that will be available “almost all the time”… [Second disclaimer, I work for Rackspace Hosting, we have a cloud (The Rackspace Cloud), the recommendations here are my opinions, not those of my employer.]
1. Start with DNS — This is overlooked quite a bit and is the easiest thing you can do to ensure availability. Get a reliable DNS provider that hosts their DNS servers in multiple data centers that each have multiple peering arrangements with documentation on their BGP convergence times. This DNS provider should let you set the TTL (time to live) on your A records down to a maximum of 5 minutes (some will let you go as low as 1 minute). Now you have the ability to redirect www.yoursite.com to a new IP address in 1-5 minutes. While this may not let you recover your site completely, the worst case is in 5 minutes you can have a simplified version of your site up and running “somewhere” in 5 minutes. Being able to give your customers a “We’re experiencing issues” message with a phone number or other information is invaiuable. When customers believe you are working on recovering your site and/or have things under control they’re willing to trust you much more than if they get a 404 or 503 error page from their browser — if they are a new visitor and not a customer a 404 most likely means they never come back.
2. Design your application with portability in mind. Using a technology only available from a single provider may sound like a good idea but it locks you into that provider. While we all believe our hosting provider will be in business forever 5 years ago we all thought we’d never see GM go bankrupt or Lehman Brothers cease to exist. Cloud computing makes this much easier to test and implement than it used to be. Part of going from idea to launch should include deploying your application to a minimum of two providers to ensure if something does happen to your provider you’ll be able to continue to run your business. I don’t recommend trying to run your application on multiple providers as it’ll generally add expense you shouldn’t need — however I do recommend having your code and data with mutiple providers. This requirement means you should try to avoid customizing at the OS/kernel/filesystem level. Those are the main items I see causing difficulty in portability. Next, if you want a hosting provider to support your application infrastructure stack (i.e. the HTTP server [Apache, IIS, etc], database server [Oracle, MySQL, MS SQL, Postgres, etc]) pick standard versions or plan on hiring staff to support your customizations. While a single provider may agree to support your (or their) modifications others probably won’t. If your provider has their own special versions of the appliation platform they may be trying to lock you in — beware!
3. Spend some time on BCP/DR (Business Continuity Planning/Disaster Recovery). You’ve spent months (or years) going from idea to application — if you spend a day or two you’ll have a fair BCP/DR plan — if you have somebody with a background in this you can have a good plan in a day or two. After putting the plan together –TEST IT! I’ve helped a number of businesses put together a plan and after we’re done they check the box, put it in a filing cabinet and then pray they never have to get it out. That mindset is like a football team having a “2 minute drill” playbook but never practicing the plays hoping that they’ll never need to use it. When it comes down the having to do it, if you haven’t practiced how well do you expect it to go with the added stress of an outage? “But Bret, I can’t test it, we can’t take our site offline for a test!” — You don’t have to go all the way to taking your main infrastructure offline (see #1 DNS). You can bring up the replacement site without ever impacting your real site by modifying the DNS on your test machines (either point them to a BCP system test DNS server or modify the local host files).


Backup your data, backup your data, backup your data.
4. Backup your data, backup your data, backup your data. Customers will deal with service outages. They won’t put up with you losing their data. You use time capsule, Jungle Disk, Mozy, Dropbox, or any other number of personal backup programs for your personal files. If your house burned down you’d still have all of your own stuff. What would happen to your web site if the data center your servers are in burned to the ground? Is the data gone? If it isn’t gone how long will it take you to restore? Is that timeframe acceptable to you and your users? A couple of concepts to familiarize yourself with are RPO (recovery point objective) and RTO (recovery time objective). RPO means how much data will be lost — if you do a daily backup you have a 24 hour RPO, if you run a transaction replicated database (such as Oracle with Data Guard) with the databases in separate geographic locations your RPO may be under a second. On RTO if you’re restoring from a backup medium like tape you’ll be able to recover ~10-40GB/hr (depending on the tape technology and compression ratio of the backup) — if you have a 400GB database you have a RTO of 10+ hours even if with cloud computing you can instantly have a new database server available to put the data on. With a live database in a second geographic location your RTO is also potentially under a second (for restoring data, since you don’t have a restore — this doesn’t mean your whole site is automatically online in that same time). I won’t go into detail here since we’re talking availability and not integrity but having a multi-geographic location replicated database doesn’t insure integrity — you still need snapshots or transaction logs or another way to go back to various points in time if you end up with bad or erased data (see my favoriate XKCD, “Exploits of a Mom”).
So now that we’ve taken all of this into account — what do we do? My recommendations…
1. Make a “gold build” of each of the server types in your application and understand how long it takes you to have your necessary quantity of each server type online at various providers — cloud makes this much easier, in the dedicated world you’re looking at days typically to provision a new environment.
2. If your business relies on a fully functional web site as a primary revenue stream have a live database at a secondary location with the ability to launch web and app servers to bring your environment online quickly in the event of a primary provider failure. If you can continue to service your customers via phone and/or e-mail have a static version of your web site running that you can switch to using DNS in the event of a primary provider issue.
3. Keep your source code in multiple locations with the ability for multiple employees to be able to deploy the site in the event of an issue. I’m a huge fan of collaborative code repositories like GitHub and Beanstalk but if your code is only one one of them and they’re down (or in maintenance window) when you need to have that code to bring up a backup environment you’re stuck — it costs next to nothing to keep that code in multiple places.
I understand that nowhere in this post do I mention HA (high availability) nor do I mention things people generally think of when they hear HA. Having redundant switches, firewalls, routers, and servers all in a single location (what people generally think of when they hear HA) will ensure that location stays online and you should certainly be doing that but it puts all of your eggs into that basket if you aren’t looking at HA beyond the single infrastructure. Now that I’ve mentioned it if you want to learn more about HA design in a single location the Internet is full of good information on the topic.
I’ve also focused the discussion on architectures relevant to “most folks”. If you’re Facebook, eBay, or Google (the search engine) you don’t want to rely on DNS to deal with outages at a specific location. You’ll want to pair DNS with GLB (global load balancing) and BGP so you can have near real-time re-routing of users and potentially even sessions. My availability recommendations certainly aren’t free to implement but they also don’t double your expenses. It is very possible to add between 5-25% to your hosting expense to significantly increase your availability (and decrease your RPO/RTO).
I’m going to also note that I didn’t mention systems management or monitoring here really. Those are both key items to understand to have an available environment but aren’t directly tied to designing an available architecture. You’ll need to have proper systems management tools and policies (or you’ll cause outages yourself) and you’ll need monitoring so you know when to implement your BCP/DR plan.
Cloud Computing, “For Everyone, Not Everything”
Cloud computing is a broad term that covers Internet based services that provide SaaS (Software as a service), PaaS (Platform as a service), and IaaS (Infrastructure as a service). SaaS services are the most commonly used cloud solutions — web based e-mail is the prime example. The most widely used PaaS offering is probably WordPress.org unless you consider customizing your Facebook profile a very restricted PaaS. IaaS is the newest of the cloud services with the most well known example of Amazon Web Services which includes EC2 (cloud servers) and S3 (cloud storage).
Until Hotmail launched in 1996 we all pretty much had an e-mail client on our own system and potentially had to run our own mail server if we didn’t want to have a mailbox tied to our college or ISP — now almost all of us use any number of SaaS e-mail services. Many of these e-mail services now include full features that businesses expect such as Rackspace E-mail or Google Apps Enterprise.
Before cloud based services if you wanted to have a website you had to run your own server until GeoCities launched in late 1995 — now PaaS providers from GoDaddy, for low price, to Mosso, for horizontal scale, provide very capable platforms to deploy a website without having your own server.
Now IaaS providers like Amazon, Terremark, and Rackspace are eliminating the need to always deploy and manage dedicated configurations for complex applications. Before these type of IaaS offerings companies like Twitter would end up with their own datacenters and dedicated infrastructure. Load testing services from companies like SOASTA would be cost prohibitive to offer.
So what about the title, “For everyone, not everything”? It sounds like cloud has the capability to do everything now doesn’t it? In a broad sense, yes, it can do a bit of everything but specific use cases in all service times aren’t a fit for cloud. In the e-mail world if you want to do offline messaging on an airplane you want a mail client. At the platform service level perhaps your application runs 10x faster if you can customize a couple of libraries or it just doesn’t work at all without those changes. The infrastructure offerings force you to re-architect for horizontal over vertical scale to use them effectively.
Many other use cases aren’t a fit for the cloud yet. Take video rendering as an example; it is much less expensive to buy a video card capable of performing rendering than it is to stream the rendered video over a network as 30 JPGs per second. Another example is a retail POS system, at least some of the functionality needs to be in the store — you don’t want to stop selling things if network connectivity is lost. Many more explanatory and reasonable examples abound.
Will cloud ever be the answer for all computing needs? I doubt it, but over time it will be used to solve more problems because a centrally managed pool of resources provides greater efficiency and flexibility. An example on this is utility power; we use it almost exclusively now but for a few use cases we still need generators. Cloud will succeed and it will be adopted for a wider set of use cases over time as it will address those use cases better than previous generation solutions.
Cloud Computing forces IT “Evolve or Perish”
When I started this blog I thought I’d be talking about technology on a regular basis and so far I haven’t. This is still somewhat business related but it is also very tech heavy. The tech focused pieces I intend to explain at a level that an average “nerd” gets but the average adult can read.
Earlier today I spent an hour watching one of the Rackspace founders deliver a training video intended for new hires in 1999. In the video they go through the complexity of ensuring hardware works properly together, that the OS is installed properly, and that DNS is configured properly. Now just 10 years later much of this is significantly simplified. When is the last time you spent time dealing with an “IRQ conflict” or “checking jumper settings” (hardware related troubleshooting that is automagic today)?
Now as we move to cloud computing with pre-defined virtual machine images the “OS is installed properly” piece is going away. Projects like TurnKey Linux will lead to one-click application stacks on top of an OS. For much of the IT community their career has been performing these tasks. Now instead of an application developer needing a system administrator to “build the server” they go to a web based control panel, pick the system type they want and click “create” and the server is spawned.
It isn’t that the system administrator career is being completely eliminated; rather instead of every company needing their own system administrators in the future the computing providers will need them and general business will only need to have an IT staff that works on their specific business applications. Business won’t need to have many other “building block” level IT roles either: networking, desktop support, and storage/backup administrators.
Many in the IT industry think I’m taking things a bit far when we have this discussion. I don’t believe it’ll happen over night but during the next 10-20 years it will. Looking back in the past nobody has a “typing pool” to type up hand written notes, a “courier” to deliver a message across town in a hurry, or a “research” department to go look up basic information we all have access to now through a search engine in a matter of seconds.
This is where the “evolve or perish” comes in. If you’re within 10 years of retirement and focused on the building blocks you may want to consider a job at an infrastructure company or risk the business you work for now eliminating your position in a transition to cloud computing. If you’re at the start of your career and focused on those building blocks you need to be the best and brightest in your field so you can obtain one of the service provider jobs in a much smaller market going foward. Your other option is to evolve and move further up the application stack. This could mean learning how to properly architect an application to make the most cost effective use of the utility priced OS clouds or it could mean going all the way up the stack to interface design.
This isn’t all doom and gloom. Evolution and automation like this increase productivity allowing us to focus on moving forward more rapidly. If you enjoy your IT industry job start asking your employer what you can learn above and beyond the building blocks to help out. While you may not need to today it is much better to be ahead of the game rather than waiting around for a layoff to start learning in panic mode.


