Making Multi-Cloud Actually Good
Multi-cloud is what we do, now. Didn’t you get the memo?
Literally every single industry research survey out there will tell you: around 80 to 90% of businesses (anywhere from SMB to Enterprise) are using more than one cloud, with an ever increasing number of them incorporating multi-cloud into their mission-critical workloads.
Does that upset you? If so, you might not be alone. Talk to experts (myself included) and they’ll tell you all about how multi-cloud is a misguided, short-sighted, naive, no good doo doo strategy.
In fact, it is outright difficult to find articles championing multi-cloud not being pushed by organizations having something to sell. Trust me, I tried. Regardless, even though it’s seemingly only propped up by sales pitches, multi-cloud is becoming (has become?) an industry standard.
How does this happen? How does an entire industry transform around a strategy that does not fit within the consensus view of cloud specialists? It can’t be that bad if it’s so widely adopted, can it?
And let’s say it is actually that bad, is there a way to rethink the concept as a whole and make it actually good?
Today, I’ll answer both those questions. In a valiant effort to get everyone to hate me, I’ll try to both debunk and validate multi-cloud in one fell swoop.
What does multi-cloud look like?
Before going too deep into debating multi-cloud, let’s take a step back and assess what it actually is. It’s important not to confuse it with an organization consuming cloud-based services from multiple providers, be it Office365 or GSuite (you’d be surprised how often IT leaders make that mistake — buy me a Scotch and I’ll tell you who’s to blame for this).
When I talk about a multi-cloud strategy, I’m referring to a strategy where an organization actively deploys its workloads across several cloud platforms. Of note, the term “Cloud platforms” does not discriminate between private and public clouds, meaning that “hybrid cloud” is implicitly included within the concept of multi-cloud.
You may notice that this definition is very broad. As a result, there are many different ways to implement multi-cloud as a concept — in real life, though, the most common approach by far is based on a strategy meant to avoid vendor lock-in called cloud agnosticism.
Cloud agnosticism is the practice of designing solutions that solely make use of capabilities that are common to all cloud platforms in use by the organization, their functional lowest common denominator. This way, any workload can be migrated from cloud to cloud or even be operated on all of them simultaneously, theoretically allowing for seamless migrations to cloud platforms providing the best performance on a metric (typically cost).
Designing to the lowest common denominator can mean different things based on how keen an organization is on maintaining private cloud portability (as opposed to going all-in on public cloud) and whether it’s willing to relinquish some control to a cloud service provider (e.g. by choosing a managed database service for hosting databases instead of providing a DBA team deploying and managing database software on dedicated hosts).
Cloud agnostic strategies aiming at maintaining private cloud portability will typically be limiting design options to capabilities neighboring the traditional datacenter featureset — essentially treating the cloud as an undifferentiated VM farm with burst capacity. Organizations unwilling to relinquish control by leveraging managed offerings on public clouds will also prefer this approach regardless of their use of private cloud as it is the “truest” form of cloud agnosticism.
Organizations letting go of private cloud portability and willing to adopt managed offerings have access to a broader catalog of eligible options — any capability shared across clouds is fair game. Their ability to leverage those services, however, is directly dependent on how much effort is invested in abstracting the different platforms’ APIs. For instance, every public cloud platform offers managed database services, but provisioning them requires interacting with drastically different APIs and choosing a database engine common to all of them. Abstracting platforms is usually accomplished by investing in agnostic Infrastructure-as-Code (IaC) tools like Terraform or Pulumi, or (less commonly) by leveraging multi-cloud management platforms such as Morpheus Data or Flexera/RightScale, whose stated objective is to abstract cloud-specific concepts and help define workloads as cloud-agnostic units of infrastructure.
Why did multi-cloud happen?
Now that we understand what multi-cloud is, let’s explore why it came to be. The reasoning behind going the multi-cloud route is sound, at least at first glance. The big ticket items you’ll often see include:
- Avoiding vendor lock-in — “I can leave whenever I want (because I’m cloud agnostic)”
- Access to best-of-breed services — “The right tool for the right job (which can either mean best or cheapest, depending on your priorities)”
- Negotiation leverage — “The other guys offered X, can you do better?”
- Resiliency in case of an outage — “In case A goes down, I’m still up on B”
- Geographic coverage — “Only one of them has a DC close to my customer/within my legal data residency parameters”
- Something vague about compliance — “I don’t really have a quote for this one, but I’ve seen it so often with no clear definition that I’m listing it here nonetheless”
They generally boil down to risk management, cost reduction and performance optimization measures. What’s wrong with that?
Well, as it turns out, a lot of those big ticket items stem from unfounded assumptions about cloud originating from traditional IT practices.
As so many experts out there will tell you (Corey Quinn more eloquently than most):
- As we’ve covered higher up, avoiding vendor lock-in by keeping to cloud agnosticism will limit your design choices to the lowest common denominators; in other words, neither the best nor cheapest, completely invalidating the best-of-breed argument.
- Splitting your spend across several clouds will only hinder your negotiation leverage, not help it. Enterprise discounts in the public cloud are fairly formulaic and get greater as your total expected cloud spend rises. “Threatening” a cloud service provider with less spend only serves to weaken your position.
- You can be plenty resilient within a single public cloud. Full region outages are rare (although Azure has tried its damnedest to prove me wrong lately), but deploying your workload across multiple availability zones over multiple regions should give your SLA enough nines to cover most if not all cases.
- The geographic coverage argument used to be valid in the early days of cloud, but the overlap across vendors has become so large that it’s essentially invalidated.
Alright, so if it is obvious to cloud specialists that those assumptions are wrong, why are they so widespread? Well, that’s quite simply because there aren’t enough cloud specialists to shoot down those assumptions everywhere they pop up. Don’t just take my word for it (although you should, I’ve experienced that specific pain point extensively): you can find survey after survey going back several years outlining the lack of cloud expertise in the industry. Contrast this to the ubiquity of cloud adoption and it becomes easy to understand how the industry might exhibit counter-intuitive trends.
This doesn’t fully explain away the sheer magnitude of multi-cloud adoption, however. No, for that, you have to turn to a much more nefarious force:
Late movers to the cloud market.
Unable to compete head-on with the big three on sheer breadth of services, late movers instead saw an opportunity to carve a piece of the market for themselves by playing into those widespread assumptions. By championing a multi-cloud approach, they could continue to sell their cloud services without needing to outdo the market leaders — cloud agnosticism effectively shielded them from differentiated service offerings. With most of those players already having the trust of IT leaders from their past dealings with them (be it in virtualization, hardware, colo or otherwise), positioning themselves as trusted cloud advisors was a relatively easy task.
To be clear, although I did joke about this higher up, I don’t mean to imply that those organizations were being intentionally misleading. Quite the contrary, they simply saw a convenient, highly in-demand gap in the market they were late to and opportunistically filled it.
And there you have it. At the confluence of a market in dire need of cloud expertise and a plethora of opportunistic vendors, you get the proliferation of multi-cloud.
So that’s it, right? Multi-cloud was born from ignorance, ergo it’s bad and everyone who champions it should feel bad, case closed? Of course not.
Why should multi-cloud happen?
While it’s easy to dismiss most of the often parroted multi-cloud talking points, one of them stands out as being much harder to summarily discard:
- Access to best-of-breed services — “The right tool for the right job (which can either mean best or cheapest, depending on your priorities)”
For this argument to validate multi-cloud as a strategy, cloud platforms would need to differentiate themselves in ways that are significant yet specific enough not to warrant simply moving the entire business over.
Knowing how each cloud provider tries very hard to cover the same ground the others do, are there even cases where one platform does one thing so dramatically better than the others that it would justify going multi-cloud over it instead of outright moving to it?
Back in April, Zoom made a deal to move a large percentage of their infrastructure to Oracle Cloud. The move raised eyebrows throughout the industry (mine included) as Oracle Cloud isn’t exactly a market leader. It took the brilliant Corey Quinn (yes, this guy again, he’s worth reading) to make me understand why. In short: Oracle Cloud is cheap when it comes to outbound data. Dirt cheap. Like, one order of magnitude cheaper than AWS, and basically all Zoom does is sending data out to its users. Seven petabytes a day, according to Oracle, which is huge. At $0.0085/Gb, that’s $255 000/month versus $1 500 000/month at AWS (using their most competitive public pricing tier). This represents a whopping 83% saving, more than you could ever save through enterprise discounts alone (if it was possible, Zoom would have pulled it off). Oh, and to top it off, that’s using public pricing, which for a deal this size Zoom absolutely isn’t paying.
Of course, this does not mean that Oracle Cloud, as a whole, is better than the other platforms (it isn’t). What it does mean is that, for the specific requirements of their video streaming workload, Oracle Cloud’s drastically lower outgoing data pricing was a game changer — the right tool for the right job.
This shows us that there is at least some legitimacy to the best-of-breed argument. A cloud platform (that wasn’t a market leader!) exhibited a characteristic that made it starkly superior to the other choices for a specific type of workload. So superior in fact, that it made a solid business case for enabling multi-cloud in an organization already operating its workloads successfully on another cloud.
It might be tempting to dismiss this example as a fluke. The simple fact that the entire industry was surprised and confused by Zoom’s move should tell you that it is somewhat of an outlier. After all, arguing that multi-cloud enables best-of-breed selection of services has been done ever since its inception; if it were true, there would be hundreds of Zooms out there, moving around clouds making massive gains. What is different now?
Differentiated Clouds
Up until fairly recently, cloud vendors were racing towards the same goalpost. Every one of them wanted to become the platform, enticed by the prospect of making AWS money. What this meant was, basically, playing catch-up to AWS, tacking “as-a-service” onto your existing product catalog and using it as leverage to transform your existing clientele into a cloud market share. As a result, there was a tremendous amount of convergence around the concept of cloud — with a few exceptions (Google comes to mind), AWS had the leeway to innovate while its competitors mostly just mimicked it.
Eventually, Azure for all intents and purposes caught up to AWS, leaving the two of them constantly racing to out-do each other in an ever-increasing breadth of services. It is unlikely at this point that any player, leading or otherwise, might leave the others in the dust unequivocally.
One by one, cloud service providers are starting to accept that they don’t have the resources or traction required to achieve or maintain as broad a scope as the top two — even if they do, being “just” as good as them is not good enough to warrant a move. This leaves them with only two options: change their approach (like VMware going all-in on hybrid cloud) or specialize their offering (like GCP refocusing on data analytics and a small number of business verticals). In other words, differentiating themselves.
In fact, GCP’s decision to specialize is one that we need to take note of. As one of the “big three”, albeit with the smallest adoption rate, GCP’s behavior is the one most likely to tell us where Azure and AWS might go when their market shares stop growing as explosively. Beyond that, how GCP specializes is also especially noteworthy as it will likely shape how the smaller players in the market do it. How do they do it? Through an increasingly refined PaaS offering that embraces multi-cloud.
It makes sense that PaaS would be their way to go (as opposed to, say, trying to go hard on CPU/GPU performance and pricing alone). PaaS is the most convenient way they can leverage their vast internal expertise and package it as a pure managed cloud-native service for their customers. The market sees this as well — as organizations go deeper and deeper in their cloud adoption, we see an ever increasing adoption of PaaS offerings.
The reason why it resonates so much goes back to the drought of cloud expertise in the industry. Anything that can increase capabilities and reduce operational overhead is generally good, but in a resource-deprived industry, it becomes crucial. It doesn’t make sense to spend your precious few resources reinventing the wheel and operating your infrastructure when you can have them focus on creating new business value.
There is already momentum in the market validating GCP’s move towards differentiated PaaS. Smaller cloud service providers such as Snowflake and Wasabi are getting a lot of traction making highly specialized PaaS solutions and explicitly positioning themselves as competitors to individual services of the bigger guys, lending themselves to being used in a multi-cloud pattern.
As the market stops trying to simply emulate the big players and instead moves towards starkly differentiated offerings, we will reach a point where “doing multi-cloud” will mean transparently composing solutions from a catalog of best-of-breed cloud native services — an actually good multi-cloud strategy.
What should multi-cloud look like?
Of course, all we’ve done so far is arguing that multi-cloud can be a good thing. The idea of cherry picking the best of each platform sounds great in theory and has been a selling point of multi-cloud from day one, but it is worthless if it can’t be realized without sacrificing operability. Native service interoperability is one of the greatest strengths of a platform like AWS, after all, so to make a solid case we need to establish a strategy founded squarely on taking advantage of differentiated offerings.
A multi-cloud playbook promoting interoperability
- Select a “home base” cloud platform
First of all, an organization should adopt a single cloud service provider as its home base, preferably a leading public cloud platform. Mastering even one cloud is hard, resources are scarce, and even an ideal world multi-cloud scenario involves at least some operational overhead. Before an organization can make an enlightened decision as to which multi-cloud services it should adopt (and whether it should even adopt them), it should have a very solid grasp on its home base. Selecting a home base also focuses most of the spend (at least initially) on one vendor, making access to discounted rates more feasible — in turn providing an optimized price point which external services have to beat to demonstrate cost-effectiveness.
Another non-negligible but less quantifiable advantage I would note, when it comes to committing to a home base, is the ability to build a tighter relationship with that provider’s resources (reps and customer architects). It is through those relationships that you will be granted access to expertise, funding programs and expedited support should the need arise. - Embrace cloud native services
It goes without saying that to fully leverage differentiated cloud services, we must let go of cloud agnosticism as a design pattern. Extracting full value from cloud platforms cannot be achieved without committing to their technology, at least in some form. “Full value” here can mean anything from lower cost, better performance, or more importantly, lower operational overhead.
In an industry as strained for resources as we are dealing with, lowering operational overhead is proving ever more crucial. As a result, managed services should be leveraged as much as possible. I am, of course, not advocating for anything as dogmatic as blindly selecting managed services at every turn — you should always run some discovery to properly understand their limitations and cost structures as they can prove daunting — but I am suggesting the adoption of a managed-first mentality. - Adopt cloud agnostic tooling
As we will be introducing more variety in our cloud services, the only way to function at scale has to be through standardized tooling able to support the entirety of your ecosystem. Agnostic IaC technologies such as Terraform and Pulumi are already well established and extensible enough to cover most infrastructure provisioning needs. One thing I would highlight here, however, is the fact that I am drawing a line between tools and tooling infrastructure: while I am telling you to use agnostic tools, I am not telling you that the infrastructure supporting them (your CI/CD pipeline, for instance) should be agnostic.
I wouldn’t necessarily extend the cloud agnostic tooling rule to the application layer either — I believe a standard application deployment model is a Good Thing™, but in a model where we are solidly anchored in a home base platform, I wouldn’t preclude leveraging cloud native solutions for deployments (ECS on AWS, Cloud Run on GCP or Serverless in general, for example). With that said, I will deliberately leave some ambiguity around the prospective role of agnostic orchestration technologies like Kubernetes in this particular multi-cloud framework as I believe the topic warrants its own conversation. - Establish automated agnostic governance layer
As multi-cloud democratizes the use of specialized SaaS/PaaS offerings across clouds, governance will become ever more fragmented (read: difficult). While reporting and auditing tools will remain relevant as reactive fail safes, the sheer number and spread of components in use will require more proactive regulatory compliance and CostOps management measures. The best way this can be enforced is through the use of Policy-as-Code (PaC) tools as part of CI/CD flows, essentially killing any derogation to policy before it can reach production.
This isn’t currently a problem with a perfect solution, however. Both Terraform with Sentinel (within the context of the Hashicorp Enterprise suite, at least) and Pulumi with CrossGuard could fill that gap, but their respective coverage is still fairly limited and none of them has been adopted as industry standard by any mean. Looking to the horizon, there is a promising CNCF-backed initiative called Open Policy Agent that may eventually provide a viable open PaC alternative, but it still has some ways to go before being solid enough for prime time. - Commoditize all cloud services
Facilitating easy, safe and optimized access to cloud services (both native and multi-cloud) is accomplished through IaC — every cloud service can be packaged as a centrally-managed IaC module and made available through a catalog. Those modules can then be versioned, updated centrally and automatically distributed across the organization; they can also seamlessly include PaC artefacts along with any service-specific configuration.
While I won’t delve too deep into the nitty gritty of CI/CD in the context of this article, I will note that for the catalog model to work properly, IaC and PaC should be systematized as part of GitOps (or any pipeline-driven deployment methodology, really). - Decouple identity management
Before spinning off into new cloud platforms, it’s essential that identity management be decoupled by using an Identity Provider (IdP) linked to every cloud platform through a System for Cross-domain Identity Management (SCIM). SCIM ensures that identity constructs (users and groups) are replicated from your IdP to your various cloud platforms, making group membership coherent and centrally managed. Azure AD and Okta are both very solid candidates for this role. - Reconsider whether you really need multi-cloud
Truthfully, most organizations would probably do well not to go beyond this stage without a solid use case. If you’ve only got a few straightforward workloads, you should probably commit to a single cloud platform and call it a day. While a singular multi-cloud service might prove attractive, enabling the type of interoperability that will enable its use will incur cost and overhead — the adoption of multi-cloud services has to provide enough value to offset those on top of being markedly superior to your foundational cloud’s native equivalents. - Enable interoperability
The key enabler for a viable multi-cloud strategy is the ability to have cloud services interoperate across clouds at least to a level remotely comparable to native. With multi-cloud interoperability, multi-cloud services can be packaged just like any other native cloud services and used indiscriminately by application teams. While there isn’t a turnkey solution to solve that problem today, there are a few things that come to mind that may pave the way:
8.1. Cloud private interconnectivity
The first step to achieve interoperability is to make inter-cloud traffic fast, cost-efficient and secure; service providers such as Megaport, Equinix or Rackspace come to mind as offering pipes to the different cloud platforms (and your own DC) and allowing for a performant interconnect.
8.2. Cloud provider-directed connectivity
By leveraging provider-directed connectivity (such as Public VIFs in AWS or Microsoft Peering in Azure) and partner connection lines (PrivateLink in AWS), direct access to cloud native service APIs can be achieved across platforms.
8.3. Multi-cloud transitive connectivity
With private and provider-directed lines established, BGP or automated static routing can be leveraged to make multi-cloud IPs resolvable across clouds. SD-WAN or augmented VPN providers such as Cisco or Aviatrix can probably solve this piece of the puzzle as they abstract complex network configuration (a good thing!), but I would caveat this with the fact that non-native technologies for cloud networking feel like a step in the wrong direction as they hinder the native interoperability of our home base services.
8.4. Multi-cloud service API Resolution
DNS forwarding can be leveraged to expose service APIs across clouds, which, combined with private transitive connectivity, can provide near-native functionality. There may be a way to leverage Consul along with its DNS resolver to make this easier as well, although this is an hypothesis — I haven’t seen it implemented.
8.5. Multi-cloud ACL Compatibility
The final challenge to overcome would be to make ACLs play nice together across clouds. Something like Vault can provide a cloud agnostic point of truth which could be leveraged for access level translation across cloud platforms.
Putting it all together
Successfully meeting every one of those challenges will leave you with a multi-cloud strategy optimized around teams with limited resources, clouds with differentiated capabilities and services with near-native interoperability. Application teams will be able to select the best available services for their needs from a catalog of centrally-managed IaC modules. The origin of the modules (whether native or multi-cloud) will be made irrelevant as interoperability will be managed transparently beyond their scope.
You’re not done, though.
One topic I have not touched on here is the organizational transformation required to support such a model. IT groups will have to transform into specialists providing cloud expertise and enabling highly empowered product teams (not unlike a Center of Excellence model) while developing a catalog of commodity service modules. CostOps will have to transform their view of services from fully cost-centric to something more understanding of value add, where a service might be more expensive but provide enough capability to offset its price tag through facilitated innovation and shortened iterative cycles. In other words, I may have attempted to solve a technical challenge here, but it is an organizational one as well.
Final Thoughts
Of course, I cannot stress enough that I have not seen this model implemented in real life. While I have seen enough of its components implemented to have confidence in its viability as a starting point for “good multi-cloud”, this model is still based on personal observations and legitimized by pre-empting some emerging trends in the market. I strongly encourage you to challenge my take, if only so I stop loudly arguing with myself while pacing around in my living room.
This is the first of many conversation pieces I intend to write on topics gravitating around cloud technologies. Follow me on Medium or find me on Twitter and LinkedIn for more.
Shoutout to @bquimper, @g_kima and @Wakeupnasko for their amazing feedback