Networking is the lifeblood of the cloud. In an environment that makes both server and storage resources available at hyperscale, the provisioning and utilization of cloud-based services primarily involves moving data from place to place quickly and efficiently. But not all cloud network architectures are the same; performance can vary widely from provider to provider. Even if you aren't provisioning actual network services from a cloud provider, you should still include some basic networking items in your next SLA. Some of these may fall outside the providers' standard agreements, but they want your business, so it never hurts to ask.
In truth, all aspects of the cloud service should be subject to uptime requirements. Networking deserves special attention, however, given that when it goes down, so does the entire service. Networking is also much more difficult to restore properly once connectivity has been lost. At the moment, most cloud providers guarantee 99.9 percent uptime (so much for five- or even four-nines availability), which translates to about 45 minutes of lost access per month. Some providers, however, offer 100 percent uptime. This doesn't mean they will never suffer an outage, just that any unplanned loss of service at all entitles the customer to compensation, the next item on our list.
What happens when providers fail to maintain uptime levels? You'll find a wide range of remedies throughout the cloud universe, where providers strive to keep their customers happy without giving away the store. Contracts for part-time service usually feature a one-to-one return (or higher) of free service for time lost. Full-time contracts employ a variety of rebates, such as a day’s credit for each hour of downtime above SLA limits or a sliding scale that lowers monthly bills in proportion to time lost. Make the methods used in calculating compensation levels a key factor in the selection of your cloud provider.
Some argue that the only performance measurement that matters in the cloud is on the application layer, but the performance of the provider’s internal network infrastructure is crucial for the smooth and reliable delivery of services and hosted applications. So in that vein, both sides need to establish ahead of time what is and is not acceptable network performance. Unlike raw bandwidth or availability, however, performance can be measured in a number of ways, and metrics will vary depending on application and user requirements. Some of the more common items include packet loss, latency and jitter, all of which can be defined within certain parameters to ensure adequate functionality.
External Network Performance
The cloud provider cannot be expected to guarantee performance on the carrier network, but you and your provider should still agree upon levels of service that you both promise to maintain. Again, metrics can run the gamut from bandwidth and throughput to packet loss, jitter, and latency. Specific numbers for each metric will vary according to application and service needs. Also explore ways to establish visibility and analysis procedures to determine whose infrastructure is causing a bottleneck.
Few providers include scheduled or even unscheduled maintenance in their downtime calculations, so it is only fair that repairs that affect service should be made during off-hour, low-traffic periods as much as possible. The provider should also deliver formal notification of work to be done, along with the steps being taken to reroute traffic or otherwise limit the impact on service levels. Ultimately, though, the user should be able to provide a certain amount of flexibility when it comes to maintaining cloud infrastructure, as this allows the provider to enhance service in the future.
Discovering problems is one thing. Revealing them to clients is another. All SLAs should contain language describing the type and frequency of reports expected, as well as emergency notification procedures should unexpected downtime occur. Providers should also offer full disclosure following an outage, including the cause, resolution, and steps taken to prevent recurrence. Clients, meanwhile, should reserve the right to employ their own SLA management and compliance systems, if only to gain an independent means of verifying the provider's numbers.
The broader service contract most likely includes this, but dispute resolution is such a touchy subject that it bears inclusion in the SLA. Naturally, clarity in defining issues like uptime and performance, as well as reporting procedures, should help reduce both the severity and duration of disputes. That said, having a formal procedure in place, perhaps even including a third-party moderator in case things get ugly, can prove particularly useful for those who sign long-term contracts only to find that service is not what was promised. After all, the last thing anyone wants is for the lawyers to get involved.
Just about every SLA will include exceptions to protect providers from things beyond their control. These might include hacking (although network security systems and procedures should be disclosed up front), natural disasters, government action and the like. Providers also shouldn’t be liable for client-side connectivity issues, such as hardware failures or transport-layer problems on the client side. Again, though, adequate monitoring and broad visibility remains essential to determine where problems lie.
Service providers are increasingly outsourcing infrastructure, including networking, to third-party providers. This can not only affect performance and availability, but also poses a potential security threat if the provider’s provider does not maintain the standards you expect. It also opens up the possibility of long chains of providers backing each other up, resulting in infrastructure distribution beyond your comfort level. The SLA, then, should have strict guidelines as to the use of third-party providers, with clear notification procedures detailing who, where, when and how.
Particularly in multitenant environments, resource utilization must be strictly monitored to protect against overload and ensure that only the resources needed to complete a task are provisioned and paid for. This is particularly important in networking, considering the rapidly shifting data loads that exist on shared networks. Most providers will have established thresholds for both physical and virtual resources, but try to write some guidelines into the SLA so both sides have a clear idea of what the data load-to-resource ratio will be.