Software-defined networks (SDN) are bringing new issues to the forefront, from how tasks are accomplished to who does them. And now, another topic is making its way onto the radar: fault testing. Fault testing hasn't yet received as much airtime in the SDN conversation as, say, the scalability of these new networks, but it's an issue that is rapidly gaining attention.
QA issues in SDNs
A host of factors make QA in the SDN world something worth talking about. One is the inherent fluidity of software-driven systems. Yes, software defined networking gives administrators a tremendous amount of flexibility, but that flexibility also means users have a much greater potential to deploy platforms in ways the designers and developers may not have envisioned. "It opens up a lot more variables in how the products are used," said Mat Mathews, co-founder and vice president of product management at SDN developer Plexxi, Inc. The old, straightforward scheme of building networks to satisfy a known number of use cases, and then testing for those specific deployments, is rapidly disappearing in the rearview mirror.
Another problem in SDN environments is that, as Simon Crosby, CTO at Bromium, described it, "one big brain" will now program the network infrastructure, giving it the states it needs for all sorts of different connections. "The challenge there is actually that you have abstracted this notion of the control plane," he said. The monumental task of reasoning out the correctness of the state, across all the switches in the network, has the potential to overwhelm traditional fault-testing protocols. In addition, the physical separation that may occur when software controllers aren't co-located with physical switches can introduce entirely new bugs that simply never presented problems in legacy network architectures.
Can Chaos Monkey help build more resilient software defined networks?
But a new resiliency tool has come onto the scene. Developed by the team at Netflix, Chaos Monkey, whose code (which is appropriately housed under the wider Simian Army banner) is now available under an Apache License on GitHub, may change how people think about fault testing in SDN environments. For those unfamiliar with the tool, the Netflix blog says, "Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group." In simplistic terms, this Monkey descends on a software-driven network, finds weaknesses and exploits them. Those failures can then be examined and rectified, resulting in a more resilient system.
For companies such as Netflix, with its enormous—and enormously scalable—infrastructure, it would be nearly impossible to do fault testing solely using legacy methods. Mathews said these large web properties have, out of necessity, come up with creative new ways to accomplish tasks that have proven unwieldy in a dynamic environment like cloud infrastructure. In addition, he said that by forcing hard failures, companies may uncover information that's more useful in building stronger networks than the data standard QA approaches typically provide. "What's easy to do is to recover from a failure you can see very clearly," he explained. "What's hard to do is recover from failures that you can't find or you can't figure out."
But while Crosby said Chaos Monkey has undoubtedly helped Netflix do amazing things, he doesn't believe tools like it should be viewed as the end-all, be-all of fault testing. He describes them as "point-in-time" solutions and says that simulating random failures doesn't go far enough in making systems more resilient. "The problem with the Chaos Monkey approach is, what if Chaos Monkey didn't generate the particular problem that's actually going to happen?" he asked. "You can't rely on it as a way of checking that everything is going to work." The randomness of a tool like Chaos Monkey can be a benefit, but, as Crosby said, that randomness can miss potentially network-disabling bugs.
Programmability and automation in SDN fault testing
Rather than seeing SDN as an environment that makes glitches harder to find and fix, Rob Sherwood, CTO of controller technologies at Big Switch Networks, thinks the opposite is true. "There's actually more potential to help people debug these networks, simply because a lot of things are automated, there's a lot more programmability and there are more customizable ways of hooking into the system." Looking at SDN from that perspective, fault testing gets easier—not harder—when compared with legacy networks. Bugs may be hiding in all sorts of new and exotic places, but systematically rooting them out and dealing with them isn't impossible.
Automation is something Mathews said his team leverages heavily. "We have a framework that allows us to, very quickly, write scripts and things that actually exercise the products completely, 100 percent, without manual intervention." They've written simulator products for their hardware, reducing the need to invest in a lot of test gear. Even the potential for random faults has been built into an overall fault testing structure, Mathews said. "Rather than having a set number of use cases that we test and certify and say, 'This is the only way to use it,' we run the automation on the basic stuff," he explained. Then the team applies random testing to various use cases "just to exercise different parts of the system and see where we see things we might not expect."
For companies to truly take advantage of cloud technologies, developers need to know they can reliably manage SDN (and cloud-based infrastructures in general) and conduct robust—and ultimately useful—fault testing. "We have to be able to have a more methodological approach, whereby somebody developing an application can have a reasonable idea that the thing works without firing up some piece of random infrastructure to try and break it," Crosby said. Better design is what he'd prefer to see, with a structured fault testing methodology at its core. "Otherwise, the only way you can actually build and operate a large-scale cloud application is to throw huge bunches of humans at it who actually understand how the cloud is built, and who can go and figure out these problems in real time." And that, he said, just isn't going to fly.
Perhaps a structured methodology, coupled with the random nature of a Chaos Monkey-like tool, is a better answer than either approach used on its own. Sherwood said the two strategies play into the inverse pyramid of bugs detected by conventional software testing, where the person who wrote the code writes a test and catches the first batch of bugs. "Then a second guy, who didn't write the code, writes a test and catches more bugs," he explained. A tool like Chaos Monkey is employed to close out the process. "It catches bugs that nobody else wrote, but at the end of the day it's the long tail of bugs that get caught." For companies with little tolerance for network failures, a combination of these tools may give the best results.