As big data becomes increasingly prevalent in data center and cloud environments, the question of how to manage networks for the transfer of millions of records at time is more important to answer than ever.
It's not just a matter of size -- though it should not be minimized in any way that size does indeed matter when approaching big-data networking solutions -- but also a matter of workflow. Big data environments simply don't behave like typical data infrastructures used to. They really can't, given the complexity and speed of the work big-data applications have to perform.
"Traditional" data-analysis architectures assume that data won't be coming from a lot of sources and there will be plenty of time to neatly store that data in the correct table on the correct repository. When looking at networks and applications such as the ones used by Twitter, Facebook and Google, it immediately becomes clear that such an approach would make a "normal" database architecture pop like a single light bulb plugged into a nuclear reactor.
To overcome the hurdles of dealing with massive amounts of data in such a short period of time, big data users have devised a two-pronged approach to the obstacles. First, a large-scale real-time database is implemented, such as BigTable, OpenDremel, MongoDB, or Cassandra. These databases all share the feature of being non-relational: They don't depend on standardized query languages (hence their sobriquet "NoSQL") and they also do not meet all Atomic, Consistent, Isolated, and Durable (ACID) requirements that must apply to all data within a relational database.
The other half of the solution is using analytical databases, such as Hadoop, to do the work of sifting through the huge mass of data, categorizing it properly on the fly.
This means that the emphasis on the network and surrounding infrastructure will shift from optimizing storage to optimizing search. It has to be, because storage is greatly simplified in a typical big-data environment, and all of the power is instead needed to sort data to come up with useful datasets that can then be appropriately analyzed for deep-dive results.
Unfortunately, this basic approach is about the only thing a lot of big-data networks have in common. Beneath this 20,000-foot view, the approaches used to come up with these kinds of data solutions are varied. And there are problems inherent in each approach that must be managed. Hadoop, for instance, uses a NameNode architecture that represents a single-point-of-failure big data managers to which are very sensitive. If the NameNode device is lost on the network, the entire Hadoop system is busted, which puts a lot of pressure on network admins to ensure that particular server stays up.
There are non-network solutions to things like this, or course. Take, for example, Brisk, a product from DataStax that seeks to bridge the realtime capabilities of Apache Cassandra to the analytical features of Hadoop. Brisk merges Hadoop's filesystem with that of Cassandra, which means there's no more single point of failure problem.
Big Data and Network Architecture
These two options represent just the tip of the iceberg when coming up with potential big data architectures, and the network architecture for these solutions alone are already pretty different. How, then, is a network manager supposed to deal with the myriad number of possible big-data solutions out there, with more coming every day?
This is where solutions like OpenFlow can help. OpenFlow is a networking infrastructure protocol that is the product of the Open Networking Foundation. The ONF's reason for being is to implement OpenFlow, a protocol built around the concept of software-defined networks (SDNs).
An SDN is designed to solve just the kinds of problems described here: Instead of setting up a one-size-fits-all networking solution that forces the applications within to work within that solution, the applications themselves define the topology of the network. OpenFlow enables network administrators to more easily configure their networks based on SDN principles by simplifying hardware and network management, thus decreasing network overhead in big-data networks.
OpenFlow is a low-level spec, but already vendors are starting to see possibilities for layering their own software on top of OpenFlow. Imagine a network management tool, for example, that would sense a sudden huge shift in network traffic and packet workload and be able to automatically configure switching settings to compensate for the push -- and then go back to "normal" when the load was complete? In essence, if broadly adopted, OpenFlow will enable "cloud networking" -- on-demand, utility-based networking configurations.
This approach is important. The kinds of bandwidth being discussed here can't be handled by switches and routers in a typical tree-like topology. And the networks themselves are increasingly becoming part of the big data solutions, as network-as-a-platform solutions like the ones pushed by Cisco's IOS product line become more common. In the face of such complexity and size, a flexible fabric-like approach is rapidly becoming the meme for network architects, not the tiered tree.
OpenFlow solutions will let network administrators automatically control the size and shape of the network fabric as needed, enabling traffic shaping on levels undreamed of just a few years ago.
This is the type of approach that network managers will have to adjust to, and very soon. The wide-scale availability of cloud-based computing (public, private, or hybrid) and big-data applications will pervade just about every corporate environment in the near future.