Editor's Note: Occasionally, Enterprise Networking Planet runs guest posts from authors in the field. Today Michael Leonard, senior product marketing manager at Juniper Networks, discusses the challenges that Big Data creates for enterprise networks.
By Michael Leonard
Imagine a world where broadcast networks can accurately predict how well a TV series will perform even before the first episode finishes airing. One where utilities can help residents reduce their energy bill by analyzing data from sensors in household appliances. Or one where transporters can optimize shipping routes and fuel consumption through live tracking of packages.
Sound too futuristic? Actually, we’re pretty much there.
The rise in mobile applications, all-IP wireless networks, online commerce, point-of-sale systems, social media and the use of sensors for everything from traffic monitoring to inventory management generates data that, if properly managed, can provide critical intelligence to drive business decisions. Much of this data is gathered on the fly and, if quickly acted upon, can provide unique competitive advantages, reveal opportunities and solve problems for organizations.
But the volume of data is huge and the velocity increasing, pushing the envelope of networking requirements. Networks handle data all the time—that's what they do. But put the word “big” in front of "data," and network admins and CIOs must confront a whole new mess of challenges.
Changing data flows
Data is undoubtedly changing. Data volumes have moved beyond terabytes to petabytes; data relationships have gone from simple and known to complex and unknown. Data models have moved from fixed schema type to schema-less. Data sources have gone from simple data entry to live streams from a variety of sources, including handheld devices and machine sensors.
Big data in particular comes in a range of formats. For instance, a call recording looks different to a network than a credit card transaction. Unlike structured data in traditional applications, big data includes semi-structured or unstructured data, such as text, audio, video, click streams, log files, and the output of sensors that measure and transmit geographic and environmental information.
And big data environments change the way data flows in the network. Big data generates far more east-west or server-to-server traffic than north-south or server-to-client traffic, and for every client interaction, there may be hundreds or thousands of server and data node interactions. Application architecture has evolved correspondingly from a centralized to a distributed model. This is counter to the traditional client/server network architectures built over the last 20 years.
Big data's impact on the network
Pulling data from a variety of sources, big data systems run on server clusters distributed over multiple network nodes. These clusters run tasks in a parallel scale-out fashion. Traffic patterns can range from 1-to-1 (telephone call), 1-to-many (TV show), many-to-1 (a concert audience), and many-to-many (CB radio), in a combination of unicast and multicast flows between multiple nodes that run in parallel. Network admins need to cope with this combination of traffic patterns, some of which create a single stream, some of which create many.
Additionally, when data is served to the compute nodes, it generates high volumes of network traffic. Data shuffle-and-sort operations between the distributed nodes require fast and predictable transfers. While analytics systems use directly attached storage for processing, intermediate storage stages the data.
Data needs to be moved around in the network and operated upon efficiently during the analysis process. As new data sets grow and sources are added, workloads grow. So does the need to quickly add capacity. As a result, it is critical to prioritize locality, high performance, horizontal scalability and direct server node-to-server node connectivity in the network architecture.
The need for a new network model
One design model involves building on low-end commodity hardware and letting the analytics software react to issues with the network, such as restarting jobs that time out due to congestion. This model is used for non-real-time processing, where completion time isn’t critical and data comes primarily from one source.
Another model involves building on a hardware-based system that provides deterministic performance to ensure continuous processing. This model is used for doing near-real-time analysis on data from multiple sources.
Real-time big data systems, meanwhile, benefit from a topology where network nodes connect with each other in an any-to-any model with a single hop between them, providing a dedicated system for processing multiple large data streams with low loss and deterministic performance. A switch fabric can provide this model.
The switch fabric offers the advantages of overall system bandwidth and performance, especially reduced latency. Location independence allows clusters and data to achieve optimum performance from anywhere in the architecture. This architecture also enables seamless merging of new data sources into the cluster without rewiring and significantly eases the expansion of the system. It provides convergence that allows server clusters, and the storage area network, to communicate across one network. Resources are managed as one entity and policies are easily applied across the entire switching infrastructure.
Implementing big data solutions
Big data represents a tremendous opportunity for businesses to capture and analyze data like never before. As IT organizations begin to test and evolve their solutions, network administrators must consider the impact of these technologies on their server, storage, networking and operations infrastructure. How can enterprises best develop new infrastructure to leverage and analyze the increasing flow of big data? Consider the following questions in developing the networking topology:
• Is the analysis of data streams being done in real time?
• Are there multiple data sources, and are they static or streaming?
• If the pilot is successful, how big will the cluster need to be?
• How easily and quickly can more capacity be added?
• Will big data applications require integration with other applications?
Answering these questions will help frame the discussion. They’ll indicate how your infrastructure will influence data center architectures and interconnect requirements.
Big data empowers organizations to develop new strategies that provide real-time business analytics and new business insights that drive organizations. With the rapid changes that businesses are seeing, however, it is necessary to consider the critical technologies that provide the scale, performance and headroom for tomorrow’s business requirements, meeting the highest levels of investment protection, business agility, and time to market.
Michael Leonard is a senior product marketing manager at Juniper Networks, where he focuses on the data center. He brings more than 15 years of experience working for networking equipment vendors in both the enterprise and service provider markets. For the previous 15 years, he worked in IT, where he managed networks, applications and data systems and dealt with many of the same issues that customers face in protecting their data and ensuring application performance.