This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter’s approach.
Over the past few years organizations have awakened to the fact that there is knowledge hidden in Big Data, and vendors are feverishly working to develop technologies such as Hadoop Map/Reduce, Dryad, Spark and HBase to efficiently turn this data into information capital. That push will benefit from the emergence of another technology – Software Defined Networking (SDN).
Much of what constitutes Big Data is actually unstructured data. While structured data fits neatly into traditional database schemas, unstructured data is much harder to wrangle. Take, for example, video storage. While the video file type, file size, and the source IP address are all structured data, the video content itself, which doesn’t fit in fixed length fields, is all unstructured. Much of the value obtained from Big Data analytics now comes from the ability to search and query unstructured data -- for example, the ability to pick out an individual from a video clip with thousands of faces using facial recognition algorithms.
The technologies aimed at the problem achieve the speed and efficiency required by parallelizing the analytic computations on the Big Data across clusters of hundreds of thousands of servers connected via high-speed Ethernet networks. Hence, the process of mining intelligence from Big Data fundamentally involves three steps: 1) Split the data into multiple server nodes; 2) Analyze each data block in parallel; 3) Merge the results.
These operations are repeated through successive stages until the entire dataset has been analyzed.
Owing to the Split-Merge nature of these parallel computations, Big Data Analytics can place a significant burden on the underlying network. Even with the fastest servers in the world, data processing speeds – the biggest bottleneck for Big Data – can only be as fast as the network’s capability to transfer data between servers in both the Split and Merge phases. For example, a study on Facebook traces show this data transfer between successive stages accounted for 33% of the total running time, and for many jobs the communication phase took up over 50% of the running time.
By addressing this network bottleneck we can significantly speed up Big Data analytics which has two-fold implications: 1) Better cluster utilization reduces TCO for the cloud provider that manages the infrastructure; and 2) faster job completion times and results in real-time analytics for the customer that rents the infrastructure.
What we need is an intelligent network that, through each stage of the computation, adaptively scales to suit the bandwidth requirements of the data transfer in the Split & Merge phases, thereby not only improving speed-up but also improving utilization.
The role of SDN
SDN has huge potential to build the intelligent adaptive network for Big Data analytics. Due to the separation of the control and data plane, SDN provides a well-defined programmatic interface for software intelligence to program networks that are highly customizable, scalable and agile, to meet the requirements of Big Data on-demand.
SDN can configure the network on-demand to the right size and shape for compute VMs to optimally talk to one another. This directly addresses the biggest challenge that Big Data, a massively parallel application, faces - slower processing speeds. Processing speeds are slow because most compute VMs in a Big Data application spend a significant amount of time waiting for massive data during scatter-gather operations to arrive so they can begin processing. With SDN, the network can create secure pathways on-demand and scale capacity up during the scatter-gather operations thereby significantly reducing the waiting time and hence overall processing time.
This software intelligence, which is fundamentally an understanding of what the application needs from the network, can be derived with much precision and efficiency for Big Data applications. The reason is two-fold: 1) the existence of well-defined computation and communication patterns, such as Hadoop’s Split-Merge or Map-Reduce paradigm; and 2) the existence of a centralized management structure that makes it possible to leverage application-level information, e.g. Hadoop Scheduler or HBase Master.
With the aid of the SDN Controller which has a global view of the underlying network – its state, its utilization etc. -- the software intelligence can accurately translate the application needs by programming the network on-demand.
SDN also offers other features that assist with management, integration and analysis of Big Data. New SDN oriented network protocols, including OpenFlow and OpenStack, promise to make network management easier, more intelligent and highly automated. OpenStack enables the set-up and configuration of network elements using a lot less manpower, and OpenFlow assists in network automation for greater flexibility to support new pressures such as data center automation, BYOD, security and application acceleration.
From a size standpoint, SDN also plays a critical role in developing network infrastructure for Big Data, facilitating streamlined management of thousands of switches, as well as the interoperability between vendors that lays the groundwork for accelerated network build out and application development. OpenFlow, a vendor-agnostic protocol that works with any vendor’s OpenFlow-enabled devices, enables this interoperability, unshackling organizations from the proprietary solutions that could hinder them as they work to transform Big Data into information capital.
As the powerful implications and potential of Big Data become increasingly clear, ensuring that the network is prepared to scale to these emerging demands will be a critical step in guaranteeing long-term success. It is clear that a successful solution will leverage two key elements – the existence of patterns in Big Data Applications & the programmability of the network that SDN offers. From that vantage point, SDN is indeed poised to play an important role in enabling the network to adapt further and faster, driving the pace of knowledge and innovation.
About the Author: Bithika Khargharia is a senior engineer focusing on vertical solutions and architecture at Extreme Networks. With more than a decade in the field of technology research and development with companies including Cisco, Bithika’s experience in Systems Engineering spans sectors including green technology, manageability and performance; server, network, and large-scale data center architectures; distributed (grid) computing; autonomic computing; and Software-Defined Networking.