For those of you just tuning in, Spark, an open source cluster computing framework, was originally developed by Matei Zaharia at U.C. Berkeley's AMPLab in 2009, and later open-sourced and donated to the Apache Foundation. Part of the motivation for creating Spark is that MapReduce only allows a single pass through the data, while machine learning (ML) and graphing algorithms generally need to perform multiple passes.
Spark is billed as a “fast and general engine for large-scale data processing,” with a tagline of “Lightning-fast cluster computing.” In the world of big data, Spark has been attracting attention and investment because it provides a powerful in-memory data-processing component within Hadoop that deals with both real-time and batch events. In addition to Databricks, Spark has been embraced by the likes of IBM, Microsoft, Amazon, Huawei, and Yahoo.
To continue reading this article register now