Microsoft kicked off the Spark Summit in San Francisco with news of "an extensive commitment for Spark to power Microsoft's big data and analytics offerings, including Cortana Intelligence Suite, Power BI and Microsoft R Server."
Spark started as an open source project at the University of California, Berkeley AMPLab in 2009 and was given to the Apache Foundation in 2012. A company to further Spark development was formed called DataBricks.
Spark is a significant accelerator for Hadoop, the primary software used in big data analytics, because it does all of the work in memory. Hadoop ran primarily as a disk-based batch process, using a framework called MapReduce to execute a batch process, often overnight. You got your insight the next day. That’s why despite big data’s promise of real-time analytics, it often couldn't deliver.
+ Also on Network World: IT wants (but struggles) to operationalize big data +
Spark runs in memory and can speed up Hadoop by up to 100 times its traditional speed, thus making good on the promise of real-time analytics.
Microsoft launched Spark for Azure HDInsight as a public preview last July. As of now, the final version is available today as "a fully managed Spark service from Hortonworks that has been hardened for the enterprise and made simpler for you to use," according to the blog announcement for the news.
Other big data announcements
In addition to the Azure HDInsight news, there were a number of other big data-related announcements at the show and in the blog as well:
- R Server for HDInsight in the cloud powered by Spark, previously announced as public preview. R Server for HDInsight will be generally available in the summer. This will make Spark available both on premises and in the cloud. Code can be moved from on premises to the cloud with a few clicks.
- R Server for Hadoop on premises now powered by Spark. R Server for Hadoop will support both Microsoft R and native Spark execution frameworks available in June. Combining R Server with Spark gives users the ability to run R functions over thousands of Spark nodes, letting you train your models on data 1000 times larger and 100 times faster than was possible with open source R and nearly two times faster than Spark’s own MLLib.
- Power BI support for Spark Streaming, previously announced with Power BI General Availability. Spark support in Power BI is now expanded with new support for Spark Streaming scenarios. This allows you to publish real-time events from Spark Streaming directly into one of the fastest-growing visualization tools in the market today.
- Free R Client for Data Scientists. Microsoft introduced R Client, a new, freely available tool for data scientists to build high-performance analytics using R. R Client allows you to use any of the open source R functions to analyze the data present on your local workstation, and it will allow for analysis of remote big data. It can perform analytics on any Microsoft R Server, such as SQL Server R Services, R Server for Hadoop and HD Insight with Spark.