As I wrote in March of this year, the Databricks service is an excellent product for data scientists. It has a full assortment of ingestion, feature selection, model building, and evaluation functions, plus great integration with data sources and excellent scalability. The Databricks service provides a superset of Spark as a cloud service. Databricks the company was founded by the original developer of Spark, Matei Zaharia, and others from U.C. Berkeley’s AMPLab. Meanwhile, Databricks continues to be a major contributor to the Apache Spark project.
In this review, I’ll discuss Spark ML, the open source machine learning library for Spark. To be more accurate, Spark ML is the newer of two machine learning libraries for Spark. As of Spark 1.6, the DataFrame-based API in the Spark ML package was recommended over the RDD-based API in the Spark MLlib package for most functionality, but was incomplete. Now, as of Spark 2.0, Spark ML is primary and complete and Spark MLlib is in maintenance mode.
To continue reading this article register now