At the heart of Spark is the aptly named Spark Core. In addition to coordinating and scheduling jobs, Spark Core provides the basic abstraction for data handling in Spark, known as the Resilient Distributed Dataset (RDD).
RDDs perform two actions on data: transformations and actions. The former makes changes on data and serves them up as a newly created RDD; the latter computes a result based on an existing RDD (such as an object count).
Spark is fast because both transformations and actions are kept in memory. Actions are lazily evaluated, meaning they’re only performed when the data in question is needed; however, it can be hard to find out what runs slowly.
Spark’s speed is a work in progress. Java’s memory management tends to gum up the works for Spark, so Project Tungsten plans to increase its memory efficiency by sidestepping the JVM’s memory and garbage collection subsystems.