One of the critical decisions facing companies embarking on big data projects is which database to use, and often that decision swings between SQL and NoSQL. SQL has the impressive track record, the large installed base, but NoSQL is making impressive gains and has many proponents. We put the question to experts in both camps.
CTO, of VoltDB says SQL has already earned its stripes in large organizations and big data is just one more job that this stalwart can shoulder. View debate
CEO of Couchbase says NoSQL is viable alternative and, in fact, down right preferable for big data on many fronts, especially when it comes to scalability. View debate
SQL is time-tested and still flourishing
Structured Query Language (SQL) is a proven winner that has dominated for several decades and is currently being aggressively invested in by big data companies and organizations such as Google, Facebook, Cloudera and Apache.
Once a technology becomes as dominant as SQL, the reasons for its ascendency are sometimes forgotten. SQL wins because of a unique combination of strengths:
- SQL enables increased interaction with data and allows a broad set of questions to be asked against a single database design. That’s key since data that’s not interactive is essentially useless, and increased interactions lead to new insight, new questions and more meaningful future interactions.
- SQL is standardized, allowing users to apply their knowledge across systems and providing support for third-party add-ons and tools.
- SQL scales, and is versatile and proven, solving problems ranging from fast write-oriented transactions, to scan-intensive deep analytics.
- SQL is orthogonal to data representation and storage. Some SQL systems support JSON and other structured object formats with better performance and more features than NoSQL implementations.
Although NoSQL has generated some noise of late, SQL continues to win in the marketplace and continues to earn investment and adoption throughout the big data problem space.
The term “NoSQL” is ambiguous, but for this discussion I use Dr. Rick Cattell’s NoSQL definition to mean “systems that provide simple operations like key/value storage or simple records and indexes, and that focus on horizontal scalability for those simple operations.”
It’s clear that the many new databases available are not all alike – and recognizing how the DNA behind each helps and hinders problem solvers is the key to success. NoSQL’s key features make it more appropriate for use in specific problem sets. For example, graph databases are better suited for those situations where data is organized by relationships vs. by row or document, and specialized text search systems should be considered appropriate in situations requiring real time search as users enter terms.
Here, I’m going to draw out the important benefits and differentiation of SQL systems vs. simple key/value and JSON object stores that do not innovate beyond storage format and scalability.
* SQL Enables Interaction. SQL is a declarative query language. Users state what they want, (e.g., display the geographies of top customers during the month of March for the prior five years) and the database internally assembles an algorithm and extracts the requested results. By contrast, NoSQL programming innovation MapReduce is a procedural query technique. MapReduce requires the user to not just know what they want, but additionally requires them to state how to produce the answer.
This might sound like an uninteresting technical difference, but it is critical for two reasons: First, declarative SQL queries are much easier to build via graphical tooling and point-and-click report builders. This opens up database querying to analysts, operators, managers and others with core competencies outside of software programming. Second, abstracting what from how allows the database engine to use internal information to select the most efficient algorithm. Change the physical layout or indexing of the database and an optimal algorithm will still be computed. In a procedural system, a programmer needs to revisit and reprogram the original how. This is expensive and error-prone.
The marketplace understands this critical difference. In 2010 Google announced a SQL implementation to complement MapReduce, driven by internal user demand. More recently, Facebook released Presto, a SQL implementation to query its petabyte HDFS clusters. According to Facebook: “As our warehouse grew to petabyte scale and our needs evolved, it became clear that we needed an interactive system optimized for low query latency.” Furthermore, Cloudera is building Impala, another SQL implementation on top of HDFS. All of these are advances over Hive, a long-standing and broadly adopted SQL façade for Hadoop.
* SQL is standardized. Although vendors sometimes specialize and introduce dialects to their SQL interface, the core of SQL is well standardized and additional specifications, such as ODBC and JDBC, provide broadly available stable interfaces to SQL stores. This enables an ecosystem of management and operator tools to help design, monitor, inspect, explore, and build applications on top of SQL systems.
SQL users and programmers can therefore reuse their API and UI knowledge across multiple backend systems, reducing application development time. Standardization also allows declarative third-party Extract, Transform, Load (ETL) tools that enable enterprises to flow data between databases and across systems.
* SQL scales. It is absolutely false to assume SQL must be sacrificed to gain scalability. As noted, Facebook created an SQL interface to query petabytes of data. SQL is equally effective at running blazingly fast ACID transactions. The abstraction that SQL provides from the storage and indexing of data allows uniform use across problems and data set sizes, allowing SQL to run efficiently across clustered replicated data stores. Using SQL as an interface is independent from building a cloud, scale or HA system, and there is nothing inherent in SQL that prohibits or restricts fault tolerance, high availability and replication. In fact, all modern SQL systems support cloud-friendly horizontal scalability, replication and fault tolerance.
* SQL supports JSON. Several years ago many SQL systems added XML document support. Now, as JSON becomes a popular data interchange format, SQL vendors are adding JSON-type support as well. There are good arguments for structured data type support given today’s agile programming processes and the uptime requirements of web-exposed infrastructure. Oracle 12c, PostgreSQL 9.2, VoltDB and others support JSON – often with performance benchmarks superior to “native” JSON NoSQL stores.
SQL will continue to win market share and will continue to see new investment and implementation. NoSQL Databases offering proprietary query languages or simple key-value semantics without deeper technical differentiation are in a challenging position. Modern SQL systems match or exceed their scalability while supporting richer query semantics, established and trained user bases, broad eco-system integration and deep enterprise adoption.
Betts is CTO at VoltDB and one of the initial developers of VoltDB’s commercial product. VoltDB provides the world’s fastest operational database, delivering high-speed data processing and real-time, in-memory analytics in a single database system. VoltDB is a relational database that gives organizations an unprecedented ability to build ultra-fast applications that can extract insights from massive volumes of dynamic data and enable real-time decision-making. Organizations in markets including telco, financial services, networking, energy and e-commerce use VoltDB to maximize the business value of data at every interaction. VoltDB is privately held with offices in Bedford, Mass. and Santa Clara, Calif.
NoSQL is better for your big data applications
NoSQL is increasingly being considered a viable alternative to relational databases, especially for Big Data applications, as more enterprises recognize that operating at scale is better achieved on clusters of standard, commodity servers. In addition, a schema-less data model is often better for the variety and type of data captured and processed today.
When we talk about Big Data in the NoSQL space, we’re referring to reads and writes from operational databases – that is, the online transaction processing that people interact with and engage in on a daily basis (e.g. having the Big Data needed to book a flight online). Operational databases are not to be confused with analytical databases, which generally look at a large amount of data and collect insights from that data (e.g. having the Big Data needed to find out how many people will book a flight on a given day).
While the Big Data of operational databases might not appear to be as analytical when scratching the surface, operational databases generally host large datasets with ultra-large numbers of users that are constantly accessing the data to execute on transactions in real time. The scale to which databases must operate to manage Big Data explains the critical nature of NoSQL, and thus why NoSQL is key for Big Data applications.
NoSQL is Critical for Scalability
Every time the technology industry experiences a fundamental shift in hardware developments, there’s an inflection point. In the database space, the shift from scale-up to scale-out architectures is what fueled the NoSQL movement. Relational databases, including those from giants Oracle and IBM, scale up. That is, they are centralized, share-everything technologies that can only scale through the addition of more expensive hardware.
Conversely, NoSQL databases were built from the ground up to be distributed, scale-out technologies. They use a distributed set of nodes, known as a cluster, to provide a highly elastic scaling capability that lets users add nodes to handle load on-the-fly.
A distributed scale-out approach also usually ends up being cheaper than the scale-up alternative. This is a consequence of large, complex, fault-tolerant servers being expensive to design, build and support. Licensing costs of commercial relational databases can also be prohibitive because they are priced with a single server in mind. NoSQL databases on the other hand are generally open source, priced to operate on a cluster of servers and are relatively inexpensive.
NoSQL is Critical for Flexibility
Relational and NoSQL data models are very different. The relational model takes data and separates it into many interrelated tables that contain rows and columns. These tables reference each other through foreign keys that are stored in columns as well.
When a user needs to run a query on a set of data, the desired information needs to be collected from many tables – often hundreds in today’s enterprise applications – and combined before it can be provided to the application. Similarly, when writing data, the write needs to be coordinated and performed on many tables. When data is relatively low-volume, and when it is flowing into a database at a low velocity, a relational database is usually able to capture and store the information. But today’s applications are often built on the expectation that massive volumes of data can be written (and read) at speeds near real-time.
NoSQL databases have a very different model. At the core, NoSQL databases are really “NoREL,” or non-relational, meaning they do not rely on tables and the links between tables in order to store and organize information. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the JSON format. Each JSON document can be thought of as an object to be used by your application. A JSON document might take all the data stored in a row that spans 25 tables of a relational database and aggregate it into a single document/object.
Aggregating this information may lead to duplication of information, but since storage is no longer a cost issue, the resulting data model flexibility, ease of efficiently distributing the resulting documents, and read and write performance improvements make it an easy trade-off for Web-based applications.
NoSQL is Critical for Big Data Applications
Data is becoming increasingly easier to capture and access through third parties, including social media sites. Personal user information, geographic location data, user-generated content, machine-logging data and sensor-generated data are just a few examples of the ever-expanding array being captured. Enterprises are also relying on Big Data to drive their mission-critical applications. Across the board, organizations are turning to NoSQL databases because they are uniquely suited for these new classes of data emerging today.