It's refreshing to see Microsoft shed the last bits of its not-invented-here mentality and embrace new industry standards without conditions, like it did to Java 20 years ago. You see it rather clearly in its support for Hadoop and Big Data.
Earlier this year, Microsoft announced plans for a Hadoop File System-compatible data store called Azure Data Lake Store that could run large analytics workloads. Data Lakes are a new term coined by the Big Data industry for massive data stores that are to be acted on at a later time. While some Big Data is meant for real-time or immediate processing, Data Lakes are more, “set it aside and we'll get to it later.”
Which is how Microsoft describes Azure Data Lake Store. In a blog post, T. K. "Ranga" Rengarajan, Microsoft's corporate vice president for data platform, laid out the three parts of the Azure Data Lake, of which Store is one of the three.
It's a single repository that lets users capture data of any size, type, or format without requiring changes to the application as the data scales. Data can be securely stored, shared, and can be processed and queried from HDFS-based applications and tools.
Rengarajan also announced the Azure Data Lake Analytics, an Apache YARN-based service that's designed to scale to handle large Big Data workloads dynamically. Azure Data Analytics service will be based on U-SQL, a language that will "unify the benefits of SQL with the power of expressive code," as Rengarajan put it.
U-SQL's scalable distributed query capability enables you to efficiently analyze data in the store and across SQL Servers in Azure, Azure SQL Database, and Azure SQL Data Warehouse.
Finally, there is Azure HDInsight, a fully managed Apache Hadoop cluster service with a broad range of open source analytics engines, including Hive, Spark, HBase, and Storm. Microsoft announced the general availability of managed clusters on Linux with an industry-leading 99.9% uptime SLA.
"Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages," Rengarajan wrote.
Michael Rys, a principal program manager for Big Data at Microsoft, explained Microsoft's new language and why it's needed for Azure Data Lake Analytics in his own blog post. He noted that Big Data analytics require the ability to process any type of data, use custom code easily to express your complex, often proprietary business algorithms and scale efficiently to any size of data without the developer having to worry about it.
The problem is SQL and procedural languages are different animals, so Microsoft designed U-SQL from the ground up as an evolution of the declarative SQL language with native extensibility through user code written in C#.
"This unifies both paradigms, unifies structured, unstructured, and remote data processing, unifies the declarative and custom imperative coding experience, and unifies the experience around extending your language capabilities," Rys wrote.
U-SQL is built on Microsoft's internal experience with SCOPE and existing languages such as T-SQL, ANSI SQL, and Hive. It uses C# data types and the C# expression language so you can seamlessly write C# predicates and expressions inside SELECT statements and use C# to add your custom logic."
"In short, basing U-SQL language on these existing languages and experiences should make it easy for you to get started and powerful enough for the hardest problems," Rys wrote.
As part of this, Microsoft announced Azure Data Lake Tools for Visual Studio, provide an integrated development environment that spans the Azure Data Lake line, and simplifies authoring, debugging, and optimization for processing and analytics at any scale.