top of page
Writer's pictureALIF Consulting

What is Azure Databricks?

Updated: Dec 5, 2024

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments for developing data-intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning.

Databricks SQL provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives and build and share dashboards.



Databricks Data Science & Engineering provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long-term persisted storage in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.

Databricks Machine Learning is an integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.

For example, in the below Microsoft reference architecture, Databricks is used for ETL and Machine Learning, and Synapse / Azure Analysis Services are serving the Line of Business / Ad Hoc BI workloads. That said, you can still use Databricks through Power BI to perform ad hoc queries on your data lake.

This example is still consistent with the vision – making your Data Lake a centralized and democratized asset that can serve all downstream data processes.


Azure Databricks

Databricks Core Components


Collaborative Workspace

Developers will mainly interact with Databricks through its collaborative and interactive workspace. This is a notebook-based environment that has some of the following key features:

  • Code collaboratively, in real-time, in notebooks that support SQL, Python, Scala, and R

  • Built-in version control and integration with Git / GitHub and other source control

  • Enterprise level security

  • Visualize queries, build algorithms and create dashboards

  • Create and schedule ETL / Data Science workloads from various data sources to be run as jobs

  • Track and manage the machine learning lifecycle from development to production

Here is a screenshot of a Databricks Notebook and the Databricks Workspace.


Collaborative Workspace

Managed Infrastructure

One of the key original value props of Databricks is its managed infrastructure. This takes the form of managed clusters. A cluster is a group of virtual machines that divide up the work of a query to return the results faster. By filling out 5-10 fields and clicking a button, you can spin up a Spark cluster that is optimized well beyond open-source Spark, includes many common data science and data analytical libraries, and can auto-scale to meet the needs of a given workload. You only pay for Databricks when a cluster is live, and there is a lot of built-in functionality to reduce this cost. For example, using a jobs cluster, the cluster will spin up to complete a specific job or task and immediately shut down.



Managed Infrastructure

Spark

Spark is an open-source distributed processing engine that processes data in memory – making it extremely popular for big data processing and machine learning. Spark is the core engine that executes workloads and queries on the Databricks platform. Databricks was founded by the original creators of Spark and continues to be the largest contributor to open-source Spark today.


Delta

Delta is an open-source file format that was built specifically to address the limitations of traditional data lake file formats. Under the hood, Delta is composed of Parquet, a columnar format optimized for big data workloads, with added metadata and transaction logs. Delta offers the following key features that are limitations in file formats such as Parquet and ORC:

  • ACID Transactions

  • Ability to perform upserts

  • Indexing for faster queries

  • Unifies streaming and batch workloads without a complex Lambda architecture

  • Schema validation and expectations


Delta

A common misconception is if you choose to build a 'Delta Lake', all of your data needs to be in the Delta format. This is not true – your raw data can stay in its original format, and if you have other specific file format requirements, you can store whatever file type you would like in the data lake. Delta is a tool to be used in the data lake where it makes sense.

ML Flow

ML Flow is an open-source machine learning framework that was built to manage the ML lifecycle. A common challenge within data science is that it is hard to get machine learning into production. ML Flow addresses this challenge with the following features:


ML flow components

All of the above components are part of the open-source ML Flow. On the Databricks platform, you get the following additional benefits:

  1. Workspaces – Collaboratively track and organize experiments from the Databricks Workspace

  2. Jobs – Execute runs as a Databricks remotely or directly from Databricks notebooks

  3. Big Data Snapshots – Track large-scale data sets that feed models with Databricks Delta snapshots

  4. Security – Take advantage of one common security model for the entire ML lifecycle.

  5. Serving – Quickly deploy an ML model to a rest endpoint for testing during the development process

Essentially, on Databricks, there is no management of ML Flow as a separate tool. Everything is built right in to the UI to create a seamless experience.


SQL Analytics

SQL Analytics is a new offering that gives SQL analysts a home within Databricks. By switching views in the traditional Databricks workspace, the SQL Analytics workspace gives an experience like that of a conventional SQL workbench. Users can:

  • Write SQL queries against the data lake

  • Visualize queries in line

  • Build dashboards and share them with the business

  • Create alerts based on SQL queries

The backend of SQL Analytics is powered by SQL Endpoints, which are spark clusters optimized for SQL workloads. These endpoints are not limited to being used by the SQL Analytics UI within Databricks – you can connect to them via your favourite BI tools, such as Tableau and Power BI and harness all of the data in your lake through your favourite BI tool.


When to use Databricks

  1. Modernize your Data Lake – if you face challenges around performance and reliability in your data lake, or your data lake has become a data swamp, consider Delta an option to modernize your Data Lake.

  2. Production Machine Learning – if your organization is doing data science work but is having trouble getting that work into the hands of business users, the Databricks platform was built to enable data scientists to get their work from Development to Production.

  3. Big Data ETL – from a cost/performance perspective, Databricks is the best in its class.

  4. Opening your Data Lake to BI users – If your analyst / BI group is consistently slowed down by the major lift of the engineering team having to build a pipeline every time they want to access new data, it might make sense to open the Data Lake to these users through a tool like SQL Analytics within Databricks.

When not to use Databricks

There are a few scenarios when using Databricks is probably not the best fit for your use case:

  1. Sub-second queries – Spark, being a distributed engine, has overhead involved in processing that makes it nearly impossible to get sub-second queries. Your data can still live in the data lake, but you will likely want to use a highly-tuned speed layer for sub-second queries.

  2. Small data – Similar to the first point, you won't get the majority of the benefits of Databricks if you are dealing with very small data (think GBs).

  3. Pure BI without a supporting data engineering team – Databricks and SQL Analytics do not erase the need for a data engineering team – in fact, they are more critical than ever in unlocking the potential of the Data Lake. That said, Databricks offers tools to enable the data engineering team itself.

  4. Teams requiring drag and drop ETL – Databricks has many UI components, but drag and drop code is not currently one of them.


129 views0 comments

Comments


bottom of page