Welcome back folks!
In all our blogs so far, we have discussed in depth about the Unified Analytics Platform along with various technologies associated with it. We have tried to cover in detail about the databricks architecture and various technologies leveraged on the platform.
This is the last blog of our series and we shall be covering some important topics to give you a holistic understanding of Databricks and its capabilities-
About Lakehouse
Continuous Integration Continuous Delivery
Deep Learning with Databricks
Data Quality
Infrastructure Management
Data Governance on Databricks
We also have an exciting announcement to make at the end. Happy reading!
Data warehouses have a long history in decision support and business intelligence applications. And ever since then, it has continued to evolve. However, while warehouses were ideal for structured data, dealing with unstructured data, semi-structured data, and data with high variety, velocity, and volume was a concern. With companies collecting huge amount of data from different sources, architects started to envision a single system to house data for analytic products and workloads. This led to the creation of repositories for raw data storage in a variety of formats and was commonly known as ‘Data Lakes’. However, data lakes lack some critical features:
Does not support transactions
Does not enforce data quality
Lack of consistency / isolation makes it hard to mix appends and reads, and batch and streaming jobs.
Companies on the other hand required systems for diverse data applications including SQL analytics, real-time monitoring, data science, and machine learning. Most of the recent advances in AI were in better models to process unstructured data (text, images, video, audio), but these were the data types a data warehouse is not optimized for. The approach then was to use multiple systems- a data lake, several data warehouses, and other specialized systems such as streaming, time-series, graph, and image databases. However, having a multitude of systems led to complexity and more specifically, delayed processing as data professionals were invariably required to move and copy data between different systems. This led to the creation of a lakehouse.
A lakehouse is an open architecture that combines the best elements of data lakes and data warehouses. It is designed to overcome the limitations of data lakes. Lakehouses are enabled by a new open and standardized system design, which is to implement similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. In simple words, it is what you would get if you had redesigned data warehouses in the modern world, just that now it is less expensive and highly reliable storage (in the form of object stores).
Typically, the key features of lakehouse are as follows:
Support for diverse data types ranging from unstructured to structured data: Lakehouse is designed to store, refine, analyze, and access data types required for new data applications, including images, video, audio, semi-structured data, and text.
Transaction support: The data pipelines are capable of reading and writing data concurrently. They support ACID transactions that ensure consistency as multiple parties concurrently read or write data, typically using SQL.
Schema enforcement and governance: Lakehouse supports schema enforcement and evolution and DW schema architectures such as star/snowflake-schemas. It provides robust governance and auditing mechanisms and offers data governance capabilities such as auditing, retention, and lineage.
BI support: It enables using BI tools directly on the source data which in turn reduces staleness and improves recency, reduces latency, and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse.
Storage is decoupled from compute: In a lakehouse setup, storage and compute use separate clusters. These systems thus have the capability scale to concurrent users and larger data sizes. This feature is also seen in some modern data warehouses.
Openness: Lakehouse leverages storage formats such as Parquet, that are open and standardized, and provide an API for variety of tools and engines, including machine learning and Python/R libraries, to access the data directly.
Support for diverse workloads: This includes data science, machine learning, and SQL and analytics workloads. One might have to leverage multiple tools to support all these workloads, but they all rely on the same data repository.
End-to-end streaming: The demand for real-time reporting is picking pace. Lakehouse supports streaming, that eliminates the need for separate systems dedicated to real-time data applications.
You can read more about Lakehouse using the link below.
About Lakehouse
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores
CICD can be leveraged to productionize and automate your data platform at scale. Data-driven innovation is of utmost importance to stay competitive in today’s marketplace. Organizations that can bring data, analytics, and ML-based products to market first can stay ahead of the competition and gain first mover advantage. While many companies have streamlined CI/CD processes for application development, not a lot of them have well-defined processes for developing data and ML products. Henceforth, it is critically important to have production-ready, reliable and scalable data pipelines to feed the analytics dashboards and ML applications.
With the introduction of new features, data practitioners need consistent toolset and environments to help them rapidly iterate on ideas. As these ideas progress, they are tested and taken from development to production. And once they are in production, the ML models and analytics need to be constantly monitored for effectiveness, stability, and scale. This complete process can be intimidating as the pace of adding new features to the tool suite is pretty high and spiralling and reiterating around development process can be time consuming. Additionally, consumers must have confidence in the validity of outcomes within these products Henceforth, if you want to accelerate the creation of new and innovative data products, you will need to rely heavily on automation to overcome the following challenges:
Lack of consistent and collaborative development environments
Lack of consistent Devops processes
Limited visibility into data pipeline and ML model performance
Fully configured data environments on-demand
Deploy workspace
Connect data sources
Provision users and groups
Create clusters and cluster policies
Add permissions for users and groups
The CI/CD pipeline
Development environment
Staging/Integration environment
Production environment
Streamlined operations
Take innovations to market faster
That was a brief gist to explain you about Continuous integration and delivery. Let us now understand CI/CD on Azure Databricks using Azure DevOps.
CI/CD refers to the process of developing and delivering software in short, frequent cycles leveraging automation pipelines.
Continuous integration begins with the practice of committing your code with some frequency to a branch within a source code repository. Each commit is then merged with the commits from other developers to ensure that no conflicts are introduced. The changes are further validated by creating build and run automated tests against that build. This process ultimately results in an artifact, or deployment bundle, that will eventually be deployed to a target environment, which is an Azure Databricks workspace in this case.
We will now give you a brief overview of a typical Azure Databricks CI/CD pipeline. The pipeline can vary based on your needs, but a typical configuration for Azure Databricks pipeline includes the following steps:
Continuous integration:
Code:
Develop code and unit tests in an Azure Databricks notebook or using an external IDE.
Manually run tests.
Commit code and tests to a git branch.
Build:
Gather new and updated code and tests.
Run automated tests.
Build libraries and non-notebook Apache Spark code.
Release: Generate a release artifact.
Continuous delivery:
Deploy
Deploy notebooks.
Deploy libraries.
Test: Run automated tests and report results.
Operate: Programmatically schedule data engineering, analytics, and machine learning workflows.
Furthermore, based on the project, you would be required to develop and commit your code, define pipeline, configure build agent, set up the pipeline, conduct unit tests in Azure Databricks notebooks, test library code using Databricks Connect and finally publish your test results. You can read more about these stages and get the reference code from the link below.
CI/CD on Azure Databricks using Azure DevOps
Furthermore, you can automate CI/CD on Databricks using Databricks Labs CI/CD Templates.
Databricks Labs CI/CD Templates are an open source tool that makes it easy for software development teams to use existing CI tooling with Databricks Jobs. It includes pipeline templates with Databricks’ best practices to help run on both Azure and AWS. This makes it convenient for developers to divert all their attention to just writing code without having to worry about setting up testing, integration and deployment systems from scratch.
Now you must be wondering about why do we need another deployment framework?
As projects on Databricks get extensive, users may find themselves struggling to keep up with the numerous notebooks containing the ETL, data science experimentation, dashboards and more. Although there are various short-term workarounds such as using the %run command to call other notebooks from within your current notebook, it is beneficial to follow traditional software engineering best practices of separating reusable code from pipelines calling the code. Additionally, building tests around your pipelines to verify that the pipelines are working efficiently is another important step towards production-grade development processes.
Finally, when changes are made in the code, being able to automatically run jobs in real time without having to manually trigger the job or manually install libraries on clusters is important to achieve scalability and stability of your overall pipeline. In a nutshell, to scale and stabilize our production pipelines, we will have to move away from running code manually in a notebook and move towards automated packaging, testing, and code deployment using traditional software engineering tools such as IDEs and continuous integration tools.
Read more about automating CI/CD from the links below-
https://databricks.com/blog/2020/03/16/productionize-and-automate.html
As the data volume and complexity continues to grow, there arises the need to provision increased processing power with advanced graphics processors. And deep learning has proved itself to be an ideal way to provide predictive analytics for big data.
With deep learning, it is easier for organizations to harness the power of unstructured data such as images, text, and voice to deliver transformative use cases that leverage techniques like AI, image interpretation, automatic translation, natural language processing, and more. Some common use cases are-
Image classification: Recognize and categorize images for easy sorting and more accurate search.
Object detection: Fast object detection to make autonomous cars and face recognition a reality.
Natural Language Processing: Accurately understanding spoken words to power new tools like speech-to-text and home automation.
However, deep learning too has its own set of challenges. While Big Data and AI offers a plethora of capabilities but identifying actionable insights from Big Data is not an ordinary task. The large and rapidly growing body of information hidden in unstructured data (images, sound, text, etc) demands for both advanced technologies and interdisciplinary teams- data engineering, data science, and business teams working in close collaboration. The major challenges with respect to deep learning are-
Disjointed technology: Reliance on separate frameworks and tools (TensorFlow, Keras, PyTorch, MXNet, Caffe, CNTK, Theano) that offer low level APIs with steep learning curves.
Costly infrastructure: Providing the infrastructure to support deep learning can require significant amounts of costly resources and computational power to scale.
Data science complexity: Training an accurate deep learning model can be manually intensive on data scientists, often requiring labelling of data and tuning of parameters together.
You can read more about using Databricks with Deep learning from the link below.
Democratizing Deep Learning:
The Databricks Unified Analytics Platform powered by Apache Spark allows you to build reliable, performant, and scalable deep learning pipelines that enable data scientists to build, train, and deploy deep learning applications with ease. Some of its leading capabilities include-
Unified infrastructure: Fully managed, serverless cloud infrastructure for isolation, cost control and elasticity. Provides an interactive environment to make it easy to work with major frameworks such as TensorFlow, Keras, PyTorch, MXNet, Caffe, CNTK, and Theano.
End-to-end workflows: A single platform to handle data preparation, exploration, model training, and large-scale prediction. High level APIs and example applications let you easily leverage state of the art models.
Performance optimized: A highly performant Databricks Runtime powered by Apache Spark and built to run on powerful GPU hardware at scale.
Interactive data science: Collaborate with your team across multiple programming languages to explore data and train deep learning models against real time data sets.
You can read more about Deep Learning from https://docs.databricks.com/applications/machine-learning/train-model/deep-learning.html.
Data Quality Monitoring on Streaming Data Using Spark Streaming and Delta Lake:
In the era of technology, streaming data is no longer an outlier- instead, it is becoming the norm. Customers are no longer anxious about delay in streaming data across channels and the pervasiveness of technologies such as Kafka and Delta Lake underline this momentum. On one end of this streaming spectrum is what we consider “traditional” streaming workloads- data that arrives with high velocity, usually in semi-structured or unstructured formats such as JSON, and often in small payloads. This type of workload cuts across verticals. One customer example is a major stock exchange and data provider who was responsible for streaming hundreds of thousands of events per minute- stock ticks, news, quotes, and other financial data. This customer leveraged Databricks, Delta and Structured Streaming to process and analyze these streams in real-time with high availability. With increasing regularity, however, we see customers on the other end of the spectrum, using streaming for low-frequency, “batch-style” processing. In this architecture, streaming acts as a way to monitor a specific directory, S3 bucket, or other landing zones, and automatically process data as soon as it lands- such an architecture removes much of the burden of traditional scheduling, particularly in the case of job failures or even partial processing.
While the emergence of streaming in the mainstream is a net positive, there are some challenges that come along with this architecture. Firstly, historically there has been a trade-off between high-quality data vs high-velocity data. However, this is not practically a valid question as quality must be coupled to velocity for all practical means. And to achieve high velocity, we need high quality data.
Implementing Quality Monitoring for Streaming Data:
Suppose we simulate data flow by running a small Kafka producer on an EC2 instance that feeds simulated transactional stock information into a topic, and using native Databricks connectors to bring this data into a Delta Lake table. To show the capabilities of data quality checks in Spark Streaming, we chose to utilize different features of Deequ throughout the pipeline:
Generate constraint suggestions based on historical ingest data
Run an incremental quality analysis on arriving data using foreachBatch
Run a (small) unit test on arriving data using foreachBatch, and quarantine bad batches into a bad records table
Write the latest metric state into a delta table for each arriving batch
Perform a periodic (larger) unit test on the entire dataset and track the results in MLFlow
Send notifications (i.e., via email or Slack) based on validation results
Capture the metrics in MLFlow for visualization and logging
The image below is the pipeline of how MLFlow is used to track quality of data performance indicators over time and versions of our Delta table, and a Slack connector for notifications and alerts. Graphically, this pipeline is shown below. MLFlow is used to track quality of data performance indicators over time as well as versions of Delta table, and a Slack connector is used for notifications and alerts. This is one way to implement monitoring for streaming data and the image below is the graphical pipeline depiction of the same.
Furthermore, you can read more on security implementing for streaming data and various use cases for the same using the link below.
Managing cloud infrastructure and provisioning resources can be a tedious task for DevOps engineers. Even the cloud admin experts can get bogged down with managing a bewildering number of interconnected cloud resources such as data streams, storage, compute power, and analytics tools. Let us take an example to help you understand better. Say you have your Databricks workspace ready, and now you want to connect your Databricks cluster to a Redshift cluster in AWS. The architecture diagram below demonstrates how this can be achieved.
Undoubtedly, cloud automation simplifies and speeds up the deployment of cloud resources. However, it is a time-consuming process and requires some complex configurations. Some major challenges with respect to cloud automation includes-
Scalability: Addition of new resources to an existing cloud deployment can become exponentially difficult and cumbersome due to resolving dependencies between cloud resources.
Modularity: Many deployment processes are repeatable and inter-dependent (For instance: In AWS, deploying to Redshift also requires a connection to S3 for staging the results).
Consistency: Tracking a deployment state may simplify remediation and reduces risk, but it is sometimes challenging to maintain and resolve.
Lifecycle management: Even though you can audit changes to some cloud resources, it may be unclear what actions are necessary to update an entire end-to-end state.
To address these issues, Databricks is introducing a solution to automate your cloud infrastructure. Databricks Cloud Automation leverages the power of Terraform, an open source tool for building, changing, and versioning cloud infrastructure safely and efficiently. It offers an intuitive graphical user interface along with pre-built, “batteries included” Terraform modules that make it easier to connect common cloud resources to Databricks.
The image below will give you a glimpse of Databricks cloud infrastructure.
The underlying idea behind developing Databricks Cloud Automation is to:
Accelerate the deployment process through automation.
Democratize the cloud infrastructure deployment process to non-DevOps/cloud specialists.
Reduce risk by maintaining a replicable state of your infrastructure.
Provide a universal, “cloud-agnostic” solution.
With this new tool, connecting your cloud resources to Databricks is faster and simpler than ever. The increase in number of customers using Databricks Cloud Automation is a result of its following capabilities-
A graphical user interface to democratize Databricks cloud deployments.
An elegant solution for tracking infrastructure state.
A modular framework for your cloud infrastructure.
Modules that can be shared, versioned and reused.
Connect to any IaaS provider.
For more information on
You can read in depth about Databricks Cloud Automation from the link below-
Interesting, right? You must have realized the importance of using Terraform by now. Databricks can also be coupled with Azure to simplify your automation processes. Let us now learn about Automating Azure Databricks Platform.
Enterprises need consistent and scalable solutions that can reuse templates to seamlessly comply with enterprise governance policies, with a goal to bootstrap unified data analytics environments across data teams. With Microsoft Azure Databricks, we use an API-first approach for all objects that enables quick provisioning & bootstrapping of cloud computing data environments, by integrating into existing Enterprise DevOps tooling without requiring customers to reinvent the wheel.We will walk you through such a cloud deployment automation process using different Azure Databricks APIs.
The process for configuring an Azure Databricks data environment is as following:
Deploy Azure Databricks Workspace
Provision users and groups
Create clusters policies and clusters
Add permissions for users and groups
Secure access to workspace within corporate network (IP Access List)
Platform access token management
To accomplish the above, we will be using APIs for the following IaaS features or capabilities available as part of Azure Databricks:
Token Management API allows admins to manage their users’ cloud service provider personal access tokens (PAT), including:
Monitor and revoke users’ personal access tokens.
Control the lifetime of future tokens in your public cloud workspace.
Control which users can create and use PATs.
AAD Token Support allows the use of AAD tokens to invoke the Azure Databricks APIs. One could also use Service Principals as first-class identities.
IP Access Lists ensure that users can only connect to Azure Databricks through privileged networks thus forming a secure perimeter.
Cluster policies is a construct that allows simplification of cluster management across workspace users, where admins could also enforce different security & cost control measures.
Permissions API allows automation to set access control on different Azure Databricks objects like Clusters, Jobs, Pools, Notebooks, Models etc.
The three leading Automation options to use the Azure Databricks APIs are:
Databricks Terraform Resource Provider could be combined with Azure provider to create an end-to-end architecture, utilizing Terraform’s dependency and state management features.
Python (or any other programming language) could be used to invoke the APIs (sample solution) providing a way to integrate with third-party or homegrown DevOps tooling.
A readymade API client like Postman could be used to invoke the API directly.
However, to keep the workflow simple, we’ll use the Postman approach which is as follows-
Use a Azure AD Service Principal to create a Azure Databricks workspace.
Use the service principal identity to set up IP Access Lists to ensure that the workspace can only be accessed from privileged networks.
Use the service principal identity to set up cluster policies to simplify the cluster creation workflow. Admins can define a set of policies that could be assigned to specific users or groups.
Use the service principal identity to provision users and groups using SCIM API (alternative to SCIM provisioning from AAD)
Use the service principal identity to limit user personal access token (PAT) permissions using token management API
All users (non-service principal identities) will use Azure AD tokens to connect to workspace APIs. This ensures conditional access (and MFA) is always enforced.
This was just a high-level overview of Azure Databricks Automation. You can use the link to read in detail about various configurations, workspace provision, cluster policies and IP Access List.
Azure Databricks Automation
You must be amazed reading about vast range of capabilities offered by Databricks, right? Let us now dive into the most important aspect of dealing with data- Data Governance.
Why is data governance important?
Data governance is an umbrella term that encapsulates all the policies and practices implemented to securely manage the data within an organization. As one of the key tenets of any successful data governance practice, data security is of utmost importance to any organization. Key to data security is the ability for data teams to have superior visibility and auditability of user data access patterns across their organization. Therefore, implementing an effective data governance solution helps companies protect their data from unauthorized access and ensures that all the rules are in place to comply with regulatory requirements.
But why is it laborious to manage data? You will have to read about data governance challenges to understand that.
Whether you are managing the data of a startup or a large corporation, security teams and platform owners have the singular challenge of ensuring that this data is secure and is being managed according to the internal controls of the organization. Regulatory bodies across the globe are changing the way we think about how data is both captured and stored. These compliance risks only add further complexity to the situation that is already tough. Then how can we open our data to those who can drive the use cases of the future? Ultimately, we need to adopt data policies and practices that help the business to realize value through the meaningful application of what can often be vast stores of data- stores that are growing all the time. But what are the challenges that hold us back from implementing effecting data governance policies? The typical challenges when considering the security and availability of your data in the cloud are:
Does your current data and analytics tool support access controls on your data in the cloud? Do they provide robust logging of actions taken on the data as it moves through the given tool?
Will the security and monitoring solution you put in place now scale as demand on the data in your data lake grows? It can be easy enough to provision and monitor data access for a small number of users. What happens when you want to open up your data lake to hundreds of users? To thousands?
Is there anything you can do to be proactive in ensuring that your data access policies are being observed? It is not enough to simply monitor; that is just more data. If data availability is merely a challenge of data security, you should have a solution in place to actively monitor and track access to this information across the organization.
What steps can you take to identify gaps in your existing data governance solution?
That was a lot of issues to address, right?
But data users will not have to worry about this anymore. Azure Databricks will solve this for you. It offers the following functionalities to make it convenient for you to secure your data.
Access control: Rich suite of access control all the way down to the storage layer. Azure Databricks can take advantage of its cloud backbone by utilizing state-of-the-art Azure security services right in the platform. Enable Azure Active Directory credential passthrough on your spark clusters to control access to your data lake.
Cluster policies: Enable administrators to control access to compute resources.
API first: Automate provisioning and permission management with the Databricks REST API.
Audit logging: Robust audit logs on actions and operations taken across the workspace delivered to your data lake. Azure Databricks can leverage the power of Azure to provide data access information across your deployment account and any others you configure. You can then use this information to power alerts that tip us off to potential wrongdoing.
You can read more about implementation of governance solutions on Azure Databricks using the link below-
By now, you must have got a good understanding of various aspects concerning data governance. We will now introduce you to some of the 3rd party tools for governance and security.
End-to-end Data Governance with Databricks and Immuta:
Businesses are consuming data at a staggering rate but when it comes to getting insights from this data, they grapple in terms of secure data access and data sharing along with ensuring compliance. With new customer data privacy regulations like GDPR and the upcoming CCPA, the leash on data security policies is getting tighter, which results in slowing down analytics and machine learning (ML) projects. Databricks and Immuta have henceforth partnered to provide an end-to-end data governance solution with enterprise data security for analytics, data science and machine learning. Their joint solution is centered around fine-grained security, secure data discovery and search that allows teams to securely share data and perform compliant analytics and ML on their data lakes.
We will now help you understand the process of enabling scalable analytics and ML on sensitive data in Data Lakes.
Immuta’s automated governance solution integrates natively with Databricks Unified Data Analytics Platform. The advanced data governance controls give users a simplified, end-to-end process to manage access to Delta Lake and meet their organization’s security and data stewardship directives.
Regulatory Compliance: Immuta offers fine-grained access control that provides row, column and cell-level access to data in Databricks. This makes it possible to make more data assets available to users without restricting entire table level access. All data security policies are enforced dynamically as users run their jobs in Databricks.
Secure Data Sharing: By building a self-service data catalog, Immuta makes it easy to perform secure data discovery and search in Databricks. The integration comes with features like programmatic data access that automatically enables global and local policies on Spark jobs in Databricks. Data engineers and data scientists can securely subscribe to and collaborate on sensitive data while having the peace of mind for all their data security and privacy needs.
Compliant Analytics and ML: Using anonymization and masking techniques in Immuta, Databricks users can perform compliant data analytics and ML in Delta tables within the context under which they need to act. With automated policy application, Immuta eliminates the need to check for permissions each time data is accessed to speed up analytics workloads while preserving historical data.
That was a high-level overview on Immuta. You can refer the link below to read more on it.
End-to-end data governance with Immuta
ANNOUNCEMENT ALERT!!!
We are thrilled to announce the launch of Databricks on Google Cloud.
------
This jointly developed service provides a simple, open lakehouse platform for data engineering, data science, analytics, and machine learning. You can implement Databricks Lakehouse Platform on Google Cloud, which is possible leveraging Delta Lake on Databricks.
Databricks on Google Cloud offers enterprise flexibility for AI-driven analytics. The key benefits offered are-
Faster innovation with Databricks using Google Cloud
Enable scale and efficiency for your analytics
Simplify data analytics infrastructure and security
Databricks on Google Cloud offers containerized deployment featuring tight integrations with Google Cloud’s analytics services. Key features it offers are-
Delta Lake on Databricks and fully managed Spark experience
Databricks integrated with BigQuery
Databricks containerization with Google Kubernetes Engine
Alternatively, you can read more about Databricks on Google using the links below-
Databricks on Google Cloud_Key Features & Benefits
We know you are intrigued, isn’t it? Then why wait?
Explore how Databricks can helps individuals and organizations adopt a Unified Data Analytics approach for better performance and keeping ahead of the competition.
Sign up to the community version of Databricks and dive into a plethora of computing capabilities.
Databricks Sign up
Alternatively, you can read more about Databricks from here:
Managing your Databricks Account
Databricks website
Databricks concepts
Video content on Databricks