This article will talk about the creation of spark clusters in Databricks. Databricks is a software platform which runs on Apache Spark in which you can have a workspace for running jobs on the cluster.
Databricks was created by Apache Spark team members. It is used for data analysis, data processing, and data engineering. You can use Job API for data management. Data is processed using job execution by data scientists, engineers, and analysts. Application is executed on the cluster using Job API. You can execute jobs using notebooks on the cluster and manage the job execution results by using CLI, API, and alerts.
A Databricks cluster is used for analysis, streaming analytics, ad hoc analytics, and ETL data workflows. Notebook on the databricks has the set of commands. Job clusters and all purpose clusters are different. All-purpose clusters are used for data analysis using notebooks, while job clusters are used for executing the jobs. All purpose clusters can be created using the User interface, Command line interface and REST API. You can collaborate with other users on different clusters.
To get access to clusters, you need to start with Azure Databricks Home:
You can create clusters from the user interface. All-purpose clusters use notebooks and are active until you stop them. You can create a job cluster for job execution. Once the job is finished, the cluster is stopped. You need to create a job to create job clusters. Ensure that you have permission for job creation and cluster creation.
From the Databricks Home (shown at the top), click on the clusters icon on the sidebar . To create a cluster you can click on the Create Cluster button (as shown in the figure below.
Databricks Cluster
You need to name the cluster. The configuration of the cluster is done using the configuration tab in the above figure. After the cluster is created and it starts, the progress spinner changes to green. You can add notebooks to execute commands and queries.
Clusters can be edited by using the cluster configuration tab as shown below.
For editing the cluster, you can also use Edit API. Jobs and notebooks will be with the cluster after edits. Similarly, libraries installed are on the cluster. After editing you can restart the cluster.
You can configure the cluster policy by setting the policy. A cluster policy is related to the capability for configuring the clusters using a ruleset. Cluster policy rules have attributes for creation of the clusters. Access control lists are set on the cluster. Access control lists help in setting the limits for the users and groups. Unrestricted policy can be selected for the clusters. This policy cannot limit any attributes.
Clusters can have multiple modes like standard, high concurrency, and single node. Standard is the default value. Cluster configuration can have auto terminate settings. This setting has a default value based on the cluster mode. Standard and single node cluster modes can have 120 minutes as the default value. High concurrency cluster mode has no termination as the default value. The cluster mode cannot be changed once the cluster is created.
Standard clusters are used for execution of workloads based on Python, R, Scala, and SQL Technology stacks. High concurrency clusters are resources managed on the cloud. They are used for workloads based on SQL, Python, and R. High concurrency clusters support table access control. Single node clusters are used for execution of Spark jobs on the driver node. Clusters can have a predefined pool of instances which are idle. Worker nodes are allocated from the pool to the cluster.
Clusters can have Databricks runtimes. Databricks runtimes have Apache Spark and add components which help in improving performance, security, and usability. You can select the Databricks runtime from the drop down during the creation and editing of the cluster. Docker images can be specified while creating the cluster.
Cluster nodes can be initialized using the cluster scoped and global scripts. Cluster scoped scripts can be executed on the cluster. Global scripts can be executed on every cluster in the Databricks workspace. These scripts help in maintaining consistency in configuring the clusters. Global init scripts are created by admin users. These scripts cannot be executed on model serving clusters. The initialization scripts are executed in the order listed below:
Legacy global (deprecated)
Cluster-named (deprecated)
Global
Cluster-scoped
Cluster configuration has the cluster scoped init scripts. These scripts are used for execution of jobs and for created clusters. The permission to change the scripts can be controlled in the cluster configuration. The scripts can be configured by the user through User interface, command line, and ClustersAPI.
Clusters can be managed by the user in the Databricks workspace. Clusters of any type can be created: job and all-purpose. Clusters can have the name, state, number of nodes, type of driver and worker nodes, Databricks runtime version, cluster creator, and the number of notebooks.
Cluster can be in one of the following states:
Pinned
Starting
Terminating
Running
Terminated
Running Serverless
Terminated Serverless
Access Denied
Running Locked
Terminated Locked
Table ACLs enabled
Running Table ACLs
Terminated Table ACLs
Filtering and job cluster actions can be performed on the cluster. The filters can be as mentioned below:
Created by the user
Accessible to the user
String value in the field
A cluster is permanently deleted after thirty days from the date of cluster termination. All-purpose clusters can be pinned by the admin after thirty days from the date of cluster termination. Clusters can be exported to JSON format.
You can use pools to cut down the start and scaleup times of the cluster. They help to quickly process data and cut down the cost. You can attach the cluster to the pool and the pool’s instances are allocated to the cluster. Pool retains the instance when the instance is released by the cluster. Pools are created by using Databricks runtimes and instance types. Pools will have spot instances to cut down the cost and can have on-demand instances to bring down the run times of the jobs. Pool and Cluster tabs are used for tracking costs. You can use pool configuration to cut down the costs. Pools are pre populated for usage by the clusters. Pools are created based on the workloads.
Now let us look at some of the best practices related to Spark clusters. First we will look at the different questions which are part of the checklist:
What are different user types in cluster management?
What are different types of workloads executed on the cluster?
What are the SLAs?
What is your budget?
How do you plan to utilize the clusters?
What is the amount of the consumed workload?
What’s the workload complexity?
What are the data sources?
What are the data partitions in external storage?
What is the count of parallel nodes required?
The goal is to configure the clusters to get high performance and low cost. Databricks units impact the cost of the cluster. To reduce the cost, job clusters need to be used after data process implementation. You need to go for a standard cluster when you want to process huge amounts of data with Spark. When you have less data, you can use single node clusters. High concurrency clusters are used when you want autoscaling to execute ad hoc jobs.
On-demand and spot instances on Azure help in cutting down the cost for data processing. Clusters can be a mix of on-demand and spot instances. Clusters can be launched using on-demand instances. This is done by storing the cluster state on an on-demand instance. It is good to have spot back up on-demand and to have a combination of on-demand and spot instances in hybrid clusters. This is done based on the importance of jobs. The other factors are data importance, timelines, and budget.
You can use autoscaling for automatic resize to manage the workloads. This helps in cutting down the cost and improving the performance.
The following are the recommendations to get high performance:
You can enable auto termination to stop clusters after a period of inactivity.
You can enable autoscaling based on workload.
You can use pools to restrict clusters to pre-approved instance types.
You can use storage autoscaling.
You can avoid deploying clusters with large amounts of RAM configured for each instance to minimize the impact of long garbage collection sweeps.
In the next part of the series, we will look at Data Transformations with DataBricks.
Azure DataBricks Platform: https://docs.microsoft.com/en-us/azure/databricks/