Parallel computing is an architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing extensive calculations by dividing the workload between more than one processor, all of which work through the calculation at the same time. The primary goal of parallel computing is to increase available computation power for faster application processing and problem solving. In sequential computing, all the instructions run one after another without overlapping, whereas in parallel computing instructions run in parallel to complete the given task faster.
Following are some of the applications of parallel computing :
Database and data mining
Extensive graphics processing like for video games and augmented reality
Medical imaging and diagnosis
Web search engines
Electronic circuit design
Dask is a free and open-source library used to achieve parallel computing in Python. It works well with all the popular Python libraries like Pandas, Numpy, scikit-learns, etc. With Pandas, we can’t handle very large datasets (unless we have plenty of RAM) because they use a lot of memory. It is not possible to use Pandas or Numpy for our data processing tasks when our data scales. We may try other distributed computing libraries like Apache Spark, but those are not written in Python so we won’t be able to utilize Pandas, Numpy or other machine learning libraries. However, Dask provides data frames that can handle large datasets with minimal code changes. Its scalability with our data in terms of processing time or memory usage gives it an advantage over other libraries. Dask helps us to utilize the power of distributed computing with seamless integration with other data science libraries like Pandas or Numpy.
Dask vs Spark - Spark is a popular name in the domain of distributed computing. In comparison to Spark, Dask is light weight and smaller, which means it has limited features. One major difference is that Spark is written in Scala and supports Python using a wrapper library, whereas Dask is written in Python and couples with almost all Python libraries for data analysis. Spark doesn’t support multi-dimensional arrays, whereas Dask supports scalable multi-dimensional arrays. Both scale from a single node to thousand-node clusters. If you want to write your project in Python and are looking to add machine learning then you can go with Dask. If you are a Scala or SQL developer, working on JVM infrastructure and doing mostly ETL parts, then you can go with Spark. In terms of ease of use Dask is far easier to implement than Spark.
Dask vs Ray vs Modin - Dask and Ray differ in their scheduling mechanisms. Dask uses a centralized scheduler which handles all tasks for a cluster. Ray is decentralized, which means every machine runs its own scheduler, so that scheduled task issues are handled at individual machine level, not at the level of the whole cluster. Dask has extensive high-level collections APIs (e.g., dataframes, distributed arrays, etc), whereas Ray does not. On the other hand, Modin runs on top of Dask or Ray. We can easily scale our Pandas workflow through Modin by just adding a single line of code import modin.pandas as pd
. Dask DataFrame does not scale the entire Pandas API but Modin attempts to parallelize as much of the Pandas API as possible. Dask DataFrame has row-based partitioning, similar to Spark. Modin is more of a column-store, which we inherited from modern database systems. Amongst developers Ray and Dask are more popular as they are old and mature.
Dask use cases are divided into two parts -
Dynamic task scheduling - which helps us to optimize our computations.
“Big Data” collections - like parallel arrays and dataframes to handle large datasets.
Dask collections are used to create a Task Graph
which is a visual representation of the structure of our data processing tasks. Dask schedulers execute this task graph.
Dask uses parallel programming to execute the tasks. Parallel programming means to execute multiple tasks at the same time. This way we can properly utilize our resources and finish multiple steps simultaneously.
Let’s see some of the data collections provided by Dask.
dask.array
- It uses the Numpy interface and cuts down the large array into smaller ones so that we can compute on arrays larger than our system’s memory.
dask.bag
- It provides operations like map, filter, fold, and group by on collections of generic Python objects.
dask.dataframe
- Distributes data frames like Pandas. It is a large parallel data frame made of many smaller data frames.
Dask can be installed using following command -
pip install dask[complete]
Let’s see some examples of parallel computing using this library. Dask.delayed
is used to achieve parallelism in our code. It delays function calls into a task graph with dependencies.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import time
import random
def profit(x, y):
time.sleep(random.random())
return x + y
def loss(x, y):
time.sleep(random.random())
return x - y
def total(x, y):
time.sleep(random.random())
return x + y
%
% time
x = profit(45, 10)
y = loss(20, 15)
z = total(x, y)
z
Output will be -
1 2 3 4
CPU times: user 1.33 ms, sys: 2.16 ms, total: 3.49 ms Wall time: 1.29 s 60
These functions will run sequentially one after another, though they are not dependent on each other. So we can run them parallely to save time.
1
2
3
4
5
6
7
8
9
10
11
import dask
profit = dask.delayed(profit)
loss = dask.delayed(loss)
total = dask.delayed(total)
%
% time
x = profit(45, 10)
y = loss(20, 15)
z = total(x, y)
z
Output will be -
1
2
3
4
CPU times: user 903 µs, sys: 0 ns, total: 903 µs
Wall time: 557 µs
Delayed('total-b0bcbf09-79e4-4dfb-9b91-0f69af46f759')
Even in this small example we can see the improvement of running time. We can also visualize the task graph like this -
z.visualize(rankdir='LR')
It will look something like this -
That’s all for now. You should definitely try this library in your next project. Happy coding :)