Data engineering is one of the fastest-growing sub-domains of data science. It is the process of designing scalable systems for collecting raw data, storing it, and analyzing it to retrieve information.
Data engineering brings more and more organizations, irrespective of their sector, to the table of practicing data science. It helps organizations make data-empowered decisions to boost their performance.
Due to the daily, rapid growth of data, it is in the best interest of data science professionals to automate their projects for the ML model to work with live data and make better decisions.
Practicing data engineering is one of the crucial responsibilities of the data scientist, and building projects can take a while. In this article we will unravel steps to make a data engineering project in just five minutes. This article will discuss the five standard steps involved in building any good data engineering project for your portfolio, which can help you land a data engineering job. Also, this article will explain the steps of creating a data engineering project with a project example. Without much ado, let’s swing into action.
Creating any data engineering project involves the five steps outlined below.
The first step in building the project is finding the live data source you are interested in analyzing. We have many sources of data APIs; some of them are data from Covid-19 affected people, Covid-19 vaccination data, Twitter data, equity/currency exchange market data, and much more.
For this experiment, Covid-19 daily affected people data is chosen. Multiple data sources such as weather, stock market, currency exchange rates, and much more are provided their information via API. To access the live data of the areas mentioned above, simply and effortlessly, we can use the API.
The second step is identifying how to extract the data of our interest. The requirement of this article’s data source is provided by NovelCOVID & disease.sh API. Below is the GET request for extracting the daily affected people by Covid-19 word-wide and Covid-19 vaccination coverage for the past thirty days in India.
GET https://corona.lmao.ninja/v2/all
JSON Response of the GET request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
"updated": 1650172623567,
"cases": 504394510,
"todayCases": 201250,
"deaths": 6221989,
"todayDeaths": 441,
"recovered": 455188181,
"todayRecovered": 343253,
"active": 42984340,
"critical": 42497,
"casesPerOneMillion": 64709,
"deathsPerOneMillion": 798.2,
"tests": 6183333841,
"testsPerOneMillion": 783281.04,
"population": 7894144670,
"oneCasePerPeople": 0,
"oneDeathPerPeople": 0,
"oneTestPerPeople": 0,
"activePerOneMillion": 5445.09,
"recoveredPerOneMillion": 57661.49,
"criticalPerOneMillion": 5.38,
"affectedCountries": 228
}
GET https://disease.sh/v3/covid-19/vaccine/coverage/countries/india?lastdays=30
JSON Response of the GET request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
"country": "India",
"timeline": {
"3/23/22": 1820574288,
"3/24/22": 1824268730,
"3/25/22": 1826265830,
"3/26/22": 1830285290,
"3/27/22": 1831347741,
"3/28/22": 1833040014,
"3/29/22": 1834500657,
"3/30/22": 1838519822,
"3/31/22": 1841947045,
"4/1/22": 1843313964,
"4/2/22": 1845350769,
"4/3/22": 1847411361,
"4/4/22": 1847000842,
"4/5/22": 1848699443,
"4/6/22": 1850895496,
"4/7/22": 1852472643,
"4/8/22": 1853783376,
"4/9/22": 1855432129,
"4/10/22": 1856267473,
"4/11/22": 1857308103,
"4/12/22": 1859559964,
"4/13/22": 1860601312,
"4/14/22": 1862055068,
"4/15/22": 1862475218,
"4/16/22": 1863626877,
"4/17/22": 1864804704,
"4/18/22": 1865530788,
"4/19/22": 1867730851,
"4/20/22": 1868475798,
"4/21/22": 1868475798
}
}
After picking the suitable API, the third step is to create ETL for this. We can choose a light-weight CRON job as ETL. The main idea here, or in any data science project, is that the data source needs to be refreshed after a certain point, and the CRON needs to be re-run. In our case, we need to re-run the CRON job daily.
The fourth step is we need someplace to write the data generated from the ETL. Writing the data to the database is the best option for retrieving records faster. Any relational database or cloud storage tools can be used. While dealing with a small volume of data/low latency data requests, steps three and four can be replaced with the below quick Python script.
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import matplotlib.pyplot as plt
import requests
url1 = "https://corona.lmao.ninja/v2/all"
source1 = requests.get(url2)
.json()
url2 = "https://disease.sh/v3/covid-19/vaccine/coverage/countries/india?lastdays=30"
source2 = requests.get(url2)
.json()
df2 = pd.DataFrame.from_records(source2)
df2.reset_index(level = 0, inplace = True)
The output of the code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
{
'updated': 1650180423900,
'cases': 504397709,
'todayCases': 204440,
'deaths': 6222008,
'todayDeaths': 453,
'recovered': 455194353,
'todayRecovered': 349503,
'active': 42981348,
'critical': 42495,
'casesPerOneMillion': 64710,
'deathsPerOneMillion': 798.2,
'tests': 6183413292,
'testsPerOneMillion': 783291.1,
'population': 7894144670,
'oneCasePerPeople': 0,
'oneDeathPerPeople': 0,
'oneTestPerPeople': 0,
'activePerOneMillion': 5444.71,
'recoveredPerOneMillion': 57662.28,
'criticalPerOneMillion': 5.38,
'affectedCountries': 228
}
{
"country": "India",
"timeline": {
"3/23/22": 1820574288,
"3/24/22": 1824268730,
"3/25/22": 1826265830,
"3/26/22": 1830285290,
"3/27/22": 1831347741,
"3/28/22": 1833040014,
"3/29/22": 1834500657,
"3/30/22": 1838519822,
"3/31/22": 1841947045,
"4/1/22": 1843313964,
"4/2/22": 1845350769,
"4/3/22": 1847411361,
"4/4/22": 1847000842,
"4/5/22": 1848699443,
"4/6/22": 1850895496,
"4/7/22": 1852472643,
"4/8/22": 1853783376,
"4/9/22": 1855432129,
"4/10/22": 1856267473,
"4/11/22": 1857308103,
"4/12/22": 1859559964,
"4/13/22": 1860601312,
"4/14/22": 1862055068,
"4/15/22": 1862475218,
"4/16/22": 1863626877,
"4/17/22": 1864804704,
"4/18/22": 1865530788,
"4/19/22": 1867730851,
"4/20/22": 1868475798,
"4/21/22": 1868475798
}
}
The final step is to visualize/analyze the data stored. We can connect some visualization tools like tableau and Power BI to the data or use the existing panda’s library to render the visualization of data. Below is the sample snippet of visualizing the Covid-19 data.
1
2
3
4
5
6
7
8
9
10
11
fig = plt.figure(figsize = (9, 6))
df = pd.DataFrame({
"value": [source['deathsPerOneMillion'], source['todayDeaths'], source['affectedCountries']]
},
index = ["Death Per One Million", "Today Deaths", "Affected Countries"])
df['value'].plot(kind = "bar")
plt.title("Global covid-19 daily statisitcs")
plt.xlabel("Metrics")
plt.ylabel("Covid-19 Numbers")
plt.xticks(rotation = 45)
plt.savefig('output.png', bbox_inches = "tight")
Below is the visualization output of the code snippet.
1 2 3 4 5 6 7 8 9 10
time = pd.date_range('03/23/2022', '4/21/2022') fig, ax = plt.subplots() # Add x - axis and y - axis ax.plot(time, df2['timeline'], color = 'purple') ax.xaxis_date() fig.autofmt_xdate() ax.set(xlabel = "Date", ylabel = "Vaccination numbers", title = "Daily Vaccine coverage in India for the last 30 days") plt.savefig('output2.png', bbox_inches = "tight")
Below is the visualization output of the code snippet.
In this article we learned about:
Data engineering and the impact of practicing data engineering,
Importance of building a data engineering project as a data science professional,
Five steps are involved in creating a data engineering project.
Below are the steps discussed in the article:
Picking the data source
Extracting data from the source
Creating the ETL
Writing the data from the ETL
Visualizing the data