🎉 Wipro's Lab45 AI Platform is now free with your Topcoder account!

January 23, 2021

The Relationship Between Impala and Hive and its Application in Business Intelligence Analysis

M_CCM_CC

DURATION

15min

Similarities and differences between Impala and Hive

Data storage: The same storage data pool supports storing data in HDFS and HBase.
Metadata: Both use the same metadata.
SQL interpretation processing: It is similar to generating execution plans through lexical analysis.

(1) Implementation plan:

Hive: Relying on the MapReduce execution framework, the execution plan is divided into a model of map->shuffle->reduce->map->shuffle->reduce. If a query will be compiled into multiple rounds of MapReduce, there will be more writing intermediate results

Due to the characteristics of the MapReduce execution framework itself, too many intermediate processes will increase the execution time of the entire query.

Impala: The execution plan is represented as a complete execution plan tree, which can more naturally distribute the execution plan to each Impalad to execute the query, instead of combining it into a pipeline map->reduce mode like Hive.

It proves that Impala has better concurrency and avoids unnecessary intermediate sort and shuffle.

(2) Data flow

Hive: Using the push method, each computing node actively pushes the data to the subsequent nodes after the calculation is completed.

Impala: Using the pull method, subsequent nodes actively ask for data from the previous node through getNext. In this way, the data can be streamed back to the client, and as long as one piece of data is processed, it can be displayed immediately, instead of waiting until all processing is completed, which is more in line with the use of SQL interactive query.

(3) Memory usage

Hive: If all data cannot be stored in the execution process, external storage will be used to ensure that the query can be executed sequentially. After each round of MapReduce is over, the intermediate results will be written to HDFS, also due to MapReduce execution.
Due to the characteristics of the architecture, the shuffle process also involves writing to local disks.

Impala: When the data cannot be stored in the internal storage, the current version 1.0.1 returns an error directly instead of using external storage. Future versions should be improved. This uses Impala to currently process query will be subject to certain restrictions, it is best still used in conjunction with Hive. Impala uses the network to transfer data between multiple stages, and there will be no disk write operations (except for insert) during the execution process.

(4) Scheduling

Hive: Task scheduling depends on Hadoop’s scheduling strategy.

Impala: Scheduling is done by you. At present, there is only one kind of scheduler, simple-schedule, which will try to meet the locality of the data, and the process of scanning data is as close as possible to the physical machine where the data itself is located. The scheduler is still relatively simple, as can be seen in SimpleScheduler::GetBackend, the load, network IO status and other factors are not considered for scheduling. However, Impala already has performance statistics analysis of the execution process, and it should be used in future versions.

(5) Fault tolerance

Hive: Rely on the fault tolerance of Hadoop.

Impala: During the query process, there is no fault-tolerant logic. If a failure occurs during execution, an error will be returned directly. This is related to the design of Impala, because Impala is positioned for real-time query, and if a query fails, it will be checked again.

The cost of checking again is very low. But on the whole, Impala is very fault-tolerant. All Impalads have a peer-to-peer structure. Users can submit queries to any Impalad. If an Impalad fails, all queries running on the above will fail, but the user can resubmit the query to be executed by other Impalads, without affecting the service. There is currently only one State Store, but when the State Store fails, it will not affect the service.

Each Impalad caches the State Store information, but the cluster state can no longer be updated, and execution tasks may be assigned to the failed Impalad for execution, causing this query to fail.

(6) Applicable surface

Hive: Complex batch query tasks and data conversion tasks.

Impala: Real-time data analysis. Because UDF is not supported, the problem domain that can be handled is limited. It can be used in conjunction with Hive to perform real-time analysis of the Hive result data set.

Application in Business Intelligence (BI)
Interactions with Impala and Hive mainly include Impala/Hive shell and Hue web UI. Access Hive and Impala through Hue, each of which has editors.

Both Hive and Impala are data query tools built on Hadoop, so how do they load and store data in actual applications? Take Impala as an example to talk about how to process data in BI. The whole process is as follows.

Data processing:

Extract data
First, extract the data from the data source to Hadoop through the Sqoop script. At the same time, a corresponding table is built in the Hadoop DB layer to store the extracted data. Refresh the Impala table.

Code:

1
2
3
4
5
6
7
8
#!/bin/sh

source / etc / profile
source~/.bash_profile
sqoop
import --connect jdbc: oracle: thin: @10 .8 .17 .23: 1521 / FINDM--username test--password = test123--target - dir / FINBI / DB / EBS_CUX_PA_CONTRACT_RPT_D_ALL--query "select SUBCON_TRACK_ID,ORG_ID,SEGMENT1,CONTRACT_NUMBER,PERIOD_NAME,ACCOUNT_CODE,ACCOUNT_NAME,ACCOUNT_AMOUNT,LAST_UPDATE_DATE,LAST_UPDATED_BY,CREATION_DATE,CREATED_BY,LAST_UPDATE_LOGIN,REQUEST_ID
from CUX_PA_CONTRACT_RPT_D_ALL @FINDM_EBS where\ $CONDITIONS "    --delete-target-dir  --fields-terminated-by '\001' --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --lines-terminated-by '\n'  -m 1 [[$ ? -eq 0]] || exit
impala - shell - q 'refresh DLA_DB.EBS_CUX_PA_CONTRACT_RPT_D_ALL;'

The result is:

Create partition
Data partitioning will greatly improve the efficiency of a data query. The ODS layer in Hadoop uses ‘PARTITIONED BY’ to create partitioned tables.
Code:

1
2
INSERT OVERWRITE TABLE DLA_ODS.ODS_EBS_PA_CONTRACT_RPT_D_ALL PARTITION(DT) SELECT * , NOW(), NOW(), '2020-07-31'
DT FROM DLA_DB.EBS_CUX_PA_CONTRACT_RPT_D_ALL

The result is:

Logical calculation
Some aggregation and logical calculations are usually performed at the DWD layer. The aggregated data will be stored in the newly created table.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
CREATE TABLE FINBI_DWD.DWD_TMP_CUX_PA_CONTRACT_ALL(
  SUBCON_TRACK_ID DOUBLE,
  PERIOD_NAME STRING,
  ORG_ID DOUBLE,
  ORG_NAME STRING,
  CONTRACT_HEADER_ID,
  CONTRACT_NUMBER STRING,
  CUSTOMER_ID DOUBLE,
  CUSTOMER_NUMBER STRING,
  CUSTOMER_NAME STRING,
  COMMENTS STRING,
  CONT_AMT DOUBLE,
  UNTAX_CONT_AMT DOUBLE,
  CON_GROSS_MARGIN DOUBLE,
  LAST_UPDATE_DATE TIMESTAMP,
  LAST_UPDATED_BY DOUBLE,
  LAST_UPDATE_LOGIN DOUBLE,
  CREATION_DATE TIMESTAMP,
  CREATED_BY DOUBLE,
  ORIG_CONTRACT_NUMBER STRING,
  CONTRACT_END_DATE TIMESTAMP,
  SOURCE_NAME STRING,
  CONT_AMOUNT DOUBLE W_INSERT_DATE TIMESTAMP,
  W_UPDATE_DATE TIMESTAMP,
)

PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
WITH SERDEPROPERTIES('field.delim' = '\u0001', 'line.delim' = '\n', 'serialization.format' = '\u0001') STORED AS PARQUET LOCATION 'hdfs://nameservice1/FINBI/DWD/DWD_TMP_CUX_PA_CONTRACT_ALL';

The result is:

The DM layer in Hadoop generally stores the latest partitioned data. The way to build a table is similar to DWD.

Transfer data to Oracle or other BI system database
Transfer the processed data from the DM layer of Hadoop to Oracle or other BI system database through the Sqoop script. The BI system used in this article is Oracle BIEE.

Code:

1
2
3
4
5
6
7
#!/bin/bash

source / etc / profile
source~/.bash_profile
java - cp / opt / obiee / warehouseOracle - 0.0 .1 - SNAPSHOT - jar - with - dependencies.jar com.TruncateTable 10.8 .17 .23 1521 FINDM test test123 DM_F_COST_SYNERGISM1[[$ ? -eq 0]] || exit
sqoop
export --connect jdbc: oracle: thin: @ //10.8.17.23:1521/FINDM --username test  --password test123  --table DM_F_COST_SYNERGISM1 --columns "MONTH_WID,SCENES_DATA,PERIOD_DATA,ASSESS1_CENTER_WID,ASSESS2_CENTER_WID,CURRENCY_TYPE_WID,DATA_SIZE_WID,MARKET_CLASSIFY,INDUSTRY,DISTRIBUTION_TYPE_WID,FOCUS_FLAG,EXPENSE_ACCOUNT_WID,INCOMES_BUSINESS,GROSS_AMOUNT,EMPLOYEE_AMOUNT,OVERSEAS_COST,CHINA_COST,GROUP_COST,INSERT_DATE,UPDATE_DATE " --fields-terminated-by '\001'  --input-null-string '\\N' --input-null-non-string '\\N'   --hcatalog-database FINBI_DM --hcatalog-table DM_F_COST_SYNERGISM    --num-mappers 20

Show on the dashboard
Finally, display the data on the BI system interface.

Reference:
http://impala.apache.org/

Chat on Discord

January 23, 2021

The Relationship Between Impala and Hive and its Application in Business Intelligence Analysis

DURATION

categories

Tags

share

Topcoder SKILL BUILDER COMPETITIONS

Similarities and differences between Impala and Hive

(1) Implementation plan:

(2) Data flow

(3) Memory usage

(4) Scheduling

(5) Fault tolerance

(6) Applicable surface

Data processing: