Challenge Overview
We're starting a new challenge series for building a tool to migrate HBase tables. In previous challenges we have built a tool to create sample tables, generate sample data and a first version of the tool to migrate smaller amounts of data. In this challenge we will improve the tool so it can be used to migrate larger amounts of data.
The basic idea for improving the performance is to parallelize data migration. You can use either Apache Spark or MapReduce. The data migration job will be run on a Cloudera cluster (Cloudera quickstart docker/VM during review) and the core requirement in this challenge is extracting as much performance out of the used resources as possible. Performance will be 70% of the total score during review. We strongly suggest reading hbase performance tuning chapter of the Apache HBase book. Especially pay attention to Tuning read and write performance section.
In the forums you will find two tools: a tool to generate sample data in HBase and the Migration tool that will be used as the base code.
The time series data (metric id, timestamp, value) that should be migrated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
Rowkey: 7 bytes
Notes
There will be two modes for the tool: export and import (same as in the base migration tool). In export mode, the tool will dump the table data from HBASE table to a CSV file and an Apache Avro file, or a set of CSV and/or Avro files. Both CSV and Apache Avro file schemas should be well designed to be as generic and reusable as possible, and clearly documented. The size of the generated files need to be manageable. For example, it can be opened in Excel or be read by some program without freezing the user session. Each file size should not be too big. The reason to support both CSV and Avro file format is to satisfy both readability/portability and the performance reasons for different use cases.
In import mode, user can choose one of the two formats as the input. User can then import data from the files back to the same or new HBASE table. User can define a list of devices UUIDs and/or a list of channelIds, and a time range to the utility. The utility shall manage the very lengthy operation and be able to resume the task if some failure happens during hours of executions. These exceptions can happen and the utility shall handle them and be able to resume the work automatically or manually:
Sample shell commands to create both tables are available in the data generation tool docs.
Spark/MapReduce jobs will produce proper logs, and no regressions should be introduced (comparing to the base data migration tool).
The tool will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to migrate X number of devices, with Y number of channels, with Z number of values per day, and D days, with ranges X [1,1000], Y [1-100], Z[1-500], D[1-365].
Acceptance criteria: The tool must migrate proper amount of data for the specified input parameters
The basic idea for improving the performance is to parallelize data migration. You can use either Apache Spark or MapReduce. The data migration job will be run on a Cloudera cluster (Cloudera quickstart docker/VM during review) and the core requirement in this challenge is extracting as much performance out of the used resources as possible. Performance will be 70% of the total score during review. We strongly suggest reading hbase performance tuning chapter of the Apache HBase book. Especially pay attention to Tuning read and write performance section.
In the forums you will find two tools: a tool to generate sample data in HBase and the Migration tool that will be used as the base code.
The time series data (metric id, timestamp, value) that should be migrated is relevant to a specific tables structure. There are two tables. One is the meta table to store metric id and device sensor path, and data type information. The other is the data table to store the actual time series data.
The following is the HBASE meta table schema:
The same data are stored in both I-Row and P-Row, because the data can be queried via both the integer metric id and path string. Each device instance is identified by a device UUID. Each sensor is identified by a channelId, which is an integer. The trait describes the characteristic of the data, such as min, max, avg, and v.
I-Row Schema: Stores mapping from the metric id to the device sensor path
- Rowkey: 'i' + 4-byte metric id integer.
- Columns (ColumnFamily:ColumnQualifier):
- m:x (ASCII String) Stores the path string (deviceUUID\channelId\trait). e.g. 8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- m:p (1 byte) Stores the prefix 'A', 'B', ... to help split data among multiple regions.
- m:d (1 byte) Stores the data type (0: UNKNOWN, 1: FLOAT, 2: DOUBLE, 3: LONG, 4: BOOLEAN, 5: STRING).
- m:v (1 byte) Stores the schema version (always set to 5)
- Rowkey: 'p' + string. e.g. p8bedf2c8-a164-4f94-9d04-93d88a01d87b\237\v
- Columns (ColumnFamily:ColumnQualifier):
- m:x (4 byte integer) Stores the metric id integer.
- m:p Same as above.
- m:d Same as above.
- m:v Same as above.
- IN_MEMORY => 'true'
- COMPRESSION => 'SNAPPY' - (or LZO)
- Row Timestamp/Version: milliseconds since epoch
- The numeric values are all Big-Endian encoded.
Rowkey: 7 bytes
- Prefix (1 byte): The ascii code for A, B, C, ..., Z, a randomly assigned prefix to avoid hot spot. The same prefix is always assigned to the same metricId.
- MetricId (4-byte): An integer generated to represent a path (e.g. deviceUUID\channelId\trait)
- DaysSinceEpoch (2-byte): Days index since 1/1/1970 UTC.
- t:offset The offset is a 4-byte integer representing the seconds offset during the current UTC day.
- Row Timestamp/Version: milliseconds since epoch (not exposed in the query)
Notes
- The numeric values are all Big-Endian encoded.
- Rowkey: A+9303+17295, Column:300, Column:600, Column:900, …, Column:second offset
- Values: float, float, float, …
- Note: The A is the prefix. The 9303 is the metric id. The 17295 daysSinceEpoch is 5/9/2017 UTC.
There will be two modes for the tool: export and import (same as in the base migration tool). In export mode, the tool will dump the table data from HBASE table to a CSV file and an Apache Avro file, or a set of CSV and/or Avro files. Both CSV and Apache Avro file schemas should be well designed to be as generic and reusable as possible, and clearly documented. The size of the generated files need to be manageable. For example, it can be opened in Excel or be read by some program without freezing the user session. Each file size should not be too big. The reason to support both CSV and Avro file format is to satisfy both readability/portability and the performance reasons for different use cases.
In import mode, user can choose one of the two formats as the input. User can then import data from the files back to the same or new HBASE table. User can define a list of devices UUIDs and/or a list of channelIds, and a time range to the utility. The utility shall manage the very lengthy operation and be able to resume the task if some failure happens during hours of executions. These exceptions can happen and the utility shall handle them and be able to resume the work automatically or manually:
- The HBASE is too busy, so that the write operation can fail.
- The network can be interrupted.
- The program can crash/shut down.
Sample shell commands to create both tables are available in the data generation tool docs.
Spark/MapReduce jobs will produce proper logs, and no regressions should be introduced (comparing to the base data migration tool).
The tool will be benchmarked against the CDH quickstart HBASE table. You should document the amount of time used to migrate X number of devices, with Y number of channels, with Z number of values per day, and D days, with ranges X [1,1000], Y [1-100], Z[1-500], D[1-365].
Acceptance criteria: The tool must migrate proper amount of data for the specified input parameters
Final Submission Guidelines
Submit the complete source code for the tool
Submit a deployment guide.
Verification guide and benchmark results.
Submit a short demonstration video (unlisted Youtube link)